根據字節流判斷內容是否使用UTF-8編碼

本文轉載自查看原文 2015-10-17 00:26 1740 java/ UTF-8

問題：

只有一個文本內容，文本沒有BOM頭，怎樣才能判斷當前文本是否使用UTF-8編碼輸出呢？

思路：

我們都知道使用UTF-8編碼輸出中文是有多個字節，而且從unicode碼轉換成UTF-8輸出有固定規則，那我們是否可以判斷字節流里面是否有滿足UTF-8規則的字節串來判斷內容是否使用UTF-8編碼呢？答案是可以，但不完美。

通過查詢https://en.wikipedia.org/wiki/UTF-8，我們得知UTF-8是通過如下規則將對unicode進行編碼，如果在字節流中發現連續字節滿足此規則，是否就可以判斷到內容就是UTF-8編碼后的結果呢？

Bits of code point	First code point	Last code point	Bytes in sequence	Byte 1	Byte 2	Byte 3	Byte 4	Byte 5	Byte 6
7	U+0000	U+007F	1	`0xxxxxxx`
11	U+0080	U+07FF	2	`110xxxxx`	`10xxxxxx`
16	U+0800	U+FFFF	3	`1110xxxx`	`10xxxxxx`	`10xxxxxx`
21	U+10000	U+1FFFFF	4	`11110xxx`	`10xxxxxx`	`10xxxxxx`	`10xxxxxx`
26	U+200000	U+3FFFFFF	5	`111110xx`	`10xxxxxx`	`10xxxxxx`	`10xxxxxx`	`10xxxxxx`
31	U+4000000	U+7FFFFFFF	6	`1111110x`	`10xxxxxx`	`10xxxxxx`	`10xxxxxx`	`10xxxxxx`	`10xxxxxx`

編碼並驗證：

package cn.com.demo.util;

import java.io.UnsupportedEncodingException;

public class Utf8Util {
    
    /**
     * UTF-8編碼規則
    Bits of code point    First code point    Last code point    Bytes in sequence    Byte 1    Byte 2    Byte 3    Byte 4    Byte 5    Byte 6
    7    U+0000    U+007F    1    0xxxxxxx
    11    U+0080    U+07FF    2    110xxxxx    10xxxxxx
    16    U+0800    U+FFFF    3    1110xxxx    10xxxxxx    10xxxxxx
    21    U+10000    U+1FFFFF    4    11110xxx    10xxxxxx    10xxxxxx    10xxxxxx
    26    U+200000    U+3FFFFFF    5    111110xx    10xxxxxx    10xxxxxx    10xxxxxx    10xxxxxx
    31    U+4000000    U+7FFFFFFF    6    1111110x    10xxxxxx    10xxxxxx    10xxxxxx    10xxxxxx    10xxxxxx
    */

    public static boolean isUtf8(byte[] bytes) {
        boolean flag = false;
        if (bytes != null && bytes.length > 0) {
            boolean foundStartByte = false;
            int requireByte = 0;
            for (int i = 0; i < bytes.length; i++) {
                byte current = bytes[i];
                //當前字節小於128，標准ASCII碼范圍
                if ((current & 0x80) == 0x00) {
                    if (foundStartByte) {
                        foundStartByte = false;
                        requireByte = 0;
                    }
                    continue;
                //當前以0x110開頭，標記2字節編碼開始，后面需緊跟1個0x10開頭字節
                }else if ((current & 0xC0) == 0xC0) {
                    foundStartByte = true;
                    requireByte = 1;
                //當前以0x1110開頭，標記3字節編碼開始，后面需緊跟2個0x10開頭字節
                }else if ((current & 0xE0) == 0xE0) {
                    foundStartByte = true;
                    requireByte = 2;
                //當前以0x11110開頭，標記4字節編碼開始，后面需緊跟3個0x10開頭字節
                }else if ((current & 0xF0) == 0xF0) {
                    foundStartByte = true;
                    requireByte = 3;
                //當前以0x111110開頭，標記5字節編碼開始，后面需緊跟4個0x10開頭字節
                }else if ((current & 0xE8) == 0xE8) {
                    foundStartByte = true;
                    requireByte = 4;
                //當前以0x1111110開頭，標記6字節編碼開始，后面需緊跟5個0x10開頭字節
                }else if ((current & 0xEC) == 0xEC) {
                    foundStartByte = true;
                    requireByte = 5;
                //當前以0x10開頭，判斷是否滿足utf8編碼規則
                }else if ((current & 0x80) == 0x80) {
                    if (foundStartByte) {
                        requireByte--;
                        //出現多個0x10開頭字節，個數滿足，發現utf8編碼字符，直接返回
                        if (requireByte == 0) {
                            return true;
                        }
                    //雖然經當前以0x10開頭，但前一字節不是以0x110|0x1110|0x11110肯定不是utf8編碼，直接返回
                    }else {
                        return false;
                    }
                //發現0x8000~0xC000之間字節，肯定不是utf8編碼
                }else {
                    return false;
                }
            }
        }
        return false;
    }
    
    public static void main(String[] args) throws UnsupportedEncodingException {
        String str = "<a href=\"http://www.baidu.com\">百度一下</a>";
        System.out.println(Utf8Util.isUtf8(str.getBytes("utf-8"))); 
        System.out.println(Utf8Util.isUtf8(str.getBytes("gbk"))); 
    }

}

結果：

基本可以識別出個大概，但存在以下問題：只判斷到滿足1次規則就直接返回，有可能是無規則字節流（圖片字節流或者其它）剛好滿足以上需求，但實際不是UTF-8編碼結果。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 檢測字節流是否是UTF8編碼用字節流解決GBK編碼UTF-8編碼的轉換用字符流和字節流解決UTF-8編碼與GBK編碼的轉換判斷文件編碼是否為UTF-8收藏 JavaScript判斷文件是否為UTF-8編碼如何檢測或判斷一個文件或字節流（無BOM）是什么編碼類型利用js判斷文件是否為utf-8編碼利用js判斷文件是否為utf-8編碼使用字節流讀寫數據 PHP使用Socket發送字節流