python判斷文件的編碼格式是否為UTF8 無BOM格式

本文轉載自查看原文 2015-11-09 20:56 7871

首先普及下知識：

1、BOM: Byte Order Mark

　　BOM簽名的意思就是告訴編輯器當前文件采用何種編碼,方便編輯器識別,但是BOM雖然在編輯器中不顯示,但是會產生輸出,就像多了一個空行。

　　Byte-order mark Description
　　　　EF BB BF UTF-8
　　　　FF FE UTF-16 aka UCS-2, little endian
　　　　FE FF UTF-16 aka UCS-2, big endian
　　　　00 00 FF FE UTF-32 aka UCS-4, little endian.
　　　　00 00 FE FF UTF-32 aka UCS-4, big-endian.

　　所以對於UTF8只要判斷文件頭包含EF BB BF，就可以判斷它是有BOM的了。

2、再了解下UTF8的具體編碼格式，UTF8算是一種自適應的，長度不定，兼容ASCII編碼。

unicode(U+)	utf-8
U+00000000 - U+0000007F:	0xxxxxxx
U+00000080 - U+000007FF:	110xxxxx10xxxxxx
U+00000800 - U+0000FFFF:	1110xxxx10xxxxxx10xxxxxx
U+00010000 - U+001FFFFF:	11110xxx10xxxxxx10xxxxxx10xxxxxx

　　也就是說，在Unicode的編碼的基礎上規定了一種編碼格式，根據每個字節的開頭的固定格式，我們就可以判斷是否是UTF8的編碼

OK 基礎知識大致普及完畢，然后看一看代碼的實現。

#!/usr/bin/env python
#coding:utf-8
import  sys,codecs

def detectUTF8(file_name):
    state = 0
    line_num = 0
    file_obj = open(file_name)
    all_lines = file_obj.readlines()
    file_obj.close()
    for line in all_lines:
        line_num += 1
        line_len = len(line)
        for index in range(line_len):
            if state == 0:
                if ord(line[index])&0x80 == 0x00:#上表中的第一種情況
                    state = 0
                elif ord(line[index])&0xE0 == 0xC0:#上表中的第二種情況
                    state = 1
                elif ord(line[index])&0xF0 == 0xE0:#第三種
                    state = 2
                elif ord(line[index])&0xF8 == 0xF0:#第四種
                    state = 3
                else:
                    print "%s isn't a utf8 file,line:\t"%file_name+str(line_num)
                    sys.exit(1)
            else:
                if not ord(line[index])&0xC0 == 0x80:
                    print "%s isn't a utf8 file in line:\t"%file_name+str(line_num)
                    sys.exit(1)
                state -= 1
    if existBOM(file_name):
        print "%s isn't a standard utf8 file,include BOM header."%file_name
        sys.exit(1)

def existBOM(file_name):
    file_obj = open(file_name,'r')
    code = file_obj.read(3)
    file_obj.close()
    if code == codecs.BOM_UTF8:#判斷是否包含EF BB BF
        return  True
    return False

if __name__ == "__main__":
    file_name = 'code.txt'
    detectUTF8(file_name)

OK，大致就是這些，只要熟悉編碼格式，python代碼的實現也就不算難。

PS：python的編碼真是太痛苦了，不同版本還有所不同。如果在導入其它的模塊也可能出現編碼問題。。。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 java如何判斷編碼是否是utf8編碼 C#生成Xml以UTF-8無BOM格式編碼 python 轉化文件編碼 utf8 C#創建UTF8無BOM文本文件設置編碼格式為utf8 判斷文件編碼是否為UTF-8收藏 tomcat設置編碼格式utf8 利用js判斷文件是否為utf-8編碼利用js判斷文件是否為utf-8編碼 iconv 文件格式轉換（gbk和utf8）