chardet庫：識別文件的編碼格式

本文轉載自查看原文 2017-11-09 21:44 1088

chardet庫文檔

http://chardet.readthedocs.io/en/latest/usage.html

小文件的編碼判斷

detect函數只需要一個非unicode字符串參數，返回一個字典。該字典包括判斷到的編碼格式及判斷的置信度。

with open('test1.txt', 'rb') as f: result = chardet.detect(f.read()) print(result)

返回結果

{'encoding': 'utf-8', 'confidence': 0.99}

百分之99可能為utf-8編碼。

測試構建函數，輸入文件路徑后識別任意小文件並輸出：

import chardet
path1='/home/ifnd/下載/oracle1.asc'
path2='/home/ifnd/下載/oracle.asc'
def load_date(file_path):
    str_cod=chardet.detect(open(file_path,'rb').read())['encoding']
    with open(file_path,'r',encoding=str_cod) as f:
        iter_f=''.join(iter(f))

    return iter_f
print(load_date(path1))

大文件的編碼判斷

考慮到有的文件非常大，如果使用上述方法，全部讀入后再判斷編碼格式，效率會變得非常低下。因此這里對讀入的數據進行分塊迭代，每次迭代出的數據喂給detector，當喂給detector數據達到一定程度足以進行高准確性判斷時， detector.done返回 True。此時我們就可以獲取該文件的編碼格式。

from chardet.universaldetector import UniversalDetector bigdata = open('test2.txt','rb') detector = UniversalDetector() for line in bigdata.readlines(): detector.feed(line) if detector.done: break detector.close() bigdata.close() print(detector.result)

返回結果

{'encoding': 'utf-8', 'confidence': 0.99}

多個大文件的編碼判斷

如果想判斷多個文件的編碼，我們可以重復使用單個的UniversalDetector對象。只需要在每次調用UniversalDetector對象時候，初始化 detector.reset()，其余同上。

import os from chardet.universaldetector import UniversalDetector detector = UniversalDetector() dirlist = os.dirlist('/Users/suosuo/Desktop/Test') for name in dirlist: """  代碼為mac上測試，如果為win  path = os.getcwd()+'\\%s'%name  """ path = os.getcwd()+'/%s'%name detector.reset() for line in open(path, 'rb').readlines(): detector.feed(line) if detector.done: break detector.close() print(detector.result)

輸出結果

{'encoding': 'utf-8', 'confidence': 0.99}
{'encoding': 'gb2312', 'confidence': 0.99}
......
{'encoding': 'utf-8', 'confidence': 0.99}

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python chardet字符編碼的判斷文件編碼格式轉換 Java判斷文件編碼格式 MyEclipse設置文件的編碼格式 Java如何獲取文件編碼格式 VS之設置文件編碼格式 Java如何獲取文件編碼格式 Linux 文件編碼格式轉換視頻文件的容器格式和編碼格式 C#自動識別文件編碼