chardet库：识别文件的编码格式

本文转载自查看原文 2017-11-09 21:44 1088

chardet库文档

http://chardet.readthedocs.io/en/latest/usage.html

小文件的编码判断

detect函数只需要一个非unicode字符串参数，返回一个字典。该字典包括判断到的编码格式及判断的置信度。

with open('test1.txt', 'rb') as f: result = chardet.detect(f.read()) print(result)

返回结果

{'encoding': 'utf-8', 'confidence': 0.99}

百分之99可能为utf-8编码。

测试构建函数，输入文件路径后识别任意小文件并输出：

import chardet
path1='/home/ifnd/下载/oracle1.asc'
path2='/home/ifnd/下载/oracle.asc'
def load_date(file_path):
    str_cod=chardet.detect(open(file_path,'rb').read())['encoding']
    with open(file_path,'r',encoding=str_cod) as f:
        iter_f=''.join(iter(f))

    return iter_f
print(load_date(path1))

大文件的编码判断

考虑到有的文件非常大，如果使用上述方法，全部读入后再判断编码格式，效率会变得非常低下。因此这里对读入的数据进行分块迭代，每次迭代出的数据喂给detector，当喂给detector数据达到一定程度足以进行高准确性判断时， detector.done返回 True。此时我们就可以获取该文件的编码格式。

from chardet.universaldetector import UniversalDetector bigdata = open('test2.txt','rb') detector = UniversalDetector() for line in bigdata.readlines(): detector.feed(line) if detector.done: break detector.close() bigdata.close() print(detector.result)

返回结果

{'encoding': 'utf-8', 'confidence': 0.99}

多个大文件的编码判断

如果想判断多个文件的编码，我们可以重复使用单个的UniversalDetector对象。只需要在每次调用UniversalDetector对象时候，初始化 detector.reset()，其余同上。

import os from chardet.universaldetector import UniversalDetector detector = UniversalDetector() dirlist = os.dirlist('/Users/suosuo/Desktop/Test') for name in dirlist: """  代码为mac上测试，如果为win  path = os.getcwd()+'\\%s'%name  """ path = os.getcwd()+'/%s'%name detector.reset() for line in open(path, 'rb').readlines(): detector.feed(line) if detector.done: break detector.close() print(detector.result)

输出结果

{'encoding': 'utf-8', 'confidence': 0.99}
{'encoding': 'gb2312', 'confidence': 0.99}
......
{'encoding': 'utf-8', 'confidence': 0.99}

免责声明！

本站转载的文章为个人学习借鉴使用，本站对版权不负任何法律责任。如果侵犯了您的隐私权益，请联系本站邮箱yoyou2525@163.com删除。

猜您在找 【python】python编码方式,chardet编码识别库 Python chardet字符编码的判断识别常见编码格式文件并转换成UTF-8编码的java实现源码用chardet判断字符编码的方法 ASCII 文件编码格式文件编码格式转换文件编码格式获取批量查询文件的编码格式 java指定文件编码格式 Java文件编码格式转换