1.chardet判斷編碼類型

import chardet
f=open('a.txt','rb')
text=f.read()
info=chardet.detect(text)
print(info)

{'encoding': 'UTF-16', 'confidence': 1.0, 'language': ''}

2.編碼解碼讀取

import chardet
f=open('a.txt',encoding='UTF-16')
text=f.read()
print(text.encode("utf-8").decode("unicode_escape"))

'1.新出吐魯番文書及其研究'

先編碼然后解碼讀取到了中文文字。

3.bert中unicode

import six
def convert_to_unicode(text):
    """
    Converts `text` to Unicode (if it's not already), assuming UTF-8 input.
    """
    # six_ensure_text is copied from https://github.com/benjaminp/six
    def six_ensure_text(s, encoding="unicode_escape", errors="strict"):
        if isinstance(s, six.binary_type):
            print('true')
            return s.decode(encoding, errors)#如果是字節流，那么就以指定方式解碼
        elif isinstance(s, six.text_type):#如果是文本類型，直接返回
            return s
        else:
            raise TypeError("not expecting type '%s'" % type(s))

    return six_ensure_text(text, encoding="unicode_escape", errors="ignore")

f=open('a.txt',encoding=('UTF-16'))
text=f.read()
print(convert_to_unicode(text.encode("utf-8")))

true
1.新出吐魯番文書及其研究

注意：

>>> type(text.encode("utf-8"))#經過編碼之后encode類型為字節類型
<class 'bytes'>

>>> type(text)#通過open中的encoding的是文件編碼方式，text類型是str
<class 'str'>

https://six.readthedocs.io/

上面的二進制類型也就是py3中的字節類型。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 關於Python文檔讀取UTF-8編碼文件問題 Python 讀取文件中unicode編碼轉成中文顯示問題 vue 讀取本地TXT GBK編碼文件 python Unicode 編碼解碼 Python | 多種編碼文件（中文）亂碼問題解決 VS2017 設置代碼文件編碼為 UTF-8 base64編碼文件下載『Python』源碼解析_源碼文件介紹 python2將str類型與unicode類型字符串寫入文件的編碼問題 Python3、Unicode、UTF-8、編碼