string和unicode

本文轉載自查看原文 2017-09-15 10:30 1266 python

string object是由characters組成的sequence，而unicode object是Unicode code units組成的sequence。

string里的character是有多種編碼方式的，比如單字節的ASCII，雙字節的GB2312等等，再比如UTF-8。很明顯要想解讀string，必需知道string里的character是用哪種編碼方式，然后才能進行。

Unicode code unit又是什么東西呢？一個Unicode code unit是一個16-bit或者32-bit的數值，每個數值代表一個unicode符號。在python里，16-bit的unicode，對應的是ucs2編碼。32-bit對應的是ucs4編碼。是不是感覺string里character的編碼沒什么區別？反正我現在腦子里就是這樣一個印象：在Python里，ucs2或者ucs4編碼的(所以才說unicode也可以解碼?)，我們叫做unicode object，其他編碼(utf8,gbk之類)的我們就叫做string。

使用chardet判斷字符串編碼

安裝：pip install chardet

# -*- coding:utf-8 -*-
import chardet

a = '哈哈'
b = u'哈哈'
print type(a)
print type(b)
print chardet.detect(a)
print chardet.detect(a.encode('gbk'))
print chardet.detect(b)

輸出：

<type 'str'>
<type 'unicode'>
{'confidence': 0.75249999999999995, 'language': '', 'encoding': 'utf-8'}
{'confidence': 0.72999999999999998, 'language': '', 'encoding': 'ISO-8859-1'}
Traceback (most recent call last):
  File "C:\Users\admin\Desktop\ad.py", line 10, in <module>
    print chardet.detect(b)
  File "C:\Python26\lib\site-packages\chardet\__init__.py", line 34, in detect
    '{0}'.format(type(byte_str)))
TypeError: Expected object of type bytes or bytearray, got: <type 'unicode'>

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 String轉成Unicode python unicode和string byte, unicode轉中文及轉換默認編碼 [轉]SSIS cannot convert between unicode and non-unicode string Python3里的unicode和byte string Unicode String 相互轉換 C# C++: std::string 與 Unicode 如何結合？ String.fromCharCode() 返回指定unicode編碼對應的字符 np.nan is an invalid document, expected byte or unicode string. pywinauto: 導入時遇到 "TypeError: LoadLibrary() argument 1 must be string, not unicode" TypeError: coercing to Unicode: need string or buffer, NoneType Found