string和unicode

本文转载自查看原文 2017-09-15 10:30 1266 python

string object是由characters组成的sequence，而unicode object是Unicode code units组成的sequence。

string里的character是有多种编码方式的，比如单字节的ASCII，双字节的GB2312等等，再比如UTF-8。很明显要想解读string，必需知道string里的character是用哪种编码方式，然后才能进行。

Unicode code unit又是什么东西呢？一个Unicode code unit是一个16-bit或者32-bit的数值，每个数值代表一个unicode符号。在python里，16-bit的unicode，对应的是ucs2编码。32-bit对应的是ucs4编码。是不是感觉string里character的编码没什么区别？反正我现在脑子里就是这样一个印象：在Python里，ucs2或者ucs4编码的(所以才说unicode也可以解码?)，我们叫做unicode object，其他编码(utf8,gbk之类)的我们就叫做string。

使用chardet判断字符串编码

安装：pip install chardet

# -*- coding:utf-8 -*-
import chardet

a = '哈哈'
b = u'哈哈'
print type(a)
print type(b)
print chardet.detect(a)
print chardet.detect(a.encode('gbk'))
print chardet.detect(b)

输出：

<type 'str'>
<type 'unicode'>
{'confidence': 0.75249999999999995, 'language': '', 'encoding': 'utf-8'}
{'confidence': 0.72999999999999998, 'language': '', 'encoding': 'ISO-8859-1'}
Traceback (most recent call last):
  File "C:\Users\admin\Desktop\ad.py", line 10, in <module>
    print chardet.detect(b)
  File "C:\Python26\lib\site-packages\chardet\__init__.py", line 34, in detect
    '{0}'.format(type(byte_str)))
TypeError: Expected object of type bytes or bytearray, got: <type 'unicode'>

免责声明！

本站转载的文章为个人学习借鉴使用，本站对版权不负任何法律责任。如果侵犯了您的隐私权益，请联系本站邮箱yoyou2525@163.com删除。

猜您在找 python unicode和string byte, unicode转中文及转换默认编码 [转]SSIS cannot convert between unicode and non-unicode string Python3里的unicode和byte string Unicode String 相互转换 C# C++: std::string 与 Unicode 如何结合？从C# String类理解Unicode（UTF8/UTF16) django BUG!!! === Django model "coercing to Unicode: need string or buffer, XXX found" delphi7 string 转到 PWideChar 用于连接unicode dll调用 Python2.X如何将Unicode中文字符串转换成 string字符串 String