1 模塊簡介

Python 3中最大的變化之一就是刪除了Unicode類型。在Python 2中，有str類型和unicode類型，例如，

Python 2.7.6 (default, Oct 26 2016, 20:30:19) 
[GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> x = "blah"
>>> type(x)
<type 'str'>
>>> y = u"blah"
>>> type(y)
<type 'unicode'>

如果我們在Python 3中輸入同樣的代碼，你將會發現，最終返回的都是一個字符串類型。

Python 3.4.3 (default, Nov 17 2016, 01:08:31) 
[GCC 4.8.4] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> x = "blah"
>>> type(x)
<class 'str'>
>>> y = u"blah"
>>> type(y)
<class 'str'>

Python 3默認的是UTF-8編碼。這意味着你可以在字符串或者變量名中使用Unciode字符集。讓我們看看實際中是如何運用的。

在Python 2輸入如下代碼，在變量名中使用Unicode編碼，不出意料的化，最終會拋出SyntaxError錯誤。

>>> 中國 = "china"
  File "<stdin>", line 1
    中國 = "china"
    ^
SyntaxError: invalid syntax

在Python 3中輸入同樣的代碼，然后將該變量輸出到控制台，可以看到，Unicode變量名在Python 3中可以正常工作。

>>> 中國 = "china"
>>> 中國
'china'

在Python 2中，當讀取一個不是ASCII編碼的文件或者網頁時，我經常會遇到莫名其妙的編碼問題。你可能會看到你的輸出結果類似於如下示例，

#Python 2
>>> "abcdef" + chr(255)
'abcdef\xff'

你將會注意到字符串的末尾有一些有意思的字符。那應該是一個不可顯示的字符，而不是xff（\xff是這個字符的16進制表示）。在Python 3中，你將會得到你期望的輸出，

#Python 3
>>> "abcdef" + chr(255)
'abcdefÿ'

過去我在Python 2中常常通過會調用Python內置的unicode函數來試圖解決這個問題。它是將一個字符串轉換為Unicode格式。下面的代碼哪塊出錯了？

#Python 2
>>> unicode('abcdef' + chr(255))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 6: ordinal not in range(128)

UnicodeDecodeError錯誤可以說是Python 2中最為頭疼的問題。我曾經在一些項目上花費很多的時間來解決這個問題。我期待在Python 3中不要再和這些問題打交道了。我知道Python 包索引（PyPI）中提供了一個叫做Unidecode的庫，可以處理大部分的Unicode字符，並將它們轉換為ASCII字符。我已經利用這個工具去解決輸入的一些特定問題了。

2 編碼/解碼

你很快就可以了解到你既不能對一個unicode字符串進行解碼，也不能對一個str類型的字符串進行編碼。如果你嘗試對一個unicode類型的字符串解碼為ascii，例如，將其轉換為字節字符串，你將會得到一個UnicodeEncodeError錯誤。如下所示，

# Python 2
# 解碼
>>> u"\xa0".decode("ascii")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 0: ordinal not in range(128)
# 編碼
>>> "\xa0".encode("ascii")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)

如果你在Python 3中輸入同樣的代碼，你就會得到一個AttributeError錯誤，

# Python 3
>>> u"\xa0".decode("ascii")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute 'decode'

原因就是Python 3中的字符串並沒有decode屬性。但是字節字符串有decode這個屬性，讓我們用字節字符串作為示例，

# Python 3
>>> b"\xa0".decode("ascii")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)

但是ASCII編碼依然不知道如何處理我們傳入的字符。幸運的是，你可以傳入額外的參數用於指定解碼方法，如下所示，

# Python 3
>>> b"\xa0".decode("ascii","replace")
'�'
>>> b"\xa0".decode("ascii","ignore")
''

當我們指定解碼方法為替換這個字符或者忽略它，我們可以看到解碼后的結果。

讓我們來通過一個Python官方文檔中提供的實例，來學習如何對一個字符串進行編碼。

# Python 3
>>> u = chr(40960) + "abcd" + chr(1972)
>>> u
'ꀀabcd\u07b4'
>>> u.encode("ascii")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in position 0: ordinal not in range(128)
>>> u.encode("ascii","ignore")
b'abcd'
>>> u.encode("ascii","replace")
b'?abcd?'

這個例子中，我們定義一個字符串，並在字符串的開始和末尾分為添加一個非ASCII字符。然后我們使用編碼方法，嘗試着將這個字符串轉換為一個Unicode字符串的字節表示。第一個嘗試失敗了，然后返回給我們一個錯誤。下一個嘗試使用了 ignore 標志位，將字符串中的非ASCII字符全部刪除。最后一個嘗試使用了 replace 標志位，將未知的Unicode字符全部替換為問號。

如果你右很多與編碼相關的任務，Python也提供了codecs模塊，你可以參考。

總結

截至到目前，你已經對如何使用Unicode非常了解了。Unicode使得你的應用可以在代碼中或者輸出上支持其他語言。你也初步接觸了Pythono中對字符串的編碼和解碼。對於這部分，Python官方文檔提供了非常豐富的資料，如果你需要了解更多，情查閱它。

3 Reference

Python 201

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python標准模塊--asyncio Python標准模塊--itertools python標准模塊之subprocess Python學習（六）模塊 —— 標准模塊 python 內置標准模塊簡介 python中的__futrue__模塊，以及unicode_literals模塊【python】Python標准庫defaultdict模塊 Unicode 14 標准發布 Python中標准模塊importlib詳解 Python標准庫之Sys模塊使用詳解