python列表中中文編碼的問題

本文轉載自查看原文 2017-12-08 00:14 8031 python

在python2列表中，有時候，想打印一個列表，會出現如下顯示：

這個是由於：

print一個對象，是輸出其“為了給人（最終用戶）閱讀”而設計的輸出形式，那么字符串中的轉義字符需要轉出來，而且也不要帶標識字符串邊界的引號。

因此，單獨打印列表中的某一項，譬如：list[0]，他可以很好的轉義出中文字符。而一個list對象，本身就是個數據結構，如果要把它顯示給最終用戶看，它不會對里面的數據進行潤色。

解決辦法參考：https://www.zhihu.com/question/20413029

由此進一步思考：

1、我們在定義字符串的時候，u"中文"的u是什么意思？

string = u"中文"
string.decode('utf8')

　　可以看到會出異常：

---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-41-b3abdaf47d60> in <module>()
      1 string = u"中文"
----> 2 string.decode('utf8')

C:\ProgramData\Anaconda2\lib\encodings\utf_8.pyc in decode(input, errors)
     14 
     15 def decode(input, errors='strict'):
---> 16     return codecs.utf_8_decode(input, errors, True)
     17 
     18 class IncrementalEncoder(codecs.IncrementalEncoder):

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

　　這說明，string的編碼方式並不是utf-8。

我之前一直以為是指的是utf-8的編碼方式，其實不然。

2、# -*- coding: utf-8 -*- 和 sys.setdefaultencoding("utf-8")的區別是什么？

# -*- coding: utf-8 -*- ：作用於源代碼，如果沒有定義，源碼不能包含中文字符。https://www.python.org/dev/peps/pep-0263/

sys.setdefaultencoding("utf-8") ：設置默認的string編碼方式

3、decode\encode指定編碼解碼方式

# -*- coding: utf-8 -*-
import sys 
#Python2.5 初始化后刪除了 sys.setdefaultencoding 方法，我們需要重新載入
reload(sys)
sys.setdefaultencoding('utf-8') 

string = "中文"
print repr(string.decode('utf-8'))

4、unicode編碼

字符串通常包含str、unicode兩種類型，通常str指字符串編碼方式。在Python程序內部，通常使用的字符串為unicode編碼，這樣的字符串字符是一種內存編碼格式，如果將這些數據存儲到文件或是記錄日志的時候，就需要將unicode編碼的字符串轉換為特定字符集的存儲編碼格式，比如：UTF-8、GBK等。

unicode編碼：編碼表的編號從0一直算到了100多萬（三個字節）。每一個區間都對應着一種語言的編碼。目前幾乎收納了全世界大部分的字符。所有的字符都有唯一的編號，事實上是一種字符集。但是，unicode把大家都歸納進來，卻沒有為編碼的二進制傳輸和二進制解碼做出規定。於是，就出現了如下解決方案：uft-8，utf-16，utf-32這些編碼方案，主要還是為了解決一個信息傳輸效率的問題，因為如果直接根據字符集進行傳輸的話，三個字節的表示就會比較低效了。

str 轉 unicode

string = "asdf"
string.decode("utf-8")

所以，u就是unicode

unicode轉 str

string = u"asdf"
string.encode("utf-8")

5、unicode-escape

在將unicode存儲到文本的過程中，還有一種存儲方式，不需要將unicode轉換為實際的文本存儲字符集，而是將unicode的內存編碼值進行存儲，讀取文件的時候再反向轉換回來，是采用：unicode-escape的轉換方式。

unicode到unicode-escape

string = "中文"  #  或 u"中文"，不影響，因為最終都是unicode的內存編碼
string.encode("unicode-escape")

unicode-escape到unicode

string = "中文"  
string.decode("unicode-escape")

　　>> u'\xe4\xb8\xad\xe6\x96\x87

6、string-escape

對於utf-8編碼的字符串，在存儲的時候，通常是直接存儲，而實際上也還有一種存儲utf-8編碼值的方法，即：string-escape。

str(utf8)到string-escape

string = "中文"  
string.encode("string-escape")

　　>> '\\xe4\\xb8\\xad\\xe6\\x96\\x87'

string-escape到str(utf8)

string = "中文"  
string.decode("string-escape")

　　>>'\xe4\xb8\xad\xe6\x96\x87'

//-------------由上，進一步分析：

a = "中文"
print repr(a.decode("utf-8"))
a = "中文"
print repr(a.decode("unicode-escape"))
print repr(u"中文")
print repr(a)

可以看到，從str轉unicode和從unicode-escape轉unicode的差距。再比如：

string = '\u4e2d\u6587'
print repr(string.decode("unicode-escape"))
print repr(string.decode("utf8"))

更為清楚的看到，從unicode-escape轉unicode，兩者沒有文本轉化的過程，是一個內存轉化的過程。而通過str轉unicode，會有文本轉化，譬如對轉義字符的操作。

對於列表中中文編碼的解釋：

arr = [u"中文"]
print arr
print repr(arr)
pp =  str(arr).decode("unicode-escape")#
print pp
print repr(pp)
tt = str(arr).decode("utf-8")
print tt
print repr(tt)

　　>>[u'\u4e2d\u6587']

>>[u'中文']

>>u"[u'\u4e2d\u6587']"

>>[u'\u4e2d\u6587']

>>u"[u'\\u4e2d\\u6587']"

由此可見，想要打印list中的中文，思路是：

通過字符串化處理，將list轉化為str（utf-8）文本編碼的方式，同時要保留list里面的unicode，避免通過字符處理導致的轉義操作，破壞掉中文的unicode，因此選擇了unicode-escape

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python的中文編碼問題 python requests 中文編碼問題 python中文編碼&json中文輸出問題 python - 中文編碼/ASCII python處理中文編碼 python json.dumps 中的ensure_ascii 參數引起的中文編碼問題關於python27和windows系統的中文編碼問題 python+selenium輸入中文編碼問題（一）深入分析 Java 中的中文編碼問題 python 中幾個層次的中文編碼.md