【292】Python 關於中文字符串的操作

本文轉載自查看原文 2018-01-18 16:03 3047 Python & ArcPy/ Python 學習

一、相關說明

Python 中關於字符串的操作只限於英文字母，當進行中文字符的操作的時候就會報錯，以下將相關操作方法進行展示。

寫在前面：如何獲得系統的默認編碼？

>>> import sys
>>> print sys.getdefaultencoding()
ascii

通過如下代碼查詢不同的字符串所使用的編碼，具體操作詳見：用chardet判斷字符編碼的方法

由此可見英文字符與中文字符用的是不一樣的編碼，因此需要將中文字符轉為 Unicode 編碼才能正常的計算了！

>>> import chardet
>>> print chardet.detect("abc")
{'confidence': 1.0, 'language': '', 'encoding': 'ascii'}
>>> print chardet.detect("我是中國人")
{'confidence': 0.9690625, 'language': '', 'encoding': 'utf-8'}
>>> print chardet.detect("abc-我是中國人")
{'confidence': 0.9690625, 'language': '', 'encoding': 'utf-8'}

通過 decode('utf-8') 將中文字符串解碼，便可以正常操作，要相對中文字符進行相關操作，涉及到字符串函數的，需要按如下操作。

decode 的作用是將其他編碼的字符串轉換成 unicode 編碼，如 str1.decode('utf-8')，表示將 utf-8 編碼的字符串 str1 轉換成 unicode 編碼。
encode 的作用是將 unicode 編碼轉換成其他編碼的字符串，如 str2.encode('utf-8')，表示將 unicode 編碼的字符串 str2 轉換成 utf-8 編碼。

>>> m = "我是中國人"
>>> m
'\xe6\x88\x91\xe6\x98\xaf\xe4\xb8\xad\xe5\x9b\xbd\xe4\xba\xba'
>>> print m
我是中國人
>>> # 為解碼前長度為15，utf-8編碼
>>> len(m)
15

>>> n = m.decode('utf-8')
>>> n
u'\u6211\u662f\u4e2d\u56fd\u4eba'
>>> print n
我是中國人
>>> # 解碼后長度為5，可以正常的操作，Unicode編碼
>>> len(n)
5

將 utf-8 與 Unicode 編碼轉化函數如下：

def decodeChinese( string ):
	"將中文 utf-8 編碼轉為 Unicode 編碼"
	tmp = string.decode('utf-8')
	return tmp

def encodeChinese( string ):
	"將 Unicode 編碼轉為 utf-8 編碼"
	tmp = string.encode('utf-8')
	return tmp

二、截取中英文字符串

代碼如下：

def cutChinese(string, *se):
	"實現漢字截取方法 —— 默認start為開始索引，不寫end就是到結尾，否則到end"
	start = se[0]
	if len(se)>1:
		end = se[1]
	else:
		end = len(string)
	tmp = string.decode('utf-8')[start:end].encode('utf-8')
	return tmp

調用方法如下：

>>> from win_diy import *
>>> print win.cutChinese("我是一個abc", 2)
一個abc
>>> print win.cutChinese("我是一個abc", 2, 4)
一個
>>> print win.cutChinese("我是一個abc", 2, 5)
一個a
>>> print win.cutChinese("我是一個abc", 2, 6)
一個ab

參考：python截取中文字符串

三、判斷變量編碼格式

通過 isinstance 函數或 type 函數可以判斷字符串類型
通過 chardet.detect 函數可以判斷字符串的編碼格式

>>> import chardet
>>> a = "abc"
>>> isinstance(a, str)
True
>>> chardet.detect(a)['encoding']
'ascii'
>>> isinstance(a, unicode)
False

>>> b = "中國"
>>> isinstance(b, str)
True
>>> chardet.detect(b)['encoding']
'utf-8'
>>> isinstance(b, unicode)
False

>>> # 用chardet.detect函數判斷Unicode會報錯
>>> c = b.decode('utf-8')
>>> isinstance(c, unicode)
True

參考：Python 字符編碼判斷

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 go 對中文字符串的操作 python截取中文字符串判斷字符串是否以中文字符開頭 [python基礎] python 2與python 3之間的區別 —— 默認中文字符串長 js jQuery中文字符串比較特定中文字符串正則匹配 golang 截取中文字符串 Qt 中文字符串問題 python 2.7中文字符串的匹配（參考） Python 中文字符串長度讀取不一致解決