unicodedata.normalize()清理字符串

# normalize()的第一個參數指定字符串標准化的方式，分別有NFD/NFC

>>> s1 = 'Spicy Jalape\u00f1o'
>>> s2 = 'Spicy Jalapen\u0303o'
>>> import unicodedata
# NFC表示字符應該是整體組成(可能是使用單一編碼)
>>> t1 = unicodedata.normalize('NFC', s1)
>>> t2 = unicodedata.normalize('NFC', s2)
>>> t1 == t2
True
# NFD表示字符應該分解為多個組合字符表示
>>> t1 = unicodedata.normalize('NFD', s1)
>>> t2 = unicodedata.normalize('NFD', s2)
>>> t1 == t2
True

注：Python中同樣支持NFKC/NFKD，使用原理同上

combining()匹配文本上的和音字符

>>> s1
'Spicy Jalapeño'
>>> t1 = unicodedata.normalize('NFD', s1)
>>> ''.join(c for c in t1 if not unicodedata.combining(c)) # 去除和音字符
'Spicy Jalapeno'

使用strip()、rstrip()和lstrip()

>>> s = ' hello world \n'
# 去除左右空白字符
>>> s.strip()
'hello world'
# 去除右邊空白字符
>>> s.rstrip()
' hello world'
# 去除左邊空白字符
>>> s.lstrip()
'hello world \n'
>>> t = '-----hello====='
# 去除左邊指定字段('-')
>>> t.lstrip('-')
'hello====='
# 去除右邊指定字段('-')
>>> t.rstrip('=')
'-----hello'

# 值得注意的是，strip等不能夠去除中間空白字符，要使用去除中間空白字符可以使用下面方法

>>> s = ' hello world \n'
# 使用replace()那么會造成"一個不留"
>>> s.replace(' ', '')
'helloworld\n'
# 使用正則
>>> import re
>>> re.sub(r'\s+', ' ', s)
' hello world '

關於translate()

# 處理和音字符

>>> s = 'pýtĥöñ\fis\tawesome\r\n'
>>> remap = {ord('\r'): None, ord('\t'): ' ', ord('\f'): ' '} # 構造字典,對應空字符
>>> a = s.translate(remap) # 進行字典轉換
>>> a
'pýtĥöñ is awesome\n'
>>> import unicodedata
>>> import sys
>>> cmb_chrs = dict.fromkeys(c for c in range(sys.maxunicode) if unicodedata.combining(chr(c))) # 查找系統的和音字符，並將其設置為字典的鍵，值設置為空
>>> b = unicodedata.normalize('NFD', a) # 將原始輸入標准化為分解形式字符
>>> b
'pýtĥöñ is awesome\n'
>>> b.translate(cmb_chrs)
'python is awesome\n'

# 將所有的Unicode數字字符映射到對應的ASCII字符上

# unicodedata.digit(chr(c)) # 將ASCII轉換為十進制數字，再加上'0'的ASCII就對應了“0~9”的ASCII碼
>>> digitmap = {c: ord('0')+unicodedata.digit(chr(c)) for c in range(sys.maxunicode) if unicodedata.category(chr(c)) == 'Nd'} # （unicodedata.category(chr(c)) == 'Nd'）表示系統“0~9”的Unicode字符
>>> len(digitmap)
610
>>> x = '\u0661\u0662\u0663'
>>> x.translate(digitmap)
'123'

關於I/O解碼和編碼函數

>>> a
'pýtĥöñ is awesome\n'
>>> b = unicodedata.normalize('NFD', a)
>>> b.encode('ascii', 'ignore').decode('ascii')
'python is awesome\n'

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python誤區之strip,lstrip,rstrip Python中strip()、lstrip()、rstrip()用法詳解 Python中strip()、lstrip()、rstrip()函數的用法 python中strip，lstrip，rstrip簡介 strip()、lstrip()、rstrip()用法 Python strip lstrip rstrip使用方法（字符串處理空格） python：字符串方法：去除字符串空格以及換行符strip、lstrip、rstrip python中strip()、encode()、decode()、split()方法【C++實現python字符串函數庫】strip、lstrip、rstrip方法 strip,lstrip,rstrip,split(字符串處理)

unicodedata.normalize()/使用strip()、rstrip()和lstrip()/encode和decode 筆記(具體可看 《Python Cookbook》3rd Edition 2.9~2.11)

unicodedata.normalize()清理字符串

關於translate()

關於I/O解碼和編碼函數

免責聲明！

unicodedata.normalize()/使用strip()、rstrip()和lstrip()/encode和decode 筆記(具體可看《Python Cookbook》3rd Edition 2.9~2.11)