python正則的中文處理(轉)

本文轉載自查看原文 2013-07-27 23:27 5756 python(轉)/ 正則表達式

匹配中文時，正則表達式規則和目標字串的編碼格式必須相同

    print sys.getdefaultencoding()
    text =u"#who#helloworld#a中文x#"
    print isinstance(text,unicode)
    print text

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 18: ordinal not in range(128)

print text報錯
解釋：控制台信息輸出窗口是按照ascii編碼輸出的（英文系統的默認編碼是ascii），而上面代碼中的字符串是Unicode編碼的，所以輸出時產生了錯誤。
改成 print(word.encode('utf8'))即可

//確定系統默認編碼
import sys
print sys.getdefaultencoding()
//'ascii'

//判斷字符類型是否unicode
print isinstance(text,unicode)
//True

unicode\python字符互轉

__author__ = 'medcl'
# -*- coding: utf-8 -*-
unistr= u'a';
pystr=unistr.encode('utf8')
unistr2=unicode(pystr,'utf8')

#需要unicode的環境
if not isinstance(input,unicode):
    temp=unicode(input,'utf8')
else:
    temp=input

#需要pythonstr的環境
if isinstance(input,unicode):
    temp2=input.encode('utf8')
else:
    temp2=input

正則獲取No-ascii

內容：
"#who#helloworld#a中文x#"

正則：
r"[\x80-\xff]+"

輸出：
中文

__author__ = 'medcl'
# -*- coding: utf-8 -*-
import re
def findPart(regex, text, name):
    res=re.findall(regex, text)
    if res:
        print "There are %d %s parts:\n"% (len(res), name)
        for r in res:
            print "\t",r.encode("utf8")
        print
 
text ="#who#helloworld#a中文x#"
usample=unicode(text,'utf8')
findPart(u"#[\w\u2E80-\u9FFF]+#", usample, "unicode chinese")

輸出

	#who#
	#a中文x#

幾個主要非英文語系字符范圍

2E80～33FFh：中日韓符號區。收容康熙字典部首、中日韓輔助部首、注音符號、日本假名、韓文音符，中日韓的符號、標點、帶圈或帶括符文數字、月份，以及日本的假名組合、單位、年號、月份、日期、時間等。
3400～4DFFh：中日韓認同表意文字擴充A區，總計收容6,582個中日韓漢字。
4E00～9FFFh：中日韓認同表意文字區，總計收容20,902個中日韓漢字。
A000～A4FFh：彝族文字區，收容中國南方彝族文字和字根。
AC00～D7FFh：韓文拼音組合字區，收容以韓文音符拼成的文字。
F900～FAFFh：中日韓兼容表意文字區，總計收容302個中日韓漢字。
FB00～FFFDh：文字表現形式區，收容組合拉丁文字、希伯來文、阿拉伯文、中日韓直式標點、小符號、半角符號、全角符號等。

REF:http://www.blogjava.net/Skynet/archive/2009/05/02/268628.html

http://iregex.org/blog/python-chinese-unicode-regular-expressions.html

本文來自: python正則的中文處理

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python 正則匹配中文(unicode)(轉) python 正則表達式匹配中文(轉) php 正則匹配中文(轉) python re 正則提取中文 python re 正則匹配中文 python 正則提取中文,漢字 python處理中文 jieba中文處理 python Unicode和Python的中文處理 python 如何處理url的中文