遇到了一個小蟲,特記錄之。
1.正則表達式及英文的處理如下:
>>> import re >>> b='adfasdfasf<1safadsaf>23wfsa<13131>' >>> pat = re.compile('<.*?>') >>> pat.findall(b) ['<1safadsaf>', '<13131>']
2. 換成中文貌似就沒反應了
>>> msg="<Fault warning -- \xb4\xed\xce\xf3!\n\n\xb2\xfa\xc6\xb7\xb1\xe0\xba\xc5\xb1\xd8\xd0\xe8\xce\xa8\xd2\xbb,\xd0\xc2\xb1\xe0\xba\xc53123\xb6\xd4\xd3\xa6\xb5\xc4\xb2\xfa\xc6\xb7\xd2\xd1\xbe\xad\xb4\xe6\xd4\xda!\n\xc8\xe7\xb9\xfb\xc4\xfa\xca\xd4\xcd\xbc\xcd\xa8\xb9\xfd\xb8\xb4\xd6\xc6\xc0\xb4\xc9\xfd\xbc\xb6\xb2\xfa\xc6\xb7\xd4\xf2\xcb\xb5\xc3\xf7\xb4\xcb\xb2\xfa\xc6\xb7\xd2\xd1\xbe\xad\xb4\xe6\xd4\xda\xc9\xfd\xbc\xb6\xb0\xe6\xa3\xac\xc7\xeb\xc1\xf4\xd2\xe2\xa1\xa3: ''>" >>> pat.findall(msg) []
仔細分析了下貌似因為其中的\n字符!
甚為不解,又try了一把:
>>> msg ='<\r>asdasf<asdfaf>' >>> pat.findall(msg) ['<\r>', '<asdfaf>'] >>> msg='<\n>adf<afd>' >>> pat.findall(msg) ['<afd>'] >>> msg='<\s>adaf<asdfa>' >>> pat.findall(msg) ['<\\s>', '<asdfa>'] >>> msg='<\n>asdfasf<asfa>' >>> pat.findall(msg) ['<asfa>']
確實點號無法匹配特殊字符'\n'!
在這里找到了說明。
. | 匹配除 "\n" 之外的任何單個字符。要匹配包括 '\n' 在內的任何字符,請使用象 '[.\n]' 的模式。 |
3.[.\n]的尷尬情況
>>> pat= re.compile('<[.\n]*?>') >>> pat.findall(msg) ['<\n>']
>>> msg '<\n>asdfasf<asfa>'
>>> msg='<\nasdfs>adaf<adaf>' >>> pat.findall(msg) []
谷歌了一番,找到了答案,在這里。即加入DOTALL選項。如下:
>>> pat = re.compile('<.*?>',re.DOTALL) >>> pat.findall(msg) ['<\nasdfs>', '<adaf>']