[Python正則表達式] 字符串中xml標簽的匹配

本文轉載自查看原文 2016-12-17 14:08 4424 python/ 正則表達式

　　現在有一個需求，比如給定如下數據：

0-0-0 0:0:0 #### the 68th annual golden globe awards ####  the king s speech earns 7 nominations  ####  <LOCATION>LOS ANGELES</LOCATION> <ORGANIZATION>Dec Xinhua Kings Speech</ORGANIZATION> historical drama British king stammer beat competitors Tuesday grab seven nominations Golden Globe Awards nominations included best film drama nod contested award organizers said films competing best picture <ORGANIZATION>Social Network Black Swan Fighter Inception Kings Speech</ORGANIZATION> earned nominations best performance actor olin <PERSON>Firth</PERSON> best performance actress <PERSON>Helena Bonham</PERSON> arter best supporting actor <PERSON>Geoffrey Rush</PERSON> best director <PERSON>Tom Hooper</PERSON> best screenplay <PERSON>David Seidler</PERSON> best movie score <ORGANIZATION>Alexandre Desplat Social Network Fighter</ORGANIZATION> earned nods apiece Black Swan Inception Kids Right tied place movie race nominations best motion picture comedy musical category <ORGANIZATION>Alice Wonderland Burlesque Kids Right Red Tourist</ORGANIZATION> compete Nominated best actor motion picture olin <ORGANIZATION>Firth Kings Speech James Franco Hours Ryan Gosling Blue Valentine Mark Wahlberg Fighter Jesse Eisenberg Social Network</ORGANIZATION> best actress motion picture nominees <PERSON>Halle Berry Frankie Alice Nicole Kidman</PERSON> Rabbit Hole <PERSON>Jennifer Lawrence</PERSON> <ORGANIZATION>Winters Bone Natalie Portman Black Swan Michelle Williams Blue Valentine TV</ORGANIZATION> categories Glee nominee nods followed Rock Boardwalk Empire Dexter Good Wife Mad Men Modern Family Pillars Earth Temple <PERSON>Grandin</PERSON> tied nods apiece awards announced Jan

　　要求按行把<></>標簽內的字符串中的空格替換成下划線_，並且將數據轉換形式，例：<X>A B C</X>需要轉換成A_B_C/X

　　由於正則表達式匹配是貪婪模式，即盡可能匹配到靠后，那么就非常麻煩，而且僅僅是用?是無法真正保證是非貪婪的。所以需要在正則匹配時給之前匹配好的字符串標一個名字。

python下，正則最終寫出來是這樣：

1 LABEL_PATTERN = re.compile('(<(?P<label>\S+)>.+?</(?P=label)>)')

　　接下來我們需要做是在原字符串中找出對應的子串，並且記下他們的位置，接下來就是預處理出需要替換成的樣子，再用一個正則就好了。

1 LABEL_CONTENT_PATTERN = re.compile('<(?P<label>\S+)>(.*?)</(?P=label)>')

　　對字符串集合做整次的map，對每一個字符串進行匹配，再吧這兩部分匹配結果zip在一起，就可以獲得一個start-end的tuple，大致這樣。

 1 ('<LOCATION>LOS ANGELES</LOCATION>', 'LOS_ANGELES/LOCATION')
 2 ('<ORGANIZATION>Dec Xinhua Kings Speech</ORGANIZATION>', 'Dec_Xinhua_Kings_Speech/ORGANIZATION')
 3 ('<ORGANIZATION>Social Network Black Swan Fighter Inception Kings Speech</ORGANIZATION>', 'Social_Network_Black_Swan_Fighter_Inception_Kings_Speech/ORGANIZATION')
 4 ('<PERSON>Firth</PERSON>', 'Firth/PERSON')
 5 ('<PERSON>Helena Bonham</PERSON>', 'Helena_Bonham/PERSON')
 6 ('<PERSON>Geoffrey Rush</PERSON>', 'Geoffrey_Rush/PERSON')
 7 ('<PERSON>Tom Hooper</PERSON>', 'Tom_Hooper/PERSON')
 8 ('<PERSON>David Seidler</PERSON>', 'David_Seidler/PERSON')
 9 ('<ORGANIZATION>Alexandre Desplat Social Network Fighter</ORGANIZATION>', 'Alexandre_Desplat_Social_Network_Fighter/ORGANIZATION')
10 ('<ORGANIZATION>Alice Wonderland Burlesque Kids Right Red Tourist</ORGANIZATION>', 'Alice_Wonderland_Burlesque_Kids_Right_Red_Tourist/ORGANIZATION')
11 ('<ORGANIZATION>Firth Kings Speech James Franco Hours Ryan Gosling Blue Valentine Mark Wahlberg Fighter Jesse Eisenberg Social Network</ORGANIZATION>', 'Firth_Kings_Speech_James_Franco_Hours_Ryan_Gosling_Blue_Valentine_Mark_Wahlberg_Fighter_Jesse_Eisenberg_Social_Network/ORGANIZATION')
12 ('<PERSON>Halle Berry Frankie Alice Nicole Kidman</PERSON>', 'Halle_Berry_Frankie_Alice_Nicole_Kidman/PERSON')
13 ('<PERSON>Jennifer Lawrence</PERSON>', 'Jennifer_Lawrence/PERSON')
14 ('<ORGANIZATION>Winters Bone Natalie Portman Black Swan Michelle Williams Blue Valentine TV</ORGANIZATION>', 'Winters_Bone_Natalie_Portman_Black_Swan_Michelle_Williams_Blue_Valentine_TV/ORGANIZATION')
15 ('<PERSON>Grandin</PERSON>', 'Grandin/PERSON')
16 ('<LOCATION>BEIJING</LOCATION>', 'BEIJING/LOCATION')
17 ('<ORGANIZATION>Xinhua Sanlu Group</ORGANIZATION>', 'Xinhua_Sanlu_Group/ORGANIZATION')
18 ('<LOCATION>Gansu</LOCATION>', 'Gansu/LOCATION')
19 ('<ORGANIZATION>Sanlu</ORGANIZATION>', 'Sanlu/ORGANIZATION')

　　處理的代碼如下：

 1 def read_file(path):
 2     if not os.path.exists(path):
 3         print 'path : \''+ path + '\' not find.'
 4         return []
 5     content = ''
 6     try:
 7         with open(path, 'r') as fp:
 8             content += reduce(lambda x,y:x+y, fp)
 9     finally:
10         fp.close()
11     return content.split('\n')
12 
13 def get_label(each):
14     pair = zip(LABEL_PATTERN.findall(each),
15                          map(lambda x: x[1].replace(' ', '_')+'/'+x[0], LABEL_CONTENT_PATTERN.findall(each)))
16     return map(lambda x: (x[0][0], x[1]), pair)
17 
18 src = read_file(FILE_PATH)
19 pattern = map(get_label, src)

　　接下來簡單處理以下就好：

1 for i in range(0, len(src)):
2     for pat in pattern[i]:
3         src[i] = re.sub(pat[0], pat[1], src[i])

　　所有代碼：

 1 # -*- coding: utf-8 -*-
 2 import re
 3 import os
 4 
 5 # FILE_PATH = '/home/kirai/workspace/sina_news_process/disworded_sina_news_attr_handled.txt'
 6 FILE_PATH = '/home/kirai/workspace/sina_news_process/test.txt'
 7 LABEL_PATTERN = re.compile('(<(?P<label>\S+)>.+?</(?P=label)>)')
 8 LABEL_CONTENT_PATTERN = re.compile('<(?P<label>\S+)>(.*?)</(?P=label)>')
 9 
10 def read_file(path):
11     if not os.path.exists(path):
12         print 'path : \''+ path + '\' not find.'
13         return []
14     content = ''
15     try:
16         with open(path, 'r') as fp:
17             content += reduce(lambda x,y:x+y, fp)
18     finally:
19         fp.close()
20     return content.split('\n')
21 
22 def get_label(each):
23     pair = zip(LABEL_PATTERN.findall(each),
24                          map(lambda x: x[1].replace(' ', '_')+'/'+x[0], LABEL_CONTENT_PATTERN.findall(each)))
25     return map(lambda x: (x[0][0], x[1]), pair)
26 
27 src = read_file(FILE_PATH)
28 pattern = map(get_label, src)
29 
30 for i in range(0, len(src)):
31     for pat in pattern[i]:
32         src[i] = re.sub(pat[0], pat[1], src[i])

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python從文件中讀取字符串，用正則表達式匹配中文字符的問題正則表達式中，如何匹配字符串中的 '\' Python 正則表達式匹配兩個指定字符串中間的內容 python 正則表達式 re findall 返回能匹配的字符串正則表達式匹配多個字符串中的一個在Python中使用正則表達式去掉字符串里的html標簽 Python字符串及正則表達式 js 正則表達式移除字符串中的所有html標簽保留純文本 python 提取字符串中的指定字符正則表達式 Python：用正則表達式，提取字符串中的所有中文