「Python」數據清洗常用正則

本文轉載自查看原文 2018-10-09 17:07 848

對爬蟲數據進行自然語言清洗時用到的一些正則表達式

標簽中的所有屬性匹配（排除src,href等指定參數）

# \b(?!src|href)\w+=[\'\"].*?[\'\"](?=[\s\>]) # 匹配特征 id="..." # \b(?!...)排除屬性名中的指定參數，零寬斷言前向界定判斷屬性結束 # tips: 帶\b的python正則匹配一定要加r轉義 str1 = ''' <div class="concent" id="zoomcon" style="padding:15px;"> <img border="0" src="/xcsglj/zyhd/201802/f5492c1752094f44bcebae4a68480c64/images/9a900610afc54ee3b468780785a2ecec.gif"> <img border="0" src="/xcsglj/zyhd/201802/f5492c1752094f44bcebae4a68480c64/images/4b802f5d2d8c4ecd9a0525e0da7d886e.gif"> <img href="0" src="/xcsglj/zyhd/201802/f5492c1752094f44bcebae4a68480c64/images/4b802f5d2d8c4ecd9a0525e0da7d886e.gif"> ''' print(re.findall(r'\b(?!src)\w+=[\'\"].*?[\'\"](?=[\s\>])', string=str1)) # result: ['class="concent"', 'id="zoomcon"', 'style="padding:15px;"', 'border="0"', 'border="0"', 'href="0"']

html標簽的所有參數

# (?<=\<\w{1}\s).*?(?=\>) # (?<=\<\w{2}\s).*?(?=\>) # ... # 清除n個字母的標簽的所有參數 # tips: 零寬斷言不支持不定長度的匹配 str1 = ''' <a class="1" id="1" style="padding:1;"> <td class="2" id="2" style="padding:2;"> <div class="3" id="3" style="padding:3;"> <span class="4" id="4" style="padding:4;"> <table class="5" id="5" style="padding:5;"> ''' print(re.findall('(?<=\<\w{1}\s).*?(?=\>)', string=str1)) # result: ['class="1" id="1" style="padding:1;"'] print(re.findall('(?<=\<\w{2}\s).*?(?=\>)', string=str1)) # result: ['class="2" id="2" style="padding:2;"'] print(re.findall('(?<=\<\w{3}\s).*?(?=\>)', string=str1)) # result: ['class="3" id="3" style="padding:3;"'] print(re.findall('(?<=\<\w{4}\s).*?(?=\>)', string=str1)) # result: ['class="4" id="4" style="padding:4;"'] print(re.findall('(?<=\<\w{5}\s).*?(?=\>)', string=str1)) # result: ['class="5" id="5" style="padding:5;"']

非中文字符

# u'[^\u4e00-\u9fa5]+' # 清除非中文字符 str1 = 'aa.，a中文,aa。a' print(re.compile(u"[^\u4e00-\u9fa5]+").sub('', str1)) # result: 中文

指定通配符中的內容

# \{.*?\} // 匹配{}中的內容 # \<.*?\> // 匹配<>中的內容 str1 = '{通配符}你好，今天開學了{通配符},你好' print(re.compile(r'\{.*?\}').sub('', str1)) # result: 你好，今天開學了,你好

html標簽尾部的空格

# \s*(?=\>)

指定標簽（包括中間的內容）

# \<style.*?/style\>

清除常用中英文字符/標點/數字外的特殊符號

# u'[^\u4e00-\u9fa5\u0041-\u005A\u0061-\u007A\u0030-\u0039\u3002\uFF1F\uFF01\uFF0C\u3001\uFF1B\uFF1A\u300C\u300D\u300E\u300F\u2018\u2019\u201C\u201D\uFF08\uFF09\u3014\u3015\u3010\u3011\u2014\u2026\u2013\uFF0E\u300A\u300B\u3008\u3009\!\@\#\$\%\^\&\*\(\)\-\=\[\]\{\}\\\|\;\'\:\"\,\.\/\<\>\?\/\*\+\_"\u0020]+' str1 = re\ .compile(\ u "[^" u "\u4e00-\u9fa5" u "\u0041-\u005A" u "\u0061-\u007A" u "\u0030-\u0039" u "\u3002\uFF1F\uFF01\uFF0C\u3001\uFF1B\uFF1A\u300C\u300D\u300E\u300F\u2018\u2019\u201C\u201D\uFF08\uFF09\u3014\u3015\u3010\u3011\u2014\u2026\u2013\uFF0E\u300A\u300B\u3008\u3009" u "\!\@\#\$\%\^\&\*\(\)\-\=\[\]\{\}\\\|\;\'\:\"\,\.\/\<\>\?\/\*\+\_" u "\u0020" u "]+")\ .sub('', str1)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python－－數據清洗利用python 進行數據清洗爬蟲數據清洗數據清洗的方法 07>>>數據清洗數據清洗數據清洗數據清洗有哪些方法？ hive 學習系列之七 hive 常用數據清洗函數數據預處理（數據清洗）的一般方法及python實現