爬蟲常用正則、re.findall 使用

本文轉載自查看原文 2019-07-26 17:12 881 Python 爬蟲

爬蟲常用正則

爬蟲經常用到的一些正則，這可以幫助我們更好地處理字符。

正則符

單字符

. : 除換行以外所有字符
[] ：[aoe] [a-w] 匹配集合中任意一個字符
\d ：數字  [0-9]
\D : 非數字
\w ：數字、字母、下划線、中文
\W : 非\w
\s ：所有的空白字符包,括空格、制表符、換頁符等等。等價於 [ \f\n\r\t\v]
\S : 非空白

數量修飾

* : 任意多次  >=0
+ : 至少1次   >=1
? : 可有可無  0次或者1次
{m} ：固定m次 hello{3,}
{m,} ：至少m次
{m,n} ：m-n次

邊界

$ : 以某某結尾 
^ : 以某某開頭

分組

(ab)

貪婪模式

.*

非貪婪惰性模式

.*?

# 1 提取出python
'''
key = 'javapythonc++php'

re.findall('python',key)
re.findall('python',key)[0]
'''
# 2 提取出 hello word
'''
key = '<html><h1>hello word</h1></html>'
print(re.findall('<h1>.*</h1>', key))
print(re.findall('<h1>(.*)</h1>', key))
print(re.findall('<h1>(.*)</h1>', key)[0])
'''
# 3 提取170
'''
key = '這個女孩身高170厘米'
print(re.findall('\d+', key)[0])
'''
# 4 提取出http://和https://
'''
key = 'http://www.baidu.com and https://www.cnblogs.com'
print(re.findall('https?://', key))
'''
# 5 提取出 hello
'''
key = 'lalala<hTml>hello</HtMl>hahaha'   # 輸出的結果<hTml>hello</HtMl>
print(re.findall('<[hH][tT][mM][lL]>.*[/hH][tT][mM][lL]>',key))
'''
# 6 提取hit. 貪婪模式;盡可能多的匹配數據
'''
key = 'qiang@hit.edu.com'                # 加?是貪婪匹配,不加?是非貪婪匹配
print(re.findall('h.*?\.', key))
'''
# 7 匹配出所有的saas和sas
'''
key = 'saas and sas and saaas'
print(re.findall('sa{1,2}s',key))
'''
# 8 匹配出 i 開頭的行
'''
key = """fall in love with you
i love you very much 
i love she
i love her
"""
print(re.findall('^i.*', key, re.M))
'''
# 9 匹配全部行
'''
key = """
<div>細思極恐
你的隊友在看書,
你的閨蜜在減肥,
你的敵人在磨刀,
隔壁老王在練腰.
</div>
"""
print(re.findall('.*', key, re.S))
'''

案例題

re.findall 使用

1、re.findall 可以對多行進行匹配，並依據參數作出不同結果。

re.findall(取值,值,re.M)
    - re.M ：多行匹配
    - re.S ：單行匹配 如果分行則顯示/n
    - re.I : 忽略大小寫
    - re.sub(正則表達式, 替換內容, 字符串)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python re.findall 使用 python正則模塊re.findall的問題 python正則匹配re.search與re.findall的區別正則表達式 re.findall 用法 python正則中re.findall匹配多個條件正則表達式 re.findall 用法 python之正則表達式 re.findall 用法 python中正則表達式 re.findall 用法 re正則匹配之findall 正則表達式整理(\w \s \d 點貪婪匹配非貪婪匹配 * + ? {} | [] ^ $ \b 單詞邊界分組、re.findall()、re.split()、re.search()、re.match()、re.compile()、re.sub())