網絡爬蟲必備知識之正則表達式

本文轉載自查看原文 2018-12-04 16:22 691 網絡爬蟲

就庫的范圍，個人認為網絡爬蟲必備庫知識包括urllib、requests、re、BeautifulSoup、concurrent.futures，接下來將結對re正則表達式的使用方法進行總結

1. 正則表達式概念

　　正則表達式是對字符串操作的一種邏輯公式，就是用事先定義好的一些特定字符、及這些特定字符的組合，組成一個“規則字符串”，這個“規則字符串”用來表達對字符串的一種過濾邏輯。

　　許多程序設計語言都支持正則表達式進行字符串操作，並不是python獨有，python的re模塊提供了對正則表達式的支持。

　　正則表達式內容太過於"深奧"，以下內容僅總結我平時使用過程中認為相對重要的點：常用匹配模式、泛匹配、貪婪匹配、分組匹配(exp)和re庫函數

2. python正則常用匹配模式

\w      匹配字母數字及下划線
\W      匹配f非字母數字下划線
\s      匹配任意空白字符，等價於[\t\n\r\f]
\S      匹配任意非空字符
\d      匹配任意數字
\D      匹配任意非數字
\A      匹配字符串開始
\Z      匹配字符串結束，如果存在換行，只匹配換行前的結束字符串
\z      匹配字符串結束
\G      匹配最后匹配完成的位置
\n      匹配一個換行符
\t      匹配一個制表符
^       匹配字符串的開頭
$       匹配字符串的末尾
.       匹配任意字符，除了換行符，re.DOTALL標記被指定時，則可以匹配包括換行符的任意字符
[....]  用來表示一組字符，單獨列出：[amk]匹配a,m或k
[^...]  不在[]中的字符：[^abc]匹配除了a,b,c之外的字符
*       匹配0個或多個的表達式
+       匹配1個或者多個的表達式
?       匹配0個或1個由前面的正則表達式定義的片段，非貪婪方式
{n}     精確匹配n前面的表示
{m,m}   匹配n到m次由前面的正則表達式定義片段，貪婪模式
a|b     匹配a或者b
()      匹配括號內的表達式，也表示一個組

2. re庫使用說明

（1）match函數

　　函數原型：def match(pattern, string, flags=0):

　　嘗試從字符串的起始位置匹配一個模式，如果起始位置沒匹配上的話，返回None

content= "hello 123 4567 World_This is a regex Demo"
result = re.match('^hello\s\d\d\d\s\d{4}\s\w{10}.*Demo$',content)
print(result)
print(result.group()) #獲取匹配的結果
print(result.span())  #獲取匹配字符串的長度范圍

　　輸出：

（2）泛匹配

　　上面的代碼正則表達式太復雜，我們可以使用下面的方式進行簡化

content= "hello 123 4567 World_This is a regex Demo"
result = re.match('^hello.*Demo$',content)
print(result)
print(result.group())
print(result.span())

　　輸出結果一樣，這樣看起來就更簡潔，以hello開頭，中間匹配任意字符0次到多次，以Demo結尾

（3）分組匹配

　　為了匹配字符串中具體的目標，可以使用（）進行分組匹配

content= "hello 123 4567 World_This is a regex Demo"
result = re.match('^hello\s(\d+).*Demo$',content)
print(result.group())
print(result.group(1))

　　輸出：

（4）命名方式的分組匹配

　　(?<name>exp) :匹配exp,並捕獲文本到名稱為name的組里，也可以寫成(?'name'exp)

content= "hello 123 4567 World_This is a regex Demo"
result = re.match('^hello\s(?P<num>\d+).*Demo$',content)
print(result.group())
print(result.group(1))
print(result.groupdict())

　　輸出：

　　采用命名分組方式，可以通過key‘num’獲取匹配到的信息

（5）貪婪匹配

　　意思就是一直匹配，匹配到匹配不上為止

content= "hello 123 4567 World_This is a regex Demo"
result = re.match('^hello.*(?P<name>\d+).*Demo$',content)
print(result.group())
print(result.group(1))
print(result.groupdict())

　　輸出：

　　最終結果輸出的是7，出現這樣的結果是因為被前面的.*給匹陪掉了，只剩下了一個數字，這就是貪婪匹配

　　若要非貪婪匹配可以使用問號（？）

content= "hello 123 4567 World_This is a regex Demo"
result = re.match('^hello.*?(?P<name>\d+).*Demo$',content)
print(result.group())
print(result.group(1))
print(result.groupdict())

　　這樣就可以匹配123了

（6）函數中添加匹配模式

　　def match(pattern, string, flags=0)函數中的第三個參數flags設置匹配模式

　　re.I：使匹配對大小寫不敏感

　　re.L：做本地化識別匹配

　　re.S：使.包括換行在內的所有字符

　　re.M：多行匹配，影響^和$

　　re.U：使用unicode字符集解析字符，這個標志影響\w,\W,\b,\B

　　re.X：該標志通過給予你更靈活的格式以便你將正則表達式寫得更易於理解

　　下面以re.I和re.S為例：

content= "heLLo 123 4567 World_This is a regex Demo"
result = re.match('hello',content,re.I)
print(result.group())

　　輸出：heLLo

　　不加re.S情況

content= '''heLLo 123 4567 World_This is 
a regex Demo'''
result = re.match('.*',content)
print(result.group())

　　輸出：heLLo 123 4567 World_This is

　　再看加re.S的情況

content= '''heLLo 123 4567 World_This is 
a regex Demo'''
result = re.match('.*',content,re.S)
print(result.group())

　　re庫中大部分函數都有該flags參數

（7）search函數

　　函數原型：def search(pattern, string, flags=0)

　　掃描整個字符串，返回第一個匹配成功的結果

content= '''hahhaha hello 123 4567 world'''
result = re.search('hello.*world',content)
print(result.group())

　　輸出：hello 123 4567 world，如果將search改為match將提示異常，因為沒有匹配到內容

（8）findall函數

　　函數原型：def findall(pattern, string, flags=0)

　　搜索字符串，以列表的形式返回所有能匹配的字串

content= '''
    <url>
        <loc>http://example.webscraping.com/places/default/view/Afghanistan-1</loc>
    </url>
    <url>
        <loc>http://example.webscraping.com/places/default/view/Aland-Islands-2</loc>
    </url>
    <url>
        <loc>http://example.webscraping.com/places/default/view/Albania-3</loc>
    </url>
    <url>
        <loc>http://example.webscraping.com/places/default/view/Algeria-4</loc>
    </url>
    <url>
        <loc>http://example.webscraping.com/places/default/view/American-Samoa-5</loc>
    </url>'''
urls = re.findall('<loc>（.*）</loc>',content)
for url in urls:
    print(url)

　　輸出：

（9）sub函數

　　函數原型：def subn(pattern, repl, string, count=0, flags=0)

　　替換字符串中每一個匹配的子串后返回替換后的字符串

content= '''hahhaha hello 123 4567 world'''
str = re.sub('hello.*world','zhangsan',content)
print(str)

　　輸出：hahhaha zhangsan

（10）compile

　　函數原型：def compile(pattern, flags=0)

　　將正則表達式編譯成正則表達式對象，方便復用該正則表達式

content= '''hahhaha hello 123 4567 world'''
pattern = 'hello.*'
regex = re.compile(pattern)
str = re.sub(regex,'zhangsan',content)
print(str)