Python 正則表達式的使用

本文轉載自查看原文 2019-08-21 17:09 2174 Python

正則表達式通常被用來檢索、替換那些符合某個模式(規則)的文本，Python使用re模塊來處理正則表達式。

一、正則表達式
1、通配符
句點 . 與除換行符外的任何字符都匹配，並且只與一個字符匹配。
例如正則表達式'.ython'與字符串'python'匹配，不與'cpython'或'ython'匹配。
2、特殊字符轉義
用兩個反斜杠轉義，如果用單個反斜杠，則前面字符串加r。
例如模式'python\\.org'，或r'python\.org'匹配字符串'python.org'。
3、字符集
字符集用方括號將一個子串括起，字符集只能匹配一個字符，如 '[pj]ython'與'python'和'jython'都匹配；
也可以使用用范圍，如'[a-zA-Z0-9]'與大寫字母、小寫字母和數字都匹配；
要指定排除字符集，在開頭添加一個^字符，如'[^ab]'與除a、b外的其他任何字符都匹配。
4、二選一和子模式
使用管道字符 | 表示匹配兩個中的一個，如'python|perl' 匹配'python'和'perl'。
如果只想將 | 用於模式的一部分，可將這部分（子模式）放在圓括號內。如'p(ython|erl)'。
單個字符也可稱為子模式。
5、字符串的開頭和結尾
開頭用脫字符 ^，結尾用美元符號 $。
6、可選模式和重復模式
在子模式后面加上指定符號，可指定可選和重復模式。
(pattern)? : pattern可重復0、1
(pattern)* : pattern可重復0、1或多次
(pattern)+ : pattern可重復1或多次
(pattern){m,n} : pattern可重復m至n次
重復運算符默認是貪婪的，匹配盡可能多的內容。
如r'\*(.+)\*'匹配字符串 '*This* is *it*!'時將匹配到*This* is *it*
在重復運算符后面加問號?可指定為非貪婪的，
如r'\*(.+?)\*'匹配字符串 '*This* is *it*!'時將匹配到*This* 和 *it*

二、模板re包含使用正則表達式的函數。

1、search(pattern, string[, flags])

（1）在給定字符串查找第一個與正則表達式匹配的子串，如果找到將返回MatchObject對象(結果為真)，否則返回None（結果為假）
參數 pattern 為正則表達式，string 為要匹配的字符串，flags為標志位，控制是否區分大小寫等等。
（2）MatchObject對象
MatchObject對象包含與模式匹配的子串的信息，這些子串部分稱為編組。
編組就是放在圓括號內的子模式，根據左邊的括號數編號，其中編組0指的是整個模式。
MatchObject對象的幾個重要方法
groups() 返回一個包含所有編組字符串的元組，從 1 到所含的編組，不包含編組0。
group([group1, ...]) 獲取與給定子模式(編組)匹配的子串，沒有指定編組號則默認為0
start([group]) 返回與給定編組匹配的子串的起始位置
end([group]) 返回與給定編組匹配的子串的終止位置(與切片一樣不包含終止位置)
span([group]) 返回與給定編組匹配的子串的起始位置和終止位置

import re

m = re.search(r'www\.(.*)\.(.{3})', 'WWW.python.org', re.I) #忽略大小寫
if(m):
    print(m.groups()) #從編組1算起

    print('編組0：')
    print(m.group()) 
    print(m.group(0))

    print('編組1：')
    print(m.group(1))
    print(m.start(1))
    print(m.end(1))
    print(m.span(1))

    print('編組2：')
    print(m.group(2))
    print(m.start(2))
    print(m.end(2))
    print(m.span(2))

運行結果：

('python', 'org')
編組0：
www.python.org
www.python.org
編組1：
python
4
10
(4, 10)
編組2：
org
11
14
(11, 14)

2、match(pattern, string[, flags])

match函數與search函數類似，不同之處是在給定字符串開頭查找與正則表達式匹配的子串。

import re

m1 = re.search(r'python', 'www.python.org')
if(m1):
    print('search匹配成功')
else:
    print('search匹配失敗')

m2 = re.match(r'python', 'www.python.org')
if(m2):
    print('match匹配成功')
else:
    print('match匹配失敗')

運行結果：

search匹配成功
match匹配失敗

3、compile(pattern[, flags])

調用search、match等函數時，如果提供的是用字符串表示的正則表達式，內部會將它們轉換為模式對象。
compile將字符串表示的正則表達式轉換為模式對象，內部無需再進行轉換。
模式對象也有搜索/匹配方法，因此
pat = re.compile(pattern[, flags])
pat.search(string) (pat是使用 compile創建的模式對象)
等價於re.search(pattern, string[, flags])

import re

m1 = re.search(r'python', 'www.python.org')
if(m1):
    print('search匹配成功')
else:
    print('search匹配失敗')

pat = re.compile(r'python')
m2 = pat.search('www.python.org')
if(m1):
    print('compile search匹配成功')
else:
    print('compile search匹配失敗')

運行結果：

search匹配成功
compile search匹配成功

4、split(pattern, string[, maxsplit=0])

根據模式來分割字符串，返回列表

import re

res = re.split('[, ]', 'ab,cd 123') #以空格和逗號為分隔符來分割
print(res)

運行結果：

['ab', 'cd', '123']

5、findall(pattern, string)

返回一個列表，其中包含字符串中所有與模式匹配的子串

import re

result = re.findall(r'\d+', 'ab,cd 123 456') #查找數字
print(result)

運行結果：

['123', '456']

6、sub(pattern, repl, string[, count=0])

將字符串中與模式pattern匹配的子串都替換為repl

import re

result = re.sub(r'\D', '', 'abc123def')
print(result)

運行結果：

三、實例：抓取博客園首頁的信息

目標：抓取首頁的每篇文章的標題、文章url、作者、發布日期。

查看html源碼，每篇文章的源碼類似如下：

<div class="post_item_body">
    <h3><a class="titlelnk" href="https://www.cnblogs.com/mukekeheart/p/11395063.html" target="_blank">iOS學習——iOS 宏(define)與常量(const)的正確使用</a></h3>                   
    <p class="post_item_summary">
<a href="https://www.cnblogs.com/mukekeheart/" target="_blank"><img width="48" height="48" class="pfs" src="https://pic.cnblogs.com/face/926487/20180313105754.png" alt=""/></a>    概述 在iOS開發中，經常用到宏定義，或用const修飾一些數據類型，經常有開發者不知怎么正確使用，導致項目中亂用宏與const修飾。你能區分下面的嗎？知道什么時候用嗎？ 當我們想全局共用一些數據時，可以用宏、變量、常量 宏、變量、常量之間的區別 宏：只是在預處理器里進行文本替換，沒有類型，不做任何 ...
    </p>              
    <div class="post_item_foot">                    
    <a href="https://www.cnblogs.com/mukekeheart/" class="lightblue">mukekeheart</a> 
    發布於 2019-08-22 16:23 
    <span class="article_comment"><a href="https://www.cnblogs.com/mukekeheart/p/11395063.html#commentform" title="0001-01-01 08:05" class="gray">
        評論(0)</a></span><span class="article_view"><a href="https://www.cnblogs.com/mukekeheart/p/11395063.html" class="gray">閱讀(19)</a></span></div>
</div>

經過多次測試調整正式表達式，最終代碼如下：

# -*- coding:utf-8 -*-

from urllib.request import urlopen
import re

#參數re.DOTALL使得表達式中的句點匹配包括換行符在內的所有字符
p = re.compile('<a class="titlelnk" href="(.*?)".*?>(.*?)</a>.*?<div class="post_item_foot">.*?<a href=".+?" class="lightblue">(.*?)</a>.*?(\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}).*?<span.*?', re.DOTALL)
text = urlopen('https://www.cnblogs.com').read().decode()
print(p.findall(text))

運行結果如下：

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python 正則表達式使用 python re：正則表達式中使用變量 Python3 如何優雅地使用正則表達式（詳解二） Python 爬蟲4——使用正則表達式篩選內容【Python】正則表達式中使用變量正則表達式介紹及Python使用方法 python中的正則表達式的使用 Python正則表達式的七個使用范例 Python中正則表達式的巧妙使用在python中使用正則表達式(一)