python——re模塊（正則表達式）

本文轉載自查看原文 2018-12-04 17:25 4618 爬蟲/ python/ re/ 數據提純/ 數據分析/ 正則表達式/ 數據處理/ 數據清洗

re 模塊的使用：

1.使用compile()函數編譯一個parttern對象，例如：parttern=re.compile(r'\d+')

2.通過pattern對象提供的一系列屬相和方法，對文本進行匹配查找，獲得結果，即一個Match對象

match 方法：從起始位置開始查找，一次匹配，匹配失敗返回None ----------> match(string[, pos[, endpos]])

m = pattern.match('one12twothree34four', 3, 10) # 從下標3開始，也即從字符串'1'的位置開始匹配，返回一個Match對象, 沒有匹配到的話返回None

# -*- conding:utf-8 -*-

import re

pattern = re.compile(r'([a-z]+) ([a-z]+)', re.I)  # re.I 表示忽略大小寫
m = pattern.match("hello world wide web python") 

print(m)  # <_sre.SRE_Match object; span=(0, 11), match='hello world'>
print(m.group(), type(m.group()))  # hello world <class 'str'>
print(m.group(1)) # hello
print(m.group(2)) # world
print(m.span(), type(m.span()))  # (0, 11) <class 'tuple'>
print(m.groups(), type(m.groups()))  # ('hello', 'world') <class 'tuple'>

search 方法：從任何位置開始查找，一次匹配，匹配失敗返回None ----------> search(string[, pos[, endpos]]) 使用同match方法
findall 方法：全部匹配，返回列表，匹配失敗返回空列表 ----------> findall(string[, pos[, endpos]])

# -*- conding:utf-8 -*-

import re

# 將正則表達式編譯成pattern對象
pattern = re.compile(r'\d+')  # 查找數字
rel1 = pattern.findall('hello 123 world 456 ')
print(rel1)   # ['123', '456']

rel2 = pattern.findall('one12two23s34f45f56s78e89t10', 10, 20)  # 指定匹配的起止位置
print(rel2)  # ['34', '45', '56']

#re模塊提供一個方法叫compile模塊，提供我們輸入一個匹配的規則
#然后返回一個pattern實例，我們根據這個規則去匹配字符串
pattern2 = re.compile(r'\d+\.\d*')
#通過partten.findall()方法就能夠全部匹配到我們得到的字符串
result = pattern2.findall("123.141593, 'bigcat', 232312, 3.15")
#findall 以 列表形式 返回全部能匹配的子串給result
print(result)  # ['123.141593', '3.15']

finditer 方法：全部匹配，返回迭代器，返回Match對象 ----------> finditer(string[, pos[, endpos]])

# -*- conding:utf-8 -*-

import re

'''finditer跟findall類似'''

pattern = re.compile(r'\d+')
resl = pattern.finditer('hello-123-world-456-python-789')

print(resl)  # <callable_iterator object at 0x0000022A886FD470>
print(type(resl))  # <class 'callable_iterator'>    # 迭代器對象
for m in resl:  # m是Match對象， 具體操作見上面的match
    print(m.group())  # 分別打印出123 456 789

split 方法：分割字符串，返回列表 ----------> split(string[, maxsplit])

# -*- conding:utf-8 -*-

import re

'''split方法按照規則將字符串分割后返回列表'''
p = re.compile(r'[\s\,;\t\n]+')
print(p.split('  a  ,    bwf  ;; c '))   # ['', 'a', 'bwf', 'c', '']

sub 方法：替換 ----------> sub(repl, string[, count])

# -*- conding:utf-8 -*-

import re

p = re.compile(r'(\w+) (\w+)')
s = 'hello 1236 hello 456'
print(p.sub('hello world', s))  # hello world hello world

3.使用match對象的屬相和方法獲取信息

match.group()

match.groups() # 匹配的所有等同於 match.group()等同於match.group(0)

match.start() # 開始位置

match.end() # 結束位置

match.span() # 返回開始結束的區域跨度

4、匹配中文

中文的Unicode編碼范圍主要在[u4e00-u9fa5]，沒有包括全角中文標點，不過大部分情況下是夠用了

# -*- conding:utf-8 -*-

import re

title = '你好，python ， 你好，世界 hello world'
pa = re.compile(r'[\u4e00-\u9fa5]+')
t = pa.findall(title)
print(t)   # ['你好', '你好', '世界']

5、貪婪匹配-------非貪婪匹配：python默認是貪婪匹配

　　貪婪匹配：在匹配成功的前提下，盡可能多的匹配（*）

　　非貪婪匹配：在匹配成功的前提下，盡可能少的匹配（?）

# -*- conding:utf-8 -*-

import re

s = 'abbbbbbdsddbbbb'

res = re.findall('ab*', s)  # *號是匹配前一個字符0次或無限次
print(res)  # ['abbbbbb']  匹配ab后已經匹配成功，但是由於是貪婪匹配，所以會繼續往后嘗試匹配

res2 = re.findall('ab*?', s)
print(res2)  # ['a']  匹配a成功后，由於是非貪婪匹配，所以匹配就結束了

加油，一步一步往下走，堅持下去，自己給自己打氣加油，workon

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 正則表達式re模塊 re模塊（正則表達式） re(正則表達式)模塊 python中的正則表達式（re模塊） python re模塊 - 正則表達式 python--re模塊(正則表達式) Python 之【re模塊的正則表達式學習】 Python中re(正則表達式)模塊學習 python3 正則表達式 re模塊 Python正則表達式re模塊