python爬蟲--解析網頁幾種方法之正則表達式

本文轉載自查看原文 2017-09-30 17:19 3540 python爬蟲學習筆記

1、正則表達式

正則表達式是一個特殊的字符序列，它能幫助你方便的檢查一個字符串是否與某種模式匹配。

re 模塊使 Python 語言擁有全部的正則表達式功能。

re.match函數

re.match 嘗試從字符串的起始位置匹配一個模式，如果不是起始位置匹配成功的話，match()就返回none。

import re
print(re.match('www', 'www.runoob.com').span())  # 在起始位置匹配
print(re.match('com', 'www.runoob.com'))         # 不在起始位置匹配

結果：

(0, 3)
None

import re

line = "Cats are smarter than dogs"

matchObj = re.match( r'(.*) are (.*?) .*', line)

if matchObj:
   print ("matchObj.group() : ", matchObj.group())
   print ("matchObj.group(1) : ", matchObj.group(1))
   print ("matchObj.group(2) : ", matchObj.group(2))
else:
   print ("No match!!")

結果：

matchObj.group() :  Cats are smarter than dogs
matchObj.group(1) :  Cats
matchObj.group(2) :  smarter

r'(.*) are (.*?) .*',r的意思為raw string，純粹的字符串，group（0），是匹配正則表達式整體結果，group(1) 列出第一個括號匹配部分，group(2) 列出第二個括號匹配部分。

re.search方法

re.search 掃描整個字符串並返回第一個成功的匹配。

re.match只匹配字符串的開始，如果字符串開始不符合正則表達式，則匹配失敗，函數返回None；而re.search匹配整個字符串，直到找到一個匹配。

import re

line = "Cats are smarter than dogs";

matchObj = re.match( r'dogs', line, re.M|re.I)
if matchObj:
   print ("match --> matchObj.group() : ", matchObj.group())
else:
   print ("No match!!")

matchObj = re.search( r'dogs', line, re.M|re.I)
if matchObj:
   print ("search --> matchObj.group() : ", matchObj.group())
else:
   print ("No match!!")

結果：

No match!!
search --> matchObj.group() :  dogs

re.findall方法

findall能夠找到所匹配的結果，並且以列表的形式返回。

import requests
import re

link = "http://www.sohu.com/"
headers = {'User-Agent' : 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
r = requests.get(link, headers= headers)
html = r.text
title_list = re.findall('href=".*?".<strong>(.*?)</strong>',html)
print (title_list)

['新聞', '財經', '體育', '房產', '娛樂', '汽車', '時尚', '科技', '美食', '星座', '郵箱', '地圖', '千帆', '暢游']

抓取搜狐的主標題。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python網絡爬蟲之解析網頁的正則表達式(爬取4k動漫圖片)[三] python爬蟲之解析庫正則表達式 python爬蟲之正則表達式 Python爬蟲（二）正則表達式 python爬蟲之正則表達式 Python爬蟲(九)_案例：使用正則表達式的爬蟲 Python爬蟲 | re正則表達式解析html頁面 Python爬蟲運用正則表達式 Python 爬蟲4——使用正則表達式篩選內容 python 3.x 爬蟲基礎---正則表達式