Python 三種網頁抓取方法

本文轉載自查看原文 2022-02-19 07:36 1670

摘要：本文講的是利用Python實現網頁數據抓取的三種方法；分別為正則表達式（re）、BeautifulSoup模塊和lxml模塊。本文所有代碼均是在python3.5中運行的。

本文抓取的是[中央氣象台](http://www.nmc.cn/)首頁頭條信息：

其HTML層次結構為：

抓取其中href、title和標簽的內容。

一、正則表達式
copy outerHTML：

<a target="_blank" href="/publish/country/warning/megatemperature.html" title="中央氣象台7月13日18時繼續發布高溫橙色預警">高溫預警</a>

代碼：

# coding=utf-8
import  re, urllib.request

url = 'http://www.nmc.cn'
html = urllib.request.urlopen(url).read()
html = html.decode('utf-8')     #python3版本中需要加入
links = re.findall('<a target="_blank" href="(.+?)" title',html)
titles = re.findall('<a target="_blank" .+? title="(.+?)">',html)
tags = re.findall('<a target="_blank" .+? title=.+?>(.+?)</a>',html)
for link,title,tag in zip(links,titles,tags):
    print(tag,url+link,title)

正則表達式符號’.’表示匹配任何字符串（除\n之外）；‘+’表示匹配0次或者多次前面出現的正則表達式；‘？’表示匹配0次或者1次前面出現的正則表達式。更多內容可以參考Python中的正則表達式教程
輸出結果如下：

高溫預警 http://www.nmc.cn/publish/country/warning/megatemperature.html 中央氣象台7月13日18時繼續發布高溫橙色預警
山洪災害氣象預警 http://www.nmc.cn/publish/mountainflood.html 水利部和中國氣象局7月13日18時聯合發布山洪災害氣象預警
強對流天氣預警 http://www.nmc.cn/publish/country/warning/strong_convection.html 中央氣象台7月13日18時繼續發布強對流天氣藍色預警
地質災害氣象風險預警 http://www.nmc.cn/publish/geohazard.html 國土資源部與中國氣象局7月13日18時聯合發布地質災害氣象風險預警

二、BeautifulSoup 模塊
Beautiful Soup是一個非常流行的Python模塊。該模塊可以解析網頁，並提供定位內容的便捷接口。
copy selector：

#alarmtip > ul > li.waring > a:nth-child(1)

因為這里我們抓取的是多個數據，不單單是第一條，所以需要改成：

#alarmtip > ul > li.waring > a

代碼：

from bs4 import BeautifulSoup
import urllib.request

url = 'http://www.nmc.cn'
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html,'lxml')
content = soup.select('#alarmtip > ul > li.waring > a')

for n in content:
    link = n.get('href')
    title = n.get('title')
    tag = n.text
    print(tag, url + link, title)

輸出結果同上。

三、lxml 模塊
Lxml是基於libxml2這一XML解析庫的Python封裝。該模塊使用C語言編寫，解析速度比Beautiful Soup更快，不過安裝過程也更為復雜。
代碼：

import urllib.request,lxml.html

url = 'http://www.nmc.cn'
html = urllib.request.urlopen(url).read()
tree = lxml.html.fromstring(html)
content = tree.cssselect('li.waring > a')

for n in content:
    link = n.get('href')
    title = n.get('title')
    tag = n.text
    print(tag, url + link, title)

輸出結果同上。

四、將抓取的數據存儲到列表或者字典中
以BeautifulSoup 模塊為例：

from bs4 import BeautifulSoup
import urllib.request

url = 'http://www.nmc.cn'
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html,'lxml')
content = soup.select('#alarmtip > ul > li.waring > a')

######### 添加到列表中
link = []
title = []
tag = []
for n in content:
    link.append(url+n.get('href'))
    title.append(n.get('title'))
    tag.append(n.text)

######## 添加到字典中
for n in content:
    data = {
        'tag'   : n.text,
        'link'  : url+n.get('href'),
        'title' : n.get('title')
    }

五、總結
表2.1總結了每種抓取方法的優缺點。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python爬蟲之三種網頁抓取方法性能比較 Python爬蟲之三種網頁抓取方法性能比較兩種判斷（抓取）網頁編碼的方法【python版】 python獲取網頁信息的三種方法新手小白必看，3種網頁抓取方法。三種倒敘方法python 使用urllib2打開網頁的三種方法（Python2） python抓取網頁圖片 python抓取中文網頁亂碼通用解決方法 python3 抓取網頁資源的 N 種方法