用Python實現網頁數據抓取

本文轉載自查看原文 2017-04-19 13:43 32708

需求：獲取某網站近10萬條數據記錄的相關詳細信息。

分析：數據的基本信息存放於近1萬個頁面上，每個頁面上10條記錄。如果想獲取特定數據記錄的詳細信息，需在基本信息頁面上點擊相應記錄條目，跳轉到詳細信息頁面。詳細信息頁面的地址可從基本信息頁面里的href屬性獲取。

方法：開始時使用beautiful soup進行爬網，因速度較慢，換用lxml，速度改善不明顯。

　　beautiful soup

import bs4
import re
import requests
import lxml.html
       
f=open('testpython2.txt','w',encoding='utf-8')
j=30
while(j<41):
    beautiful = requests.get(webaddress).content
    soup=bs4.BeautifulSoup(beautiful,"lxml")
    m=5
    while m <85:
        daf1=soup.find_all('a')[m].get_text()
         if daf1!='哈哈':
            daf=soup.find_all('a')[m-1].get('href')
            c='webaddress1'+ str(daf)
            if requests.get(c).status_code==500:
                f.write('Cannot found!')
                f.write('\n')
            else:
                beautiful1=requests.get(c).content
                soup1=bs4.BeautifulSoup(beautiful1,"lxml")
                daf2=soup1.find(id="project_div2")
                p=2
                while (p<20):
                    mm=daf2.find_all('td')[p].get_text()
                    f.write(mm)
                    f.write(' ')
                    p=p+2
                daf3=soup1.find(id="xiugai")
                hh=0
                for tag in daf3(re.compile("td")):
                    hh=hh+1
                q=2
                while (q<hh) :
                    nn=daf3.find_all('td')[q].get_text().replace(' ','')
                    nn1=daf3.find_all('td')[q+1].get_text().replace(' ','')
                    nn2=daf3.find_all('td')[q-1].get_text().replace(' ','')
                    nn3=daf3.find_all('td')[q-2].get_text().replace(' ','')
                    if nn2==nn3:
                        f.write(nn2)
                        f.write(';')
                        f.write(nn)
                        f.write('，')
                        f.write(nn1)
                        f.write(',')   
                    else:
                        if nn2=='1':
                            f.write('InteriorRing ')
                            f.write(nn2)
                            f.write(';')
                            f.write(nn)
                            f.write('，')
                            f.write(nn1)
                            f.write(',')
                        else:
                            f.write(nn2)
                            f.write(';')
                            f.write(nn)
                            f.write('，')
                            f.write(nn1)
                            f.write(',')   
                    q=q+4
            f.write('\n')
        m=m+8
    j=j+1

f.close()

　　lxml

import bs4
import re
import requests
import lxml.html
from lxml.cssselect import CSSSelector

f=open('testpython2.txt','w',encoding='utf-8')
j=2001
while(j<2592):
    link="webaddress"
    headers={'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6','referer':'link'}
    beautiful = requests.get(link,headers=headers).content
    tree=lxml.html.fromstring(beautiful)
    sel=CSSSelector('div div table tr td a')
    results=sel(tree)
    m=5
    while m <85:
        match=results[m]
        if results[m-4].text=='XXX:
            daf=match.get('href')
            c='webaddress2'+ str(daf)
            if requests.get(c).status_code==500:
                f.write('Cannot found!')
                f.write('\n')
            else:
                beautiful1=requests.get(c).content
                tree1=lxml.html.fromstring(beautiful1)
                sel1=CSSSelector('div[id="project_div2"] table tr td')
                results1=sel1(tree1)
                p=2
                while (p<20):
                    match1=results1[p]
                    mm=match1.text
                    if mm is None:
                        f.write('NoValue')
                    else:
                        f.write(mm)
                    f.write(' ')
                    p=p+2
                sel2=CSSSelector('div[id="xiugai"] table tr')
                sel3=CSSSelector('div[id="xiugai"] table tr td')
                results2=sel2(tree1)
                results3=sel3(tree1)
                ee=len(results3)
                q=2
                while (q<ee+1) :
                    nn1=results3[q].text
                    nn2=results3[q+1].text
                    nn3=results3[q-1].text
                    nn4=results3[q-2].text
                    f.write(nn4)
                    f.write(',')
                    f.write(nn3)
                    f.write(',')
                    f.write(nn1)
                    f.write(',')
                    f.write(nn2)
                    f.write(';')
                    q=q+4
                f.write('\n')
        m=m+8
    j=j+1
f.close()

問題：1. Python中如何安裝庫。

　　　　解決方法：cmd，cd 定位到Python安裝目錄相應文件夾，再用easy install或者 pip命令進行安裝

cd C:\Python36-32\Scripts
pip install lxml

　　2. urllib使用。

　　　　2.x版本的Python可以直接使用import urllib來進行操作，但是3.x版本的python使用的是import urllib.request來進行操作

beautiful = urllib.request.urlopen(webaddress).read()

　　3. urllib vs. requests

　　　　使用urllib，網頁讀取不穩定，時常很快斷連接。改用requests。

beautiful = requests.get(webaddress).content

　　4. beautiful soup爬網速度太慢。查詢文檔，換用lxml，速度改善不明顯

　　之前

soup=bs4.BeautifulSoup(beautiful,"html.parser")

　　之后

soup=bs4.BeautifulSoup(beautiful,"lxml")

　　5.根據網上查詢（http://blog.csdn.net/my_precious/article/details/52948362），為了測試速度，完全棄用beautiful soup，使用lxml和CSSSelector　　

import lxml.html
from lxml.cssselect import CSSSelector

tree=lxml.html.fromstring(beautiful)
sel=CSSSelector('div div table tr td a')
results=sel(tree)
match=results[m]
daf=match.get('href')
daf1=match[1].text

　　6. 讀取50+頁面時，遭遇10054錯誤，鏈接斷開。

　　requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(10054, '遠程主機強迫關閉了一個現有的連接。', None, 10054, None))

　　解決方法：添加header，講referer設為網站自身地址，避免網站誤以為網站攻擊

headers={'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6','referer':'link'}
beautiful = requests.get(link,headers=headers).content

感受： Python大小寫敏感，縮進格式要求嚴格。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 怎么用Python寫爬蟲抓取網頁數據抓取HTML網頁數據網絡爬蟲－使用Python抓取網頁數據 java簡單實現抓取動態網頁數據 Python爬蟲-抓取網頁數據並解析，寫入本地文件 python+selenium動態抓取網頁數據如何實時抓取動態網頁數據？使用HtmlAgilityPack抓取網頁數據淺談抓取網頁數據（奉上Demo） Web網頁數據抓取（C/S）