Python開發爬蟲之動態網頁抓取篇：爬取博客評論數據——通過瀏覽器審查元素解析真實網頁地址

本文轉載自查看原文 2018-04-14 15:36 4900 Python網絡爬蟲

由於主流網站都使用JavaScript展示網頁內容，和前面簡單抓取靜態網頁不同的是，在使用JavaScript時，很多內容並不會出現在HTML源代碼中，而是在HTML源碼位置放上一段JavaScript代碼，最后呈現出來的數據是通過JavaScript提取服務器返回的數據加載到源代碼中進行呈現。因此爬取靜態網頁的技術可能無法正常使用。因此，我們需要用到動態網頁抓取的兩種技術：

1.通過瀏覽器審查元素解析真實網頁地址；

2.使用selenium模擬瀏覽器的方法。

我們這里先介紹第一種方法。

以爬取《Python 網絡爬蟲：從入門到實踐》一書作者的個人博客評論為例。網址：http://www.santostang.com/2017/03/02/hello-world/

1）“抓包”：找到真實的數據地址

右鍵點擊“檢查”，點擊“network”，選擇“js”。刷新一下頁面，選中頁面刷新時返回的數據list?callback....這個js文件。右邊再選中Header。如圖：

其中，Request URL即是真實的數據地址。

在此狀態下滾動鼠標滾輪可發現User-Agent。

2）相關代碼：

import requests
import json
headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
link="https://api-zero.livere.com/v1/comments/list?callback=jQuery112405600294326674093_1523687034324&limit=10&offset=2&repSeq=3871836&requestPath=%2Fv1%2Fcomments%2Flist&consumerSeq=1020&livereSeq=28583&smartloginSeq=5154&_=1523687034329"
r=requests.get(link,headers=headers)
# 獲取 json 的 string
json_string = r.text
json_string = json_string[json_string.find('{'):-2]
json_data=json.loads(json_string)
comment_list=json_data['results']['parents']
for eachone in comment_list:
    message=eachone['content']
    print(message)

輸出為：

現在死在了4.2節上，頁面評論是有的，但是XHR里沒有東西啊，這是什么情況？有解決的大神嗎？
為何靜態網頁抓取不了？
奇怪了，我按照書上的方法來操作，XHR也是空的啊
XHR沒有顯示任何東西啊。奇怪。
找到原因了
caps["marionette"] = True
作者可以解釋一下這句話是干什么的嗎
我用的是 pycham IDE，按照作者的寫法寫的，怎么不行
對火狐版本有要求嗎
4.3.1 打開Hello World,代碼用的作者的，火狐地址我也設置了，為啥運行沒反應
from selenium import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary

caps = webdriver.DesiredCapabilities().FIREFOX
caps["marionette"] = False
binary = FirefoxBinary(r'C:\Program Files\Mozilla Firefox\firefox.exe')
#把上述地址改成你電腦中Firefox程序的地址
driver = webdriver.Firefox(firefox_binary=binary, capabilities=caps)
driver.get("http://www.santostang.com/2017/03/02/hello-world/")
我是番茄
為什么刷新沒有XHR數據，評論明明加載出來了

代碼解析：

1）對於代碼 json_string.find() api解析為：

Docstring:
S.find(sub[, start[, end]]) -> int

Return the lowest index in S where substring sub is found,
such that sub is contained within S[start:end].  Optional
arguments start and end are interpreted as in slice notation.

Return -1 on failure.
Type:      method_descriptor

所以代碼 json_string.find('{') 即返回”{“在json_string字符串中的索引位置。

2）若在代碼中增加一句代碼 print json_string，則該句輸出結果為（由於輸出內容過多，只截取了開頭和結尾,關鍵位置均作了紅色標記）：

/**/ typeof jQuery112405600294326674093_1523687034324 === 'function' && jQuery112405600294326674093_1523687034324({"results":{"parents":[{"replySeq":33365104,"name":"骨犬","memberId":"B9E06FBF9013D49CADBB5B623E8226C8","memberIcon":"http://q.qlogo.cn/qqapp/101256433/B9E06FBF9013D49CADBB5B623E8226C8/100","memberUrl":"https://qq.com/","memberDomain":"qq","good":0,"bad":0,"police":0,"parentSeq":33365104,"directSeq":0,"shortUrl":null,"title":"Hello world! - 數據科學@唐松
Santos","site":"http://www.santostang.com/2017/03/02/hello-world/","email":null,"ipAddress":"27.210.192.241","isMobile":"0","agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.108 Safari/537.36 2345Explorer/8.8.3.16721","septSns":null,"targetService":null,"targetUserName":null,"info1":null,"info2":null,"info3":null,"image1":null,"image2":null,"image3":null,"link1":null,"link2":null,"link3":null,"isSecret":0,"isModified":0,"confirm":0,"subCount":1,"regdate":"2018-01-01T06:27:50.000Z","deletedDate":null,"file1":null,"file2":null,"file3":null,"additionalSeq":0,"content":"現在死在了4.2節上，頁面評論是有的，但是XHR里沒有東西啊，這是什么情況？有解決的大神嗎？"
 。。。。。。。。。 tent":"我的也是提示火狐版本不匹配，你解決了嗎","quotationSeq":null,"quotationContent":null,"consumerSeq":1020,"livereSeq":28583,"repSeq":3871836,"memberGroupSeq":26828779,"memberSeq":27312353,"status":0,"repGroupSeq":0,"adminSeq":25413747,"deleteReason":null,"sticker":0,"version":null}],"quotations":[]},"resultCode":200,"resultMessage":"Okay, livere"});

由上面輸出結果可知，我們在代碼中加入 json_string = json_string[json_string.find('{'):-2]的重要性。

若不加入json_string.find('{')則該結果不是合法的json格式，不能順利構成json文件；若不截取到倒數第二位，則結果包含多余的);也構不成合法的json格式。

3）對於代碼comment_list=json_data['results']['parents']和message=eachone['content'] 中的中括號中的字符串類型的標簽定位，可在上面2）中關鍵部位查找，即完成截取后的合法的json文件由“results”和“parents”兩者所包含故使用兩個中括號逐級定位，又由於我們爬取的是評論，其內容在該json文件的“content”標簽中，故使用["content"]進行定位。

據觀察，在真實的數據地址中的offset是頁數。

爬取所有頁面的評論：

import requests
import json
def single_page_comment(link):
    headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
   
    r=requests.get(link,headers=headers)
    # 獲取 json 的 string
    json_string = r.text
    json_string = json_string[json_string.find('{'):-2]
    json_data=json.loads(json_string)
    comment_list=json_data['results']['parents']
    for eachone in comment_list:
        message=eachone['content']
        print(message)
        
for page in range(1,4):
    link1="https://api-zero.livere.com/v1/comments/list?callback=jQuery112405600294326674093_1523687034324&limit=10&offset="
    link2="&repSeq=3871836&requestPath=%2Fv1%2Fcomments%2Flist&consumerSeq=1020&livereSeq=28583&smartloginSeq=5154&_=1523687034329"
    page_str=str(page)
    link=link1+page_str+link2
    print(link)
    single_page_comment(link)

輸出為:

https://api-zero.livere.com/v1/comments/list?callback=jQuery112405600294326674093_1523687034324&limit=10&offset=1&repSeq=3871836&requestPath=%2Fv1%2Fcomments%2Flist&consumerSeq=1020&livereSeq=28583&smartloginSeq=5154&_=1523687034329
在JS 里面也找不到https://api.gentie.163.com/products/ 哪位大神幫忙解答下。謝謝。
在JS 里面也找不到https://api.gentie.163.com/products/ 哪位大神幫忙解答下。謝謝。
在JS 里面也找不到https://api.gentie.163.com/products/ 哪位大神幫忙解答下。謝謝。
測試
為什么我用代碼打開的文章只有兩條評論，本來是有46條的，有大神知道怎么回事嗎？
菜鳥一只，求學習群
lalala1
我來試一試 :smiley:
我來試一試 :smiley:
應該點JS，然后看里面的Preview或者Response，里面響應的是Ajax的內容，然后如果去爬網站的評論的話，點開js那個請求后點Headers -->在General里面拷貝 RequestURL 就可以了 :grinning:
https://api-zero.livere.com/v1/comments/list?callback=jQuery112405600294326674093_1523687034324&limit=10&offset=2&repSeq=3871836&requestPath=%2Fv1%2Fcomments%2Flist&consumerSeq=1020&livereSeq=28583&smartloginSeq=5154&_=1523687034329
現在死在了4.2節上，頁面評論是有的，但是XHR里沒有東西啊，這是什么情況？有解決的大神嗎？
為何靜態網頁抓取不了？
奇怪了，我按照書上的方法來操作，XHR也是空的啊
XHR沒有顯示任何東西啊。奇怪。
找到原因了
caps["marionette"] = True
作者可以解釋一下這句話是干什么的嗎
我用的是 pycham IDE，按照作者的寫法寫的，怎么不行
對火狐版本有要求嗎
4.3.1 打開Hello World,代碼用的作者的，火狐地址我也設置了，為啥運行沒反應
from selenium import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary

caps = webdriver.DesiredCapabilities().FIREFOX
caps["marionette"] = False
binary = FirefoxBinary(r'C:\Program Files\Mozilla Firefox\firefox.exe')
#把上述地址改成你電腦中Firefox程序的地址
driver = webdriver.Firefox(firefox_binary=binary, capabilities=caps)
driver.get("http://www.santostang.com/2017/03/02/hello-world/")
我是番茄
為什么刷新沒有XHR數據，評論明明加載出來了
https://api-zero.livere.com/v1/comments/list?callback=jQuery112405600294326674093_1523687034324&limit=10&offset=3&repSeq=3871836&requestPath=%2Fv1%2Fcomments%2Flist&consumerSeq=1020&livereSeq=28583&smartloginSeq=5154&_=1523687034329
為什么刷新沒有XHR數據，評論明明加載出來了
為什么刷新沒有XHR數據，評論明明加載出來了
第21條測試評論
第20條測試評論
第19條測試評論
第18條測試評論
第17條測試評論
第16條測試評論
第15條測試評論
第14條測試評論

注意：page變量取自int，進行字符串拼接前需要進行轉換，即page_str=str(page)

參考書目：唐松，來自《Python 網絡爬蟲：從入門到實踐》

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python開發爬蟲之動態網頁抓取篇：爬取博客評論數據——通過Selenium模擬瀏覽器抓取 Python開發爬蟲之靜態網頁抓取篇：爬取“豆瓣電影 Top 250”電影數據 python爬取動態網頁數據，詳解 Python3網絡爬蟲：requests爬取動態網頁內容 Python爬取javascript(js)動態網頁 python爬蟲——爬取網頁數據和解析數據爬蟲入門（三）——動態網頁爬取：爬取pexel上的圖片【虎牙直播源】瀏覽器抓取真實直播源地址(純前端JS解析源碼) 谷歌瀏覽器插件，當前網頁地址的二維碼如何實時抓取動態網頁數據？