python爬蟲-靜態爬取豆瓣評論

本文轉載自查看原文 2020-05-19 21:17 1596 python爬蟲

from bs4 import BeautifulSoup
import requests
import pandas as pd

header = {
'Referer': 'https://movie.douban.com/subject/33420285/comments?status=P',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
}
reviewers = []
dates = []
shot_comments = []
votes = []
for i in range(0,100,20):
    url=f'https://movie.douban.com/subject/33420285/comments?start={i}&limit=20&sort=new_score&status=P'
    request = requests.get(url,headers=header)
    html = request.content.decode('utf-8')

    dom = BeautifulSoup(html,'lxml')

    reviewers = reviewers + [i.getText() for i in dom.select('#comments > div > div.comment > h3 > span.comment-info > a')]
    dates = dates + [i.getText() for i in dom.select('#comments > div > div.comment > h3 > span.comment-info > span.comment-time')]
    shot_comments = shot_comments + [i.getText() for i in dom.select('#comments > div > div.comment > p > span')]
    votes = votes+ [i.getText() for i in dom.select('#comments > div > div.comment > h3 > span.comment-vote > span')]

short = pd.DataFrame({
    '時間':dates,'評論者':reviewers,'留言':shot_comments,'票數':votes
})
short.to_excel('./short.xlsx')

分析：

我們寫代碼的步驟是第一步：判斷是否設置反爬機制，第二步：先爬取整個網頁，第三步：再提取想要的內容，第四步：最后保存到本地。明白了我們要做什么再一步一步的去做

step1：判斷是否設置反爬

import requests

url = "https://movie.douban.com/subject/33420285/comments?status=P"
request = requests.get(url)

print(request.status_code)

requests.get（url，params = None，headers = None，cookies = None，auth = None，timeout =無）發送GET請求。返回Response對象，其存儲了服務器響應的內容。

打印出響應的狀態碼，如果為418則是設置了反爬機制，如果是200，就ok。

可以看到狀態碼為418，那么就需要繞過反爬，設置head

將request headers中的User-Agent加上

import requests

url = "https://movie.douban.com/subject/33420285/comments?status=P"

headers = {
'Referer': 'https://movie.douban.com/subject/33420285/comments?status=P',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
}

request = requests.get(url,headers=headers)

print(request.status_code)

此時狀態碼就為200了，進行下一步

step2：抓取頁面內容

import requests


url = "https://movie.douban.com/subject/33420285/comments?status=P"

headers = {
'Referer': 'https://movie.douban.com/subject/33420285/comments?status=P',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
}

request = requests.get(url,headers=headers)
html = request.content.decode('utf-8')

html = request.content.decode('utf-8')表示將網頁的html內容解碼出來，右鍵查看源碼可以看到編碼格式

可以print(html)，也可以在python console中查看爬取的網頁源碼

切換到python console運行的方法為：

選擇菜單欄

然后勾選

step3：提取有效內容

F12，選擇要爬取的內容，然后右鍵copy，得到內容為

 #comments > div:nth-child(1) > div.comment > h3 > span.comment-info > a  指明要爬取的內容處於html結構中的哪個位置

from bs4 import BeautifulSoup   #好找到提取文本對象的工具
import requests


url = "https://movie.douban.com/subject/33420285/comments?status=P"

headers = {
'Referer': 'https://movie.douban.com/subject/33420285/comments?status=P',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
}

request = requests.get(url,headers=headers)


html = request.content.decode('utf-8')

dom = BeautifulSoup(html , 'lxml') #將html類型的內容轉換為文檔類型

reviewers =[i.getText() for i in dom.select('#comments > div > div.comment > h3 > span.comment-info > a')]  #使用一個循環，循環dom.select列表中的每一個元素i，並用getText()提取出文本  評論者id
dates = [i.getText() for i in dom.select('#comments > div > div.comment > h3 > span.comment-info > span.comment-time')] #評論日期
shot_comments = [i.getText() for i in dom.select('#comments > div > div.comment > p > span')]   #shift+alt+e 評論
votes = [i.getText() for i in dom.select('#comments > div > div.comment > h3 > span.comment-vote > span')] #投票數

#comments > div:nth-child(1) > div.comment > h3 > span.comment-info > a  
#comments > div:nth-child(2) > div.comment > h3 > span.comment-info > a   可以看到不同位置上的id是不同的，所以這里將:nth-child()這一塊刪除掉就會顯示所有評論者的id了

dom.selec()將指定位置處的內容轉換為文檔類型，使用了一個i.getText() for i in 循環，表示將dom.select得到的列表中的每一個元素都經過getText()處理。getText()代表將獲取的列表提取出其中的文本，不要html等結構。

此時可以看到我們想要的單獨的數據

shift+alt+e 選中區域可以單獨運行選中的代碼

但是現在只有第一頁的數據，要怎么樣才能爬取很多頁的數據呢？就要用到循環

每一頁的操作都是一樣的，唯一不同的就是url，每一頁的不同

第一頁：https://movie.douban.com/subject/33420285/comments?start=0&limit=20&sort=new_score&status=P

第二頁：https://movie.douban.com/subject/33420285/comments?start=20&limit=20&sort=new_score&status=P

第三頁：https://movie.douban.com/subject/33420285/comments?start=40&limit=20&sort=new_score&status=P

可以看到是start值發生了變化，那么我們就可以改變start的值來循環

from bs4 import BeautifulSoup  
import requests


url1 = "https://movie.douban.com/subject/33420285/comments?status=P"

headers = {
'Referer': 'https://movie.douban.com/subject/33420285/comments?status=P',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
}

for i in range(0,100,20):
    url = f'https://movie.douban.com/subject/33420285/comments?start={i}&limit=20&sort=new_score&status=P'
    request = requests.get(url,headers=headers)

    html = request.content.decode('utf-8')  

    dom = BeautifulSoup(html , 'lxml') 
    reviewers = reviewers +[i.getText() for i in dom.select('#comments > div > div.comment > h3 > span.comment-info > a')]  
    dates = dates+[i.getText() for i in dom.select('#comments > div > div.comment > h3 > span.comment-info > span.comment-time')] 
    shot_comments = shot_comments+[i.getText() for i in dom.select('#comments > div > div.comment > p > span')]  
    votes = votes+[i.getText() for i in dom.select('#comments > div > div.comment > h3 > span.comment-vote > span')]

將start的值設為參數i，參與循環從0到100，步數20，也就是0、20、40、60、80、100，循環6次。

以下的內容都要縮進到for循環中。

要注意的一點是，都要加上reviewers+ 、dates+ 、shot_comments+ 、votes+ ，因為如果不加的話，光是

reviewers = [i.getText() for i in dom.select('#comments > div > div.comment > h3 > span.comment-info > a')]

那么第二次循環就會覆蓋掉前一次獲取到的reviewers值，第三次循環又會覆蓋掉第二次循環的值。。。所以加上reviewers表示追加，就不會覆蓋掉內容了

step4：保存到本地，在代碼最后加上

import pandas as pd
short = pd.DataFrame({
    '時間':dates,'評論者':reviewers,'留言':shot_comments,'票數':votes
})
short.to_excel('./short.xlsx')

運行完整代碼，如果提示

那么就在循環前先定義一下，reviewers = []

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python開發爬蟲之靜態網頁抓取篇：爬取“豆瓣電影 Top 250”電影數據爬蟲系列(十一) 用requests和xpath爬取豆瓣電影評論【python爬蟲實戰】爬取豆瓣影評數據 python爬蟲實踐——爬取“豆瓣top250” 用python寫網絡爬蟲-爬取新浪微博評論 python3爬蟲 -----新浪微博(m)-------評論爬取 python制作爬蟲爬取京東商品評論教程 Python爬取豆瓣電影評論，並用詞雲顯示 python爬取知乎評論 Python爬蟲入門教程01：豆瓣Top電影爬取