python實例：自動爬取豆瓣讀書短評，分析短評內容

本文轉載自查看原文 2019-08-31 20:32 545 爬蟲/ python/ python實例

思路：

1、打開書本“更多”短評，復制鏈接

2、腳本分析鏈接，通過獲取短評數，計算出頁碼數

3、通過頁碼數，循環爬取當頁短評

4、短評寫入到txt文本

5、讀取txt文本，處理文本，輸出出現頻率最高的詞組（前X）----通過分析得到其他結果可自由發散

用到的庫：

lxml 、re、jieba、time

整個腳本如下

# -*-coding:utf8-*-
# encoding:utf-8
#豆瓣每頁20條評論

import requests
from lxml import etree
import re
import jieba
import time

firstlink = "https://book.douban.com/subject/30193594/comments/"

def stepc(firstlink):#獲取評論條數
    url=firstlink
    response = requests.get(url=url)
    wb_data = response.text
    html = etree.HTML(wb_data)
    a = html.xpath('//*[@id="total-comments"]/text()')
    return(a)
a=stepc(firstlink)
c=re.sub(r'\D', "", a[0])#返回評論數篩選數字
d=int(int(c)/20+1)#通過評論數計算出頁碼數，評論數/20+1
print("當前評論有"+ str(d) +"頁,請耐心等待")


def stepa (firstlink,d):#讀取評論內容
    content=[]
    for page in range(1,d):
        url=firstlink+"hot?p"+str(page)
        response = requests.get(url=url)
        wb_data = response.text
        html = etree.HTML(wb_data)
        a = html.xpath('//*[@id="comments"]//div[2]/p/span/text()')
        content.append(a)
    return(content)
a=stepa (firstlink,d)

def stepb(a):#寫入txt
    for b in a:
        for c in b:
            with open('C:/Users/Beckham/Desktop/python/2.txt', 'a',encoding='utf-8') as w:
                w.write('\n'+c)
                w.close()
stepb(a)
print("完成評論爬取，接下來分析關鍵字")
time.sleep(5)

def stepd():#分析評論
    txt=open("C:\\Users\\Beckham\\Desktop\\python\\2.txt","r", encoding='utf-8').read()    #打開倚天屠龍記文本
    exculdes={}   #創建字典，主要用於存儲非人物名次，供后面剔除使用
    words=jieba.lcut(txt)   #jieba庫分析文本
    counts={}
    for word in words:    #篩選分析后的詞組
        if len(word)==1:   #因為詞組中的漢字數大於1個即認為是一個詞組，所以通過continue結束點讀取的漢字書為1的內容
            continue
        else:
            word=word
        counts[word]=counts.get(word,0)+1  #對word出現的頻率進行統計，當word不在words時，返回值是0，當rword在words中時，返回+1，以此進行累計計數
    for word in exculdes:#如果循環讀取到的詞組與exculdes字典內的內容匹配，那么過濾掉（不顯示）這個詞組
        del(counts[word])
    items=list(counts.items())#字典到列表
    items.sort(key=lambda x:x[1],reverse=True)#lambda是一個隱函數，是固定寫法，以下命令的意思就是按照記錄的第2列排序    
    for i in range(15):#顯示前15位數據
        word,count=items[i]
        print("{0:<10}{1:>10}".format(word,count)) #0:<10左對齊，寬度10，”>10"右對齊
stepd()       
print("分析完成")

執行結果

需要注意的是，如果頻繁執行這個腳本，豆瓣會認為ip訪問過多，彈出需要登錄的頁面

其他解析，在腳本內有注釋

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python爬蟲爬取豆瓣讀書 Python爬蟲實例：爬取豆瓣Top250 python爬蟲-爬取豆瓣電影數據 python實現對豆瓣數據的爬取 Python的scrapy之爬取豆瓣影評和排名利用Python爬取豆瓣電影 Python爬取豆瓣+數據可視化 python爬取豆瓣電影top250 Python爬蟲-爬取豆瓣圖書Top250 Python實戰之如何爬取豆瓣電影？本文教你