python爬取豆瓣影評，根據關鍵詞生成詞雲圖

本文轉載自查看原文 2020-04-20 11:27 1206 爬蟲/ python

背景：

python 版本：3.7.4

使用IDEA：pycharm

操作系統：Windows64

第一步：獲取登錄狀態

爬取豆瓣評論是需要用戶登錄的，所以需要先拿到登陸相關 cookie。進入瀏覽器（IE瀏覽器把所有的 cookie 集合到一起了，比較方便取值，其他瀏覽器需要自己整合所有的 cookie）登陸豆瓣之后，按下 F12 ，拿到請求頭里的 cookie 與 user-agent 的數據，保持登陸狀態不要退出。

第二步：分析 HTML

簡單獲取《豪斯醫生》的某一頁影評，經過分析影評的 html 數據展示格式可知，我們需要的是 tr 標簽下面的 td 下面的第二個 p 標簽里面的內容：

第三步：編碼

采用 BeautifulSoup 進行 HTML 解析，簡版 python 編碼如下：（因為輸出內容字符集是 utf-8 的，所以建議指定字符集格式）

#!/usr/bin/python
# -*- coding: utf-8 -*-
import io
import sys
import requests
from bs4 import BeautifulSoup
sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='utf8')
url = 'https://movie.douban.com/subject/1442129/collections?start=20'
headers = {
    'cookie':'ll=118172; bid=nO_yhRGdS8c; __utma=30149280.744941980.1587025849.1587025849.1587025849.1; __utmb=30149280.7.10.1587025849; __utmz=30149280.1587025849.1.1.utmcsr=so.com|utmccn=(referral)|utmcmd=referral|utmcct=/link; __utmt=1; push_noty_num=0; push_doumail_num=0; __utmv=30149280.18122; douban-profile-remind=1; __utmc=30149280; dbcl2=181229630:peNlRIftZSU; ck=0DBS; _vwo_uuid_v2=D6F0A378B72943607FFB8D0DE9AA9E4F2|e4b22c328b795c724132d4d5a5551615; _pk_ref.100001.4cf6=%5B%22%22%2C%22%22%2C1587025959%2C%22https%3A%2F%2Fwww.douban.com%2Fsearch%3Fsource%3Dsuggest%26q%3D%25E9%2587%258D%25E7%2594%259F%22%5D; _pk_id.100001.4cf6=55b0d18436426829.1587025959.1.1587025959.1587025959.; _pk_ses.100001.4cf6=*; __utma=223695111.917770948.1587025959.1587025959.1587025959.1; __utmb=223695111.0.10.1587025959; __utmc=223695111; __utmz=223695111.1587025959.1.1.utmcsr=douban.com|utmccn=(referral)|utmcmd=referral|utmcct=/search; __yadk_uid=wBD152Qkg8CojaIRAPIB7nXOYiwGgYAj',
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko'
}

response = requests.get(url, headers=headers).text
bs4 = BeautifulSoup(response, 'html.parser')
print(bs4.select("tr > td > p:nth-of-type(2)"))

爬到的影評結果如下（可以設置規則，去掉 p 標簽）：

[<p>看之前：不就是個醫療劇能拍出什么花？？
看之后：為什么一個醫療劇可以拍出這么多花？？</p>, <p>高中時期的下飯劇</p>]

第四步：將獲取到的影評做成詞雲

主要用到的模塊有：jieba、wordcloud、image，可以使用 pip 進行安裝，具體詞雲制作代碼如下：

爬到的影評的數據存放位置：F:\\python\\install_3_7_4\\txt\\haosiyisheng.txt；

網上找的一張豪斯醫生的劇照的存放位置：F:\\python\\install_3_7_4\\txt\\haosiyisheng.png

詞雲采用的字體的存放位置：C:/Windows/Fonts/msyh.ttc

#!/usr/bin/python
# -*- coding: utf-8 -*-
import io
import sys
from PIL import Image
from wordcloud import WordCloud, ImageColorGenerator
import numpy as np
import jieba
import matplotlib.pyplot as plt
fig, ax=plt.subplots()

sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='utf8')

def GetWordCloud():
    path_txt = "F:\\python\\install_3_7_4\\txt\\haosiyisheng.txt";
    path_img = "F:\\python\\install_3_7_4\\txt\\haosiyisheng.png";
    f = open(path_txt, 'r', encoding='UTF-8').read()
    background_image = np.array(Image.open(path_img))
    cut_text = " ".join(jieba.cut(f))

    wordcloud = WordCloud(
        font_path="C:/Windows/Fonts/msyh.ttc",
        background_color="white",
        mask=background_image
    ).generate(cut_text)

    ax.imshow(wordcloud)
    ax.axis("off")
    plt.show()
    wordcloud.to_file(r"haosiyisheng_result.png")


if __name__ == '__main__':
    GetWordCloud()

詞雲最終效果圖：

第五步：編碼過程中的異常與解決方案

1. 解決異常：ReadTimeoutError: HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Read timed out.

使用 pip install xxx模塊時，經常會遇到這個異常：

ReadTimeoutError: HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Read timed out.

可以嘗試更改 pip 源，國內源：

http://pypi.douban.com/ 豆瓣
http://pypi.hustunique.com/ 華中理工大學
http://pypi.sdutlinux.org/ 山東理工大學
http://pypi.mirrors.ustc.edu.cn/ 中國科學技術大學

最簡單的方式，直接指定 pip 源，如下所示指定為豆瓣的源：

pip install -i https://pypi.douban.com/simple <需要安裝的包>

2. 安裝 wordcloud

安裝 wordcloud 遇到一點意外，正確安裝方式如下：

首先進入鏈接：https://www.lfd.uci.edu/~gohlke/pythonlibs/#wordcloud

根據 python 大版本號下載對應的 wordcloud，我本機的 python 大版本是37，所以下載的是：wordcloud‑1.6.0‑cp37‑cp37m‑win32.whl (我的電腦是 windows 64位)

下載 wheel 模塊，因為要通過 wheel 模塊進行.whl文件的安裝

pip install wheel

將之前下載好的 wordcloud-1.6.0-cp37-cp37m-win32.whl 文件復制到 python 的安裝目錄的 /Scripts 目錄下，在此位置執行：

$ pip install wordcloud-1.6.0-cp37-cp37m-win32.whl
Processing f:\python\install_3_7_4\scripts\wordcloud-1.6.0-cp37-cp37m-win32.whl
Requirement already satisfied: pillow in f:\python\install_3_7_4\lib\site-packag                                                                                                                                                                                      es (from wordcloud==1.6.0) (7.1.1)
Requirement already satisfied: numpy>=1.6.1 in f:\python\install_3_7_4\lib\site-                                                                                                                                                                                      packages (from wordcloud==1.6.0) (1.18.2)
Requirement already satisfied: matplotlib in f:\python\install_3_7_4\lib\site-pa                                                                                                                                                                                      ckages (from wordcloud==1.6.0) (3.2.1)
Requirement already satisfied: kiwisolver>=1.0.1 in f:\python\install_3_7_4\lib\                                                                                                                                                                                      site-packages (from matplotlib->wordcloud==1.6.0) (1.2.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in f:\py                                                                                                                                                                                      thon\install_3_7_4\lib\site-packages (from matplotlib->wordcloud==1.6.0) (2.4.7)
Requirement already satisfied: cycler>=0.10 in f:\python\install_3_7_4\lib\site-                                                                                                                                                                                      packages (from matplotlib->wordcloud==1.6.0) (0.10.0)
Requirement already satisfied: python-dateutil>=2.1 in f:\python\install_3_7_4\l                                                                                                                                                                                      ib\site-packages (from matplotlib->wordcloud==1.6.0) (2.8.1)
Requirement already satisfied: six in f:\python\install_3_7_4\lib\site-packages                                                                                                                                                                                       (from cycler>=0.10->matplotlib->wordcloud==1.6.0) (1.14.0)
Installing collected packages: wordcloud
Successfully installed wordcloud-1.6.0

3. 使用 pip list 查看已安裝的模塊

$ pip list
Package         Version
--------------- ----------
asgiref         3.2.7
beautifulsoup4  4.9.0
bs4             0.0.1
certifi         2020.4.5.1
chardet         3.0.4
cycler          0.10.0
Django          3.0.5
idna            2.9
image           1.5.30
jieba           0.39
kiwisolver      1.2.0
matplotlib      3.2.1
numpy           1.18.2
Pillow          7.1.1
pip             19.2.3
pyparsing       2.4.7
python-dateutil 2.8.1
pytz            2019.3
requests        2.23.0
setuptools      40.8.0
six             1.14.0
soupsieve       2.0
sqlparse        0.3.1
urllib3         1.25.8
wheel           0.34.2
wordcloud       1.

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 [超詳細] Python3爬取豆瓣影評、去停用詞、詞雲圖、評論關鍵詞繪圖處理 python 爬取豆瓣電影短評並wordcloud生成詞雲圖 Python爬取《冰雪奇緣2》豆瓣影評 python爬蟲實戰：豆瓣模擬登錄 + 影評爬取 + 詞雲制作爬蟲-python（三）百度搜索關鍵詞后爬取搜索結果【python網絡編程】新浪爬蟲：關鍵詞搜索爬取微博數據使用php的curl根據關鍵詞爬取百度搜索結果頁 Python模塊---Wordcloud生成詞雲圖用python爬取微博數據並生成詞雲 python提取文本關鍵詞