背景:
python 版本:3.7.4
使用IDEA:pycharm
操作系統:Windows64
第一步:獲取登錄狀態
爬取豆瓣評論是需要用戶登錄的,所以需要先拿到登陸相關 cookie。進入瀏覽器(IE瀏覽器把所有的 cookie 集合到一起了,比較方便取值,其他瀏覽器需要自己整合所有的 cookie)登陸豆瓣之后,按下 F12 ,拿到請求頭里的 cookie 與 user-agent 的數據,保持登陸狀態不要退出。
第二步:分析 HTML
簡單獲取《豪斯醫生》的某一頁影評,經過分析影評的 html 數據展示格式可知,我們需要的是 tr 標簽下面的 td 下面的第二個 p 標簽里面的內容:
第三步:編碼
采用 BeautifulSoup 進行 HTML 解析,簡版 python 編碼如下:(因為輸出內容字符集是 utf-8 的,所以建議指定字符集格式)
#!/usr/bin/python # -*- coding: utf-8 -*- import io import sys import requests from bs4 import BeautifulSoup sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='utf8') url = 'https://movie.douban.com/subject/1442129/collections?start=20' headers = { 'cookie':'ll=118172; bid=nO_yhRGdS8c; __utma=30149280.744941980.1587025849.1587025849.1587025849.1; __utmb=30149280.7.10.1587025849; __utmz=30149280.1587025849.1.1.utmcsr=so.com|utmccn=(referral)|utmcmd=referral|utmcct=/link; __utmt=1; push_noty_num=0; push_doumail_num=0; __utmv=30149280.18122; douban-profile-remind=1; __utmc=30149280; dbcl2=181229630:peNlRIftZSU; ck=0DBS; _vwo_uuid_v2=D6F0A378B72943607FFB8D0DE9AA9E4F2|e4b22c328b795c724132d4d5a5551615; _pk_ref.100001.4cf6=%5B%22%22%2C%22%22%2C1587025959%2C%22https%3A%2F%2Fwww.douban.com%2Fsearch%3Fsource%3Dsuggest%26q%3D%25E9%2587%258D%25E7%2594%259F%22%5D; _pk_id.100001.4cf6=55b0d18436426829.1587025959.1.1587025959.1587025959.; _pk_ses.100001.4cf6=*; __utma=223695111.917770948.1587025959.1587025959.1587025959.1; __utmb=223695111.0.10.1587025959; __utmc=223695111; __utmz=223695111.1587025959.1.1.utmcsr=douban.com|utmccn=(referral)|utmcmd=referral|utmcct=/search; __yadk_uid=wBD152Qkg8CojaIRAPIB7nXOYiwGgYAj', 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko' } response = requests.get(url, headers=headers).text bs4 = BeautifulSoup(response, 'html.parser') print(bs4.select("tr > td > p:nth-of-type(2)"))
爬到的影評結果如下(可以設置規則,去掉 p 標簽):
[<p>看之前:不就是個醫療劇能拍出什么花??
看之后:為什么一個醫療劇可以拍出這么多花??</p>, <p>高中時期的下飯劇</p>]
第四步:將獲取到的影評做成詞雲
主要用到的模塊有:jieba、wordcloud、image,可以使用 pip 進行安裝,具體詞雲制作代碼如下:
爬到的影評的數據存放位置:F:\\python\\install_3_7_4\\txt\\haosiyisheng.txt;
網上找的一張豪斯醫生的劇照的存放位置:F:\\python\\install_3_7_4\\txt\\haosiyisheng.png
詞雲采用的字體的存放位置:C:/Windows/Fonts/msyh.ttc
#!/usr/bin/python # -*- coding: utf-8 -*- import io import sys from PIL import Image from wordcloud import WordCloud, ImageColorGenerator import numpy as np import jieba import matplotlib.pyplot as plt fig, ax=plt.subplots() sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='utf8') def GetWordCloud(): path_txt = "F:\\python\\install_3_7_4\\txt\\haosiyisheng.txt"; path_img = "F:\\python\\install_3_7_4\\txt\\haosiyisheng.png"; f = open(path_txt, 'r', encoding='UTF-8').read() background_image = np.array(Image.open(path_img)) cut_text = " ".join(jieba.cut(f)) wordcloud = WordCloud( font_path="C:/Windows/Fonts/msyh.ttc", background_color="white", mask=background_image ).generate(cut_text) ax.imshow(wordcloud) ax.axis("off") plt.show() wordcloud.to_file(r"haosiyisheng_result.png") if __name__ == '__main__': GetWordCloud()
詞雲最終效果圖:
第五步:編碼過程中的異常與解決方案
1. 解決異常:ReadTimeoutError: HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Read timed out.
使用 pip install xxx模塊 時,經常會遇到這個異常:
ReadTimeoutError: HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Read timed out.
可以嘗試更改 pip 源,國內源:
http://pypi.douban.com/ 豆瓣 http://pypi.hustunique.com/ 華中理工大學 http://pypi.sdutlinux.org/ 山東理工大學 http://pypi.mirrors.ustc.edu.cn/ 中國科學技術大學
最簡單的方式,直接指定 pip 源,如下所示指定為豆瓣的源:
pip install -i https://pypi.douban.com/simple <需要安裝的包>
2. 安裝 wordcloud
安裝 wordcloud 遇到一點意外,正確安裝方式如下:
首先進入鏈接:https://www.lfd.uci.edu/~gohlke/pythonlibs/#wordcloud
根據 python 大版本號下載對應的 wordcloud,我本機的 python 大版本是37,所以下載的是:wordcloud‑1.6.0‑cp37‑cp37m‑win32.whl (我的電腦是 windows 64位)
下載 wheel 模塊,因為要通過 wheel 模塊進行.whl文件的安裝
pip install wheel
將之前下載好的 wordcloud-1.6.0-cp37-cp37m-win32.whl 文件復制到 python 的安裝目錄的 /Scripts 目錄下,在此位置執行:
$ pip install wordcloud-1.6.0-cp37-cp37m-win32.whl Processing f:\python\install_3_7_4\scripts\wordcloud-1.6.0-cp37-cp37m-win32.whl Requirement already satisfied: pillow in f:\python\install_3_7_4\lib\site-packag es (from wordcloud==1.6.0) (7.1.1) Requirement already satisfied: numpy>=1.6.1 in f:\python\install_3_7_4\lib\site- packages (from wordcloud==1.6.0) (1.18.2) Requirement already satisfied: matplotlib in f:\python\install_3_7_4\lib\site-pa ckages (from wordcloud==1.6.0) (3.2.1) Requirement already satisfied: kiwisolver>=1.0.1 in f:\python\install_3_7_4\lib\ site-packages (from matplotlib->wordcloud==1.6.0) (1.2.0) Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in f:\py thon\install_3_7_4\lib\site-packages (from matplotlib->wordcloud==1.6.0) (2.4.7) Requirement already satisfied: cycler>=0.10 in f:\python\install_3_7_4\lib\site- packages (from matplotlib->wordcloud==1.6.0) (0.10.0) Requirement already satisfied: python-dateutil>=2.1 in f:\python\install_3_7_4\l ib\site-packages (from matplotlib->wordcloud==1.6.0) (2.8.1) Requirement already satisfied: six in f:\python\install_3_7_4\lib\site-packages (from cycler>=0.10->matplotlib->wordcloud==1.6.0) (1.14.0) Installing collected packages: wordcloud Successfully installed wordcloud-1.6.0
3. 使用 pip list 查看已安裝的模塊
$ pip list Package Version --------------- ---------- asgiref 3.2.7 beautifulsoup4 4.9.0 bs4 0.0.1 certifi 2020.4.5.1 chardet 3.0.4 cycler 0.10.0 Django 3.0.5 idna 2.9 image 1.5.30 jieba 0.39 kiwisolver 1.2.0 matplotlib 3.2.1 numpy 1.18.2 Pillow 7.1.1 pip 19.2.3 pyparsing 2.4.7 python-dateutil 2.8.1 pytz 2019.3 requests 2.23.0 setuptools 40.8.0 six 1.14.0 soupsieve 2.0 sqlparse 0.3.1 urllib3 1.25.8 wheel 0.34.2 wordcloud 1.