python 爬取豆瓣電影短評並wordcloud生成詞雲圖

本文轉載自查看原文 2019-05-24 19:55 741 爬蟲/ python/ wordcloud

最近學到數據可視化到了詞雲圖，正好學到爬蟲，各種爬網站

【實驗名稱】爬取豆瓣電影《千與千尋》的評論並生成詞雲

1. 利用爬蟲獲得電影評論的文本數據

2. 處理文本數據生成詞雲圖

第一步，准備數據　　

需要登錄豆瓣網站才能夠獲得短評文本數據https://movie.douban.com/subject/1291561/comments

首先獲取cookies，使用爬蟲強大的firefox瀏覽器

將cookies數據復制到cookies.txt文件當中備用，

2.第二步，編寫爬蟲代碼

#coding = utf-8
import requests
import time
import random
from bs4 import BeautifulSoup

abss = 'https://movie.douban.com/subject/1291561/comments'
firstPag_url = 'https://movie.douban.com/subject/1291561/comments?start=20&limit=20&sort=new_score&status=P&percent_type='
url = 'https://movie.douban.com/subject/1291561/comments?start=0&limit=20&sort=new_score&status=P'
header = {
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0',
'Connection':'keep-alive'
}

def get_data(html):
    # 獲取所需要的頁面數據
    soup = BeautifulSoup(html, 'lxml')
    comment_list = soup.select('.comment > p')
    next_page = soup.select('#paginator > a')[2].get('href')
    date_nodes = soup.select('..comment-time')
    return comment_list, next_page, date_nodes

def get_cookies(path):
    # 獲取cookies
    f_cookies = open(path, 'r')
    cookies ={}
    for line in f_cookies.read().split(';'): # 將Cookies字符串其轉換為字典
        name ,value = line.strip().split('=', 1)
        cookies[name] = value
    return cookies

if __name__ == '__main__':
    cookies = get_cookies('cookies.txt') # cookies文件保存的前面所述的cookies
    html = requests.get(firstPag_url, cookies=cookies,headers=header).content
    comment_list, next_page, date_nodes = get_data(html) #首先從第一個頁面處理
    soup = BeautifulSoup(html, 'lxml')
    while (next_page): #不斷的處理接下來的頁面
        print(abss + next_page)
        html = requests.get(abss + next_page, cookies=cookies, headers=header).content
        comment_list, next_page, date_nodes = get_data(html)
        soup = BeautifulSoup(html, 'lxml')
        comment_list, next_page,date_nodes = get_data(html)
        with open("comments.txt", 'a', encoding='utf-8')as f:
            for ind in range(len(comment_list)):
                comment = comment_list[ind];
                date = date_nodes[ind]
                comment = comment.get_text().strip().replace("\n", "")
                date= date.get_text().strip()
                f.writelines(date+u'\n' +comment + u'\n')
        time.sleep(1 + float(random.randint(1, 100)) / 20)

每一頁都會有20條的短評，所以我們依次遍歷每一頁a

第二步，處理爬到的數據，在第一步當中已經將數據存檔到了commit.txt文件當中，

# -*- coding:utf-8 -*- 


import jieba
import matplotlib.pyplot as plt
from wordcloud import WordCloud,ImageColorGenerator
from scipy.misc import imread

f_comment = open("comments.txt",'rb')
words = []
for line in f_comment.readlines():
    if(len(line))==12:
        continue
    A = jieba.cut(line)
    words.append(" ".join(A))
# 去除停用詞
stopwords = [',','。','【','】', '”','“','，','《','》','！','、','？','.','…','1','2','3','4','5','[',']','（','）',' ']
new_words = []
for sent in words :
    word_in = sent.split(' ')
    new_word_in = []
    for word in word_in:
        if word in stopwords:
            continue
        else:
            new_word_in.append(word)
    new_sent = " ".join(new_word_in)
    new_words.append(new_sent)

final_words = []
for sent in new_words:
    sent = sent.split(' ')
    final_words +=sent
final_words_flt = []
for word in final_words:
    if word == ' ':
        continue
    else:
        final_words_flt.append(word)

text = " ".join(final_words_flt)

　　處理完數據之后得到帶有空格的高頻詞

第三步，生成詞雲圖：

首先安裝python的wordcloud庫

pip install wordcloud

在第二步text后面加上下面代碼生成詞雲圖

font = r'C:\Windows\Fonts\FZSTK.TTF'
bk = imread("bg.png") # 設置背景文件
wc = WordCloud(collocations=False, mask = bk, font_path=font, width=1400, height=1400, margin=2).generate(text.lower())
image_colors = ImageColorGenerator(bk) # 讀取背景文件色彩
plt.imshow(wc.recolor(color_func=image_colors))
plt.axis("off")
plt.figure()
plt.imshow(bk, cmap=plt.cm.gray)
plt.axis("off")
plt.show()
wc.to_file('word_cloud1.png')

wordcloud作為對象是為小寫，生成一個詞雲文件大概需要三步：

配置詞雲對象參數
加載詞文本
輸出詞雲文件（如果不加說明默認圖片大小是400*200

方法	描述
Wordcloud.generate(text)	向wordcloud對象中加載文本text，例如：wordcloud.genertae(“python && wordclooud”)
Wordcloud.to_file(filename)	將詞雲輸出為圖像元件以.png .jpg格式保存，例wordcloud.to_file(“picture.png”)

具體的方法上面
wordcloud做詞頻統計時分為下面幾步：
1. 分割，以空格分割單詞
2. 統計：單詞出現的次數並過濾
3. 字體：根據統計搭配相應的字號
4. 布局

最后我么可以看到短評當中處理過后的高頻詞

我們隨便照一張圖片讀取背景顏色

最后生成的詞雲圖就出來了：

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 爬取豆瓣電影-長津湖短評 - Python python爬取豆瓣影評，根據關鍵詞生成詞雲圖 Java爬取B站彈幕 —— Python雲圖Wordcloud生成彈幕詞雲 Python爬取《少年的你》豆瓣短評 Scrapy實戰篇（三）之爬取豆瓣電影短評 Python模塊---Wordcloud生成詞雲圖 python詞雲圖之WordCloud python實例：自動爬取豆瓣讀書短評，分析短評內容爬取豆瓣電影 Python爬取豆瓣電影top