貓眼電影之哪吒數據爬取、數據分析


最近哪吒大火,所以我們分析一波哪吒的影評信息,分析之前我們需要數據呀,所以開篇我們先講一下爬蟲的數據提取;話不多說,走着。

首先我們找到網站的url = "https://maoyan.com/films/1211270",找到評論區看看網友的吐槽,如下

F12打開看看有沒有評論信息,我們發現還是有信息的。

但是現在的問題時,我們好像只有這幾條評論信息,完全不支持我們的分析呀,我們只能另謀出路了;

f12中由手機測試功能,打開刷新頁面,向下滾動看見查看好幾十萬的評論數據,點擊進入后,在network中會看見url = "http://m.maoyan.com/review/v2/comments.json?movieId=1211270&userId=-1&offset=15&limit=15&ts=1568600356382&type=3" api,有這個的時候我們就可以搞事情了。

但是隨着爬取,還是不能獲取完整的信息,百度、谷歌、必應一下,我們通過時間段獲取信息,這樣我們不會被貓眼給牆掉,所以我們使用該url="http://m.maoyan.com/mmdb/comments/movie/1211270.json?_v_=yes&offset=0&startTime="

效果如下:

開始構造爬蟲代碼:

 1 #!/usr/bin/env python
 2 # -*- coding: utf-8 -*-
 3 # author:albert time:2019/9/3
 4 import  requests,json,time,csv
 5 from fake_useragent import  UserAgent  #獲取userAgent
 6 from datetime import  datetime,timedelta
 7  8 def get_content(url):
 9     '''獲取api信息的網頁源代碼'''
10     ua = UserAgent().random
11     try:
12         data = requests.get(url,headers={'User-Agent':ua},timeout=3 ).text
13         return data
14     except:
15         pass
16     
17 def  Process_data(html):
18     '''對數據內容的獲取'''
19     data_set_list = []
20     #json格式化
21     data_list =  json.loads(html)['cmts']
22     for data in data_list:
23         data_set = [data['id'],data['nickName'],data['userLevel'],data['cityName'],data['content'],data['score'],data['startTime']]
24         data_set_list.append(data_set)
25     return  data_set_list
26 27 if __name__ == '__main__':
28     start_time = start_time = datetime.now().strftime('%Y-%m-%d %H:%M:%S')  # 獲取當前時間,從當前時間向前獲取
29     # print(start_time)
30     end_time = '2019-07-26 08:00:00'
31 32     # print(end_time)
33     while start_time > str(end_time):
34         #構造url
35         url = 'http://m.maoyan.com/mmdb/comments/movie/1211270.json?_v_=yes&offset=0&startTime=' + start_time.replace(
36             ' ', '%20')
37         print('........')
38         try:
39             html = get_content(url)
40         except Exception as e:
41             time.sleep(0.5)
42             html = get_content(url)
43         else:
44             time.sleep(1)
45         comments = Process_data(html)
46         # print(comments[14][-1])
47         if comments:
48             start_time = comments[14][-1]
49             start_time = datetime.strptime(start_time, '%Y-%m-%d %H:%M:%S') + timedelta(seconds=-1)
50             # print(start_time)
51             start_time = datetime.strftime(start_time,'%Y-%m-%d %H:%M:%S')
52             print(comments)
53             #保存數據為csv
54             with open("comments_1.csv", "a", encoding='utf-8',newline='') as  csvfile:
55                 writer = csv.writer(csvfile)
56                 writer.writerows(comments)
57

 

-----------------------------------數據分析部分-----------------------------------

我們手里有接近兩萬的數據后開始進行數據分析階段:

工具:jupyter、庫方法:pyecharts v1.0===> pyecharts 庫向下不兼容,所以我們需要使用新的方式(鏈式結構)實現:

我們先來分析一下哪吒的等級星圖,使用pandas 實現分組求和,正對1-5星的數據:

 1 from pyecharts import options as opts
 2 from pyecharts.globals import SymbolType
 3 from pyecharts.charts import Bar,Pie,Page,WordCloud
 4 from pyecharts.globals import ThemeType,SymbolType
 5 import numpy
 6 import pandas as pd
 7  8 df = pd.read_csv('comments_1.csv',names=["id","nickName","userLevel","cityName","score","startTime"])
 9 attr = ["一星", "二星", "三星", "四星", "五星"]
10 score = df.groupby("score").size()  # 分組求和
11 value = [
12     score.iloc[0] + score.iloc[1]+score.iloc[1],
13     score.iloc[3] + score.iloc[4],
14     score.iloc[5] + score.iloc[6],
15     score.iloc[7] + score.iloc[8],
16     score.iloc[9] + score.iloc[10],
17 ]
18 # 餅圖分析
19 # 暫時處理,不能直接調用value中的數據
20 attr = ["一星", "二星", "三星", "四星", "五星"]
21 value = [286, 43, 175, 764, 10101]
22 23 pie = (
24     Pie(init_opts=opts.InitOpts(theme=ThemeType.LIGHT))
25     .add('',[list(z) for z in zip(attr, value)])
26     .set_global_opts(title_opts=opts.TitleOpts(title='哪吒等級分析'))
27     .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}:{c}"))
28 )
29 pie.render_notebook()

 

實現效果:

然后進行詞雲分析:

 1 import jieba
 2 import matplotlib.pyplot as plt   #生成圖形
 3 from  wordcloud import WordCloud,STOPWORDS,ImageColorGenerator
 4  5 df = pd.read_csv("comments_1.csv",names =["id","nickName","userLevel","cityName","content","score","startTime"])
 6  7 comments = df["content"].tolist()
 8 # comments
 9 df
10 11 # 設置分詞
12 comment_after_split = jieba.cut(str(comments), cut_all=False)  # 非全模式分詞,cut_all=false
13 words = " ".join(comment_after_split)  # 以空格進行拼接
14 15 stopwords = STOPWORDS.copy()
16 stopwords.update({"電影","最后","就是","不過","這個","一個","感覺","這部","雖然","不是","真的","覺得","還是","但是"})
17 18 bg_image = plt.imread('bg.jpg')
19 #生成
20 wc=WordCloud(
21     width=1024,
22     height=768,
23     background_color="white",
24     max_words=200,
25     mask=bg_image,            #設置圖片的背景
26     stopwords=stopwords,
27     max_font_size=200,
28     random_state=50,
29     font_path='C:/Windows/Fonts/simkai.ttf'   #中文處理,用系統自帶的字體
30     ).generate(words)
31 32 #產生背景圖片,基於彩色圖像的顏色生成器
33 image_colors=ImageColorGenerator(bg_image)
34 #開始畫圖
35 plt.imshow(wc.recolor(color_func=image_colors))
36 #為背景圖去掉坐標軸
37 plt.axis("off")
38 #保存雲圖
39 plt.show()
40 wc.to_file("評價.png")

 

效果如下:

初學者

分享及成功,你的報應就是我,記得素質三連!

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM