首先我們找到網站的url = "https://maoyan.com/films/1211270",找到評論區看看網友的吐槽,如下
F12打開看看有沒有評論信息,我們發現還是有信息的。
但是現在的問題時,我們好像只有這幾條評論信息,完全不支持我們的分析呀,我們只能另謀出路了;
f12中由手機測試功能,打開刷新頁面,向下滾動看見查看好幾十萬的評論數據,點擊進入后,在network中會看見url = "http://m.maoyan.com/review/v2/comments.json?movieId=1211270&userId=-1&offset=15&limit=15&ts=1568600356382&type=3" api,有這個的時候我們就可以搞事情了。
但是隨着爬取,還是不能獲取完整的信息,百度、谷歌、必應一下,我們通過時間段獲取信息,這樣我們不會被貓眼給牆掉,所以我們使用該url="http://m.maoyan.com/mmdb/comments/movie/1211270.json?_v_=yes&offset=0&startTime="
效果如下:
開始構造爬蟲代碼:
1 #!/usr/bin/env python 2 # -*- coding: utf-8 -*- 3 # author:albert time:2019/9/3 4 import requests,json,time,csv 5 from fake_useragent import UserAgent #獲取userAgent 6 from datetime import datetime,timedelta 7 8 def get_content(url): 9 '''獲取api信息的網頁源代碼''' 10 ua = UserAgent().random 11 try: 12 data = requests.get(url,headers={'User-Agent':ua},timeout=3 ).text 13 return data 14 except: 15 pass 16 17 def Process_data(html): 18 '''對數據內容的獲取''' 19 data_set_list = [] 20 #json格式化 21 data_list = json.loads(html)['cmts'] 22 for data in data_list: 23 data_set = [data['id'],data['nickName'],data['userLevel'],data['cityName'],data['content'],data['score'],data['startTime']] 24 data_set_list.append(data_set) 25 return data_set_list 26 27 if __name__ == '__main__': 28 start_time = start_time = datetime.now().strftime('%Y-%m-%d %H:%M:%S') # 獲取當前時間,從當前時間向前獲取 29 # print(start_time) 30 end_time = '2019-07-26 08:00:00' 31 32 # print(end_time) 33 while start_time > str(end_time): 34 #構造url 35 url = 'http://m.maoyan.com/mmdb/comments/movie/1211270.json?_v_=yes&offset=0&startTime=' + start_time.replace( 36 ' ', '%20') 37 print('........') 38 try: 39 html = get_content(url) 40 except Exception as e: 41 time.sleep(0.5) 42 html = get_content(url) 43 else: 44 time.sleep(1) 45 comments = Process_data(html) 46 # print(comments[14][-1]) 47 if comments: 48 start_time = comments[14][-1] 49 start_time = datetime.strptime(start_time, '%Y-%m-%d %H:%M:%S') + timedelta(seconds=-1) 50 # print(start_time) 51 start_time = datetime.strftime(start_time,'%Y-%m-%d %H:%M:%S') 52 print(comments) 53 #保存數據為csv 54 with open("comments_1.csv", "a", encoding='utf-8',newline='') as csvfile: 55 writer = csv.writer(csvfile) 56 writer.writerows(comments) 57
-----------------------------------數據分析部分-----------------------------------
我們手里有接近兩萬的數據后開始進行數據分析階段:
工具:jupyter、庫方法:pyecharts v1.0===> pyecharts 庫向下不兼容,所以我們需要使用新的方式(鏈式結構)實現:
我們先來分析一下哪吒的等級星圖,使用pandas 實現分組求和,正對1-5星的數據:
1 from pyecharts import options as opts 2 from pyecharts.globals import SymbolType 3 from pyecharts.charts import Bar,Pie,Page,WordCloud 4 from pyecharts.globals import ThemeType,SymbolType 5 import numpy 6 import pandas as pd 7 8 df = pd.read_csv('comments_1.csv',names=["id","nickName","userLevel","cityName","score","startTime"]) 9 attr = ["一星", "二星", "三星", "四星", "五星"] 10 score = df.groupby("score").size() # 分組求和 11 value = [ 12 score.iloc[0] + score.iloc[1]+score.iloc[1], 13 score.iloc[3] + score.iloc[4], 14 score.iloc[5] + score.iloc[6], 15 score.iloc[7] + score.iloc[8], 16 score.iloc[9] + score.iloc[10], 17 ] 18 # 餅圖分析 19 # 暫時處理,不能直接調用value中的數據 20 attr = ["一星", "二星", "三星", "四星", "五星"] 21 value = [286, 43, 175, 764, 10101] 22 23 pie = ( 24 Pie(init_opts=opts.InitOpts(theme=ThemeType.LIGHT)) 25 .add('',[list(z) for z in zip(attr, value)]) 26 .set_global_opts(title_opts=opts.TitleOpts(title='哪吒等級分析')) 27 .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}:{c}")) 28 ) 29 pie.render_notebook()
實現效果:
然后進行詞雲分析:
1 import jieba 2 import matplotlib.pyplot as plt #生成圖形 3 from wordcloud import WordCloud,STOPWORDS,ImageColorGenerator 4 5 df = pd.read_csv("comments_1.csv",names =["id","nickName","userLevel","cityName","content","score","startTime"]) 6 7 comments = df["content"].tolist() 8 # comments 9 df 10 11 # 設置分詞 12 comment_after_split = jieba.cut(str(comments), cut_all=False) # 非全模式分詞,cut_all=false 13 words = " ".join(comment_after_split) # 以空格進行拼接 14 15 stopwords = STOPWORDS.copy() 16 stopwords.update({"電影","最后","就是","不過","這個","一個","感覺","這部","雖然","不是","真的","覺得","還是","但是"}) 17 18 bg_image = plt.imread('bg.jpg') 19 #生成 20 wc=WordCloud( 21 width=1024, 22 height=768, 23 background_color="white", 24 max_words=200, 25 mask=bg_image, #設置圖片的背景 26 stopwords=stopwords, 27 max_font_size=200, 28 random_state=50, 29 font_path='C:/Windows/Fonts/simkai.ttf' #中文處理,用系統自帶的字體 30 ).generate(words) 31 32 #產生背景圖片,基於彩色圖像的顏色生成器 33 image_colors=ImageColorGenerator(bg_image) 34 #開始畫圖 35 plt.imshow(wc.recolor(color_func=image_colors)) 36 #為背景圖去掉坐標軸 37 plt.axis("off") 38 #保存雲圖 39 plt.show() 40 wc.to_file("評價.png")
效果如下:
初學者
分享及成功,你的報應就是我,記得素質三連!