一、選題背景
在大數據的時代,人們的物質生活提升了很多,對視頻的播放內容,都有自己獨特的簡介,因而在視頻中,會被某個視頻,進行評論,此項目,就是抓取B站視頻評論,並使用詞雲圖進行展示。
二、開發的環境與硬件支撐和功能的描述
開發環境:
Python 3.7.4 + Pycharm 2020.1.3
Python是Python代碼運行環境,Pycharm是編輯器,用於寫Python代碼
三、實訓目的
抓取指定B站的評論數據,並使用stylecloud生成可視化詞雲圖。
四、實訓內容
1、使用爬蟲技術,抓取B戰視頻主頁的評論數據:
代碼截圖和效果截圖:
A、 User-Agent大列表, 防止被反爬
B、 導包部分
a) Requests-html是請求模塊,用於發送請求;
b) Jasonpath是解析模塊,用於解析疫情數據
c) worldcloud是可視化模塊,用於詞雲可視化
d) Numpy模塊,數據分析模塊,用於數據分析
e) os模塊:創建文件夾,用於保存
f) Xlutils,xlrd, xlwd模塊,用於保存excel評論文件
g) Time模塊,用於添加時間延時,進行時間轉換
h) Random模塊,用於生成隨機延時時間
i) Re模塊,用於解析
C、 初始化部分,獲取用戶輸入電影名字,翻頁起始頁碼數,百度搜索接口部分
D、 發送請求,獲取響應數據,其中響應數據為response響應,提取豆瓣電影相關電影鏈接。
E、 請求用戶輸入地址
F、解析評論,並且使用is_running實現下一頁翻頁
G、 解析生成詞雲圖
H、 解析評論大列表,使用jsonpath解析,並且將解析出的格林威治時間進行時間轉換
I、 獲取評論內容后,將數據保存乘excel表格
J、 生成地圖詞雲圖,使用地圖背景
K、 代碼運行結果圖:
blblMapObjSpider.py代碼
USER_AGENT_LIST = [ 'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3451.0 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:57.0) Gecko/20100101 Firefox/57.0', 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.2999.0 Safari/537.36', 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.70 Safari/537.36', 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2', 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36 OPR/31.0.1889.174', 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.1.4322; MS-RTC LM 8; InfoPath.2; Tablet PC 2.0)', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36 OPR/55.0.2994.61', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.814.0 Safari/535.1', 'Mozilla/5.0 (Macintosh; U; PPC Mac OS X; ja-jp) AppleWebKit/418.9.1 (KHTML, like Gecko) Safari/419.3', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36', 'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0; Touch; MASMJS)', 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1041.0 Safari/535.21', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36', 'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3451.0 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:57.0) Gecko/20100101 Firefox/57.0', 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.2999.0 Safari/537.36', 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.70 Safari/537.36', 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2', 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36 OPR/31.0.1889.174', 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.1.4322; MS-RTC LM 8; InfoPath.2; Tablet PC 2.0)', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36 OPR/55.0.2994.61', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.814.0 Safari/535.1', 'Mozilla/5.0 (Macintosh; U; PPC Mac OS X; ja-jp) AppleWebKit/418.9.1 (KHTML, like Gecko) Safari/419.3', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36', 'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0; Touch; MASMJS)', 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1041.0 Safari/535.21', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36', 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4093.3 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko; compatible; Swurl) Chrome/77.0.3865.120 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4086.0 Safari/537.36', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:75.0) Gecko/20100101 Firefox/75.0', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) coc_coc_browser/91.0.146 Chrome/85.0.4183.146 Safari/537.36', 'Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/537.36 (KHTML, like Gecko) Safari/537.36 VivoBrowser/8.4.72.0 Chrome/62.0.3202.84', 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.101 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36 Edg/87.0.664.60', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.16; rv:83.0) Gecko/20100101 Firefox/83.0', 'Mozilla/5.0 (X11; CrOS x86_64 13505.63.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:68.0) Gecko/20100101 Firefox/68.0', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.101 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36 OPR/72.0.3815.400', 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.101 Safari/537.36', ] from requests_html import HTMLSession from jsonpath import jsonpath from PIL import Image import os, xlwt, xlrd, time, stylecloud, random, re from xlutils.copy import copy import numpy as np import pandas as pd import matplotlib.pyplot as plt from wordcloud import WordCloud session = HTMLSession() class BZSpider(object): def __init__(self): self.yun_list = [] self.start_url = 'https://api.bilibili.com/x/v2/reply/main?jsonp=jsonp&next={}&type=1&oid={}&mode=3&plat=1&_=1623082600632' """循環條件""" self.is_running = True """循環計數""" self.start_page = 1 """評論內容大容器""" self.big_list = [] # self.start_url = input('請輸入視頻的鏈接') self.pinglun_url = 'https://www.bilibili.com/video/BV1PN411X7QW?from=search&seid=13620076173109636987' def parse_pl_url_response(self): """ 解析用戶輸入的地址 :return: """ headers = { 'user-agent': random.choice(USER_AGENT_LIST) } response = session.get(self.pinglun_url, headers=headers).content.decode() aid_set = re.findall(r'"aid":(.*?),', response) aid_list = list(set(aid_set)) for aid in aid_list: self.parse_start_url(aid) """回調解析詞雲圖方法""" self.parse_c_y_img() def parse_start_url(self, aid): """ 解析視頻的評論 :return: """ while self.is_running: headers = { 'user-agent': random.choice(USER_AGENT_LIST) } response = session.get(self.start_url.format(self.start_page, aid), headers=headers).json() """jsonpath提取評論大列表""" data_replies = jsonpath(response, '$..replies')[0] """回調解析評論大列表""" self.parse_data_replies(data_replies) """循環出口""" if data_replies == 'null': self.is_running = False if self.start_page == 10: self.is_running = False """循環計數 +1""" self.start_page += 1 break def parse_c_y_img(self): """ 解析生成詞雲圖 :return: """ print('--------------詞雲圖生成中logging--------------') data = ''.join(self.big_list) stylecloud.gen_stylecloud(data, font_path="C:/Windows/Fonts/simfang.ttf") img = Image.open("stylecloud.png") img.show() print('\n' + '----------------------詞雲圖已生成---------------------' + '\n') def parse_data_replies(self, data_replies): """ 解析評論大列表 :param data_replies: :return: """ for dict_data in data_replies: message = jsonpath(dict_data, '$..message') c_time = jsonpath(dict_data, '$..ctime') for text, temp in zip(message, c_time): """時間戳轉換""" timeArray = time.localtime(int(temp)) otherStyleTime = time.strftime("%Y--%m--%d %H:%M:%S", timeArray) self.big_list.append(text) data = { '評論數據': [otherStyleTime, text] } self.save_excel(data) self.yun_list.append(text) print('評論數據保存一條完成----logging!!!') def save_excel(self, data): # data = { # '基本詳情': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'] # } os_path_1 = os.getcwd() + '/數據/' if not os.path.exists(os_path_1): os.mkdir(os_path_1) # os_path = os_path_1 + self.os_path_name + '.xls' os_path = os_path_1 + '評論數據.xls' if not os.path.exists(os_path): # 創建新的workbook(其實就是創建新的excel) workbook = xlwt.Workbook(encoding='utf-8') # 創建新的sheet表 worksheet1 = workbook.add_sheet("評論數據", cell_overwrite_ok=True) borders = xlwt.Borders() # Create Borders """定義邊框實線""" borders.left = xlwt.Borders.THIN borders.right = xlwt.Borders.THIN borders.top = xlwt.Borders.THIN borders.bottom = xlwt.Borders.THIN borders.left_colour = 0x40 borders.right_colour = 0x40 borders.top_colour = 0x40 borders.bottom_colour = 0x40 style = xlwt.XFStyle() # Create Style style.borders = borders # Add Borders to Style """居中寫入設置""" al = xlwt.Alignment() al.horz = 0x02 # 水平居中 al.vert = 0x01 # 垂直居中 style.alignment = al # 合並 第0行到第0列 的 第0列到第13列 '''基本詳情13''' # worksheet1.write_merge(0, 0, 0, 13, '基本詳情', style) excel_data_1 = ('評論時間', '評論內容') for i in range(0, len(excel_data_1)): worksheet1.col(i).width = 2560 * 3 # 行,列, 內容, 樣式 worksheet1.write(0, i, excel_data_1[i], style) workbook.save(os_path) # 判斷工作表是否存在 if os.path.exists(os_path): # 打開工作薄 workbook = xlrd.open_workbook(os_path) # 獲取工作薄中所有表的個數 sheets = workbook.sheet_names() for i in range(len(sheets)): for name in data.keys(): worksheet = workbook.sheet_by_name(sheets[i]) # 獲取工作薄中所有表中的表名與數據名對比 if worksheet.name == name: # 獲取表中已存在的行數 rows_old = worksheet.nrows # 將xlrd對象拷貝轉化為xlwt對象 new_workbook = copy(workbook) # 獲取轉化后的工作薄中的第i張表 new_worksheet = new_workbook.get_sheet(i) for num in range(0, len(data[name])): new_worksheet.write(rows_old, num, data[name][num]) new_workbook.save(os_path) def show_img(self): ''' 生成地圖詞雲圖 ''' data = ''.join(self.yun_list) bg = np.array(Image.open("qq.jpg")) mask = bg wc = WordCloud(width=500, # 詞雲圖寬 height=500, # 詞雲圖高 mask=mask, # 詞雲蒙版圖 background_color='white', # 詞雲圖背景顏色,默認為白色 font_path=r'C:/Windows/Fonts/simfang.ttf', # 詞雲圖 字體(中文需要設定為本機有的中文字體) max_font_size=400, # 最大字體,默認為200 random_state=50, # 為每個單詞返回一個PIL顏色 ) wc.generate(data) # matplotlib用於顯示 詞雲圖 import matplotlib.pyplot as plt plt.imshow(wc) plt.axis("off") # plt方式存為本地圖片 plt.savefig('B站視頻-詞雲圖.png') plt.show() if __name__ == '__main__': b = BZSpider() b.parse_pl_url_response() b.show_img()
爬取內容:
四、實訓總結
這次實訓,在同學和老師的幫助下,成功完成,收貨頗多,了解了requests請求庫的使用,jsonpath, jsonpath數據解析庫的使用。
此次實訓中,發現對worldcloud了解不夠深入,了解了面向對象的含義,對反爬機制有了進一步了解。