一、選題的背景
為什么要選擇此選題?要達到的數據分析的預期目標是什么?(10 分)從社會、經濟、技術、數據來源等方面進行描述(200 字以內)
選題原因:爬蟲是指一段自動抓取互聯網信息的程序,從互聯網上抓取對於我們有價值的信息。選擇此題正是因為隨着信息化的發展,大數據時代對信息的采需求和集量越來越大,相應的處理量也越來越大,正是因為如此,爬蟲相應的崗位也開始增多,因此,學好這門課也是為將來就業打下扎實的基礎。B站在當今眾多視頻網站中,相對於年輕化較有爬取價值,可以進一步了解現階段年輕人的觀看喜好。
預期目標:熟悉地掌握爬取網頁信息,將儲存地信息進行清洗、查重並處理,並對其進行持久性可更新性地儲存,然后對數據進行簡單的可視化處理,最后再假設根據客戶需求,提供快捷方便的相應的數據。
二、主題式網絡爬蟲設計方案(10 分)
1.主題式網絡爬蟲名稱
爬取B站原創視頻以及其動漫視頻相關信息並反饋處理的程序
2.主題式網絡爬蟲爬取的內容與數據特征分析
內容:B站熱門原創視頻排名(視頻標題、排名、播放量、彈幕量、作者網絡名稱、視頻播放地址、作者空間地址);B站熱門動漫的排名(排名,動漫標題,播放量,彈幕量,更新至集數,動漫播放地址)
數據特征分析:對前十名的視頻進行制作柱狀圖(視頻標題與播放量,視頻排名與彈幕量,動漫標題與播放量,動漫排名與彈幕量)
3.主題式網絡爬蟲設計方案概述(包括實現思路與技術難點)
實現思路:
1.網絡爬蟲爬取B站的內容與數據進行分析
2.數據清洗和統計
3.mysql數據庫的數據儲存
技術難點:網頁各信息上的標簽屬性查找,def自定義函數的建立,對存儲至csv文件的數據進行清理查重,並且對其特點數據進行數據整數化(如:排名,播放量,彈幕量),對網址進行添加刪除(如:添加“https://”,刪除多余的“//”),機器學習sklearn庫的學習與調用,selenium庫的學習與調用。
三、主題頁面的結構特征分析(10 分)
本次爬取兩個同網址不同排行榜的主題頁面(B站的原創視頻排行榜、B站的動漫排行榜)的URL,分別為:“https://www.bilibili.com/v/popular/rank/all”與“https://www.bilibili.com/v/popular/rank/bangumi”。
Schema : https
Host : www.bilibili.com
Path : /v/popular/rank/all
/v/popular/rank/bangumi
主題頁面組成為:<html>
<head>...</head>
<body class="header-v2">...<body>
<html>
B站的原創視頻排行榜和B站的動漫排行榜的<head>標簽中包含了<mate><title><script><link><style>五種標簽,這些標簽定義文檔的頭部,它是所有頭部元素的容器。(附圖)

1.Htmls 頁面解析
本次課程設計主要是對<body>部分進行解析,<body>中存在<svg><div><script><style>四種標簽,經過定位,確定要爬取的數據位於<div id=”app”>的<li ...class=”rank-item”>標簽中。
以下為爬取<li ...class=”rank-item”>標簽中所有信息的代碼:
import requests
from bs4 import BeautifulSoup
url = 'https://www.bilibili.com/v/popular/rank/all'
bdata = requests.get(url).text
soup = BeautifulSoup(bdata,'html.parser')
items = soup.findAll('li',{'class':'rank-item'})#提取列表
print(items)

1.節點(標簽)的查找方法與遍歷方法(必要時畫出節點樹結構)
import requests
from bs4 import BeautifulSoup
r=requests.get('https://www.bilibili.com/v/popular/rank/all')
demo=r.text
soup=BeautifulSoup(demo,'html.parser')
#遍歷方法:
print(soup.contents)# 獲取整個標簽樹的兒子節點
print(soup.body.content)#返回標簽樹的body標簽下的節點
print(soup.head)#返回head標簽
#查找方法:
print(soup.title)#查找標簽,這里查找了title標簽
print(soup.li['class'])#根據標簽名查找某屬性,這里查找了li標簽下的class
print(soup.find_all('li'))#根據標簽名查找元素,這里查找了li標簽下的所有代碼

節點樹結構圖:

二、 網絡爬蟲程序設計(60分)
爬蟲程序主體要包括以下各部分,要附源代碼及較詳細注釋,並在每部分程序后面提供輸出結果的截圖。
1.數據爬取與采集
①bvid網址獲取

②aid的獲取

③爬取界面
1 #導入數據庫 2 3 import requests 4 5 from bs4 import BeautifulSoup 6 7 import csv 8 9 import datetime 10 11 import pandas as pd 12 13 import numpy as np 14 15 from matplotlib import rcParams 16 17 import matplotlib.pyplot as plt 18 19 import matplotlib.font_manager as font_manager 20 21 from selenium import webdriver 22 23 from time import sleep 24 25 import matplotlib 26 27 url = 'https://www.bilibili.com/v/popular/rank/all' 28 29 #發起網絡請求 30 31 response = requests.get(url) 32 33 html_text = response.text 34 35 soup = BeautifulSoup(html_text,'html.parser') 36 37 #創建Video對象 38 39 class Video: 40 41 def __init__(self,rank,title,visit,barrage,up_id,url,space): 42 43 self.rank = rank 44 45 self.title = title 46 47 self.visit = visit 48 49 self.barrage = barrage 50 51 self.up_id = up_id 52 53 self.url = url 54 55 self.space = space 56 57 def to_csv(self): 58 59 return[self.rank,self.title,self.visit,self.barrage,self.up_id,self.url,self.space] 60 61 @staticmethod 62 63 def csv_title(): 64 65 return ['排名','標題','播放量','彈幕量','Up_ID','URL','作者空間'] 66 67 #提取列表 68 69 items = soup.findAll('li',{'class':'rank-item'}) 70 71 #保存提取出來的Video列表 72 73 videos = [] 74 75 for itm in items: 76 77 title = itm.find('a',{'class':'title'}).text #視頻標題 78 79 rank = itm.find('i',{'class':'num'}).text #排名 80 81 visit = itm.find_all('span')[3].text #播放量 82 83 barrage = itm.find_all('span')[4].text #彈幕量 84 85 up_id = itm.find('span',{'class':'data-box up-name'}).text #作者id 86 87 url = itm.find_all('a')[1].get('href')#獲取視頻網址 88 89 space = itm.find_all('a')[2].get('href')#獲取作者空間網址 90 91 v = Video(rank,title,visit,barrage,up_id,url,space) 92 93 videos.append(v) 94 95 #建立時間后綴 96 97 now_str = datetime.datetime.now().strftime('%Y%m%d_%H%M%S') 98 99 #建立文件名稱以及屬性 100 101 file_name1 = f'嗶哩嗶哩視頻top100_{now_str}.csv' 102 103 #寫入數據到文件中,並存儲 104 105 with open(file_name1,'w',newline='',encoding='utf-8') as f: 106 107 writer = csv.writer(f) 108 109 writer.writerow(Video.csv_title()) 110 111 for v in videos: 112 113 writer.writerow(v.to_csv()) 114 115 url = 'https://www.bilibili.com/v/popular/rank/bangumi' 116 117 #發起網絡請求 118 119 response = requests.get(url) 120 121 html_text = response.text 122 123 soup = BeautifulSoup(html_text,'html.parser') 124 125 #創建Video對象 126 127 class Video: 128 129 def __init__(self,rank,title,visit,barrage,new_word,url): 130 131 self.rank = rank 132 133 self.title = title 134 135 self.visit = visit 136 137 self.barrage = barrage 138 139 self.new_word = new_word 140 141 self.url = url 142 143 def to_csv(self): 144 145 return[self.rank,self.title,self.visit,self.barrage,self.new_word,self.url] 146 147 @staticmethod 148 149 def csv_title(): 150 151 return ['排名','標題','播放量','彈幕量','更新話數至','URL'] 152 153 #提取列表 154 155 items = soup.findAll('li',{'class':'rank-item'}) 156 157 #保存提取出來的Video列表 158 159 videos = [] 160 161 for itm in items: 162 163 rank = itm.find('i',{'class':'num'}).text #排名 164 165 title = itm.find('a',{'class':'title'}).text #視頻標題 166 167 url = itm.find_all('a')[0].get('href')#獲取視頻網址 168 169 visit = itm.find_all('span')[2].text #播放量 170 171 barrage = itm.find_all('span')[3].text #彈幕量 172 173 new_word = itm.find('span',{'class':'data-box'}).text#更新話數 174 175 v = Video(rank,title,visit,barrage,new_word,url) 176 177 videos.append(v) 178 179 #建立時間后綴 180 181 now_str = datetime.datetime.now().strftime('%Y%m%d_%H%M%S') 182 183 #建立文件名稱以及屬性 184 185 file_name2 = f'嗶哩嗶哩番劇top50_{now_str}.csv' 186 187 #寫入數據到文件中,並存儲 188 189 with open(file_name2,'w',newline='',encoding='utf-8') as f: 190 191 writer = csv.writer(f) 192 193 writer.writerow(Video.csv_title()) 194 195 for v in videos: 196 197 writer.writerow(v.to_csv())


④清洗數據
1 #導入數據庫 2 3 import pandas as pd 4 5 file_name1 = f'嗶哩嗶哩視頻top100_20211215_154744.csv' 6 7 file_name2 = f'嗶哩嗶哩番劇top50_20211215_154745.csv' 8 9 paiming1 = pd.DataFrame(pd.read_csv(file_name1,encoding="utf_8_sig"))#對數據進行清洗和處理 10 11 paiming2 = pd.DataFrame(pd.read_csv(file_name2,encoding="utf_8_sig")) 12 13 print(paiming1.head()) 14 15 print(paiming2.head()) 16 17 #查找重復值 18 19 print(paiming1.duplicated()) 20 21 print(paiming2.duplicated()) 22 23 #查找空值與缺失值 24 25 print(paiming1['標題'].isnull().value_counts()) 26 27 print(paiming2['標題'].isnull().value_counts()) 28 29 print(paiming1['URL'].isnull().value_counts()) 30 31 print(paiming2['URL'].isnull().value_counts()) 32 33 print(paiming1['播放量'].isnull().value_counts()) 34 35 print(paiming2['播放量'].isnull().value_counts()) 36 37 print(paiming1['彈幕量'].isnull().value_counts()) 38 39 print(paiming2['彈幕量'].isnull().value_counts())
2.儲存至mysql數據庫當中
①爬取網站
1 # 爬取B站日榜新聞 2 3 def BilibiliNews(): 4 5 newsList=[] 6 7 # 偽裝標頭 8 9 headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122 Safari/537.36'} 10 11 res=requests.get('https://www.bilibili.com/ranking/all/0/0/3',headers=headers) # 請求網頁 12 13 soup = BeautifulSoup(res.text,'html.parser') # 解析網頁 14 15 result=soup.find_all(class_='rank-item') # 找到榜單所在標簽 16 17 num=0 18 19 startTime=time.strftime("%Y-%m-%d", time.localtime()) # 記錄爬取的事件 20 21 for i in result: 22 23 try: 24 25 num=int(i.find(class_='num').text) # 當前排名 26 27 con=i.find(class_='content') 28 29 title=con.find(class_='title').text # 標題 30 31 detail=con.find(class_='detail').find_all(class_='data-box') 32 33 play=detail[0].text # 播放量 34 35 view=detail[1].text # 彈幕量 36 37 # 由於這兩者存在類似15.5萬的數據情況,所以為了保存方便將他們同義轉換為整型 38 39 if(play[-1]=='萬'): 40 41 play=int(float(play[:-1])*10000) 42 43 if(view[-1]=='萬'): 44 45 view=int(float(view[:-1])*10000) 46 47 # 以下為預防部分數據不顯示的情況 48 49 if(view=='--'): 50 51 view=0 52 53 if(play=='--'): 54 55 play=0 56 57 author=detail[2].text # UP主 58 59 60 61 url=con.find(class_='title')['href'] # 獲取視頻鏈接 62 63 BV=re.findall(r'https://www.bilibili.com/video/(.*)', url)[0] # 通過正則表達式解析得到視頻的BV號 64 65 pts=int(con.find(class_='pts').find('div').text) # 視頻綜合得分 66 67 newsList.append([num,title,author,play,view,BV,pts,startTime]) # 將數據插入列表中 68 69 except: 70 71 continue 72 73 return newsList # 返回數據信息列表

②數據庫的創建
1 mysql> create table BILIBILI( 2 3 -> NUM INT, 4 5 -> TITLE CHAR(80), 6 7 -> UP CHAR(20), 8 9 -> VIEW INT, 10 11 -> COMMENT INT, 12 13 -> BV_NUMBER INT, 14 15 -> SCORE INT, 16 17 -> EXECUTION_TIME DATETIME);

③將數據插入MySQL中
1 def GetMessageInMySQL(): 2 3 # 連接數據庫 4 5 db = pymysql.connect(host="cdb-cdjhisi3hih.cd.tencentcdb.com",port=10056,user="root",password="xxxxxx",database="weixinNews",charset='utf8') 6 7 cursor = db.cursor() # 創建游標 8 9 news=getHotNews() # 調用getHotNews()方法獲取熱搜榜數據內容 10 11 sql = "INSERT INTO WEIBO(NUMBER_SERIAL,TITLE, ATTENTION,EXECUTION_TIME) VALUES (%s,%s,%s,%s)" # 插入語句 12 13 timebegin=time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()) # 記錄開始事件,便於查找錯誤發生情況 14 15 try: 16 17 # 執行sql語句,executemany用於批量插入數據 18 19 cursor.executemany(sql, news) 20 21 # 提交到數據庫執行 22 23 db.commit() 24 25 print(timebegin+"成功!") 26 27 except : 28 29 # 如果發生錯誤則回滾 30 31 db.rollback() 32 33 print(timebegin+"失敗!") 34 35 # 關閉游標 36 37 cursor.close() 38 39 # 關閉數據庫連接 40 41 db.close()
④利用schedule實現定時爬取
1 # 記錄程序運行事件 2 3 time1=time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()) 4 5 print("開始爬取信息,程序正常執行:"+time1) 6 7 # 每20分鍾執行一次程序 8 9 schedule.every(20).minutes.do(startFunction) 10 11 # 檢查部署的情況,如果任務准備就緒,就開始執行任務 12 13 while True: 14 15 schedule.run_pending() 16 17 time.sleep(1)


3.flask開發服務器端
①jieba提詞和echarts wordcloud
1 from collections import Counter 2 from pyecharts import WordCloud 3 import jieba.analyse 4 # 將counter拆分成兩個list 5 def counter2list(counter): 6 keyList,valueList = [],[] 7 for c in counter: 8 keyList.append(c[0]) 9 valueList.append(c[1]) 10 return keyList,valueList 11 # 使用jieba提取關鍵詞並計算權重 12 def extractTag(content,tagsList): 13 keyList,valueList = [],[] 14 if content: 15 tags = jieba.analyse.extract_tags(content, topK=100, withWeight=True) 16 for tex, widget in tags: 17 tagsList[tex] += int(widget*10000) 18 19 def drawWorldCloud(content,count): 20 outputFile = './測試詞雲.html' 21 cloud = WordCloud('詞雲圖', width=1000, height=600, title_pos='center') 22 cloud.add( 23 ' ',content,count, 24 shape='circle', 25 background_color='white', 26 max_words=200 27 ) 28 cloud.render(outputFile) 29 if __name__ == '__main__': 30 c = Counter() #建一個容器 31 filePath = './新建文本文檔.txt' #分析的文檔路徑 32 with open(filePath) as file_object: 33 contents = file_object.read() 34 extractTag(contents, c) 35 contentList,countList = counter2list(c.most_common(200)) 36 drawWorldCloud(contentList, countList)

②flask接受請求的參數
1 username = request.form.get("username") 2 password = request.form.get("password", type=str, default=None) 3 cpuCount = request.form.get("cpuCount", type=int, default=None) 4 memorySize = request.form.get("memorySize", type=int, default=None)

③BV爬取
1 # _*_ coding: utf-8 _*_ 2 3 from urllib.request import urlopen, Request 4 5 from http.client import HTTPResponse 6 7 from bs4 import BeautifulSoup 8 9 import gzip 10 11 import json 12 13 def get_all_comments_by_bv(bv: str, time_order=False) -> tuple: 14 15 """ 16 17 根據嗶哩嗶哩的BV號,返回對應視頻的評論列表(包括評論下面的回復) 18 19 :param bv: 視頻的BV號 20 21 :param time_order: 是否需要以時間順序返回評論,默認按照熱度返回 22 23 :return: 包含三個成員的元組,第一個是所有評論的列表(評論的評論按原始的方式組合其中,字典類型) 24 25 第二個是視頻的AV號(字符串類型),第三個是統計到的實際評論數(包括評論的評論) 26 27 """ 28 29 video_url = 'https://www.bilibili.com/video/' + bv 30 31 headers = { 32 33 'Host': 'www.bilibili.com', 34 35 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0', 36 37 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 38 39 'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2', 40 41 'Accept-Encoding': 'gzip, deflate, br', 42 43 'Connection': 'keep-alive', 44 45 'Cookie': '', 46 47 'Upgrade-Insecure-Requests': '1', 48 49 'Cache-Control': 'max-age=0', 50 51 'TE': 'Trailers', 52 53 } 54 55 rep = Request(url=video_url, headers=headers) # 獲取頁面 56 57 html_response = urlopen(rep) # type: HTTPResponse 58 59 html_content = gzip.decompress(html_response.read()).decode(encoding='utf-8') 60 61 bs = BeautifulSoup(markup=html_content, features='html.parser') 62 63 comment_meta = bs.find(name='meta', attrs={'itemprop': 'commentCount'}) 64 65 av_meta = bs.find(name='meta', attrs={'property': 'og:url'}) 66 67 comment_count = int(comment_meta.attrs['content']) # 評論總數 68 69 av_number = av_meta.attrs['content'].split('av')[-1][:-1] # AV號 70 71 print(f'視頻 {bv} 的AV號是 {av_number} ,元數據中顯示本視頻共有 {comment_count} 條評論(包括評論的評論)。') 72 73 74 75 page_num = 1 76 77 replies_count = 0 78 79 res = [] 80 81 while True: 82 83 # 按時間排序:type=1&sort=0 84 85 # 按熱度排序:type=1&sort=2 86 87 comment_url = f'https://api.bilibili.com/x/v2/reply?pn={page_num}&type=1&oid={av_number}' + \ 88 89 f'&sort={0 if time_order else 2}' 90 91 comment_response = urlopen(comment_url) # type: HTTPResponse 92 93 comments = json.loads(comment_response.read().decode('utf-8')) # type: dict 94 95 comments = comments.get('data').get('replies') # type: list 96 97 if comments is None: 98 99 break 100 101 replies_count += len(comments) 102 103 for c in comments: # type: dict 104 105 if c.get('replies'): 106 107 rp_id = c.get('rpid') 108 109 rp_num = 10 110 111 rp_page = 1 112 113 while True: # 獲取評論下的回復 114 115 reply_url = f'https://api.bilibili.com/x/v2/reply/reply?' + 116 117 f'type=1&pn={rp_page}&oid={av_number}&ps={rp_num}&root={rp_id}' 118 119 reply_response = urlopen(reply_url) # type: HTTPResponse 120 121 reply_reply = json.loads(reply_response.read().decode('utf-8')) # type: dict 122 123 reply_reply = reply_reply.get('data').get('replies') # type: dict 124 125 if reply_reply is None: 126 127 break 128 129 replies_count += len(reply_reply) 130 131 for r in reply_reply: # type: dict 132 133 res.append(r) 134 135 if len(reply_reply) < rp_num: 136 137 break 138 139 rp_page += 1 140 141 c.pop('replies') 142 143 res.append(c) 144 145 if replies_count >= comment_count: 146 147 break 148 149 page_num += 1 150 151 152 153 print(f'實際獲取視頻 {bv} 的評論總共 {replies_count} 條。') 154 155 return res, av_number, replies_count 156 157 if __name__ == '__main__': 158 159 cts, av, cnt = get_all_comments_by_bv('BV1op4y1X7N2') 160 161 for i in cts: 162 163 print(i.get('content').get('message'))

2.數據分析可視化(例如:數據柱形圖、直方圖、散點圖、盒圖、分布圖)
1 #數據分析以及可視化 2 3 filename1 = file_name1 4 5 filename2 = file_name2 6 7 with open(filename1,encoding="utf_8_sig") as f1: 8 9 #創建閱讀器(調用csv.reader()將前面存儲的文件對象最為實參傳給它) 10 11 reader1 = csv.reader(f1) 12 13 #調用了next()一次,所以這邊只調用了文件的第一行,並將頭文件存儲在header_row中 14 15 header_row1 = next(reader1) 16 17 print(header_row1) 18 19 #指出每個頭文件的索引 20 21 for index,column_header in enumerate(header_row1): 22 23 print(index,column_header) 24 25 #建立空列表 26 27 title1 = [] 28 29 rank1 = [] 30 31 highs1=[] 32 33 url1 = [] 34 35 visit1 = [] 36 37 space1 = [] 38 39 up_id1 = [] 40 41 for row in reader1: 42 43 rank1.append(row[0]) 44 45 title1.append(row[1]) 46 47 visit1.append(row[2].strip('\n').strip(' ').strip('\n')) 48 49 highs1.append(row[3].strip('\n').strip(' ').strip('\n')) 50 51 up_id1.append(row[4].strip('\n').strip(' ').strip('\n')) 52 53 url1.append(row[5].strip('\n').strip(' ').strip('\n').strip('//')) 54 55 space1.append(row[6].strip('\n').strip(' ').strip('\n').strip('//')) 56 57 visit1 = str(visit1) 58 59 visit1 = visit1.replace('萬', '000') 60 61 visit1 = visit1.replace('.', '') 62 63 visit1 = eval(visit1) 64 65 visit_list_new1 = list(map(int, visit1)) 66 67 highs1 = str(highs1) 68 69 highs1 = highs1.replace('萬', '000') 70 71 highs1 = highs1.replace('.', '') 72 73 highs1 = eval(highs1) 74 75 highs_list_new1 = list(map(int, highs1)) 76 77 print(highs_list_new1) 78 79 #設置x軸數據 80 81 x=np.array(rank1[0:10]) 82 83 #設置y軸數據 84 85 y=np.array(highs_list_new1[0:10]) 86 87 # 繪制柱狀圖,並把每根柱子的顏色設置自己的喜歡的顏色,順便設置每根柱子的寬度 88 89 plt.bar(x,y,color = ["red","yellow","green","blue","black","gold","pink","purple","violet","Chocolate"],width = 0.5) 90 91 plt.show() 92 93 #設置x軸數據 94 95 x=np.array(title1[0:10]) 96 97 #設置y軸數據 98 99 y=np.array(visit_list_new1[0:10]) 100 101 # 繪制柱狀圖,並把每根柱子的顏色設置自己的喜歡的顏色,順便設置每根柱子的寬度 102 103 plt.bar(x,y,color = ["red","yellow","green","blue","black","gold","pink","purple","violet","Chocolate"],width = 0.5) 104 105 matplotlib.rcParams['font.sans-serif'] = ['KaiTi'] 106 107 plt.show() 108 109 #定義畫布的大小 110 111 fig = plt.figure(figsize = (15,8)) 112 113 #添加主標題 114 115 plt.title("各視頻播放量") 116 117 #設置X周與Y周的標題 118 119 plt.xlabel("視頻名稱") 120 121 plt.ylabel("播放量") 122 123 # 顯示網格線 124 125 plt.grid(True) 126 127 #設置x軸數據 128 129 x=np.array(title1[0:10]) 130 131 #設置y軸數據 132 133 y=np.array(visit_list_new1[0:10]) 134 135 #繪制柱狀圖,並把每根柱子的顏色設置自己的喜歡的顏色,順便設置每根柱子的寬度 136 137 plt.bar(x,y,color = ["red","yellow","green","blue","black","gold","pink","purple","violet","Chocolate"],width = 0.6) 138 139 #圖片保存 140 141 plt.savefig(r"C:\Users\24390\Desktop\bilibili-up-v.png") 142 143 with open(filename2,encoding="utf_8_sig") as f2: 144 145 reader2 = csv.reader(f2) 146 147 header_row2 = next(reader2) 148 149 print(header_row2) 150 151 for index,column_header in enumerate(header_row2): 152 153 print(index,column_header) 154 155 rank2 = [] 156 157 title2 = [] 158 159 highs2 = [] 160 161 url2 = [] 162 163 visit2 = [] 164 165 new_word2 = [] 166 167 for row in reader2: 168 169 rank2.append(row[0]) 170 171 title2.append(row[1]) 172 173 visit2.append(row[2].strip('\n').strip(' ').strip('\n')) 174 175 highs2.append(row[3].strip('\n').strip(' ').strip('\n')) 176 177 new_word2.append(row[4]) 178 179 url2.append(row[5].strip('\n').strip(' ').strip('\n').strip('//')) 180 181 print(highs2) 182 183 title2 = str(title2) 184 185 title2 = eval(title2) 186 187 visit2 = str(visit2) 188 189 visit2 = visit2.replace('萬', '000') 190 191 visit2 = visit2.replace('億', '0000000') 192 193 visit2 = visit2.replace('.', '') 194 195 visit2 = eval(visit2) 196 197 visit2 = list(map(int, visit2)) 198 199 visit_list_new2 = list(map(int, visit2)) 200 201 highs2 = str(highs2) 202 203 highs2 = highs2.replace('萬', '000') 204 205 highs2 = highs2.replace('.', '') 206 207 highs2 = eval(highs2) 208 209 highs_list_new2 = list(map(int, highs2)) 210 211 print(highs_list_new2) 212 213 #設置x軸數據 214 215 x=np.array(rank2[0:10]) 216 217 #設置y軸數據 218 219 y=np.array(highs_list_new2[0:10]) 220 221 # 繪制柱狀圖,並把每根柱子的顏色設置自己的喜歡的顏色,順便設置每根柱子的寬度 222 223 plt.bar(x,y,color = ["red","yellow","green","blue","black","gold","pink","purple","violet","Chocolate"],width = 0.5) 224 225 plt.show() 226 227 #設置x軸數據 228 229 x=np.array(title2[0:10]) 230 231 #設置y軸數據 232 233 y=np.array(visit_list_new2[0:10]) 234 235 # 繪制柱狀圖,並把每根柱子的顏色設置自己的喜歡的顏色,順便設置每根柱子的寬度 236 237 plt.bar(x,y,color = ["red","yellow","green","blue","black","gold","pink","purple","violet","Chocolate"],width = 0.5) 238 239 matplotlib.rcParams['font.sans-serif'] = ['KaiTi'] 240 241 plt.show() 242 243 # 定義畫布的大小 244 245 fig = plt.figure(figsize = (15,8)) 246 247 #添加主標題 248 249 plt.title("番劇播放量") 250 251 #設置X周與Y周的標題 252 253 plt.xlabel("番劇名稱") 254 255 plt.ylabel("播放量") 256 257 # 顯示網格線 258 259 plt.grid(True) 260 261 #設置x軸數據 262 263 x=np.array(title2[0:10]) 264 265 #設置y軸數據 266 267 y=np.array(visit_list_new2[0:10]) 268 269 # 繪制柱狀圖,並把每根柱子的顏色設置自己的喜歡的顏色,順便設置每根柱子的寬度 270 271 plt.bar(x,y,color = ["red","yellow","green","blue","black","gold","pink","purple","violet","Chocolate"],width = 0.6) 272 273 #圖片保存 274 275 plt.savefig(r"C:\Users\24390\Desktop\bilibili-draw-v.png")


3.根據數據之間的關系,分析兩個變量之間的相關系數,畫出散點圖,並建立變量之間的回歸方程(一元或多元)。
1 import pandas as pd 2 3 import numpy as np 4 5 import matplotlib.pyplot as plt 6 7 from pandas import DataFrame,Series 8 9 from sklearn.model_selection import train_test_split 10 11 from sklearn.linear_model import LinearRegression 12 13 import csv 14 15 file_name2 = f'嗶哩嗶哩番劇top50_20211215_154745.csv' 16 17 filename2 = file_name2 18 19 with open(filename2,encoding="utf_8_sig") as f2: 20 21 reader2 = csv.reader(f2) 22 23 header_row2 = next(reader2) 24 25 print(header_row2) 26 27 for index,column_header in enumerate(header_row2): 28 29 print(index,column_header) 30 31 rank2 = [] 32 33 title2 = [] 34 35 highs2 = [] 36 37 url2 = [] 38 39 visit2 = [] 40 41 new_word2 = [] 42 43 for row in reader2: 44 45 rank2.append(row[0]) 46 47 title2.append(row[1]) 48 49 visit2.append(row[2].strip('\n').strip(' ').strip('\n')) 50 51 highs2.append(row[3].strip('\n').strip(' ').strip('\n')) 52 53 new_word2.append(row[4]) 54 55 url2.append(row[5].strip('\n').strip(' ').strip('\n').strip('//')) 56 57 print(highs2) 58 59 title2 = str(title2) 60 61 title2 = eval(title2) 62 63 visit2 = str(visit2) 64 65 visit2 = visit2.replace('萬', '000') 66 67 visit2 = visit2.replace('億', '0000000') 68 69 visit2 = visit2.replace('.', '') 70 71 visit2 = eval(visit2) 72 73 visit2 = list(map(int, visit2)) 74 75 visit_list_new2 = list(map(int, visit2)) 76 77 highs2 = str(highs2) 78 79 highs2 = highs2.replace('萬', '000') 80 81 highs2 = highs2.replace('.', '') 82 83 highs2 = eval(highs2) 84 85 highs_list_new2 = list(map(int, highs2)) 86 87 with open('output.csv','w') as f: 88 89 writer = csv.writer(f) 90 91 writer.writerows(zip(highs_list_new2,visit_list_new2)) 92 93 #創建數據集 94 95 examDict = {'彈幕量':highs_list_new2[0:10], 96 97 '播放量':visit_list_new2[0:10]} 98 99 #轉換為DataFrame的數據格式 100 101 examDf = DataFrame(examDict) 102 103 #繪制散點圖 104 105 plt.scatter(examDf.播放量,examDf.彈幕量,color = 'b',label = "Exam Data") 106 107 #添加圖的標簽(x軸,y軸) 108 109 plt.xlabel("Hours") 110 111 plt.ylabel("Score") 112 113 #顯示圖像 114 115 plt.show() 116 117 rDf = examDf.corr() 118 119 print(rDf) 120 121 exam_X=examDf.彈幕量 122 123 exam_Y=examDf.播放量 124 125 #將原數據集拆分訓練集和測試集 126 127 X_train,X_test,Y_train,Y_test = train_test_split(exam_X,exam_Y,train_size=.8) 128 129 #X_train為訓練數據標簽,X_test為測試數據標簽,exam_X為樣本特征,exam_y為樣本標簽,train_size 訓練數據占比 130 131 132 133 print("原始數據特征:",exam_X.shape, 134 135 ",訓練數據特征:",X_train.shape, 136 137 ",測試數據特征:",X_test.shape) 138 139 140 141 print("原始數據標簽:",exam_Y.shape, 142 143 ",訓練數據標簽:",Y_train.shape, 144 145 ",測試數據標簽:",Y_test.shape) 146 147 #散點圖 148 149 plt.scatter(X_train, Y_train, color="blue", label="train data") 150 151 plt.scatter(X_test, Y_test, color="red", label="test data") 152 153 154 155 #添加圖標標簽 156 157 plt.legend(loc=2) 158 159 plt.xlabel("Hours") 160 161 plt.ylabel("Pass") 162 163 #顯示圖像 164 165 plt.savefig("tests.jpg") 166 167 plt.show() 168 169 model = LinearRegression() 170 171 172 173 #對於模型錯誤我們需要把我們的訓練集進行reshape操作來達到函數所需要的要求 174 175 # model.fit(X_train,Y_train) 176 177 178 179 #reshape如果行數=-1的話可以使我們的數組所改的列數自動按照數組的大小形成新的數組 180 181 #因為model需要二維的數組來進行擬合但是這里只有一個特征所以需要reshape來轉換為二維數組 182 183 X_train = X_train.values.reshape(-1,1) 184 185 X_test = X_test.values.reshape(-1,1) 186 187 188 189 model.fit(X_train,Y_train) 190 191 a = model.intercept_#截距 192 193 194 195 b = model.coef_#回歸系數 196 197 198 199 print("最佳擬合線:截距",a,",回歸系數:",b) 200 201 #訓練數據的預測值 202 203 y_train_pred = model.predict(X_train) 204 205 #繪制最佳擬合線:標簽用的是訓練數據的預測值y_train_pred 206 207 plt.plot(X_train, y_train_pred, color='black', linewidth=3, label="best line") 208 209 210 211 #測試數據散點圖 212 213 plt.scatter(X_test, Y_test, color='red', label="test data") 214 215 216 217 #添加圖標標簽 218 219 plt.legend(loc=2) 220 221 plt.xlabel("Number1") 222 223 plt.ylabel("Number2") 224 225 #顯示圖像 226 227 plt.savefig("lines.jpg") 228 229 plt.show() 230 231 score = model.score(X_test,Y_test) 232 233 print(score)


4.數據持久化
1 file_name1 = f'嗶哩嗶哩視頻top100_{now_str}.csv' 2 3 with open(file_name1,'w',newline='',encoding='utf-8') as f: 4 5 writer = csv.writer(f) 6 7 writer.writerow(Video.csv_title()) 8 9 for v in videos: 10 11 writer.writerow(v.to_csv()) 12 13 file_name2 = f'嗶哩嗶哩番劇top50_{now_str}.csv' 14 15 with open(file_name2,'w',newline='',encoding='utf-8') as f: 16 17 writer = csv.writer(f) 18 19 writer.writerow(Video.csv_title()) 20 21 for v in videos: 22 23 writer.writerow(v.to_csv()) 24 25 plt.savefig(r"C:\Users\24390\Desktop\bilibili-up-v.png")#圖片保存 26 27 plt.savefig(r"C:\Users\24390\Desktop\bilibili-draw-v.png")#圖片保存

5.將以上各部分的代碼匯總,附上完整程序代碼
1 import requests 2 3 from bs4 import BeautifulSoup 4 5 import csv 6 7 import datetime 8 9 import pandas as pd 10 11 import numpy as np 12 13 from matplotlib import rcParams 14 15 import matplotlib.pyplot as plt 16 17 import matplotlib.font_manager as font_manager 18 19 from selenium import webdriver 20 21 from time import sleep 22 23 import matplotlib 24 25 from sklearn.model_selection import train_test_split 26 27 from sklearn.linear_model import LinearRegression 28 29 from pandas import DataFrame,Series 30 31 url = 'https://www.bilibili.com/v/popular/rank/all' 32 33 response = requests.get(url)#發起網絡請求 34 35 html_text = response.text 36 37 soup = BeautifulSoup(html_text,'html.parser') 38 39 class Video:#創建Video對象 40 41 def __init__(self,rank,title,visit,barrage,up_id,url,space): 42 43 self.rank = rank 44 45 self.title = title 46 47 self.visit = visit 48 49 self.barrage = barrage 50 51 self.up_id = up_id 52 53 self.url = url 54 55 self.space = space 56 57 def to_csv(self): 58 59 return[self.rank,self.title,self.visit,self.barrage,self.up_id,self.url,self.space] 60 61 @staticmethod 62 63 def csv_title(): 64 65 return ['排名','標題','播放量','彈幕量','Up_ID','URL','作者空間'] 66 67 items = soup.findAll('li',{'class':'rank-item'})#提取列表 68 69 videos = []#保存提取出來的Video列表 70 71 for itm in items: 72 73 title = itm.find('a',{'class':'title'}).text #視頻標題 74 75 rank = itm.find('i',{'class':'num'}).text #排名 76 77 visit = itm.find_all('span')[3].text #播放量 78 79 barrage = itm.find_all('span')[4].text #彈幕量 80 81 up_id = itm.find('span',{'class':'data-box up-name'}).text #作者id 82 83 url = itm.find_all('a')[1].get('href')#獲取視頻網址 84 85 space = itm.find_all('a')[2].get('href')#獲取作者空間網址 86 87 v = Video(rank,title,visit,barrage,up_id,url,space) 88 89 videos.append(v) 90 91 now_str = datetime.datetime.now().strftime('%Y%m%d_%H%M%S') 92 93 file_name1 = f'嗶哩嗶哩視頻top100_{now_str}.csv' 94 95 with open(file_name1,'w',newline='',encoding='utf-8') as f: 96 97 writer = csv.writer(f) 98 99 writer.writerow(Video.csv_title()) 100 101 for v in videos: 102 103 writer.writerow(v.to_csv()) 104 105 url = 'https://www.bilibili.com/v/popular/rank/bangumi' 106 107 response = requests.get(url)#發起網絡請求 108 109 html_text = response.text 110 111 soup = BeautifulSoup(html_text,'html.parser') 112 113 class Video:#創建Video對象 114 115 def __init__(self,rank,title,visit,barrage,new_word,url): 116 117 self.rank = rank 118 119 self.title = title 120 121 self.visit = visit 122 123 self.barrage = barrage 124 125 self.new_word = new_word 126 127 self.url = url 128 129 def to_csv(self): 130 131 return[self.rank,self.title,self.visit,self.barrage,self.new_word,self.url] 132 133 @staticmethod 134 135 def csv_title(): 136 137 return ['排名','標題','播放量','彈幕量','更新話數至','URL'] 138 139 items = soup.findAll('li',{'class':'rank-item'})#提取列表 140 141 videos = []#保存提取出來的Video列表 142 143 for itm in items: 144 145 rank = itm.find('i',{'class':'num'}).text #排名 146 147 title = itm.find('a',{'class':'title'}).text #視頻標題 148 149 url = itm.find_all('a')[0].get('href')#獲取視頻網址 150 151 visit = itm.find_all('span')[2].text #播放量 152 153 barrage = itm.find_all('span')[3].text #彈幕量 154 155 new_word = itm.find('span',{'class':'data-box'}).text#更新話數 156 157 v = Video(rank,title,visit,barrage,new_word,url) 158 159 videos.append(v) 160 161 now_str = datetime.datetime.now().strftime('%Y%m%d_%H%M%S') 162 163 file_name2 = f'嗶哩嗶哩番劇top50_{now_str}.csv' 164 165 with open(file_name2,'w',newline='',encoding='utf-8') as f: 166 167 writer = csv.writer(f) 168 169 writer.writerow(Video.csv_title()) 170 171 for v in videos: 172 173 writer.writerow(v.to_csv()) 174 175 paiming1 = pd.DataFrame(pd.read_csv(file_name1,encoding="utf_8_sig"))#對數據進行清洗和處理 176 177 paiming2 = pd.DataFrame(pd.read_csv(file_name2,encoding="utf_8_sig")) 178 179 print(paiming1.head()) 180 181 print(paiming2.head()) 182 183 print(paiming1.duplicated())#查找重復值 184 185 print(paiming2.duplicated()) 186 187 print(paiming1['標題'].isnull().value_counts())#查找空值與缺失值 188 189 print(paiming2['標題'].isnull().value_counts()) 190 191 print(paiming1['URL'].isnull().value_counts()) 192 193 print(paiming2['URL'].isnull().value_counts()) 194 195 print(paiming1['播放量'].isnull().value_counts()) 196 197 print(paiming2['播放量'].isnull().value_counts()) 198 199 print(paiming1['彈幕量'].isnull().value_counts()) 200 201 print(paiming2['彈幕量'].isnull().value_counts()) 202 203 #數據分析以及可視化 204 205 filename1 = file_name1 206 207 filename2 = file_name2 208 209 with open(filename1,encoding="utf_8_sig") as f1: 210 211 reader1 = csv.reader(f1)#創建閱讀器(調用csv.reader()將前面存儲的文件對象最為實參傳給它) 212 213 header_row1 = next(reader1)#調用了next()一次,所以這邊只調用了文件的第一行,並將頭文件存儲在header_row中 214 215 print(header_row1) 216 217 for index,column_header in enumerate(header_row1):#指出每個頭文件的索引 218 219 print(index,column_header) 220 221 title1 = [] 222 223 rank1 = [] 224 225 highs1=[] 226 227 url1 = [] 228 229 visit1 = [] 230 231 space1 = [] 232 233 up_id1 = [] 234 235 for row in reader1: 236 237 rank1.append(row[0]) 238 239 title1.append(row[1]) 240 241 visit1.append(row[2].strip('\n').strip(' ').strip('\n')) 242 243 highs1.append(row[3].strip('\n').strip(' ').strip('\n')) 244 245 up_id1.append(row[4].strip('\n').strip(' ').strip('\n')) 246 247 url1.append(row[5].strip('\n').strip(' ').strip('\n').strip('//')) 248 249 space1.append(row[6].strip('\n').strip(' ').strip('\n').strip('//')) 250 251 visit1 = str(visit1) 252 253 visit1 = visit1.replace('萬', '000') 254 255 visit1 = visit1.replace('.', '') 256 257 visit1 = eval(visit1) 258 259 visit_list_new1 = list(map(int, visit1)) 260 261 highs1 = str(highs1) 262 263 highs1 = highs1.replace('萬', '000') 264 265 highs1 = highs1.replace('.', '') 266 267 highs1 = eval(highs1) 268 269 highs_list_new1 = list(map(int, highs1)) 270 271 print(highs_list_new1) 272 273 x=np.array(rank1[0:10])#設置x軸數據 274 275 y=np.array(highs_list_new1[0:10])#設置y軸數據 276 277 plt.bar(x,y,color = ["red","yellow","green","blue","black","gold","pink","purple","violet","Chocolate"],width = 0.5)# 繪制柱狀圖,並把每根柱子的顏色設置自己的喜歡的顏色,順便設置每根柱子的寬度 278 279 plt.show() 280 281 x=np.array(title1[0:10])#設置x軸數據 282 283 y=np.array(visit_list_new1[0:10])#設置y軸數據 284 285 plt.bar(x,y,color = ["red","yellow","green","blue","black","gold","pink","purple","violet","Chocolate"],width = 0.5)# 繪制柱狀圖,並把每根柱子的顏色設置自己的喜歡的顏色,順便設置每根柱子的寬度 286 287 matplotlib.rcParams['font.sans-serif'] = ['KaiTi'] 288 289 plt.show() 290 291 fig = plt.figure(figsize = (15,8))#定義畫布的大小 292 293 plt.title("各視頻播放量")#添加主標題 294 295 plt.xlabel("視頻名稱")#設置X周與Y周的標題 296 297 plt.ylabel("播放量") 298 299 plt.grid(True)# 顯示網格線 300 301 x=np.array(title1[0:10])#設置x軸數據 302 303 y=np.array(visit_list_new1[0:10])#設置y軸數據 304 305 plt.bar(x,y,color = ["red","yellow","green","blue","black","gold","pink","purple","violet","Chocolate"],width = 0.6)#繪制柱狀圖,並把每根柱子的顏色設置自己的喜歡的顏色,順便設置每根柱子的寬度 306 307 plt.savefig(r"C:\Users\24390\Desktop\bilibili-up-v.png")#圖片保存 308 309 with open(filename2,encoding="utf_8_sig") as f2: 310 311 reader2 = csv.reader(f2) 312 313 header_row2 = next(reader2) 314 315 print(header_row2) 316 317 for index,column_header in enumerate(header_row2): 318 319 print(index,column_header) 320 321 rank2 = [] 322 323 title2 = [] 324 325 highs2 = [] 326 327 url2 = [] 328 329 visit2 = [] 330 331 new_word2 = [] 332 333 for row in reader2: 334 335 rank2.append(row[0]) 336 337 title2.append(row[1]) 338 339 visit2.append(row[2].strip('\n').strip(' ').strip('\n')) 340 341 highs2.append(row[3].strip('\n').strip(' ').strip('\n')) 342 343 new_word2.append(row[4]) 344 345 url2.append(row[5].strip('\n').strip(' ').strip('\n').strip('//')) 346 347 print(highs2) 348 349 title2 = str(title2) 350 351 title2 = eval(title2) 352 353 visit2 = str(visit2) 354 355 visit2 = visit2.replace('萬', '000') 356 357 visit2 = visit2.replace('億', '0000000') 358 359 visit2 = visit2.replace('.', '') 360 361 visit2 = eval(visit2) 362 363 visit2 = list(map(int, visit2)) 364 365 visit_list_new2 = list(map(int, visit2)) 366 367 highs2 = str(highs2) 368 369 highs2 = highs2.replace('萬', '000') 370 371 highs2 = highs2.replace('.', '') 372 373 highs2 = eval(highs2) 374 375 highs_list_new2 = list(map(int, highs2)) 376 377 print(highs_list_new2) 378 379 x=np.array(rank2[0:10])#設置x軸數據 380 381 y=np.array(highs_list_new2[0:10])#設置y軸數據 382 383 plt.bar(x,y,color = ["red","yellow","green","blue","black","gold","pink","purple","violet","Chocolate"],width = 0.5)# 繪制柱狀圖,並把每根柱子的顏色設置自己的喜歡的顏色,順便設置每根柱子的寬度 384 385 plt.show() 386 387 x=np.array(title2[0:10])#設置x軸數據 388 389 y=np.array(visit_list_new2[0:10])#設置y軸數據 390 391 plt.bar(x,y,color = ["red","yellow","green","blue","black","gold","pink","purple","violet","Chocolate"],width = 0.5)# 繪制柱狀圖,並把每根柱子的顏色設置自己的喜歡的顏色,順便設置每根柱子的寬度 392 393 matplotlib.rcParams['font.sans-serif'] = ['KaiTi'] 394 395 plt.show() 396 397 fig = plt.figure(figsize = (15,8))# 定義畫布的大小 398 399 plt.title("番劇播放量")#添加主標題 400 401 plt.xlabel("番劇名稱")#設置X周與Y周的標題 402 403 plt.ylabel("播放量") 404 405 plt.grid(True)# 顯示網格線 406 407 x=np.array(title2[0:10])#設置x軸數據 408 409 y=np.array(visit_list_new2[0:10])#設置y軸數據 410 411 plt.bar(x,y,color = ["red","yellow","green","blue","black","gold","pink","purple","violet","Chocolate"],width = 0.6)# 繪制柱狀圖,並把每根柱子的顏色設置自己的喜歡的顏色,順便設置每根柱子的寬度 412 413 plt.savefig(r"C:\Users\24390\Desktop\bilibili-draw-v.png")#圖片保存 414 415 with open('output.csv','w') as f: 416 417 writer = csv.writer(f) 418 419 writer.writerows(zip(highs_list_new2,visit_list_new2)) 420 421 #創建數據集 422 423 examDict = {'彈幕量':highs_list_new2[0:10], 424 425 '播放量':visit_list_new2[0:10]} 426 427 #轉換為DataFrame的數據格式 428 429 examDf = DataFrame(examDict) 430 431 #繪制散點圖 432 433 plt.scatter(examDf.播放量,examDf.彈幕量,color = 'b',label = "Exam Data") 434 435 #添加圖的標簽(x軸,y軸) 436 437 plt.xlabel("Hours") 438 439 plt.ylabel("Score") 440 441 #顯示圖像 442 443 plt.show() 444 445 rDf = examDf.corr() 446 447 print(rDf) 448 449 exam_X=examDf.彈幕量 450 451 exam_Y=examDf.播放量 452 453 #將原數據集拆分訓練集和測試集 454 455 X_train,X_test,Y_train,Y_test = train_test_split(exam_X,exam_Y,train_size=.8) 456 457 #X_train為訓練數據標簽,X_test為測試數據標簽,exam_X為樣本特征,exam_y為樣本標簽,train_size 訓練數據占比 458 459 460 461 print("原始數據特征:",exam_X.shape, 462 463 ",訓練數據特征:",X_train.shape, 464 465 ",測試數據特征:",X_test.shape) 466 467 468 469 print("原始數據標簽:",exam_Y.shape, 470 471 ",訓練數據標簽:",Y_train.shape, 472 473 ",測試數據標簽:",Y_test.shape) 474 475 #散點圖 476 477 plt.scatter(X_train, Y_train, color="blue", label="train data") 478 479 plt.scatter(X_test, Y_test, color="red", label="test data") 480 481 482 483 #添加圖標標簽 484 485 plt.legend(loc=2) 486 487 plt.xlabel("Hours") 488 489 plt.ylabel("Pass") 490 491 #顯示圖像 492 493 plt.savefig("tests.jpg") 494 495 plt.show() 496 497 model = LinearRegression() 498 499 500 501 #對於模型錯誤我們需要把我們的訓練集進行reshape操作來達到函數所需要的要求 502 503 # model.fit(X_train,Y_train) 504 505 506 507 #reshape如果行數=-1的話可以使我們的數組所改的列數自動按照數組的大小形成新的數組 508 509 #因為model需要二維的數組來進行擬合但是這里只有一個特征所以需要reshape來轉換為二維數組 510 511 X_train = X_train.values.reshape(-1,1) 512 513 X_test = X_test.values.reshape(-1,1) 514 515 516 517 model.fit(X_train,Y_train) 518 519 a = model.intercept_#截距 520 521 522 523 b = model.coef_#回歸系數 524 525 526 527 print("最佳擬合線:截距",a,",回歸系數:",b) 528 529 #訓練數據的預測值 530 531 y_train_pred = model.predict(X_train) 532 533 #繪制最佳擬合線:標簽用的是訓練數據的預測值y_train_pred 534 535 plt.plot(X_train, y_train_pred, color='black', linewidth=3, label="best line") 536 537 538 539 #測試數據散點圖 540 541 plt.scatter(X_test, Y_test, color='red', label="test data") 542 543 544 545 #添加圖標標簽 546 547 plt.legend(loc=2) 548 549 plt.xlabel("Number1") 550 551 plt.ylabel("Number2") 552 553 #顯示圖像 554 555 plt.savefig("lines.jpg") 556 557 plt.show() 558 559 score = model.score(X_test,Y_test) 560 561 print(score) 562 563 print(title1[1],title2[1]) 564 565 print('請問您想觀看UP主視頻還是番劇亦或者是查詢UP主的空間頁面?\n觀看UP主視頻請扣1,觀看番劇請扣2,查詢UP主空間頁面請扣3。') 566 567 z = int(input()) 568 569 if z == int(2): 570 571 print(title2) 572 573 print('請輸入您想觀看的番劇:') 574 575 name = input() 576 577 i=0 578 579 for i in range(0,50,1): 580 581 if title2[i]==name: 582 583 print(i) 584 585 break 586 587 print(url2[i]) 588 589 to_url2=url2[i] 590 591 d = webdriver.Chrome()#打開谷歌瀏覽器,並且賦值給變量d 592 593 d.get('https://'+to_url2)#通過get()方法,在當前窗口打開網頁 594 595 sleep(2) 596 597 elif z == int(1): 598 599 print(title1) 600 601 print('請輸入您想觀看的UP主視頻:') 602 603 name = input() 604 605 i=0 606 607 for i in range(0,100,1): 608 609 if title1[i]==name: 610 611 print(i) 612 613 break 614 615 print(url1[i]) 616 617 to_url1=url1[i] 618 619 d = webdriver.Chrome()#打開谷歌瀏覽器,並且賦值給變量d 620 621 d.get('https://'+to_url1)#通過get()方法,在當前窗口打開網頁 622 623 sleep(2) 624 625 elif z == int(3): 626 627 print(up_id1) 628 629 print('請輸入您想查詢的UP主空間:') 630 631 name = input() 632 633 i=0 634 635 for i in range(0,100,1): 636 637 if up_id1[i]==name: 638 639 print(i) 640 641 break 642 643 print(space1[i]) 644 645 to_space11=space1[i] 646 647 d = webdriver.Chrome()#打開谷歌瀏覽器,並且賦值給變量d 648 649 d.get('https://'+to_space11)#通過get()方法,在當前窗口打開網頁 650 651 sleep(2) 652 653 else: 654 655 print('輸入不符合要求')
三、 總結
1.經過對主題數據的分析與可視化,可以得到哪些結論?是否達到預期的目標?
結論:本次課程設計,影響最深的就是在遇到問題時候,可以通過網上了解BUG問題的原因並很好地解決,在設計課程時候,可以考慮與機器學習以及其他方面進行結合本次課程所繪制的散點圖與直方圖等不只局限於課程爬蟲設計這一主題,其中還涉及到對機器主題的應用,讓我明白了設計課題主題的知識廣泛與應用。
目標:首先需要學好網絡爬蟲基本的步驟request請求與存儲。采集信息並提取出來進行可視化繪制也是我下次要學習的重點。實行數據的持久化可以減少對所獲取的數據的清洗與處理次數。這次的課程設計使我明白了要加強對python的了解與理解,才能迅速的找到自己不足的地方並且專攻下來,爭取推動自己對python的進程。
