爬取知乎熱度搜索標題並數據分析及可視化

本文轉載自查看原文 2020-04-23 20:07 642

一、主題式網絡爬蟲設計方案

1、主題式網絡爬蟲名稱：爬取知乎熱度數據並數據分析及可視化

2、爬取的內容：知乎熱搜的標題、排行、熱度

數據特征：隨機、以文字和數字為主

3、實現思路：首先查看所要爬取頁面的源代碼，找到所需要爬取的數據在源代碼中的位置，接下來進行數據爬取，並將爬取的數據持久化，保存在excel表格中用於使用，接下來對數據進行清洗處理，並進行數據分析額可視化

技術難點：正則表達式、回歸方程

二、主題頁面的結構特征分析

1、主題頁面的結構和特征分析：所要爬取的熱度數據在標簽‘td’里面，標題在標簽‘<a href> .... <a>’里面

2、頁面解析：

三、

1、數據爬取與采集

import requests
import re
import pandas as pd
import openpyxl
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.linear_model import LinearRegression
url = 'https://tophub.today/n/mproPpoq6O'
header = {'user-agent':'Mozilla/5.0'}
r = requests.get(url, headers=header)
r.raise_for_status()
r.encoding = r.apparent_encoding
r.text
html = r.text
title = re.findall('<a href=.*? target="_blank" .*?>(.*?)</a>',html)[3:20]
redu = re.findall('<td>(.*?)</td>',html)[0:17]
print(title)
print(redu)
print('{:^55}'.format('知乎熱度榜單'))
print('{:^5}\t{:^40}\t{:^10}'.format('排名','標題','熱度(單位:萬)'))
num = 8
lst = []
for i in range(num):
    print('{:^5}\t{:^40}\t{:^10}'.format(i+1, title[i], redu[i][:-3]))
    lst.append([i+1, title[i], redu[i][:-3]])
df = pd.DataFrame(lst, columns=['排名','標題','熱度(單位:萬)'])
df.to_excel('知乎熱度榜.xlsx')

2、對數據進行清洗和處理

df = pd.DataFrame(pd.read_excel('知乎熱度榜.xlsx'))
print(df.head())

print(df.duplicated())

print(df['標題'].isnull().value_counts())
print(df['熱度(單位:萬)'].isnull().value_counts())

print(df.describe())

3、數據分析與可視化

def zhexian():
    plt.rcParams['font.sans-serif'] = ['SimHei']
    x = df['排名']
    y = df['熱度(單位:萬)']
    plt.xlabel('排名')
    plt.ylabel('熱度(單位:萬)')
    plt.plot(x,y)
    plt.scatter(x,y)
    plt.title('排名與熱度的折線圖')
    plt.show()
zhexian()

plt.rcParams['font.sans-serif'] = ['SimHei']
plt.bar(range(1,9),redu[:8])
plt.xlabel('排名')
plt.ylabel('熱度(單位:萬)')
plt.title('排名與熱度的柱狀圖')
plt.show()

4、回歸方程

df = pd.read_excel('知乎熱度榜.xlsx')
df.head(8)
X = df.drop('標題',axis=1)
predict_model = LinearRegression()
predict_model.fit(X, df['熱度(單位:萬)'])
print('回歸系數:',predict_model.coef_)

5、數據持久化

df = pd.DataFrame(lst, columns=['排名','標題','熱度(單位:萬)'])
df.to_excel('知乎熱度榜.xlsx')

6、代碼匯總

import requests
import re
import pandas as pd
import openpyxl
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.linear_model import LinearRegression
url = 'https://tophub.today/n/mproPpoq6O'
header = {'user-agent':'Mozilla/5.0'}
r = requests.get(url, headers=header)
r.raise_for_status()
r.encoding = r.apparent_encoding
r.text
html = r.text
title = re.findall('<a href=.*? target="_blank" .*?>(.*?)</a>',html)[3:20]
redu = re.findall('<td>(.*?)</td>',html)[0:17]
print(title)
print(redu)
print('{:^55}'.format('知乎熱度榜單'))
print('{:^5}\t{:^40}\t{:^10}'.format('排名','標題','熱度(單位:萬)'))
num = 8
lst = []
for i in range(num):
    print('{:^5}\t{:^40}\t{:^10}'.format(i+1, title[i], redu[i][:-3]))
    lst.append([i+1, title[i], redu[i][:-3]])
df = pd.DataFrame(lst, columns=['排名','標題','熱度(單位:萬)'])
df.to_excel('知乎熱度榜.xlsx')

df = pd.DataFrame(pd.read_excel('知乎熱度榜.xlsx'))
print(df.head())

print(df.duplicated())

print(df['標題'].isnull().value_counts())
print(df['熱度(單位:萬)'].isnull().value_counts())

print(df.describe())

def zhexian():
    plt.rcParams['font.sans-serif'] = ['SimHei']
    x = df['排名']
    y = df['熱度(單位:萬)']
    plt.xlabel('排名')
    plt.ylabel('熱度(單位:萬)')
    plt.plot(x,y)
    plt.scatter(x,y)
    plt.title('排名與熱度的折線圖')
    plt.show()
zhexian()

plt.rcParams['font.sans-serif'] = ['SimHei']
plt.bar(range(1,9),redu[:8])
plt.xlabel('排名')
plt.ylabel('熱度(單位:萬)')
plt.title('排名與熱度的柱狀圖')
plt.show()

df = pd.read_excel('知乎熱度榜.xlsx')
df.head(8)
X = df.drop('標題',axis=1)
predict_model = LinearRegression()
predict_model.fit(X, df['熱度(單位:萬)'])
print('回歸系數:',predict_model.coef_)

四、結論

1、經過對知乎今日的熱度標題進行爬取，今日第一和第二名的標題較為受關注，后面的標題較為平穩，相差不大

2、本次的程序設計的任務我完成的時間花了較長，遇到了挺多問題，但是經過百度搜索等等，最后將問題一步步解決，使得我對python更加的感興趣了，完成任務之后非常的有成就感，正則表達式還不是很會，回歸方程也遇到了問題，接下來的學習里，我會更加努力學習計算機這門課程。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 爬取知乎熱度並且進行數據分析和可視化爬取知乎熱搜榜進行數據分析和數據可視化用Python爬取拉勾網數據分析職位及數據可視化爬取數據並進行數據分析及可視化爬取拉勾網關於python職位並進行數據分析和可視化 Scrapy爬取拉勾網數據分析崗位+可視化微博熱搜榜前20信息數據爬取進行數據分析與可視化爬取B站全站日榜前20數據進行數據分析與可視化數據分析與可視化爬取百度熱搜榜及數據分析與可視化處理