Python爬取全網熱點榜單數據

本文轉載自查看原文 2020-09-28 00:49 937

一、主題式網絡爬蟲設計方案

1.主題式網絡爬蟲名稱：爬取全網熱點榜單數據

2.主題式網絡爬蟲爬取的內容與數據特征分析：

　　1）熱門榜單；

　　2）數據有日期、標題、鏈接地址等

3.主題式網絡爬蟲設計方案概述：

　　1）HTML頁面分析得到HTML代碼結構；

　　2）程序實現：

　　　　a. 定義代碼字典；

　　　　b. 用requests抓取網頁信息；

　　　　c. 用BeautifulSoup庫解析網頁；

　　　　d. 用pandas庫保存數據為xls；

　　　　e. 定義主函數main()；

　　　　f. 定義功能函數，解耦；

二、主題頁面的結構特征分析

1.主題頁面的結構與特征分析（網頁地址：https://tophub.today/）：

2.Html頁面解析

3.節點（標簽）查找方法與遍歷方法：使用 find_all() 和 find() 方法尋找關鍵class獲取數據

三、網絡爬蟲程序設計

1.數據爬取與采集

用requests抓取網頁信息，設置UA（User-Agent），訪問獲取網頁數據；

部分代碼：

import requests

def getHtml(url):
    headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/538.55 (KHTML, like Gecko) Chrome/81.0.3345.132 Safari/538.55'}
    resp = requests.get(url, headers=headers)
    return resp.text

部分運行截圖：

2.對數據進行清洗和處理

用BeautifulSoup庫解析網頁，find_all()方法尋找需要的數據，然后find()方法通過class標簽尋找關鍵數據；

部分代碼：

from bs4 import BeautifulSoup

def get_data(html):
    soup = BeautifulSoup(html, 'html.parser')
    nodes = soup.find_all('div', class_='cc-cd')
    return nodes

def get_node_data(df, nodes):
    now = int(time.time())
    for node in nodes:
        source = node.find('div', class_='cc-cd-lb').text.strip()
        messages = node.find('div', class_='cc-cd-cb-l nano-content').find_all('a')
        for message in messages:
            content = message.find('span', class_='t').text.strip()
            if source == '微信':
                reg = '「.+?」(.+)'
                content = re.findall(reg, content)[0]

            if df.empty or df[df.content == content].empty:
                data = {
                    'content': [content],
                    'url': [message['href']],
                    'source': [source],
                    'start_time': [now],
                    'end_time': [now]
                }

                item = pandas.DataFrame(data)
                df = pandas.concat([df, item], ignore_index=True)

            else:
                index = df[df.content == content].index[0]
                df.at[index, 'end_time'] = now

    return df

部分運行截圖：

3.數據持久化

用pandas庫保存數據為xls；

部分代碼：

import pandas

res = pandas.read_excel('tophub.xlsx')
res = get_node_data(res, data)
res.to_excel('tophub.xlsx')

部分運行截圖：

4.將以上各部分的代碼匯總，完整代碼：

import requests
from bs4 import BeautifulSoup
import time
import pandas
import re

def getHtml(url):
    headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/538.55 (KHTML, like Gecko) Chrome/81.0.3345.132 Safari/538.55'}
    resp = requests.get(url, headers=headers)
    return resp.text


def get_data(html):
    soup = BeautifulSoup(html, 'html.parser')
    nodes = soup.find_all('div', class_='cc-cd')
    return nodes


def get_node_data(df, nodes):
    now = int(time.time())
    for node in nodes:
        source = node.find('div', class_='cc-cd-lb').text.strip()
        messages = node.find('div', class_='cc-cd-cb-l nano-content').find_all('a')
        for message in messages:
            content = message.find('span', class_='t').text.strip()
            if source == '微信':
                reg = '「.+?」(.+)'
                content = re.findall(reg, content)[0]

            if df.empty or df[df.content == content].empty:
                data = {
                    'content': [content],
                    'url': [message['href']],
                    'source': [source],
                    'start_time': [now],
                    'end_time': [now]
                }

                item = pandas.DataFrame(data)
                df = pandas.concat([df, item], ignore_index=True)

            else:
                index = df[df.content == content].index[0]
                df.at[index, 'end_time'] = now

    return df


url = 'https://tophub.today'
html = getHtml(url)
data = get_data(html)
res = pandas.read_excel('tophub.xlsx')
res = get_node_data(res, data)
res.to_excel('tophub.xlsx')

四、結論

本次程序設計任務補考，我選擇的課題是爬取全網熱門榜單聚合數據，並不是每個網站的榜單數據，平時也經常使用這個網站關注全國的熱點資訊。對於這個網站的爬取相對簡單也比較熟悉，首先它是一個靜態網頁，其次節點也相當好找，通過class標簽就可以輕松找到，而且爬蟲部分也不需要特別的偽裝，設置好UA信息，偽裝成正常訪客就可以了。

小結：

　　1.編碼很重要，一開始爬取的數據解析后中文都亂碼了，主要是GBK和UTF-8編碼轉換的問題；

　　2.養成寫代碼解耦分部並且檢查的習慣，一開始代碼一路寫下來，全部是一坨，出問題非常難定位到哪里錯了。修改分函數后，每個部分每個功能獨立出來，不僅代碼看起來直觀了，出現問題也變少；

　　3.基礎不夠，還是需要繼續努力；

最后，通過這次的補考，讓我對python的應用有了更進一步的提升，受益良多。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python爬取QQ音樂榜單數據 python爬取疫情數據 python 爬取動態數據 python爬取疫情數據 python爬取github數據 Python 爬取異步加載的數據 Python 爬取每日北上資金數據 Python爬取股票數據利用python爬取疫情最新數據 python3爬取拉鈎招聘數據