幾行代碼完成微博熱搜榜爬蟲

本文轉載自查看原文 2021-04-06 22:31 242 數據分析/ Python

1. 數據抓取

首先，我們得知道微博熱搜內容的具體鏈接。https://s.weibo.com/top/summary

def get_html_data(self):
    res = requests.get(self.url, headers=self.headers).text
    return res

通過requests模塊包，我們就能得到網頁的html文件，接下來就是要對html文件的處理解析。

2. 數據處理

為了更好的分析html文件內容，我復制到編輯器上分析文本數據。

通過分析，不難發現，我們所想要的數據如下圖所示結構中。

簡單代碼實現如下：

def deal_html_data(self, res):
    res = BeautifulSoup(res, "lxml")
    # 遍歷熱搜的標簽
    # #pl_top_realtimehot 根據id, > table > tbody > tr 逐層查找
    for item in res.select("#pl_top_realtimehot > table > tbody > tr"):
        # 按類名.td-01提取熱搜排名
        _rank = item.select_one('.td-01').text
        if not _rank:
            continue
        # 按類名.td-02提取熱搜關鍵詞
        keyword = item.select_one(".td-02 > a").text

        # 提取熱搜熱度
        heat = item.select_one(".td-02 > span").text

        # 提取熱搜標簽
        icon = item.select_one(".td-03").text

        self.hot_list.append({"rank": _rank, "keyword": keyword, "heat": heat, "icon": icon, "time":
                              datetime.now().strftime("%Y-%m-%d %H:%M:%S")})

這里采用BeautifulSoup中select，和select_one去解析html文件。

這里對select和select_one做一下簡單補充。

# 通過標簽名查找
soup.select_one('a')
# 通過類名查找
soup.select_one('.td-02')
# 通過ID去查找
soup.select_one('#pl_top_realtimehot')
# 組合查找，根據ID及標簽層級關系查找
res.select("#pl_top_realtimehot > table > tbody > tr")

3. 數據存儲

更多信息，請參考原文

https://mp.weixin.qq.com/s?__biz=Mzg3OTExODI3OA==&mid=2247484291&idx=1&sn=992419916130cf4b414b77b20c38db82&chksm=cf08112af87f983ce954aefd22a0179bbd7110540750b2db566cfc36e66a9000c953d675eecf&token=2077647426&lang=zh_CN#rd

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 爬取微博熱搜榜網絡爬蟲（微博熱搜榜單）網絡爬蟲獲取微博熱搜微博熱搜排行榜前十爬取新浪微博熱搜榜 Python網絡爬蟲-爬取微博熱搜微博熱搜數據爬取微博熱搜榜並進行數據分析 python爬取微博熱搜爬取微博熱搜