Python爬取酷狗音樂TOP500榜單

本文轉載自查看原文 2019-12-08 23:09 748 爬蟲/ Python

最近參加了一個數據挖掘比賽，所以一邊比賽，一邊學Python/(ㄒoㄒ)/~~，相比被算法折磨的死去活來，python就很友好了(●'◡'●)，學了點基礎就直接應用了。廢話不多說直接開始。

環境配置

我們用到的是 bs4，requests，lxml這三個庫來提取，其中bs4是簡稱，全稱是 BeautifulSoup4.0 庫。中文名也叫“美麗的湯”，安裝也很簡單。
打開 cmd 命令行（win + r），輸入 pip install bs4 完成安裝。pip是一個通用的 Python 包管理工具。提供了對Python 包的查找、下載、安裝、卸載的功能。
其他兩個庫同理。成功會提示Successfully installed xxx。

構造請求頭

我們需要安裝Chrome瀏覽器，進入瀏覽器 Ctrl+Shift+I 呼出開發者工具。接着打開網址：https://www.kugou.com/yy/rank/home/1-8888.html?from=rank ，找到如圖所示的User-Agent
什么是請求頭？
別人網頁區別是人還是機器訪問的一種手段，我們設置請求頭為瀏覽器的請求頭，對方就會認為我們是人為的訪問，從而不會反爬，當然這只是最簡單的一種防反爬的手段，一般我們都會帶上，代碼如下：
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
后面代碼會用到。

請求訪問網頁

def get_html(url):
    headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                             'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
    }
    response = requests.get(url,headers=headers)
    if response.status_code == 200:
        return response.text
    else:
        return

response = requests.get(url, headers=headers)

使用 requests 庫的 get 方法，去訪問網頁，第一個參數為網址，第二個參數為請求頭，請求結果賦值給變量 response，其中里面有很多結果，狀態響應碼，網頁源碼，二進制等

response.status_code == 200

調用請求結果 response 中的 status_code 查看請求狀態碼，200 代表請求成功，就返回，否則返回一個 None，狀態碼一般有 2xx，4xx，3xx，5xx，分別代表請求成功，客戶端訪問失敗，重定向，服務器問題。

return response.text

返回響應結果的 text，代表返回網頁 html 源碼

解析網頁

返回了html源碼后，我們需要解析網頁 html 源碼，需要結構化，便於提取數據。

html = BeautifulSoup(html,"lxml") //"lxml"是解析器。

獲取數據

選擇要提取數據，右鍵檢查。

#rankWrap > div.pc_temp_songlist > ul > li:nth-child(1) > span.pc_temp_num

其中 li:nth-child(1) 需要改成 li，因為 nth-child(1) 是獲取 li 標簽下的一條數據，我們是要獲取這一頁的所有排名。
其他數據同理，得到如下代碼。

    ranks = html.select('#rankWrap > div.pc_temp_songlist > ul > li > span.pc_temp_num')
    name = html.select('#rankWrap > div.pc_temp_songlist > ul > li > a')
    time = html.select('#rankWrap > div.pc_temp_songlist > ul > li > span.pc_temp_tips_r > span')

整合數據

    for r,n,t in zip(ranks,name,time):
        r = r.get_text().replace('\n','').replace('\t','').replace('\r','')
        n = n.get_text()
        t = t.get_text().replace('\n','').replace('\t','').replace('\r','')

用了 zip 函數，意思是把對應的排名，歌名歌手，播放時間打包，可以這樣理解 zip 函數的結果是一個列表 [(排名，歌手歌名，播放時間)。
每一次循環的 r，n，t 一次對應元組中的元素

get_text()

我們提取到的是這個數據所在的標簽信息，並不是實際數據，所以需要使用 get_text() 獲得實際數據

.replace('\n','').replace('\t','').replace('\r','')

去掉實際數據中多余的字符串，最后把數據打包成字典打印。

End!

大體思路到這里就結束了，很多地方用了Python123的專欄。非常感謝原文的作者！！！

[1]木下瞳的專欄,https://python123.io/python/muxiatong/5dd14d1b71efdc10be55ee22

完整代碼

#酷狗TOP500
import time
import requests
from bs4 import BeautifulSoup

def get_html(url):
    headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                             'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
    }
    response = requests.get(url,headers=headers)
    if response.status_code == 200:
        return response.text
    else:
        return

def get_infos(html):

    html = BeautifulSoup(html,"lxml")
    ranks = html.select('#rankWrap > div.pc_temp_songlist > ul > li > span.pc_temp_num')
    name = html.select('#rankWrap > div.pc_temp_songlist > ul > li > a')
    time = html.select('#rankWrap > div.pc_temp_songlist > ul > li > span.pc_temp_tips_r > span')
    for r,n,t in zip(ranks,name,time):
        r = r.get_text().replace('\n','').replace('\t','').replace('\r','')
        n = n.get_text()
        t = t.get_text().replace('\n','').replace('\t','').replace('\r','')
        data ={
            '排名': r,
            '歌名-歌手': n,
            '播放時間': t
        }
        print(data)

def main():
    urls =['https://www.kugou.com/yy/rank/home/{}-8888.html?from=rank'.format(str(i))for i in range(1,24)]
    for url in urls:
        html = get_html(url)
        get_infos(html)
        time.sleep(1)

#if __name__ == '_main_':
if __name__ == '__main__':
    main()

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python--爬取酷狗TOP500的數據【Python】【爬蟲】爬取酷狗TOP500 爬取酷狗top500歌曲熱度排名【python爬蟲】爬取當當網TOP500圖書暢銷榜 python！實現各大平台(網易，酷狗，qq)音樂爬取和收聽,下載，第三次更新爬蟲爬取千千音樂榜單音樂名字top500字典各種格式及python腳本 Python 應用爬蟲下載酷狗音樂 Python爬取豆瓣音樂TOP250，爬取的數據保存到csv文件和xls文件【Python3爬蟲】下載酷狗音樂上的歌曲