Python爬蟲入門教程17：酷某音樂網站的爬取

本文轉載自查看原文 2021-02-05 16:13 406 Python爬蟲

前言💨

本文的文字及圖片來源於網絡,僅供學習、交流使用,不具有任何商業用途,如有問題請及時聯系我們以作處理。

前文內容💨

Python爬蟲入門教程01：豆瓣Top電影爬取

Python爬蟲入門教程02：小說爬取

Python爬蟲入門教程03：二手房數據爬取

Python爬蟲入門教程04：招聘信息爬取

Python爬蟲入門教程05：B站視頻彈幕的爬取

Python爬蟲入門教程06：爬取數據后的詞雲圖制作

Python爬蟲入門教程07：騰訊視頻彈幕爬取

Python爬蟲入門教程08：爬取csdn文章保存成PDF

Python爬蟲入門教程09：多線程爬取表情包圖片

Python爬蟲入門教程10：彼岸壁紙爬取

Python爬蟲入門教程11：新版王者榮耀皮膚圖片的爬取

Python爬蟲入門教程12：英雄聯盟皮膚圖片的爬取

Python爬蟲入門教程13：高質量電腦桌面壁紙爬取

Python爬蟲入門教程14：有聲書音頻爬取

Python爬蟲入門教程15：音樂網站數據的爬取

Python爬取入門教程16：音頻素材網站的爬取

PS：如有需要 Python學習資料 以及 解答 的小伙伴可以加點擊下方鏈接自行獲取
python免費學習資料以及群交流解答點擊即可加入

基本開發環境💨

Python 3.6
Pycharm

一、💥確定需求

爬取所有榜單上面的音樂
在這里插入圖片描述

二、💥網頁數據分析

1、先找音樂的URL地址

點擊播放，開發者工具里面就會有出現一個音樂播放地址。
在這里插入圖片描述

2、找尋音樂url地址的來源。

https://webfs.yun.kugou.com/202102051451/598a943870c34115e8c290507183a2c9/G188/M06/18/09/_A0DAF34pOiABslMADSv-ykkq2s784.mp3

這樣的音樂URL根本就不知道有什么規律，所以可以在開發者工具里面搜索來源。
在這里插入圖片描述
兩個url地址都是可以用的，因為有一個備用的url地址。

這些就是數據包的請求參數。一個鏈接是看不出來變化參數的。所以需要在對比一個音樂地址。

通過對比可以看到 hash，album_id 主要是這兩個參數的變化，最后的那個參數是時間戳。也可以把它當作恆定不變的也可以。

3、找尋 hash，album_id 請求參數的來源

其實這兩個參數在列表頁面的網頁源代碼里面就有的
在這里插入圖片描述

里面的音樂名字是需要轉碼的，不過我們只需要 hash 和 album_id 這兩個參數就可以了，也不需要在這獲取音樂名字。不過還是說一下吧。

`遇到 \u591c\u591c\u591c\u6f2b\u957f如何轉碼`

字符串.encode('utf-8').decode('unicode_escape')

既然知道了 hash 和 album_id 這兩個參數在網頁的源代碼里面就有，那現在只需要獲取每個類目的url地址就可以爬取所有的榜單的音樂了。

直接請求網頁就可以獲取所有類目的url地址了
在這里插入圖片描述

三、💥代碼實現

獲取所有類目url地址以及標題

def get_type_url(html_url):
    response = get_response(html_url)
    selector = parsel.Selector(response.text)
    lis = selector.css('.pc_temp_side ul li')
    for li in lis:
        # 獲取類目標題
        type_title = li.css('a::attr(title)').get()
        # 獲取類目url
        type_url = li.css('a::attr(href)').get()
        print(f'正在爬取{type_title}', type_url)

獲取請求參數 hash 以及 album_id

def get_music_info(type_url):
    response = get_response(type_url)
    result = re.findall('global\.features = \[(.*?)\]', response.text)[0].encode('utf-8').decode('unicode_escape')
    hash_num = re.findall('"Hash":"(.*?)"', result)
    album_id = re.findall('"album_id":(\d+),"', result)
    music_info = zip(hash_num, album_id)
    for index in music_info:
        music_hash = index[0]
        music_id = index[1]

獲取音樂url 以及音樂名

def get_music_url(music_hash, album_id):
    page_url = 'https://wwwapi.kugou.com/yy/index.php'
    params = {
        'r': 'play/getdata',
        'hash': music_hash,
        'dfid': '3ve7aQ2XyGmN0yE3uv3WcaHs',
        'mid': 'ac3836df72c523f46a85d8a5fd90fe59',
        'platid': '4',
        'album_id': album_id,
        '_': '1612508120385',
    }
    headers = {
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
    }
    response = requests.get(url=page_url, params=params, headers=headers)
    json_data = response.json()
    music_name = json_data['data']['audio_name']
    music_url = json_data['data']['play_url']

保存數據到本地

def save(music_name, music_url):
    path = 'music\\'
    if not os.path.exists(path):
        os.mkdir(path)
    music_content = get_response(music_url).content
    with open(path + music_name + '.mp3', mode='wb') as f:
        f.write(music_content)
        print('正在保存：', music_name)

在這里插入圖片描述

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。