Python爬蟲實踐 —— 4.好聽音樂網輕音樂資源下載

本文轉載自查看原文 2019-12-18 20:20 1003

第三章的requests庫階段性demo，爬取好聽音樂網的榜上歌曲。

此網站沒有js混淆，音樂資源鏈接有規律，適合爬蟲新手上手。

首先觀察首頁和音樂榜的url鏈接關系，我們先手動模擬下載，chrome f12獲取response，可探查到url規律如下：

熱播榜url為：

http://www.htqyy.com/top/hot

新曲榜url為：

http://www.htqyy.com/top/new

由此可知hot、new、recommend、latest、gedan分別為各榜二級網址

2.再分析hot榜單內頁碼網址規律

可得url規律為：{index}榜第 i 頁網址為 http://www.htqyy.com/top/musicList/{index}?pageIndex=(i-1)&pageSize=20

3.接着找下載鏈接規律

在試聽界面點擊試聽按鈕，f12刷新 network media會獲得media音頻response，statu code顯示為206，說明歌曲get請求已緩存到disk，驗證url規律為http://f2.htqyy.com/play7/sid/mp3/12，試聽和下載url的關聯關系，其中sid，也就是歌曲名為主鍵，所以我們從音樂榜頁碼頁獲得的sid可以傳入下載地址中，拼接獲得下載地址，/mp3/12為固定字符串

其后我在運行爬蟲的時候發現部分資源會401，找到對應的sid頁面，f12排查，發現下載地址並不唯一，按mp3和m4a文件類型分別有兩個地址：

http://f2.htqyy.com/play7/{sid}/mp3/12 和 http://f2.htqyy.com/play7/{sid}/m4a/12

實際運行爬蟲的時候，我們會發現經常有歌曲無法200正常下載，或者401或者拋出異常我自己分析的話，這有兩種原因，

1.mp3格式url網址無效，需切換m4a網址下載

2.請求時間間隔太短，應設置1-2s，以防因為延遲和服務器原因無法正常爬取。

此demo主要考察python和程序設計基本功，requests庫的應用較少，由此也發現異常處理、網址拼接、f12工具的使用、性能和代碼質量考核等因素在爬蟲設計時是非常重要的，由此也看出一個設計優異的爬蟲框架對實際業務來說是非常重要的，連續爬取、防封、分布式爬取，提高性能門檻也是必須的業務要求，爬蟲隨便幾個demo寫完看似很簡單，但是簡單的事如何做的有質量也是不簡單的。

具體代碼如下：

# SweetLightSpider
import re   # python 的正則庫
import requests     # python 的requests庫
import time
import random       # 隨機選擇
from requests import exceptions  # requests 內置exception


class SweetLightMusicSpider:

    def __init__(self, page):
        self.page = page

# 隨機獲取音樂榜單的網頁信息text數據，取得songID准備為后續url拼接，獲得songName獲得歌曲名
    def __getSong(self):
        songID = []
        songName = []
        keyword = ["hot", "new", "recommend", "latest", "gedan"]
        rankform = random.choice(keyword)
        print(rankform)
        for i in range(0, self.page):
            url = "http://www.htqyy.com/top/musicList/"+str(rankform)+"?pageIndex="+str(i)+"&pageSize=20"
            # 獲得帶有音樂id和name的信息的html文本
            html = requests.get(url)
            strr = html.text

            #正則匹配篩選信息
            pat1 = r'title="(.*?)" sid'
            pat2 = r'sid="(.*?)"'

            id_list = re.findall(pat2, strr)
            title_list = re.findall(pat1, strr)
            
            # 獲得songID/Name數組
            songID.extend(id_list)
            songName.extend(title_list)
        return songID, songName

    def __songdownload(self):
        song_list = SweetLightMusicSpider.__getSong(self)
        for x in range(0, len(song_list[0])):
            song_url = "http://f2.htqyy.com/play7/"+str(song_list[0][x])+"/"+"mp3"+"/12"
            song_name = song_list[1][x]

            response = requests.get(song_url)
            print(response.status_code)
            data = response.content

            if response.status_code == 200:

                print("正在下載第{0}首， 歌曲名為：《{1}》".format(x+1, song_name))
                with open("E:\\music\\{}.mp3".format(song_name), "wb") as f:
                    f.write(data)
                print("第{0}首： 《{1}》 已下載完畢".format(x+1, song_name))

            elif response.status_code == 401:
                time.sleep(2)
                print("重定向資源中")
                song_url2 = "http://f2.htqyy.com/play7/"+str(song_list[0][x])+"/"+"m4a"+"/12"
                response2 = requests.get(song_url2)
                print(response2.status_code)
                try:
                    assert response2.status_code == 200
                except exceptions.HTTPError as e:
                    print(e)
                    continue
                else:
                    data2 = response2.content
                    print("正在下載第{0}首， 歌曲名為：《{1}》".format(x + 1, song_name))
                    with open("E:\\music\\{}.mp3".format(song_name), "wb") as f:
                        f.write(data2)
                    print("第{0}首： 《{1}》 已下載完畢".format(x + 1, song_name))

        time.sleep(1)

    def music_Spider(self):
        SweetLightMusicSpider.__songdownload(self)


if __name__ == '__main__':
    i = SweetLightMusicSpider(10)
    i.music_Spider()

設置兩個請求間隔為1s，重定向url為2s，調試如下，基本杜絕了

1.無法獲取到的歌曲 2.下載錯誤為十幾KB無法打開的歌曲這兩個運行異常bug

最后犧牲了部分爬取效率獲得了爬取質量的提高：

最后檢查E盤的音樂下載情況，沒有下載錯誤，文件大小均在1-3mb，不存在音樂無法打開的情況

隨便打開一首網易雲播放器驗證下：

沒有問題，bingo。因為是輕音樂網站嘛，faded的純樂器版。。。。。。i am faded >_<

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 VSX-5 VSXMusic 編碼聽音樂 openwrt上用usb聲卡聽音樂為什么聽音樂要使用2.0音箱推薦幾首緩解大腦疲勞的輕音樂 android -- 藍牙 bluetooth （五）接電話與聽音樂作業-python爬蟲-音樂下載 Python爬蟲入門教程17：酷某音樂網站的爬取 Python爬蟲入門教程15：音樂網站數據的爬取 Python爬蟲-爬取音樂資源基於SpringBoot+Mybatis+MySQL5.7的輕語音樂網