最近做了一個爬取千千音樂的demo,免去下載歌曲還要下載對應客戶端的麻煩,剛開始接觸爬蟲,可能寫的不太好,求別噴!話不多說,進入正題
1.獲取主頁信息(獲取各個榜單的url)
這里想要說的就是關於千千音樂的登錄問題,可能是我在瀏覽器其他地方登錄了百度賬號,導致點擊退出之后它就會自動又登錄上,本來想通過代碼登錄獲取cookie等登錄信息的,我也懶得清除緩存了,
索性直接從抓包工具中把請求頭全部復制過來,稍微修改一下
# 獲取主頁 def gethomepage(): # 創建會話 s = requests.Session() home_url = 'http://music.taihe.com/' headers ={ 'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3', 'Accept-Language':'zh-CN,zh;q=0.9', 'Cache-Control':'max-age=0', 'Connection':'keep-alive', 'Cookie':'log_sid=1561218778562E9DB28E6A3CDA8ED552F27E3703A9AB4; BAIDUID=E9DB28E6A3CDA8ED552F27E3703A9AB4:FG=1; BDUSS=3AtOE5xTDJnOTBGb2h6UXVYVnZxTEl-Z2VKc0w2V0kyUVV6MmticWxmaHdlVEZkSUFBQUFBJCQAAAAAAAAAAAEAAADQRIc5uqO~3cqvwMMzNjUAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAHDsCV1w7Aldc; flash_tip_pop=true; tracesrc=-1%7C%7C-1; u_lo=0; u_id=; u_t=; u_login=1; userid=965166288; app_vip=show; Hm_lvt_d0ad46e4afeacf34cd12de4c9b553aa6=1561206432,1561209820; __qianqian_pop_tt=8; Hm_lpvt_d0ad46e4afeacf34cd12de4c9b553aa6=1561218967', # 'Host':'music.taihe.com', 'Referer':'http://music.taihe.com/', 'Upgrade-Insecure-Requests':'1', 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36', } r = s.get(home_url,headers = headers) soup = BeautifulSoup(r.text, 'lxml') # 獲取新歌榜 熱榜 網絡歌曲榜的url list_m = soup.findAll('h4', class_='more-rank') for h in list_m: bd_url = h.find('a')['href'] title = h.find('a')['title'] entitle = h.find('a')['href'].split('/')[-1] bd_url = 'http://music.taihe.com' + bd_url gotolist(bd_url, headers, s, title, entitle)
2.獲取每個榜單中的每首歌曲的id
# 獲取各個榜單的歌曲id,並拼接成以逗號隔開的字符串 def gotolist(bd_url, headers, s, title, entitle): r = s.get(bd_url, headers = headers) r.encoding='utf8' soup = BeautifulSoup(r.text, 'lxml') m_list = soup.select('.song-item') m_num_list = '' for m_num in m_list: soup = BeautifulSoup(str(m_num), 'lxml') text = soup.find('span', class_='song-title').find('a')['href'] m_num_list += text.split('/')[-1] + ',' getjson(m_num_list.strip(','), title, entitle)
3.根據歌曲id獲取每首歌曲的基本信息
json_url = 'http://play.taihe.com/data/music/songlink' formdata = { 'songIds': num, 'hq': '0', 'type': 'm4a,mp3', 'rate': '', 'pt': '0', 'flag': '-1', 's2p': '-1', 'prerate': '-1', 'bwt': '-1', 'dur': '-1', 'bat': '-1', 'bp': '-1', 'pos': '-1', 'auto': '-1', } r = requests.post(json_url,headers = headers, data = formdata) # 將獲取到的歌曲信息保存在一個列表中 songlist = json.loads(r.text)['data']['songList']
4.遍歷並下載歌曲
r = requests.get(music_url, timeout = 500)這行代碼中的
timeout = 500得加上,數字可以按情況填寫,因為我下載的時候如果不加這個參數下載到中途就會被服務器關閉連接,從而報錯
# 遍歷找到歌曲的下載地址/播放地址 for song in songlist: music_url = song['linkinfo']['128']['songLink'] print(music_url) # 創建父目錄 dirname = 'paihangbang' if not os.path.exists(dirname): os.mkdir(dirname) #創建對應排行榜目錄 dirname = dirname + '/' + entitle + '/' if not os.path.exists(dirname): os.mkdir(dirname) try: # 歌曲以歌曲名+歌手名進行命名 filename = dirname + str(song['songName']) + '-' + str(song['artistName']) + '.mp3' r = requests.get(music_url, timeout = 500) with open(filename, 'wb') as fp: fp.write(r.content) except FileNotFoundError as e: print(filename + '未找到!') time.sleep(1)
以上就是全部的代碼,下載成功后的目錄使這樣的