導言:初次爬蟲,若有不足之處,多多指正,內容借鑒一位大神爬蟲經歷,我這邊錦上添花,添加獲取音樂播放路徑和連接mysql數據庫等相關內容,
涉及軟件: Navicat for MySQL破解版 以及 postman
爬蟲數據有: 歌詞,歌曲,歌手,播放路徑,封面圖顯示,歌曲時長,歌詞次數大小等等
爬蟲涉及模塊:

import time
import pymysql
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import json
爬蟲思路以及問題:
1·hash以及mid加密問題
2·歌曲播放路徑以及歌曲詳情請求路徑巧妙繞過方案
3·獲到數據編碼問題
4·歌曲頁面獲取localstorage以及cookie值問題
5·連接數據庫儲存mysql問題
解決問題過程以其中一條歌曲路徑為例:
URL:https://wwwapi.kugou.com/yy/index.php?r=play/getdata&callback=jQuery19108013258872165683_1631704461109&hash=BC4E172CF13BB79303203A48246D84E1&dfid=2C8TCD3wtvFq3A2P4h4Slbtf&appid=1014&mid=ee9a0573ca7b9cda6b916c684b10b6da&platid=4&album_id=38915273&_=1631704461111
至於這條URL怎么來的,暫時不管,先分析這條get請求
·涉及參數:
1·hash:BC4E172CF13BB79303203A48246D84E1
2·dfid:2C8TCD3wtvFq3A2P4h4Slbtf
3·appid:1014
4·mid:ee9a0573ca7b9cda6b916c684b10b6da
5·platid:4
6·_:1631085855865
7·callback:jQuery19108013258872165683_1631704461109
然后再postman打開這請求:顯示結果如下
轉碼JSON:

jQuery19108013258872165683_1631704461109({
"status": 1,
"err_code": 0,
"data": {
"hash": "BC4E172CF13BB79303203A48246D84E1",
"timelength": 168000,
"filesize": 2701932,
"audio_name": "傅夢彤、安蘇羽 - 潮汐 (Natural)",
"have_album": 1,
"album_name": "潮汐 (Natural)",
"album_id": "38915273",
"img": "http://imge.kugou.com/stdmusic/20201204/20201204164503970613.jpg",
"have_mv": 1,
"video_id": "4709291",
"author_name": "傅夢彤、安蘇羽",
"song_name": "潮汐 (Natural)",
"lyrics": "[id:$00000000]\r\n[ar:傅夢彤、安蘇羽]\r\n[ti:潮汐 (Natural)]\r\n[by:]\r\n[hash:bc4e172cf13bb79303203a48246d84e1]\r\n[al:]\r\n[sign:]\r\n[qq:]\r\n[total:168829]\r\n[offset:0]\r\n[00:00.08]傅夢彤、安蘇羽 - 潮汐 (Natural)\r\n[00:00.87]作詞:安蘇羽、舒心\r\n[00:01.12]混音:謝驍\r\n[00:21.12]當海面迎來洶涌的潮汐\r\n[00:23.55]我奔跑尋找昔日的足跡\r\n[00:26.18]夕陽下倒影迷人的美麗\r\n[00:28.71]可我卻丟失故事和你\r\n[00:31.34]你說過向往大海的神秘\r\n[00:33.92]也憧憬我們遺失的過去\r\n[00:36.50]分享給大海秘密\r\n[00:39.74]藍色的海底\r\n[00:42.27]遠山的風景\r\n[00:45.16]我們的距離遙不可及\r\n[00:50.02]退守的愛情\r\n[00:52.70]還剩下回憶\r\n[00:55.03]瘋狂地尋覓你的身影\r\n[01:00.69]殘月憂郁\r\n[01:03.07]星夜靜謐\r\n[01:05.60]潮落嘆息\r\n[01:11.04]聆聽山語\r\n[01:13.42]回盪不清\r\n[01:15.91]若即若離\r\n[01:23.05]當海面迎來洶涌的潮汐\r\n[01:25.53]我奔跑尋找昔日的足跡\r\n[01:28.11]夕陽下倒影迷人的美麗\r\n[01:30.65]可我卻丟失故事和你\r\n[01:33.28]你說過向往大海的神秘\r\n[01:35.86]也憧憬我們遺失的過去\r\n[01:38.40]分享給大海秘密\r\n[01:41.59]藍色的海底\r\n[01:44.27]遠山的風景\r\n[01:46.90]我們的距離遙不可及\r\n[01:52.06]退守的愛情\r\n[01:54.59]還剩下回憶\r\n[01:57.12]瘋狂地尋覓你的身影\r\n[02:02.59]殘月憂郁\r\n[02:04.92]星夜靜謐\r\n[02:07.55]潮落嘆息\r\n[02:12.97]聆聽山語\r\n[02:15.45]回盪不清\r\n[02:17.87]若即若離\r\n[02:23.24]殘月憂郁\r\n[02:25.57]星夜靜謐\r\n[02:28.15]潮落嘆息\r\n[02:33.65]聆聽山語\r\n[02:36.07]回盪不清\r\n[02:38.50]若即若離\r\n",
"author_id": "968893",
"privilege": 8,
"privilege2": "1000",
"play_url": "https://webfs.ali.kugou.com/202109151914/04b52e1a2976cec776bee89f7690b6ae/G226/M05/18/14/gocBAF87mjyAfkS4ACk6bFP5o1Q758.mp3",
"authors": [
{
"author_id": "968893",
"author_name": "傅夢彤",
"is_publish": "1",
"sizable_avatar": "http://singerimg.kugou.com/uploadpic/softhead/{size}/20210610/20210610031307109445.jpg",
"avatar": "http://singerimg.kugou.com/uploadpic/softhead/400/20210610/20210610031307109445.jpg"
},
{
"author_id": "87264",
"author_name": "安蘇羽",
"is_publish": "1",
"sizable_avatar": "http://singerimg.kugou.com/uploadpic/softhead/{size}/20190321/20190321201004866434.jpg",
"avatar": "http://singerimg.kugou.com/uploadpic/softhead/400/20190321/20190321201004866434.jpg"
}
],
"is_free_part": 0,
"bitrate": 128,
"recommend_album_id": "38915273",
"audio_id": "80133277",
"has_privilege": true,
"play_backup_url": "https://webfs.cloud.kugou.com/202109151914/b75e58fc8b2b687b5a4a65df5d319aa1/G226/M05/18/14/gocBAF87mjyAfkS4ACk6bFP5o1Q758.mp3"
}
});
通過postman對上調URL刪除某些參數結果:https://wwwapi.kugou.com/yy/index.php?r=play/getdata&hash=BC4E172CF13BB79303203A48246D84E1&appid=1014&mid=ee9a0573ca7b9cda6b916c684b10b6da&album_id=38915273
這條請求數據結果和上述結果是一樣的,有意思了,經過排除,我們通過以下幾個參數照樣能夠獲取到我們想要的數據
4個必需參數:
1·r=play/getdata
2·hash=BC4E172CF13BB79303203A48246D84E1
3·mid=ee9a0573ca7b9cda6b916c684b10b6da
4·album_id=38915273
接下來就是推測這些參數的大致作用,方便從網頁爬取,hash和mid,加密的意思!我們爬取的一條一條音樂數據,一個網頁(我們爬取的是酷狗音樂排行榜播放詳情頁的數據)只有一條音樂數據,應該hash和mid對應的是單獨一條音樂數據,參數r是固值
剩下的就是 album_id這一個參數啦
接下來就是怎么找參數啦
爬蟲目標網址:https://www.kugou.com/yy/rank/home/
目標名稱:酷狗音樂飆升榜全部音樂
隨便點開一個音樂,我這邊點開的是第一個:https://www.kugou.com/song/1lnk4m0b.html#hash=C2BF05871AC21B3928F63AB8EE3890EF&album_id=48707908
按F12刷新,點擊media欄目會有一條音頻數據:https://webfs.ali.kugou.com/202109161024/9fe5a6a564d487a4701d1454223ea008/KGTX/CLTX001/c2bf05871ac21b3928f63ab8ee3890ef.mp3
截圖如下:
瞅到沒,有個壞家伙cookie朝着你笑呢!kg_mid,我們再找找其他參數
想着cookie有數據,我順便看看本地儲存有木有,好家伙,讓我直接發現了這個,趕緊上圖:
這下好了,參數全到手了
接下來就是通過Python打開瀏覽器頁面獲取本地儲存數據,然后請求獲到數據保存到mysql,先獲取20條試試!
tips:localStorage數據獲取需要用到chromedriver.exe驅動谷歌瀏覽器,版本與自己的谷歌瀏覽器版本相近網上有下載,這里不贅述了
我就直接上源碼了,涉及連接mysql數據庫感興趣的話點這里!

import time
import pymysql
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import json
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:58.0) Gecko/20100101 Firefox/58.0'}
# 固值 callback 可選 _ 可選
# dfid = '2C8TCD3wtvFq3A2P4h4Slbtf'
appid = '1014'
# mid = 'ee9a0573ca7b9cda6b916c684b10b6da'
platid = '4'
r = 'play/getdata'
def top(conn, cursor,url):
html = requests.get(url, headers=headers)
soup = BeautifulSoup(html.text, 'lxml')
No = soup.select('.pc_temp_num')
titles = soup.select('.pc_temp_songname')
href = soup.select('.pc_temp_songname')
time = soup.select('.pc_temp_time')
for No, titles, time, href in zip(No, titles, time, href):
data = {
'NO': No.get_text().strip(),
'titles': titles.get_text(),
'time': time.get_text().strip(),
'href': href.get('href')}
print(data)
GetCurrentPageLocalStorage(conn, cursor, href.get('href'))
# 獲取當前頁面localStorage儲存的 jStorage 和 kg_mid的值 cookie值 中的dfid值, mid值
def GetCurrentPageLocalStorage(conn, cursor, href):
browser = webdriver.Chrome(
executable_path='C:\\Users\Administrator\AppData\Local\Programs\Python\Python39\chromedriver.exe')
browser.get(href)
kg_mid = browser.execute_script("return localStorage.getItem('kg_mid')")
jStorage = browser.execute_script("return localStorage.getItem('jStorage')")
cookies = browser.get_cookies()
# print('kg_mid:', kg_mid, "jStorage:", jStorage)
OjectJson = json.loads(jStorage)["k_play_list"]
formatPlay = json.loads(OjectJson)
# print('formatPlay:', formatPlay)
mid = kg_mid
hash = formatPlay[0]["hash"]
# print('hash:', hash)
album_id = formatPlay[0]["album_id"]
html = requests.get(url='https://wwwapi.kugou.com/yy/index.php?r=play/getdata&hash=' + str(hash) + '&mid=' + str(mid) + '&platid=' + str(platid) + '&album_id=' + str(album_id))
# print('kg_mid:', kg_mid, "jStorage:", jStorage)
# print('cookies:', cookies)
# print('soup:', html.json())
Deposit(conn, cursor, html.json())
# 創建數據表單
def createTable(con, cs):
cs.execute("create table if not exists kugouMusic (audio_name varchar(1000), img varchar(1000), song_name varchar(100) primary key,\
timelength varchar(100), filesize varchar(100), language varchar(100), \
video_id varchar(100), \
author_name varchar(100), album_id varchar(100), play_backup_url varchar(500), lyrics varchar(2000),play_url varchar(1000))")
# 提交事務:
con.commit()
# 酷狗音樂存入數據庫 audio_name img song_name timelength filesize language video_id author_name album_id play_backup_url lyrics play_url
def Deposit(con, cs, data):
music_data = data['data']
description = music_data
lyrics = description['lyrics']
img = description['img']
song_name = description['song_name']
timelength = description['timelength']
filesize = description['filesize']
language = description['privilege']
video_id = description['video_id']
album_id = description['album_id']
play_backup_url = description['play_backup_url']
play_url = description['play_url']
author_name = description['author_name']
audio_name = description['audio_name']
try:
cs.execute(
'insert into kugouMusic values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)', \
(audio_name, img, song_name, timelength, filesize, language, video_id, author_name, album_id, play_backup_url, lyrics, play_url))
except:
pass
finally:
con.commit()
if __name__ == '__main__':
urls = {'http://www.kugou.com/yy/rank/home/{}-8888.html'.format(str(i)) for i in range(1, 2)}
# 連接MySQL數據庫
conn = pymysql.connect(host='127.0.0.1', user='root', password='123', db='music', charset='utf8')
cursor = conn.cursor()
createTable(conn, cursor)
for url in urls:
time.sleep(5)
top(conn, cursor, url)
# 關閉數據庫
cursor.close()
conn.close()
print('所有頁面地址爬取完畢!')