>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
因為太過於簡單,所有就只寫了基本的流程
網頁分析
進入一個視頻,然后抓包
通過filter過濾帶有'list'的地址
彈幕在以下地址中
https://api.bilibili.com/x/v1/dm/list.so?oid=193943133
https://api.bilibili.com/x/v1/dm/list.so?oid=194941323(另一個視頻的彈幕地址,為了區別哪些參數變化了)
oid參數在變化,即一個視頻對應一個oid
獲取oid
https://api.bilibili.com/x/player/pagelist?bvid=BV1T5411x7y3&jsonp=jsonp
該地址返回一個json字符串,
去掉&jsonp=jsonp后返回的數據不變
該地址返回的數據如下:
{
code: 0,
message: "0",
ttl: 1,
data: [
{
cid: 193943133,
page: 1,
from: "vupload",
part: "咖喱愛情",
duration: 221,
vid: "",
weblink: "",
dimension: {
width: 3840,
height: 2160,
rotate: 0
}
}
]
}
可以發現oid和cid是一樣的
而https://api.bilibili.com/x/player/pagelist?bvid=BV1T5411x7y3中的bvid后面的值為視頻的編號
可以發現在https://www.bilibili.com/video/BV1T5411x7y3,的值符合
總結,通過bvid=BV1T5411x7y3構造https://api.bilibili.com/x/player/pagelist?bvid=BV1T5411x7y3
獲得oid號(即cid),通過oid構造彈幕請求的地址:https://api.bilibili.com/x/v1/dm/list.so?oid=193943133
總結:當自己找不到數據入口的時候,不妨先去網上找一下思路,這個爬蟲的難點是彈幕數據入口不清楚,只要找到就好了
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
import requests import re from lxml import etree from pprint import pprint bvid_url = 'https://www.bilibili.com/video/BV1T5411x7y3' bvid = re.findall(r"video/(\S+)", bvid_url, re.S)[0] oid_url = "https://api.bilibili.com/x/player/pagelist?bvid={}".format(bvid) headers = { "User-Agent": 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36', "referer": "https://www.bilibili.com/video/BV1T5411x7y3" } response = requests.get(oid_url, headers=headers).content.decode() oid = re.findall(r'"cid":(.*?),', response, re.S)[0] danmu_url = "https://api.bilibili.com/x/v1/dm/list.so?oid={}".format(oid) response = requests.get(danmu_url,headers=headers).content html = etree.HTML(response) d_list = html.xpath("//d/text()") pprint(d_list)