【python】下載中國大學MOOC的視頻
腳本目標:
輸入課程id和cookie下載整個課程的視頻文件,方便復習時候看
網站的反爬機制分析:
分析數據包的目的:找到獲取m3u8文件的路徑
1. 從第一步分析數據包開始,就感覺程序員一定是做了反爬機制,從一開始就防備着了,網站在打開調試工具的時候會死循環在debugger上,代碼寫法和原理可以參考這篇文章【如何防止頁面被調試_小敏哥的專欄-CSDN博客_網頁禁止調試】,只需要停用斷點就可以繼續調試,在network里看數據包
2. 搜索關鍵字m3u8,獲得下載m3u8的鏈接
https://mooc2vod.stu.126.net/nos/hls/2019/03/13/1214418097_fe38e0e942144bxxxxxxxxxxxx8ef_sd.m3u8?ak=7909bff134372bffca53cdc2c17adc27a4c38c6336120510aea1ae1790819de820f66b1081b2dbb1d6300ca9e91c8b349a14ab1e5b4e06c0887fe54fe47de9823059f726dc7bb86b92adbc3d5b34b1320647b25cf54eb8ac6ed1f0d7db7826b19bb0a5ea14ff29775bd482caa79ccf8b
簡化得到
https://mooc2vod.stu.126.net/nos/hls/2019/03/13/1214418097_fe38e0e942144bxxxxxxxxxxxx8ef_sd.m3u8
通過get方式,下載查看,里面的ts下載鏈接是可以需要拼接的,文件沒有加密,因此只需要用request發送數據包,就可以下載得到,接下來尋找構造數據包的依據
3. 分析數據包得到,m3u8文件的下載鏈接出現在另一個鏈接訪問得到的內容里
鏈接為:https://vod.study.163.com/eds/api/v1/vod/video?videoId=1215086738&signature=75584967794f373450387775426e39512b565a6d4a4643384a366b3743766e556474475a7454466252672b6c5252306a744e585a61365031766269547650504454724750544c683252715a475876585962727a536f7431736a596d4f7843593577306d714654534864385873446e5470397a5675537a7233486836336d7a4355576e506745582b47762f2b37574b6972745365744d673d3d&clientType=1
鏈接需要由videoId和signature,兩個變量組成,兩個變量必須對應,不然無法訪問,並且是采用了http2的形式傳輸,相當於ban掉了request庫,接下來需要找到videoId和其對應的signature
4.分析數據包得到signature的值被包含在以下url中,需要通過http2進行訪問,並且在訪問的同時需要POST提交一個數據包,數據包的內容為“bizId=1268148xxx&bizType=1&contentType=1
https://www.icourse163.org/web/j/resourceRpcBean.getResourceToken.rpc?csrfKey=7d56f95de17346eeb0e0fd1d2d6b5751”
此時需要尋找bizId和csrfKey的值
5. 分析數據包得到,bizid,每節課的名稱,每節課videoid(contentId)被存放在以下url的內容中,需要通過http2進行訪問,並且在訪問的同時需要POST提交一個數據包,數據包的內容為“termId: 1465388xxx”
https://www.icourse163.org/web/j/courseBean.getLastLearnedMocTermDto.rpc?csrfKey=7d56f95de17346eeb0e0fd1d2d6b575
提交的數值和課程鏈接中tid后面的值相同,需要填寫變量僅剩下csrfKey
6.csrfKey被存放在cookie的NTESSTUDYSI=中,只需要cookie就可以得到csrfKey。此時只需要cookie和課程id就能獲得每個課程的m3u8文件下載鏈接
已知信息:
1. 網站的反爬機制
2. 視頻ts文件沒有加密
3. 網站使用http2協議,使用httpx代替request
4. 拿到m3u8文件的下載鏈接后,直接甩給上一篇寫的m3u8的下載器,改一改拼接在一起,能用就行
5. 雖然在調試的時候提示是“非法的跨域請求”,但是沒有判斷referer的值,只是判斷了cookie值
腳本思路:
1. 獲得自己的cookie,從cookie中提取出NTESSTUDYSI=得到csrfKey
2. 用http2.x,且為post的方式訪問“https://www.icourse163.org/web/j/courseBean.getLastLearnedMocTermDto.rpc?csrfKey=csrfKey“,構造POST數據包 termId: tid。在得到的數據包內得到每節課的名稱,每節課videoid(contentId),bizId(id)
3. 用http2.0,且為post的方式訪問,訪問做了跨域限制“https://www.icourse163.org/web/j/resourceRpcBean.getResourceToken.rpc?csrfKey=csrfKey”構造POST數據包bizId=bizId&bizType=1&contentType=1 。 得到signature
4. 用http2.0,且為get的方式訪問https://vod.study.163.com/eds/api/v1/vod/video/videoId=videoId&signature=signature&clientType=1 得到videoUrl,就是一個m3u8的下載地址,刪除問號后面的內容,直接訪問就可以下載
5. 拼接上次寫的腳本
最終功能代碼:
import os import re import httpx import time import requests import aiohttp import asyncio import aiofiles obj = re.compile(r".*?NTESSTUDYSI=(?P<csrfkey>.*?);") obj2 = re.compile(r".*?'units': \[{'id':(?P<id>.*?),") obj3 = re.compile(r".*?'contentType': 1, 'contentId':(?P<bizld>.*?),") obj4 = re.compile('.*?"signature":"(.*?)",') obj5 = re.compile('.*?"videoUrl":"(?P<m3u8>.*?)\?ak') obj6 =re.compile('.*?"name":"(.*?)",') def init(): if not os.path.exists("./temp_data"): os.mkdir("./temp_data") if not os.path.exists("./cookie.txt"): cookie = str(input("沒有檢測到cookie,請輸入cookie>")) with open("./cookie.txt", "w") as f: f.write(f"{cookie}") with open("cookie.txt", "r") as f: cookie = f.read() return cookie def get_video_id(csrfkey): tid = str(input("輸入課程的tid>")) data = {"termId": f"{tid}"} client = httpx.Client(http2=True) resp = client.post(f"https://www.icourse163.org/web/j/courseBean.getLastLearnedMocTermDto.rpc?csrfKey={csrfkey}",headers=header, data=data) resp.close() json = resp.json() shu_ju = str(json["result"]["mocTermDto"]["chapters"]) bizld = str(obj2.findall(shu_ju)).replace(" ","").replace("[","").replace("]","").replace("'","").split(",") video_id = str(obj3.findall(shu_ju)).replace("' None', ","").replace(" ","").replace("'","").replace("[","").replace("]","").split(",") print("獲取bzid和video_id") return bizld,video_id def get_signature(bizlds,csrfkey): signatures=[] client = httpx.Client(http2=True) for bizld in bizlds: data = {"bizId":f"{bizld}","bizType":"1","contentType":"1"} resp=client.post(f"https://www.icourse163.org/web/j/resourceRpcBean.getResourceToken.rpc?csrfKey={csrfkey}",headers=header,data=data) resp.close() signatures = signatures+obj4.findall(resp.text) print("獲取signatures") return signatures def get_m3u8_url(video_ids,signatures): m3u8_urls = [] merge_name = [] client = httpx.Client(http2=True) for video_id,signature in zip(video_ids,signatures): resp = client.get(f"https://vod.study.163.com/eds/api/v1/vod/video?videoId={video_id}&signature={signature}&clientType=1",headers=header) resp.close() time.sleep(2) merge_name.append(str(obj6.findall(resp.text)).replace("[", "").replace("]", "").replace("'", "")) m3u8_url = obj5.findall(resp.text)[0] m3u8_urls.append(m3u8_url) print("下面是合並文件名:",merge_name) return m3u8_urls,merge_name async def download_ts(file_name,download_url,session): async with session.get(download_url,headers=header) as resp: async with aiofiles.open(f"temp_data/{file_name}",mode="wb") as f: await f.write(await resp.content.read()) async def starter(name,m3u8_url): tasks=[] async with aiohttp.ClientSession() as session: #https://mooc2vod.stu.126.net/nos/hls/2019/03/13/1214418097_fe38e0e942144b60bd5f16c4426b08ef_sd0.ts url = str(m3u8_url).rsplit("/",1)[0] with open(f"temp_data/{name}.txt", "r") as f: for line in f: if line.startswith("#"): continue else: line = line.strip() file_name = line # 得下載的ts文件名 download_url = url + "/" + line print("下載鏈接是:",download_url) task = download_ts(file_name, download_url, session) tasks.append(task) await asyncio.wait(tasks) # 等待任務執行結束 print("文件下載完成") def m3u8_files_download(url,name): #下載m3u8文件 resp = requests.get(url) with open(f"temp_data/{name}.txt",mode="wb") as f: f.write(resp.content) resp.close() def verification(name): files=[] with open(f"temp_data/{name}.txt","r") as f: for line in f: if line.startswith("#"): continue else: line=line.strip() if os.path.exists(f"temp_data/{line}"): continue else: files.append(line) print("以下文件缺失,請手動查看:",files) def merge_ts(file_name,merge_name): new_name = str(merge_name) with open(f"./{new_name}.mp4", "ab+") as f: with open(f"temp_data/{file_name}.txt","r") as f2: for line in f2: if line.startswith("#"): continue else: line = line.strip().split("/")[-1].strip() ts_name = line try: with open(f"temp_data/{ts_name}","rb") as f3: f.write(f3.read()) except: continue def m3u8_main(m3u8_urls,merge_names): print("獲取m3u8鏈接,開始執行下載任務") for url,merge_name in zip(m3u8_urls,merge_names): name = url.rsplit("/")[-1] m3u8_files_download(url, name) # 下載m3u8文件 asyncio.run(starter(name,url)) print("校驗文件完整性") verification(name) merge_ts(name,merge_name) if __name__=="__main__": cookie = init() header = {"cookie": f"{cookie}"} csrfkey = obj.findall(cookie)[0] print("獲取csrfkey") bizid,video_id= get_video_id(csrfkey) signatures = get_signature(bizid,csrfkey) m3u8_urls,merge_name = get_m3u8_url(video_id,signatures) m3u8_main(m3u8_urls,merge_name)