【python】下載中國大學MOOC的視頻

本文轉載自查看原文 2022-01-18 22:32 1025

【python】下載中國大學MOOC的視頻

腳本目標：

　　　　輸入課程id和cookie下載整個課程的視頻文件，方便復習時候看

網站的反爬機制分析：

　　　　分析數據包的目的：找到獲取m3u8文件的路徑　　　

　　　　1. 從第一步分析數據包開始，就感覺程序員一定是做了反爬機制，從一開始就防備着了，網站在打開調試工具的時候會死循環在debugger上，代碼寫法和原理可以參考這篇文章【如何防止頁面被調試_小敏哥的專欄-CSDN博客_網頁禁止調試】，只需要停用斷點就可以繼續調試，在network里看數據包

　　　　2. 搜索關鍵字m3u8，獲得下載m3u8的鏈接

https://mooc2vod.stu.126.net/nos/hls/2019/03/13/1214418097_fe38e0e942144bxxxxxxxxxxxx8ef_sd.m3u8?ak=7909bff134372bffca53cdc2c17adc27a4c38c6336120510aea1ae1790819de820f66b1081b2dbb1d6300ca9e91c8b349a14ab1e5b4e06c0887fe54fe47de9823059f726dc7bb86b92adbc3d5b34b1320647b25cf54eb8ac6ed1f0d7db7826b19bb0a5ea14ff29775bd482caa79ccf8b

　　　　簡化得到

https://mooc2vod.stu.126.net/nos/hls/2019/03/13/1214418097_fe38e0e942144bxxxxxxxxxxxx8ef_sd.m3u8

　　　　通過get方式，下載查看，里面的ts下載鏈接是可以需要拼接的，文件沒有加密，因此只需要用request發送數據包，就可以下載得到，接下來尋找構造數據包的依據

　　　　3. 分析數據包得到，m3u8文件的下載鏈接出現在另一個鏈接訪問得到的內容里

鏈接為：https://vod.study.163.com/eds/api/v1/vod/video?videoId=1215086738&signature=75584967794f373450387775426e39512b565a6d4a4643384a366b3743766e556474475a7454466252672b6c5252306a744e585a61365031766269547650504454724750544c683252715a475876585962727a536f7431736a596d4f7843593577306d714654534864385873446e5470397a5675537a7233486836336d7a4355576e506745582b47762f2b37574b6972745365744d673d3d&clientType=1

　　　　鏈接需要由videoId和signature，兩個變量組成，兩個變量必須對應，不然無法訪問，並且是采用了http2的形式傳輸，相當於ban掉了request庫，接下來需要找到videoId和其對應的signature

　　　　4.分析數據包得到signature的值被包含在以下url中，需要通過http2進行訪問，並且在訪問的同時需要POST提交一個數據包，數據包的內容為“bizId=1268148xxx&bizType=1&contentType=1

https://www.icourse163.org/web/j/resourceRpcBean.getResourceToken.rpc?csrfKey=7d56f95de17346eeb0e0fd1d2d6b5751”

　　　　此時需要尋找bizId和csrfKey的值

　　　　5. 分析數據包得到，bizid，每節課的名稱，每節課videoid（contentId）被存放在以下url的內容中，需要通過http2進行訪問，並且在訪問的同時需要POST提交一個數據包，數據包的內容為“termId: 1465388xxx”

https://www.icourse163.org/web/j/courseBean.getLastLearnedMocTermDto.rpc?csrfKey=7d56f95de17346eeb0e0fd1d2d6b575

　　　　提交的數值和課程鏈接中tid后面的值相同，需要填寫變量僅剩下csrfKey

　　　　6.csrfKey被存放在cookie的NTESSTUDYSI=中，只需要cookie就可以得到csrfKey。此時只需要cookie和課程id就能獲得每個課程的m3u8文件下載鏈接

已知信息：

　　　　1. 網站的反爬機制

　　　　2. 視頻ts文件沒有加密

　　　　3. 網站使用http2協議，使用httpx代替request

　　　　4. 拿到m3u8文件的下載鏈接后，直接甩給上一篇寫的m3u8的下載器，改一改拼接在一起，能用就行

　　　　5. 雖然在調試的時候提示是“非法的跨域請求”，但是沒有判斷referer的值，只是判斷了cookie值

腳本思路：

　　　　1. 獲得自己的cookie，從cookie中提取出NTESSTUDYSI=得到csrfKey

　　　　2. 用http2.x,且為post的方式訪問“https://www.icourse163.org/web/j/courseBean.getLastLearnedMocTermDto.rpc?csrfKey=csrfKey“,構造POST數據包 termId: tid。在得到的數據包內得到每節課的名稱，每節課videoid（contentId），bizId（id）

　　　　3. 用http2.0,且為post的方式訪問,訪問做了跨域限制“https://www.icourse163.org/web/j/resourceRpcBean.getResourceToken.rpc?csrfKey=csrfKey”構造POST數據包bizId=bizId&bizType=1&contentType=1 。得到signature

　　　　4. 用http2.0,且為get的方式訪問https://vod.study.163.com/eds/api/v1/vod/video/videoId=videoId&signature=signature&clientType=1 得到videoUrl，就是一個m3u8的下載地址,刪除問號后面的內容，直接訪問就可以下載

　　　　5. 拼接上次寫的腳本

最終功能代碼：

import os
import re
import httpx
import time
import requests
import aiohttp
import asyncio
import aiofiles
obj = re.compile(r".*?NTESSTUDYSI=(?P<csrfkey>.*?);")
obj2 = re.compile(r".*?'units': \[{'id':(?P<id>.*?),")
obj3 = re.compile(r".*?'contentType': 1, 'contentId':(?P<bizld>.*?),")
obj4 = re.compile('.*?"signature":"(.*?)",')
obj5 = re.compile('.*?"videoUrl":"(?P<m3u8>.*?)\?ak')
obj6 =re.compile('.*?"name":"(.*?)",')
def init():
    if not os.path.exists("./temp_data"):
        os.mkdir("./temp_data")
    if not os.path.exists("./cookie.txt"):
        cookie = str(input("沒有檢測到cookie，請輸入cookie>"))
        with open("./cookie.txt", "w") as f:
            f.write(f"{cookie}")
    with open("cookie.txt", "r") as f:
        cookie = f.read()
    return cookie

def get_video_id(csrfkey):
    tid = str(input("輸入課程的tid>"))
    data = {"termId": f"{tid}"}
    client = httpx.Client(http2=True)
    resp = client.post(f"https://www.icourse163.org/web/j/courseBean.getLastLearnedMocTermDto.rpc?csrfKey={csrfkey}",headers=header, data=data)
    resp.close()
    json = resp.json()
    shu_ju = str(json["result"]["mocTermDto"]["chapters"])
    bizld = str(obj2.findall(shu_ju)).replace(" ","").replace("[","").replace("]","").replace("'","").split(",")
    video_id = str(obj3.findall(shu_ju)).replace("' None', ","").replace(" ","").replace("'","").replace("[","").replace("]","").split(",")
    print("獲取bzid和video_id")
    return bizld,video_id

def get_signature(bizlds,csrfkey):
    signatures=[]
    client = httpx.Client(http2=True)
    for bizld in bizlds:
        data = {"bizId":f"{bizld}","bizType":"1","contentType":"1"}
        resp=client.post(f"https://www.icourse163.org/web/j/resourceRpcBean.getResourceToken.rpc?csrfKey={csrfkey}",headers=header,data=data)
        resp.close()
        signatures = signatures+obj4.findall(resp.text)
    print("獲取signatures")
    return signatures

def get_m3u8_url(video_ids,signatures):
    m3u8_urls = []
    merge_name = []
    client = httpx.Client(http2=True)
    for video_id,signature in zip(video_ids,signatures):
        resp = client.get(f"https://vod.study.163.com/eds/api/v1/vod/video?videoId={video_id}&signature={signature}&clientType=1",headers=header)
        resp.close()
        time.sleep(2)
        merge_name.append(str(obj6.findall(resp.text)).replace("[", "").replace("]", "").replace("'", ""))
        m3u8_url = obj5.findall(resp.text)[0]
        m3u8_urls.append(m3u8_url)
    print("下面是合並文件名：",merge_name)
    return m3u8_urls,merge_name
async def download_ts(file_name,download_url,session):

    async with session.get(download_url,headers=header) as resp:
        async with aiofiles.open(f"temp_data/{file_name}",mode="wb") as f:
            await f.write(await resp.content.read())

async def starter(name,m3u8_url):
    tasks=[]
    async with aiohttp.ClientSession() as session:
        #https://mooc2vod.stu.126.net/nos/hls/2019/03/13/1214418097_fe38e0e942144b60bd5f16c4426b08ef_sd0.ts
        url = str(m3u8_url).rsplit("/",1)[0]
        with open(f"temp_data/{name}.txt", "r") as f:
            for line in f:
                if line.startswith("#"):
                    continue
                else:
                    line = line.strip()
                    file_name = line  # 得下載的ts文件名
                    download_url = url + "/" + line
                    print("下載鏈接是：",download_url)
                    task = download_ts(file_name, download_url, session)
                    tasks.append(task)
            await asyncio.wait(tasks)  # 等待任務執行結束
            print("文件下載完成")

def m3u8_files_download(url,name):   #下載m3u8文件
    resp = requests.get(url)
    with open(f"temp_data/{name}.txt",mode="wb") as f:
        f.write(resp.content)
    resp.close()
def verification(name):
    files=[]
    with open(f"temp_data/{name}.txt","r") as f:
        for line in f:
            if line.startswith("#"):
                continue
            else:
                line=line.strip()
                if os.path.exists(f"temp_data/{line}"):
                    continue
                else:
                    files.append(line)
        print("以下文件缺失，請手動查看:",files)
def merge_ts(file_name,merge_name):
    new_name = str(merge_name)
    with open(f"./{new_name}.mp4", "ab+") as f:
        with open(f"temp_data/{file_name}.txt","r") as f2:
            for line in f2:
                if line.startswith("#"):
                    continue
                else:
                    line = line.strip().split("/")[-1].strip()
                    ts_name = line
                    try:
                        with open(f"temp_data/{ts_name}","rb") as f3:
                            f.write(f3.read())
                    except:
                        continue

def m3u8_main(m3u8_urls,merge_names):
    print("獲取m3u8鏈接，開始執行下載任務")
    for url,merge_name in zip(m3u8_urls,merge_names):
        name = url.rsplit("/")[-1]
        m3u8_files_download(url, name)  # 下載m3u8文件
        asyncio.run(starter(name,url))
        print("校驗文件完整性")
        verification(name)
        merge_ts(name,merge_name)


if __name__=="__main__":
    cookie = init()
    header = {"cookie": f"{cookie}"}
    csrfkey = obj.findall(cookie)[0]
    print("獲取csrfkey")
    bizid,video_id= get_video_id(csrfkey)
    signatures = get_signature(bizid,csrfkey)
    m3u8_urls,merge_name = get_m3u8_url(video_id,signatures)
    m3u8_main(m3u8_urls,merge_name)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 （中國大學mooc）Python網絡爬蟲與信息提取中國大學mooc直播回放中國大學MOOC 郵箱驗證的問題 python—中國大學mooc題庫測試一（helloworld條件輸出數值運算） 1.3溫度轉換(中國大學Mooc-Python 語言程序設計）中國大學MOOC中的后台文件傳輸工科物理實驗()中國大學MOOC答案(已更新) Python之爬蟲-中國大學排名中國大學MOOC-翁愷-C語言程序設計習題集（一）中國大學MOOC-數據結構基礎習題集、04-2、File Transfer