//20200115
最近在看“咱們裸熊——we bears”第一季和第三季都看完了,單單就第二季死活找不到,只有騰訊有資源,但是要vip……而且還是國語版……所以就瞄上了一個視頻網站——可以在線觀看(好的動漫喜歡收藏,就想着下載,第一季第三季都找到了資源,甚至第四季都有,就沒有第二季……)
最近又正好在學python(為大數據打基礎),就想着爬取視頻,下面說說流程:
首先F12檢查,先看看是否是直接嵌入鏈接(以防真的有笨笨的web主~),然后發現沒有
然后就開始點開Networks檢查抓包,發現有后綴為.m3u8的鏈接,就點開看了——有兩層,第二層是一大堆格式化數據
然后再看剩下的包,都是.ts文件,再以.ts文件鏈接比對第二個m3u8包里的東西,發現正好對應,只是要拼接字符串獲取真實鏈接,確認了思路之后就開始上手了(只是基礎爬取,還未用到線程——其實用線程池技術可以更快,畢竟ts文件很多,也未用到代理,因為數據量還不算大,而且有手動限時)
理一下思路:
先從視頻播放界面源碼中獲取每一集的鏈接,存在列表里備用(這個是顯示的)---->然后獲取每一個鏈接對應網址的源碼——里邊兒有一個ckplayer的div塊,里邊兒有第一層m3u8的鏈接 ----> 用beautifulSoup獲取到這個鏈接(這個鏈接返回的是一個json,用json包轉格式獲取到第一層鏈接) -----> 訪問這個鏈接獲取到第二個m3u8鏈接(其中要拼接字符串)----->然后訪問第二個鏈接獲取到ts視頻地址信息(也要拼接字符串——拼接完成后存儲到列表中備用)----->使用文件輸出流將ts文件下載並存在對應文件夾內
接下來就是等待了,等它下完,因為文件很細碎,所以耗時很久……可以考慮使用線程池改進(等我把大數據基礎學完了再說,不急)
然后在每一個ts文件夾中用windows命令copy/b *.ts video.mp4將ts文件合並為mp4文件——可以嵌入到python代碼中,不過我沒有bat基礎,就直接手動了,也不會太困難(大功告成!)
下面上源碼:
source code:
#20200115 import requests import json import time from bs4 import BeautifulSoup headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.81 Safari/537.36'} location = 'https://www.****.cc' mid = "/1000k/hls" last = "/index.m3u8" url_pool = ['/dianshiju/20740/player-1-1.html', '/dianshiju/20740/player-1-2.html', '/dianshiju/20740/player-1-3.html', '/dianshiju/20740/player-1-4.html', '/dianshiju/20740/player-1-5.html', '/dianshiju/20740/player-1-6.html', '/dianshiju/20740/player-1-7.html', '/dianshiju/20740/player-1-8.html', '/dianshiju/20740/player-1-9.html', '/dianshiju/20740/player-1-10.html', '/dianshiju/20740/player-1-11.html', '/dianshiju/20740/player-1-12.html', '/dianshiju/20740/player-1-13.html', '/dianshiju/20740/player-1-14.html', '/dianshiju/20740/player-1-15.html', '/dianshiju/20740/player-1-16.html', '/dianshiju/20740/player-1-17.html', '/dianshiju/20740/player-1-18.html', '/dianshiju/20740/player-1-19.html', '/dianshiju/20740/player-1-20.html', '/dianshiju/20740/player-1-21.html', '/dianshiju/20740/player-1-22.html'] len1 = len(url_pool) def get_json_url(soup): url = soup.find("div",id = "iFrame_play").script.get('src') return location + url def get_first_url(json_url): r2 = requests.get(json_url,headers = headers,timeout = 10).text dic = json.loads(r2[r2.find('{'):r2.find('}')+1]) return dic['url'] def get_real_m3u8_url(url): index_of_last = url.rfind('/') the_forward = url[:index_of_last] return the_forward + mid def get_the_ts_pack(url): r3 = requests.get(url,headers = headers,timeout = 10).text list_of_ts = r3.split('#') return list_of_ts def get_each_ts_url(the_ts_pack,the_real_m38u_url): len2 = len(the_ts_pack) for i in range(0,len2): suffix = the_ts_pack[i].split('\n')[1] the_ts_pack[i] = the_real_m38u_url + "/" + suffix # return the_ts_pack def mission(url,n,group): print('*****') response=requests.get(url,headers=headers,timeout = 10) print('-----') f=open("./"+str(group)+"/%03d.ts"%n,"wb") f.write(response.content) f.close() print("%03d.ts OK..."%n) def download(the_ts_pack,group): len3 = len(the_ts_pack) count = 0 i = -1 while i != len3-1: try: i+=1 mission(the_ts_pack[i],i,group) except (requests.exceptions.ConnectionError,requests.exceptions.ReadTimeout): count+=1 print("第"+str(count)+"次等待") time.sleep(5) i-=1 else: count=0 time.sleep(0.5) # for i in range(0,len1): for i in range(12,22): completed_link = location + url_pool[i] r1 = requests.get(completed_link,headers=headers,timeout = 10) soup = BeautifulSoup(r1.text,"lxml") json_url = get_json_url(soup) time.sleep(0.1) the_first_mu38_url = get_first_url(json_url) time.sleep(0.1) the_real_m38u_url = get_real_m3u8_url(the_first_mu38_url) the_ts_pack = get_the_ts_pack(the_real_m38u_url + last)[5:-1] get_each_ts_url(the_ts_pack,the_real_m38u_url) print(the_ts_pack) download(the_ts_pack,i) print("第" + str(i) + "組ts視頻已經下載完成") time.sleep(10) # # # list1 = str1.rfind('/') # str2 = str1[:list1] # print(str2) # for i in range() # for each in url_pool: # print(each) # # for n in range(1,167): # mission(link + str(8000+n)+".ts",n) # dic = {'%3A':':','%2F':"/"} # str1 = str1.replace('%3A',':') # str1 = str1.replace('%2F','/') # print(str1) # # # r = requests.get(link,headers = headers,timeout = 10) # text = r.text # print(text) #
注:因為視頻有版權,網站地址就不放出來了,重要的是思路,每個網站都不一樣,都要重新分析
侵刪!
關於python異常機制:
1.try-except代碼塊,就是如果有異常就執行except里的代碼,然后如果有循環就跳過這一次(顯然不符合要求,因為要下齊資源,所以要用到2)
2.try-except-else代碼塊,如果有異常,就執行except內代碼,如果沒有,執行完try中代碼后,繼續執行else中代碼
另:except后跟的異常,可以是一個也可以是多個(多個使用“(..,..,..)”這種格式,不知道啥異常可以直接用Exception)
因為代碼執行過程中,服務器有的時候會返回不了信息,就要異常來處理,不然每次都手動怎么稱得上自動化呢~
希望對大家有所幫助
以上
