如何通過鏈接下載一個大文件,大概10G???
要快速
在寫爬蟲的過程中常常遇到下載大文件的情況,比如說視頻之類的。如果只是傳統的直接下載保存,速度就比較慢,所有就想寫個多線程同步下載大文件的模塊。
使用到的庫
模塊中使用到的庫都比較簡單:requests(寫爬蟲的都知道吧),threading(多線程,必須的),os(文件操作需要用到)。
主要的難點
一個是多線程下載的同步問題,另一個是文件中斷下載后,再次開始從上次中斷的地方繼續下載。
其實我覺得就這兩個問題,迅雷之類的下載器早就已經給我們做了個解決方法事例,那就是在下載文件的路徑添加一個管理下載進度的文件。
具體實現
模塊中有兩個類,一個負責處理管理文件的更新和創建等,還有一個是線程任務,發下載請求,寫入文件等。
管理文件類
文件內的格式使用的是很簡單的用“=”分割的配置文件的形式,包括四個配置信息,分別是:【已寫入的字節】range,【未寫入的字節】range,【寫入中的字節】range和文件下載的url。之前所說的比較難處理的問題都是用這個管理文件解決的。
配置文件
writing_range=[(42991616, 46137344)] unwritten_range=[(46137344, 10633872234)] written_range=[(0, 42991616)] url_in_file=https:xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
主要思路是,剛開始下載時創建這個文件,獲取帶下載文件的大小,並填入【未下載字節】中去。然后其他線程同步的不斷從這個【未下載字節】的中提取一小部分寫到【下載中字節】,發請求下載,並寫入到文件中去,之后再把已經下載好的字節寫入到【已下載字節】中去。因為考慮到下載的字節可能時一段一段分隔開的,所有寫成圖中所示的形式。
其實對range的分割,合並(也就是這個類的主要功能)還是需要一些小技巧的,如果想要自己先練練的話可以前往https://leetcode.com/problems/insert-interval/
之前也是寫的時候感覺很熟悉,發現在leetcode上做過類似的題目。
多線程下載類
這個類就沒有什么比較復雜的處理了,主要就是讀取待下載字節的同步,獲得文件大小,下載文件Range。(在http(s)中,文件下載是分了很多次請求的,每次請求的headers中帶上Range可以指明需要下載文件的哪一部分,格式為: Range: bytes=1024-2048 )獲取文件大小可以先發一個Range : bytes=0-0 這樣的請求過去,響應中的header會帶有content-Range的頭部(如果他需要的話,一般都會有),這個值就是文件的總大小。
多線程同步,就用普通的鎖就好了(threading.Lock類),文件內容的寫入和配置的讀取都需要。寫入到文件的指定位置用的是file的seek函數,不清楚的可以百度一下,就跟c里面的移動文件位置指針一樣。
對了,外面還有一個創建這些下載線程的守護線程。
總結
這樣子,大文件的多線程下載和中斷續傳功能就得以實現了。有興趣的話可以自己寫一下,挺有意思的。需要參考的話-https://github.com/HBertram/python-grab
# -*- coding: utf-8 -*- """ Created on Mon Jul 1 16:47:38 2019 @author: Administrator """ # -*- coding: utf-8 -*- #單下載鏈接多任務下載 import requests import threading import os import time def download(url, filename, headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3559.6 Safari/537.36'}): t = DownloadWorkerThread(url, filename, headers=headers) t.start() return t #處理單個下載線程 class DownloadWorkerThread(threading.Thread): thread_count = 5 file_lock = threading.Lock() fileinfo_lock = threading.Lock() def __init__(self, url, filename, headers = {}, thread_count = 3): threading.Thread.__init__(self) self.filename = filename self.url = url self.fileinfo_name = filename + ".tmp" self.headers = headers self.thread_count = thread_count def run(self): self.range_manager = self.read_range_file() print(u"Begin Downloading \nurl= " + self.url + "\nfilename = " + self.filename) if self.url.strip() == "": return tlst = [] for i in range(self.thread_count): t = threading.Thread(target = self.RangeWorker, args=(self,)) print(u"Start Thread :" + t.getName()) t.setDaemon(True) t.start() tlst.append(t) for t in tlst: t.join() def write_content(self, content, content_range): self.file_lock.acquire() with open(self.filename, 'rb+') as f: f.seek(content_range[0]) f.write(content) self.file_lock.release() self.fileinfo_lock.acquire() self.range_manager.set_written_range(content_range) self.fileinfo_lock.release() def read_next_range(self): self.fileinfo_lock.acquire() time.sleep(0.1) r = self.range_manager.get_unwritten_range() self.fileinfo_lock.release() return r def read_range_file(self): self.fileinfo_lock.acquire() manager = None if os.path.exists(self.fileinfo_name): print("read filename " + self.fileinfo_name) manager = DownloadWorkerThread.FileInfoManager(self.fileinfo_name, url = self.url) self.content_length = manager.get_total_length() if self.url.strip() == "": self.url = manager.url_in_file else: self.content_length = self.get_content_length() print("create filename_info length:" + str(self.content_length)) with open(self.filename, "wb+") as f: f.seek(self.content_length) manager = DownloadWorkerThread.FileInfoManager(self.fileinfo_name, url = self.url, filesize = self.content_length) self.fileinfo_lock.release() return manager def get_content_length(self): headers = self.headers headers['Range'] = "bytes=0-1" length = 0 while length < 1024*1024*3: time.sleep(3) length = int(requests.get(self.url, headers=self.headers).headers['content-Range'].split('/')[1]) print("Get length " + str(length)) return length def RangeWorker(self, downloadWorker): while True: content_range = downloadWorker.read_next_range() if content_range == 0: os.remove(self.fileinfo_name) print(self.filename + " finished") break headers = downloadWorker.headers headers['Range'] = "bytes=" + str(content_range[0]) + "-" + str(content_range[1]-1) while True: iTryTimes = 0 r = requests.get(downloadWorker.url, headers=headers) if r.ok: downloadWorker.write_content(r.content, content_range) print("We are working on " + self.filename + " and now processing : " + \ str(round(1.0*content_range[1]/self.content_length*100,2)) + "% in size " + str(round(self.content_length/1024.0/1024.0,2)) + "MB.") break else: iTryTimes += 1 if iTryTimes > 1: print("Downloading " + downloadWorker.url + " error. Now Exit Thread.") return class FileInfoManager(): url_in_file = "" writing_range = [] written_range = [] unwritten_range = [] def __init__(self, filename, url = "", filesize = 0): self.filename = filename if not os.path.exists(filename): with open(filename, "w") as f: f.write("unwritten_range=[(0," + str(filesize) + ")]\r\n") f.write("writing_range=[]\r\n") f.write("written_range=[]\r\n") f.write("url_in_file=" + url) self.unwritten_range.append((0,filesize)) self.url_in_file = url else: with open(filename, "r") as f: for l in f.readlines(): typ = l.split("=")[0] if typ == "writing_range": typ = "unwritten_range" elif typ == "url_in_file": if url.strip() == "": self.url_in_file = l.split("=")[1] else: self.url_in_file = url continue for tup in l.split("=")[1][1:-3].split('),'): if tup == "": continue if tup.find("(") != 0: tup = tup[tup.find("("):] if tup.find(")") != 0: tup = tup[:tup.find(")")] getattr(self, typ).append(\ (int(tup.split(",")[0][1:]),int(tup.split(",")[1]))) def get_total_length(self): if len(self.unwritten_range) > 0: return self.unwritten_range[-1][1] elif len(self.writing_range) > 0: return self.writing_range[-1][1] elif len(self.written_range) > 0: return self.written_range[-1][1] return 0 def _save_to_file(self): with open(self.filename, "w") as f: f.write("writing_range=" + \ str(self.writing_range) + "\r\n") f.write("unwritten_range=" + \ str(self.unwritten_range) + "\r\n") f.write("written_range=" + \ str(self.written_range) + "\r\n") f.write("url_in_file=" + self.url_in_file) def _splice(self, intervals, newInterval): if len(intervals) == 0: return [] intervals = self._concat(intervals, (0,0)) response = [] for interval in intervals: if interval[0] == interval[1]: continue if interval[0] > newInterval[1]: response.append(interval) elif interval[1] < newInterval[0]: response.append(interval) else: max_range = (min(interval[0], newInterval[0]), max(interval[1], newInterval[1])) if max_range != newInterval: left = (min(max_range[0], newInterval[0]), max(max_range[0], newInterval[0])) right = (min(max_range[1], newInterval[1]), max(max_range[1], newInterval[1])) if left[0] != left[1]: response.append(left) if right[0] != right[1]: response.append(right) return response def _concat(self, intervals, newInterval): if len(intervals) == 0: return [newInterval] response = [newInterval] for interval in intervals: i = response.pop() if interval[0] == interval[1]: continue if i[0] > interval[1]: response.append(interval) response.append(i) elif i[1] < interval[0]: response.append(i) response.append(interval) else: response.append((min(i[0], interval[0]), max(i[1], interval[1]))) return response def get_unwritten_range(self, size = 1024*1024): if len(self.unwritten_range) == 0: return 0 r = self.unwritten_range[0] r = (r[0], min(r[0]+size, r[1])) self.unwritten_range = self._splice(self.unwritten_range, r) self.writing_range = self._concat(self.writing_range, r) self._save_to_file() return r def set_written_range(self, content_range): self.writing_range = self._splice(self.writing_range, content_range) self.written_range = self._concat(self.written_range, content_range) self._save_to_file() #t = DownloadWorkerThread(r'http://a3.kuaihou.com/ruanjian/ucdnb.zip',\ # 'd:\\ucdnb.zip', \ # headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3559.6 Safari/537.36'}) #t.start() if __name__ == '__main__': url = input(u"The URL Waiting for downloading:") filename = input(u"The Filepath to save:") t = download(url, filename) while t.is_alive(): time.sleep(60) print("bye")
鏈接:https://www.jianshu.com/p/e0f42bd3a3ea
來源:簡書
簡書著作權歸作者所有,任何形式的轉載都請聯系作者獲得授權並注明出處。