Python下如何下載超大文件???


如何通過鏈接下載一個大文件,大概10G???

要快速

 

在寫爬蟲的過程中常常遇到下載大文件的情況,比如說視頻之類的。如果只是傳統的直接下載保存,速度就比較慢,所有就想寫個多線程同步下載大文件的模塊。

使用到的庫

模塊中使用到的庫都比較簡單:requests(寫爬蟲的都知道吧),threading(多線程,必須的),os(文件操作需要用到)。


主要的難點

一個是多線程下載的同步問題,另一個是文件中斷下載后,再次開始從上次中斷的地方繼續下載。

其實我覺得就這兩個問題,迅雷之類的下載器早就已經給我們做了個解決方法事例,那就是在下載文件的路徑添加一個管理下載進度的文件。

具體實現

模塊中有兩個類,一個負責處理管理文件的更新和創建等,還有一個是線程任務,發下載請求,寫入文件等。

管理文件類

文件內的格式使用的是很簡單的用“=”分割的配置文件的形式,包括四個配置信息,分別是:【已寫入的字節】range,【未寫入的字節】range,【寫入中的字節】range和文件下載的url。之前所說的比較難處理的問題都是用這個管理文件解決的。

 配置文件

writing_range=[(42991616, 46137344)]

unwritten_range=[(46137344, 10633872234)]

written_range=[(0, 42991616)]

url_in_file=https:xxxxxxxxxxxxxxxxxxxxxxxxxxxxx

主要思路是,剛開始下載時創建這個文件,獲取帶下載文件的大小,並填入【未下載字節】中去。然后其他線程同步的不斷從這個【未下載字節】的中提取一小部分寫到【下載中字節】,發請求下載,並寫入到文件中去,之后再把已經下載好的字節寫入到【已下載字節】中去。因為考慮到下載的字節可能時一段一段分隔開的,所有寫成圖中所示的形式。

其實對range的分割,合並(也就是這個類的主要功能)還是需要一些小技巧的,如果想要自己先練練的話可以前往https://leetcode.com/problems/insert-interval/

之前也是寫的時候感覺很熟悉,發現在leetcode上做過類似的題目。

多線程下載類

這個類就沒有什么比較復雜的處理了,主要就是讀取待下載字節的同步,獲得文件大小,下載文件Range。(在http(s)中,文件下載是分了很多次請求的,每次請求的headers中帶上Range可以指明需要下載文件的哪一部分,格式為:  Range: bytes=1024-2048  )獲取文件大小可以先發一個Range : bytes=0-0 這樣的請求過去,響應中的header會帶有content-Range的頭部(如果他需要的話,一般都會有),這個值就是文件的總大小。

多線程同步,就用普通的鎖就好了(threading.Lock類),文件內容的寫入和配置的讀取都需要。寫入到文件的指定位置用的是file的seek函數,不清楚的可以百度一下,就跟c里面的移動文件位置指針一樣。

對了,外面還有一個創建這些下載線程的守護線程。


總結

這樣子,大文件的多線程下載和中斷續傳功能就得以實現了。有興趣的話可以自己寫一下,挺有意思的。需要參考的話-https://github.com/HBertram/python-grab

# -*- coding: utf-8 -*-
"""
Created on Mon Jul  1 16:47:38 2019

@author: Administrator
"""

# -*- coding: utf-8 -*-
#單下載鏈接多任務下載

import requests
import threading
import os
import time

def download(url, filename, headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3559.6 Safari/537.36'}):
    t = DownloadWorkerThread(url, filename, headers=headers)
    t.start()
    return t


#處理單個下載線程
class DownloadWorkerThread(threading.Thread):
    thread_count = 5
    file_lock = threading.Lock()
    fileinfo_lock = threading.Lock()
    
    def __init__(self, url, filename, headers = {}, thread_count = 3):
        threading.Thread.__init__(self)
        self.filename = filename
        self.url = url
        self.fileinfo_name = filename + ".tmp"
        self.headers = headers
        self.thread_count = thread_count

    def run(self):
        self.range_manager = self.read_range_file()
        print(u"Begin Downloading \nurl= " + self.url + "\nfilename = " + self.filename)
        if self.url.strip() == "":
            return
        tlst = []
        for i in range(self.thread_count):
            t = threading.Thread(target = self.RangeWorker, args=(self,))
            print(u"Start Thread :" + t.getName())
            t.setDaemon(True)
            t.start()
            tlst.append(t)
            
        for t in tlst:
            t.join()
        

    def write_content(self, content, content_range):
        self.file_lock.acquire()
        with open(self.filename, 'rb+') as f:
            f.seek(content_range[0])
            f.write(content)
        self.file_lock.release()
        
        self.fileinfo_lock.acquire()
        self.range_manager.set_written_range(content_range)
        self.fileinfo_lock.release()        

    def read_next_range(self):
        self.fileinfo_lock.acquire()
        time.sleep(0.1)
        r = self.range_manager.get_unwritten_range()
        self.fileinfo_lock.release()
        return r
        
    def read_range_file(self):
        self.fileinfo_lock.acquire()
        manager = None
        if os.path.exists(self.fileinfo_name):
            print("read filename " + self.fileinfo_name)
            manager = DownloadWorkerThread.FileInfoManager(self.fileinfo_name, url = self.url)
            self.content_length = manager.get_total_length()
            if self.url.strip() == "":
                self.url = manager.url_in_file
        else:
            self.content_length = self.get_content_length()

            print("create filename_info length:" + str(self.content_length))
            with open(self.filename, "wb+") as f:
                f.seek(self.content_length)
            manager = DownloadWorkerThread.FileInfoManager(self.fileinfo_name, url = self.url, filesize = self.content_length)   
        self.fileinfo_lock.release()
        return manager

    def get_content_length(self):
        headers = self.headers
        headers['Range'] = "bytes=0-1"
        length = 0
        while length < 1024*1024*3:
            time.sleep(3)
            length = int(requests.get(self.url, headers=self.headers).headers['content-Range'].split('/')[1])
            print("Get length " + str(length))
        return length
            

    def RangeWorker(self, downloadWorker):
        while True:
            content_range = downloadWorker.read_next_range()
            if content_range == 0:
                os.remove(self.fileinfo_name)
                print(self.filename + " finished")
                break
            headers = downloadWorker.headers
            headers['Range'] = "bytes=" + str(content_range[0]) + "-" + str(content_range[1]-1)
            while True:
                iTryTimes = 0
                r = requests.get(downloadWorker.url, headers=headers)
                if r.ok:
                    downloadWorker.write_content(r.content, content_range)
                    print("We are working on " + self.filename + " and now processing : " + \
                           str(round(1.0*content_range[1]/self.content_length*100,2)) + "% in size " + str(round(self.content_length/1024.0/1024.0,2)) + "MB.")
                    break
                else:
                    iTryTimes += 1
                    if iTryTimes > 1:
                        print("Downloading " + downloadWorker.url + " error. Now Exit Thread.")
                        return


    class FileInfoManager():
        url_in_file = ""
        writing_range = []
        written_range = []
        unwritten_range = []
        
        def __init__(self, filename, url = "", filesize = 0):
            self.filename = filename
            if not os.path.exists(filename):
                with open(filename, "w") as f:
                    f.write("unwritten_range=[(0," + str(filesize) + ")]\r\n")
                    f.write("writing_range=[]\r\n")
                    f.write("written_range=[]\r\n")
                    f.write("url_in_file=" + url)
                self.unwritten_range.append((0,filesize))
                self.url_in_file = url
            else:
                with open(filename, "r") as f:
                    for l in f.readlines():
                        typ = l.split("=")[0]
                        if typ == "writing_range":
                            typ = "unwritten_range"
                        elif typ == "url_in_file":
                            if url.strip() == "":
                                self.url_in_file = l.split("=")[1]
                            else:
                                self.url_in_file = url
                            continue
                        for tup in l.split("=")[1][1:-3].split('),'):
                            if tup == "":
                                continue
                            if tup.find("(") != 0:
                                tup = tup[tup.find("("):]
                            if tup.find(")") != 0:
                                tup = tup[:tup.find(")")]
                            getattr(self, typ).append(\
                                (int(tup.split(",")[0][1:]),int(tup.split(",")[1])))                     


        def get_total_length(self):
            if len(self.unwritten_range) > 0:
                return self.unwritten_range[-1][1]
            elif len(self.writing_range) > 0:
                return self.writing_range[-1][1]
            elif len(self.written_range) > 0:
                return self.written_range[-1][1]
            return 0
        
        def _save_to_file(self):
            with open(self.filename, "w") as f:
                f.write("writing_range=" + \
                            str(self.writing_range) + "\r\n")
                f.write("unwritten_range=" + \
                            str(self.unwritten_range) + "\r\n")
                f.write("written_range=" + \
                            str(self.written_range) + "\r\n")
                f.write("url_in_file=" + self.url_in_file)
        
        def _splice(self, intervals, newInterval):
            if len(intervals) == 0:
                return []
            intervals = self._concat(intervals, (0,0))
            response = []
            for interval in intervals:
                if interval[0] == interval[1]:
                    continue
                if interval[0] > newInterval[1]:
                    response.append(interval)
                elif interval[1] < newInterval[0]:
                    response.append(interval)
                else:
                    max_range = (min(interval[0], newInterval[0]), max(interval[1], newInterval[1]))
                    if max_range != newInterval:
                        left = (min(max_range[0], newInterval[0]), max(max_range[0], newInterval[0]))
                        right = (min(max_range[1], newInterval[1]), max(max_range[1], newInterval[1]))
                        if left[0] != left[1]:
                            response.append(left)
                        if right[0] != right[1]:
                            response.append(right)
            return response
        
        def _concat(self, intervals, newInterval):
            if len(intervals) == 0:
                return [newInterval]
            response = [newInterval]
            for interval in intervals:
                i = response.pop()
                if interval[0] == interval[1]:
                    continue
                if i[0] > interval[1]:
                    response.append(interval)
                    response.append(i)
                elif i[1] < interval[0]:
                    response.append(i)
                    response.append(interval)
                else:
                    response.append((min(i[0], interval[0]), max(i[1], interval[1])))
            return response
                        
        
        def get_unwritten_range(self, size = 1024*1024):
            if len(self.unwritten_range) == 0:
                return 0
            r = self.unwritten_range[0]
            r = (r[0], min(r[0]+size, r[1]))
            self.unwritten_range = self._splice(self.unwritten_range, r)
            self.writing_range = self._concat(self.writing_range, r)
            self._save_to_file()
            return r
            
        def set_written_range(self, content_range):
            self.writing_range = self._splice(self.writing_range, content_range)
            self.written_range = self._concat(self.written_range, content_range)
            self._save_to_file()


#t = DownloadWorkerThread(r'http://a3.kuaihou.com/ruanjian/ucdnb.zip',\
#                         'd:\\ucdnb.zip', \
#                         headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3559.6 Safari/537.36'})
#t.start()

if __name__ == '__main__':
    url = input(u"The URL Waiting for downloading:")
    filename = input(u"The Filepath to save:")
    t = download(url, filename)
    while t.is_alive():
        time.sleep(60)
    print("bye")
作者:編程小道士
鏈接:https://www.jianshu.com/p/e0f42bd3a3ea
來源:簡書
簡書著作權歸作者所有,任何形式的轉載都請聯系作者獲得授權並注明出處。


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM