用map函數來完成Python並行任務的簡單示例

本文轉載自查看原文 2018-07-23 17:20 783

眾所周知，Python的並行處理能力很不理想。我認為如果不考慮線程和GIL的標准參數（它們大多是合法的），其原因不是因為技術不到位，而是我們的使用方法不恰當。大多數關於Python線程和多進程的教材雖然都很出色，但是內容繁瑣冗長。它們的確在開篇鋪陳了許多有用信息，但往往都不會涉及真正能提高日常工作的部分。

經典例子

DDG上以“Python threading tutorial （Python線程教程）”為關鍵字的熱門搜索結果表明：幾乎每篇文章中給出的例子都是相同的類+隊列。

事實上，它們就是以下這段使用producer/Consumer來處理線程/多進程的代碼示例：

 
                #Example.py 
               
                ''' 
               
                Standard Producer/Consumer Threading Pattern 
               
                ''' 
               
                import 
                time 
               
                import 
                threading 
               
                import 
                Queue 
               
                class 
                Consumer(threading.Thread): 
               
                def 
                __init__( 
                self 
                , queue): 
               
                threading.Thread.__init__( 
                self 
                ) 
               
                self 
                ._queue  
                = 
                queue 
               
                def 
                run( 
                self 
                ): 
               
                while 
                True 
                : 
               
                # queue.get() blocks the current thread until 
               
                # an item is retrieved. 
               
                msg  
                = 
                self 
                ._queue.get() 
               
                # Checks if the current message is 
               
                # the "Poison Pill" 
               
                if 
                isinstance 
                (msg,  
                str 
                )  
                and 
                msg  
                = 
                = 
                'quit' 
                : 
               
                # if so, exists the loop 
               
                break 
               
                # "Processes" (or in our case, prints) the queue item  
               
                print 
                "I'm a thread, and I received %s!!" 
                % 
                msg 
               
                # Always be friendly! 
               
                print 
                'Bye byes!' 
               
                def 
                Producer(): 
               
                # Queue is used to share items between 
               
                # the threads. 
               
                queue  
                = 
                Queue.Queue() 
               
                # Create an instance of the worker 
               
                worker  
                = 
                Consumer(queue) 
               
                # start calls the internal run() method to 
               
                # kick off the thread 
               
                worker.start() 
               
                # variable to keep track of when we started 
               
                start_time  
                = 
                time.time() 
               
                # While under 5 seconds.. 
               
                while 
                time.time()  
                - 
                start_time <  
                5 
                : 
               
                # "Produce" a piece of work and stick it in 
               
                # the queue for the Consumer to process 
               
                queue.put( 
                'something at %s' 
                % 
                time.time()) 
               
                # Sleep a bit just to avoid an absurd number of messages 
               
                time.sleep( 
                1 
                ) 
               
                # This the "poison pill" method of killing a thread. 
               
                queue.put( 
                'quit' 
                ) 
               
                # wait for the thread to close down 
               
                worker.join() 
               
                if 
                __name__  
                = 
                = 
                '__main__' 
                : 
               
                Producer()

唔…….感覺有點像Java。

我現在並不想說明使用Producer / Consume來解決線程/多進程的方法是錯誤的——因為它肯定正確，而且在很多情況下它是最佳方法。但我不認為這是平時寫代碼的最佳選擇。

它的問題所在（個人觀點）

首先，你需要創建一個樣板式的鋪墊類。然后，你再創建一個隊列，通過其傳遞對象和監管隊列的兩端來完成任務。（如果你想實現數據的交換或存儲，通常還涉及另一個隊列的參與）。

Worker越多，問題越多。

接下來，你應該會創建一個worker類的pool來提高Python的速度。下面是IBM tutorial給出的較好的方法。這也是程序員們在利用多線程檢索web頁面時的常用方法。

 
                #Example2.py 
               
                ''' 
               
                A more realistic thread pool example 
               
                ''' 
               
                import 
                time 
               
                import 
                threading 
               
                import 
                Queue 
               
                import 
                urllib2 
               
                class 
                Consumer(threading.Thread): 
               
                def 
                __init__( 
                self 
                , queue): 
               
                threading.Thread.__init__( 
                self 
                ) 
               
                self 
                ._queue  
                = 
                queue 
               
                def 
                run( 
                self 
                ): 
               
                while 
                True 
                : 
               
                content  
                = 
                self 
                ._queue.get() 
               
                if 
                isinstance 
                (content,  
                str 
                )  
                and 
                content  
                = 
                = 
                'quit' 
                : 
               
                break 
               
                response  
                = 
                urllib2.urlopen(content) 
               
                print 
                'Bye byes!' 
               
                def 
                Producer(): 
               
                urls  
                = 
                [ 
               
                'http://www.python.org' 
                ,  
                'http://www.yahoo.com' 
               
                'http://www.scala.org' 
                ,  
                'http://www.google.com' 
               
                # etc.. 
               
                ] 
               
                queue  
                = 
                Queue.Queue() 
               
                worker_threads  
                = 
                build_worker_pool(queue,  
                4 
                ) 
               
                start_time  
                = 
                time.time() 
               
                # Add the urls to process 
               
                for 
                url  
                in 
                urls: 
               
                queue.put(url)  
               
                # Add the poison pillv 
               
                for 
                worker  
                in 
                worker_threads: 
               
                queue.put( 
                'quit' 
                ) 
               
                for 
                worker  
                in 
                worker_threads: 
               
                worker.join() 
               
                print 
                'Done! Time taken: {}' 
                . 
                format 
                (time.time()  
                - 
                start_time) 
               
                def 
                build_worker_pool(queue, size): 
               
                workers  
                = 
                [] 
               
                for 
                _  
                in 
                range 
                (size): 
               
                worker  
                = 
                Consumer(queue) 
               
                worker.start() 
               
                workers.append(worker) 
               
                return 
                workers 
               
                if 
                __name__  
                = 
                = 
                '__main__' 
                : 
               
                Producer()

它的確能運行，但是這些代碼多么復雜阿！它包括了初始化方法、線程跟蹤列表以及和我一樣容易在死鎖問題上出錯的人的噩夢——大量的join語句。而這些還僅僅只是繁瑣的開始！

我們目前為止都完成了什么？基本上什么都沒有。上面的代碼幾乎一直都只是在進行傳遞。這是很基礎的方法，很容易出錯（該死，我剛才忘了在隊列對象上還需要調用task_done()方法（但是我懶得修改了）），性價比很低。還好，我們還有更好的方法。

介紹：Map

Map是一個很棒的小功能，同時它也是Python並行代碼快速運行的關鍵。給不熟悉的人講解一下吧，map是從函數語言Lisp來的。map函數能夠按序映射出另一個函數。例如

1 2	`urls` `=` `[` `'http://www.yahoo.com'` `,` `'http://www.reddit.com'` `]` `results` `=` `map` `(urllib2.urlopen, urls)`

這里調用urlopen方法來把調用結果全部按序返回並存儲到一個列表里。就像：

 
                results  
                = 
                [] 
               
                for 
                url  
                in 
                urls: 
               
                results.append(urllib2.urlopen(url))

Map按序處理這些迭代。調用這個函數，它就會返回給我們一個按序存儲着結果的簡易列表。

為什么它這么厲害呢？因為只要有了合適的庫，map能使並行運行得十分流暢！

有兩個能夠支持通過map函數來完成並行的庫：一個是multiprocessing，另一個是鮮為人知但功能強大的子文件：multiprocessing.dummy。

題外話：這個是什么？你從來沒聽說過dummy多進程庫？我也是最近才知道的。它在多進程的說明文檔里面僅僅只被提到了一句。而且那一句就是大概讓你知道有這么個東西的存在。我敢說，這樣幾近拋售的做法造成的后果是不堪設想的！

Dummy就是多進程模塊的克隆文件。唯一不同的是，多進程模塊使用的是進程，而dummy則使用線程（當然，它有所有Python常見的限制）。也就是說，數據由一個傳遞給另一個。這能夠使得數據輕松的在這兩個之間進行前進和回躍，特別是對於探索性程序來說十分有用，因為你不用確定框架調用到底是IO 還是CPU模式。

准備開始

要做到通過map函數來完成並行，你應該先導入裝有它們的模塊：

1 2	`from` `multiprocessing` `import` `Pool` `from` `multiprocessing.dummy` `import` `Pool as ThreadPool`

再初始化:

1	`pool` `=` `ThreadPool()`

這簡單的一句就能代替我們的build_worker_pool 函數在example2.py中的所有工作。換句話說，它創建了許多有效的worker，啟動它們來為接下來的工作做准備，以及把它們存儲在不同的位置，方便使用。

Pool對象需要一些參數，但最重要的是：進程。它決定pool中的worker數量。如果你不填的話，它就會默認為你電腦的內核數值。

如果你在CPU模式下使用多進程pool，通常內核數越大速度就越快（還有很多其它因素）。但是，當進行線程或者處理網絡綁定之類的工作時，情況會比較復雜所以應該使用pool的准確大小。

1	`pool` `=` `ThreadPool(` `4` `)` `# Sets the pool size to 4`

如果你運行過多線程，多線程間的切換將會浪費許多時間，所以你最好耐心調試出最適合的任務數。

我們現在已經創建了pool對象，馬上就能有簡單的並行程序了，所以讓我們重新寫example2.py中的url opener吧！

 
                 import 
                 urllib2 
                
                 from 
                 multiprocessing.dummy  
                 import 
                 Pool as ThreadPool 
                
                 urls  
                 = 
                 [ 
                
                 'http://www.python.org' 
                 , 
                
                 'http://www.python.org/about/' 
                 , 
                
                 'http://www.onlamp.com/pub/a/python/2003/04/17/metaclasses.html' 
                 , 
                
                 'http://www.python.org/doc/' 
                 , 
                
                 'http://www.python.org/download/' 
                 , 
                
                 'http://www.python.org/getit/' 
                 , 
                
                 'http://www.python.org/community/' 
                 , 
                
                 'https://wiki.python.org/moin/' 
                 , 
                
                 'http://planet.python.org/' 
                 , 
                
                 'https://wiki.python.org/moin/LocalUserGroups' 
                 , 
                
                 'http://www.python.org/psf/' 
                 , 
                
                 'http://docs.python.org/devguide/' 
                 , 
                
                 'http://www.python.org/community/awards/' 
                
                 # etc.. 
                
                 ] 
                
                 # Make the Pool of workers 
                
                 pool  
                 = 
                 ThreadPool( 
                 4 
                 ) 
                
                 # Open the urls in their own threads 
                
                 # and return the results 
                
                 results  
                 = 
                 pool. 
                 map 
                 (urllib2.urlopen, urls) 
                
                 #close the pool and wait for the work to finish 
                
                 pool.close() 
                
                 pool.join()

看吧！這次的代碼僅用了4行就完成了所有的工作。其中3句還是簡單的固定寫法。調用map就能完成我們前面例子中40行的內容！為了更形象地表明兩種方法的差異，我還分別給它們運行的時間計時。

 
                 # results = [] 
                
                 # for url in urls: 
                
                 #  result = urllib2.urlopen(url) 
                
                 #  results.append(result) 
                
                 # # ------- VERSUS ------- # 
                
                 # # ------- 4 Pool ------- # 
                
                 # pool = ThreadPool(4) 
                
                 # results = pool.map(urllib2.urlopen, urls) 
                
                 # # ------- 8 Pool ------- # 
                
                 # pool = ThreadPool(8) 
                
                 # results = pool.map(urllib2.urlopen, urls) 
                
                 # # ------- 13 Pool ------- # 
                
                 # pool = ThreadPool(13) 
                
                 # results = pool.map(urllib2.urlopen, urls)

結果：

 
                 #            Single thread: 14.4 Seconds 
                
                 #               4 Pool:  3.1 Seconds 
                
                 #               8 Pool:  1.4 Seconds 
                
                 #               13 Pool:  1.3 Seconds

相當出色！並且也表明了為什么要細心調試pool的大小。在這里，只要大於9，就能使其運行速度加快。

實例2：

生成成千上萬的縮略圖

我們在CPU模式下來完成吧！我工作中就經常需要處理大量的圖像文件夾。其任務之一就是創建縮略圖。這在並行任務中已經有很成熟的方法了。

基礎的單線程創建

 
                 import 
                 os 
                
                 import 
                 PIL 
                
                 from 
                 multiprocessing  
                 import 
                 Pool 
                
                 from 
                 PIL  
                 import 
                 Image 
                
                 SIZE  
                 = 
                 ( 
                 75 
                 , 
                 75 
                 ) 
                
                 SAVE_DIRECTORY  
                 = 
                 'thumbs' 
                
                 def 
                 get_image_paths(folder): 
                
                 return 
                 (os.path.join(folder, f) 
                
                 for 
                 f  
                 in 
                 os.listdir(folder) 
                
                 if 
                 'jpeg' 
                 in 
                 f) 
                
                 def 
                 create_thumbnail(filename): 
                
                 im  
                 = 
                 Image. 
                 open 
                 (filename) 
                
                 im.thumbnail(SIZE, Image.ANTIALIAS) 
                
                 base, fname  
                 = 
                 os.path.split(filename) 
                
                 save_path  
                 = 
                 os.path.join(base, SAVE_DIRECTORY, fname) 
                
                 im.save(save_path) 
                
                 if 
                 __name__  
                 = 
                 = 
                 '__main__' 
                 : 
                
                 folder  
                 = 
                 os.path.abspath( 
                
                 '11_18_2013_R000_IQM_Big_Sur_Mon__e10d1958e7b766c3e840' 
                 ) 
                
                 os.mkdir(os.path.join(folder, SAVE_DIRECTORY)) 
                
                 images  
                 = 
                 get_image_paths(folder) 
                
                 for 
                 image  
                 in 
                 images: 
                
                 create_thumbnail(Image)

對於一個例子來說，這是有點難，但本質上，這就是向程序傳遞一個文件夾，然后將其中的所有圖片抓取出來，並最終在它們各自的目錄下創建和儲存縮略圖。

我的電腦處理大約6000張圖片用了27.9秒。

如果我們用並行調用map來代替for循環的話：

 
                 import 
                 os 
                
                 import 
                 PIL 
                
                 from 
                 multiprocessing  
                 import 
                 Pool 
                
                 from 
                 PIL  
                 import 
                 Image 
                
                 SIZE  
                 = 
                 ( 
                 75 
                 , 
                 75 
                 ) 
                
                 SAVE_DIRECTORY  
                 = 
                 'thumbs' 
                
                 def 
                 get_image_paths(folder): 
                
                 return 
                 (os.path.join(folder, f) 
                
                 for 
                 f  
                 in 
                 os.listdir(folder) 
                
                 if 
                 'jpeg' 
                 in 
                 f) 
                
                 def 
                 create_thumbnail(filename): 
                
                 im  
                 = 
                 Image. 
                 open 
                 (filename) 
                
                 im.thumbnail(SIZE, Image.ANTIALIAS) 
                
                 base, fname  
                 = 
                 os.path.split(filename) 
                
                 save_path  
                 = 
                 os.path.join(base, SAVE_DIRECTORY, fname) 
                
                 im.save(save_path) 
                
                 if 
                 __name__  
                 = 
                 = 
                 '__main__' 
                 : 
                
                 folder  
                 = 
                 os.path.abspath( 
                
                 '11_18_2013_R000_IQM_Big_Sur_Mon__e10d1958e7b766c3e840' 
                 ) 
                
                 os.mkdir(os.path.join(folder, SAVE_DIRECTORY)) 
                
                 images  
                 = 
                 get_image_paths(folder) 
                
                 pool  
                 = 
                 Pool() 
                
                 pool. 
                 map 
                 (create_thumbnail,images) 
                
                 pool.close() 
                
                 pool.join()

5.6秒！

對於只改變了幾行代碼而言，這是大大地提升了運行速度。這個方法還能更快，只要你將cpu 和 io的任務分別用它們的進程和線程來運行——但也常造成死鎖。總之，綜合考慮到 map這個實用的功能，以及人為線程管理的缺失，我覺得這是一個美觀，可靠還容易debug的方法。

好了，文章結束了。一行完成並行任務。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 (轉)python之並行任務的技巧 c# Task多線程並行任務中等待所有線程都執行完成 Java並行任務框架Fork/Join C# 並行任務——Parallel類 Jenkins Pipeline如何動態的並行任務 concurrent.futures- 啟動並行任務 C# 並行任務——Parallel類 .NET並發編程-任務函數並行 Python中使用 input 函數來獲取輸入數據並行和任務並行