1、multiprocessing.pool函數
class multiprocessing.pool.Pool([processes[, initializer[, initargs[, maxtasksperchild[, context]]]]])
用途:A process pool object which controls a pool of worker processes to which jobs can be submitted. It supports asynchronous results with timeouts and callbacks and has a parallel map implementation.
參數介紹:
processes is the number of worker processes to use. If processes is None then the number returned by os.cpu_count() is used.
If initializer is not None then each worker process will call initializer(*initargs) when it starts.
maxtasksperchild is the number of tasks a worker process can complete before it will exit and be replaced with a fresh worker process, to enable unused resources to be freed. The default maxtasksperchild is None, which means worker processes will live as long as the pool.
context can be used to specify the context used for starting the worker processes. Usually a pool is created using the function multiprocessing.Pool() or the Pool() method of a context object. In both cases context is set appropriately.
Note that the methods of the pool object should only be called by the process which created the pool.
關於Pool()的相關翻譯參見:http://www.cnblogs.com/congbo/archive/2012/08/23/2652490.html
關於multiprocess:
multiprocessing is a package that supports spawning processes using an API similar to the threading module. The multiprocessing package offers both local and remote concurrency, effectively side-stepping the Global Interpreter Lock by using subprocesses instead of threads. Due to this, the multiprocessing module allows the programmer to fully leverage multiple processors on a given machine. It runs on both Unix and Windows.
上面說的尤其注意這里產生的是多進程,而不是多線程,所以pool函數里面的第一個參數如果大於CPU的核心數可能反而導致效率更低,可以實測一下!!!
for more information about multiprocessing,please check the Python API
2、實例和介紹
主要介紹map函數的使用,一手包辦了序列操作、參數傳遞和結果保存等一系列的操作。
首先是引入庫:
from multiprocessing.dummy import Pool
pool=Pool(4)
results=pool.map(爬取函數,網址列表)
本文將一個簡單的例子來看一下如何使用map函數以及這種方法與普通方法的對比情況。
import time
from multiprocessing.dummy import Pool
def getsource(url):
html=requests.get(url)
urls=[]
for i in range(1,21):
newpage='http://tieba.baidu.com/p/3522395718?pn='+str(i)
urls.append(newpage)
timex=time.time() #測試一
for i in urls:
getsource(i)
print (time.time()-timex)
#這里是輸出的結果:
#10.2820000648
time1=time.time() #測試二
pool=Pool(4)
results=pool.map(getsource,urls)
pool.close()
pool.join()
print (time.time()-time1)
#這里是輸出結果:
#3.23600006104
對比以上兩種方法,可以很明顯地看出 測試二比測試一要快很多。
對程序做一下解釋:
測試一種
for i in urls:
getsource(i) #使程序一直遍歷urls列表中的網址,然后循環調用getsource函數
測試二中:
pool=Pool(4) #聲明了4個線程數量,這里的個數根據你電腦的CPU個數來定。
results=pool.map(getsource,urls) #這里使用map函數,並且函數的參數為自定義函數名稱,以及函數中的參數(這里為一個列表)
pool.close() #關閉pool對象
pool.join() #join函數的主要作用是等待所有的線程(4個)都執行結束后
print (time.time()-time1) #輸出所用時間差
列舉Pool的其他應用函數:
from multiprocessing import Pool
def f(x): #定義一個自定義函數f
return x*x
if __name__ == '__main__':
pool = Pool(processes=4) # start 4 worker processes
result = pool.apply_async(f, (10,)) # 評估"f(10)" asynchronously
print result.get(timeout=1) #限定反應時間為1 通過get函數取得result的結果
print pool.map(f, range(10)) # prints "[0, 1, 4,..., 81]"
it = pool.imap(f, range(10)) #使用imap函數執行自定義函數
print it.next() # prints "0" 使用next函數一個一個地取得it的執行結果
print it.next() # prints "1"
print it.next(timeout=1) # prints "4" unless your computer is *very* slow
import time
result = pool.apply_async(time.sleep, (10,))
print result.get(timeout=1) # raises TimeoutError
實例參考:http://blog.csdn.net/winterto1990/article/details/47976105