多任務爬蟲

本文轉載自查看原文 2020-03-31 00:47 629 爬蟲

一、多任務簡介

1、為什么要使用多任務爬蟲？

在大量的url需要請求時，單線程/單進程去爬取，速度太慢，此時cpu不工作，浪費cpu資源。
爬取與寫入文件分離，可以規避io操作，增加爬取速度，充分利用cpu。

2、多任務分類

進程：進程是操作資源分配的最小單位，一個運行的程序，至少包括一個進程，進程之間數據不能共享。（利用多核）
線程：線程是cpu調度的最小單位，一個進程中至少含有一個線程，線程中數據是共享的，如果多個線程操作同一個對象時，需要考慮數據安全問題。（爬蟲中最常用）
協程：協程位於線程內部，如果一個線程中運行的代碼，遇到IO操作時，切換到線程其他代碼執行（最大程度的規避IO操作）

2、如何提高程序的運行速度

1、提高CPU的利用率

假如我們的程序有只有一個線程，CPU就只處理這一個線程。如果在程序中遇到IO操作。此時CPU就不工作了。休息的這段時間，就浪費了CPU的資源。

若我們的程序是多線程的，CPU會在這多個任務之間切換，如果其中一個線程阻塞了，CPU不會休息，會處理其他線程。

2、增加CPU數量

一個CPU同一時間只能護理一個任務，若我們增加CPU數量，那么多個CPU處理多個任務，也會提升程序的運行速度，例如使用多進程。

二、python中的threading模塊（開啟多線程）

cpython解釋器下的 python中沒有真正的多線程（因為多個線程不能同時在多核上執行，只能在一個CPU上進行多個線程的切換輪流執行，在視覺效果上看起來同時在執行），造成這個情況的原因是因為GIL（全局性解釋器鎖），在一個進程中，多個線程是數據共享的，如果不設置全局解釋性鎖，多個線程可能在同一時間對同一個變量進行操作，造成變量的引用計數不正確，影響其進行垃圾回收，所以需要加全局性解釋器鎖。

2.1、多線程開啟方法

from threading import Thread
1、使用函數
t = Thread(
					target=線程執行的任務（方法）名字，
					args = 執行方法的參數，是一個元組
				)---創建線程
t.start()---啟動線程

2、使用類
class Mythread(Thread)
	def __init__(self,參數)
		self.參數=參數
		super(Mythread,self).__init__()
	
	def run(self):
		將需要多任務執行的代碼，添加到此處

if __name__ == '__main__':
    my =  Mythread(參數)
    my.start()

2.2、線程中常用的幾個方法

from threading import Thread, current_thread, enumerate, active_count
import time
import random


class MyThread(Thread):
    def run(self):
        time.sleep(random.random())
        msg = "I'm" + self.name + "@" + str(i)  #self.name 當前線程名
        print(msg)
        print(current_thread().ident)  #當前線程的id號
        print(current_thread().is_alive()) #當前線程是否存活


if __name__ == '__main__':
    t_list=[]
    for i in range(5):
        t = MyThread()
        t.start()
        t_list.append(t)
    while active_count() > 1:  #active_count() 當前存活線程數，包括主線程
        print(enumerate()) #enumerate() 當前存活線程列表，包括主線程
     for i  in t_list:
        i.join() #join方法，會使異步執行的多線程，變為同步執行，主線程會等i線程執行完，才會往下執行。

2.3、守護線程

守護線程，當一個子線程設置為守護線程時，該子線程會等待其他非守護子線程和主線程執行完成后，結束線程。

from threading import Thread, current_thread
import time


def bar():
    while True:
        time.sleep(1)
        print(current_thread().name)


def foo():
    print(f'{current_thread().name}開始了...')
    time.sleep(2)
    print(f'{current_thread().name}結束了...')


if __name__ == '__main__':
    t1 = Thread(target=bar)
    t1.daemon = True #將t1設置為守護線程，
    t1.start()
    t2 = Thread(target=foo)
    t2.start()

#執行結果
Thread-2開始了...
Thread-1
Thread-1
Thread-2結束了...

2.4、鎖

在使用多線程爬蟲的時候，有時候多個線程會對同一個文件進行讀寫。造成數據不安全，下面是一個Tencent招聘的例子，在寫入excel文件中的時候，由於多個線程對同一個文件進行寫入操作，造成數據不安全。

import requests
from jsonpath import jsonpath
from excle_wirte import ExcelUtils
from threading import Thread
import os
from multiprocessing import Lock
import threading

def get_content(url):
    headers = {
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36',
        'referer': 'https://careers.tencent.com/search.html'
    }
    print(url)
    res = requests.get(url, headers=headers).json()
    jp = jsonpath(res, '$.*.Posts.*')
    return jp


def write_excel(filename, item_list, sheetname):
    if not os.path.exists(filename):
        ExcelUtils.write_to_excel(filename, item_list, sheetname)
    else:
        ExcelUtils.append_to_excel(filename, item_list)


def main(i, lock):
    base_url = 'https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1585401795646&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex={}&pageSize=20&language=zh-cn&area=cn'
    content = get_content(base_url.format(i))
    with lock:   #加鎖
        write_excel('tencent.xls', content, 'hr')


if __name__ == '__main__':
    lock = Lock()  #創建鎖
    for i in range(1, 11):
        t = Thread(target=main, args=(i, lock))
        t.start()

2.5、生產者與消費者模型

生產者和消費者問題是線程模型中的經典問題：生產者和消費者在同一時間段內共用同一個存儲空間，生產者往存儲空間中添加產品，消費者從存儲空間中取走產品，當存儲空間為空時，消費者阻塞，當存儲空間滿時，生產者阻塞。

例子：Tencent招聘生產者與消費者版本,我這里是用函數寫的，當然也可以用類來寫，會更加方便。

import requests
from jsonpath import jsonpath
from excle_wirte import ExcelUtils
from threading import Thread
import os
from multiprocessing import Lock
from queue import Queue

flag = False


def ger_url_list(num, url_queue):
    base_url = 'https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1585401795646&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex={}&pageSize=20&language=zh-cn&area=cn'
    for i in range(1, num + 1):
        url_queue.put(base_url.format(i))


def producer(url_queue, content_queue):
    headers = {
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36',
        'referer': 'https://careers.tencent.com/search.html'
    }
    while True:
        try:
            url = url_queue.get_nowait()
            res = requests.get(url, headers=headers).json()
            jp = jsonpath(res, '$.*.Posts.*')
            content_queue.put(jp)
        except Exception as e:
            break


def consumer(content_queue, lock, filename, sheetname):
    while True:
        if content_queue.empty() and flag:
            break
        try:
            item_list = content_queue.get_nowait()
            with lock:
                if not os.path.exists(filename):
                    ExcelUtils.write_to_excel(filename, item_list, sheetname)
                else:
                    ExcelUtils.append_to_excel(filename, item_list)
        except Exception as e:
            pass


if __name__ == '__main__':
    p_t_list = []
    url_queue = Queue()   #存放url的隊列
    content_queue = Queue()  #網頁內容隊列
    ger_url_list(10, url_queue)  #往url隊列添加url
    lock = Lock() #創建鎖對象
    for i in range(4): # 開啟四個線程來抓取網頁內容
        p_t = Thread(target=producer, args=(url_queue, content_queue))
        p_t.start()
        p_t_list.append(p_t)
    for i in range(4): #四個線程來解析內容和寫入文件
        t = Thread(target=consumer, args=(content_queue, lock, 'tencent.xls', 'hr'))
        t.start()
    for i in p_t_list:
        i.join()
    flag=True #判斷標志，用來判斷生產者是否生產完畢。

2.6、多進程

多進程一般用於處理計算密集型任務，在爬蟲方面用的較少，因為多進程開啟數量依賴於CPU核心數，且多進程開啟操作系統需要為每個進程分配資源，效率不高。這里只簡單說明python中使用的庫和使用方法，注意進程間不能之間進行數據交換，需要依賴於IPC(Inter-Process Communication)進程間通信，提供了各種進程間通信的方法進行數據交換），常用方法為隊列和管道和Socket。當然還有第三方工具，例如RabbitMQ，redis

from multiprocessing import Process
1、使用函數
t = Process(
					target=進程執行的任務（方法）名字，
					args = 執行方法的參數，是一個元組
				)---創建進程
t.start()---啟動進程

2、使用類
class MyProcess(Process)
	def __init__(self,參數)
		self.參數=參數
		super(Mythread,self).__init__()
	
	def run(self):
		將需要多任務執行的代碼，添加到此處

if __name__ == '__main__':
    my =  MyProcess(參數)
    my.start()

在 multiprocessing這個庫中有很多於多進程相關對象

from multiprocessing import Queue, Pipe, Pool,等
Queue：隊列 
Pipe：管道
Pool：池（有另外的模塊，統一了進程池，線程池的接口，使用更加方便）

三、池

3.1、什么是池

池，包括線程池與進程池，一個池內，可以含有指定的線程數，或者是進程數，多個任務，從中拿取線程/進程執行任務，執行完成后，下一個任務再從池中拿取線程/進程。直到所有任務都執行完畢。

3.2、為什么使用池

可以比較好的控制開啟線程/線程的數量，在提升效率的同時又控制住資源開銷。
可以指定回調函數，很方便的處理返回數據

3.2、池的簡單使用，以進程池為例，線程池一樣的操作。

from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor


def fun(i):
    return i ** 2


def pr(con):
    p = con.result()
    print(p)


if __name__ == '__main__':
    p_pool = ProcessPoolExecutor(max_workers=4)  #創建一個含有四個進程的池
    for i in range(10): #10個任務
        p = p_pool.submit(fun, i)  #任務提交
        p.add_done_callback(pr)  #指定回調函數
    p_pool.shutdown()#關閉池
#執行結果
0
1
4
9
16
25
36
49
64
81

3.3、池map方法使用，適合於簡單參數

from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor


def fun(i):
    return i ** 2
   
if __name__ == '__main__':
    p_pool = ProcessPoolExecutor(max_workers=4)
    p = p_pool.map(fun, range(10))
    print(list(p)) #map方法返回的是一個生成器，可通過強轉或者循環取值。

#執行結果
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 【26】多任務學習 Python多任務之進程多任務Multitask Learning Celery多任務結構 Python多任務之協程【譯】Async/Await（一）——多任務 springboot + @scheduled 多任務並發多任務多目標CTR預估技術 C# ConcurrentQueue 處理多任務 C#多線程與多任務