requests模塊高級應用

本文轉載自查看原文 2019-09-18 08:31 614 python爬蟲

requests模塊高級應用

requests模塊高級應用

HttpConnectinPool 問題解決

- HttpConnectinPool:
    - 原因：
        - 1.短時間內發起了高頻的請求導致ip被禁
        - 2.http連接池中的連接資源被耗盡
    - 解決：
        - 1.使用代理
        - 2.headers中加入Conection：“close”

IP代理

- 代理：代理服務器，可以接受請求然后將其轉發。
- 匿名度
    - 高匿：接收方,啥也不知道
    - 匿名：接收方知道你使用了代理，但是不知道你的真實ip
    - 透明：接收方知道你使用了代理並且知道你的真實ip
- 類型：
    - http
    - https
- 免費代理：
    - 全網代理IP 	www.goubanjia.com 
    - 快代理     	https://www.kuaidaili.com/
    - 西祠代理   	https://www.xicidaili.com/nn/
    - 代理精靈   	http://http.zhiliandaili.cn/

簡單使用代理

- 代理服務器
  - 進行請求轉發
  - 代理ip：port作用到get、post方法的proxies = {'http':'ip:port'}中
  - 代理池（列表）

爬蟲代碼使用代理

import requests

headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36'
}

url = 'https://www.baidu.com/s?wd=ip'
page_text = requests.get(url,headers=headers,proxies={'https':'36.111.140.6:8080'}).text
with open('ip.html','w',encoding='utf-8') as fp:
    fp.write(page_text)

瀏覽器設置代理

代理池

代理池的作用

解決短時間內頻繁爬取統一網站導致IP封鎖的情況,具體工作機制：從各大代理網站抓取免費IP,
去重后以有序集合的方式保存到Redis中,定時檢測IP有效性、根據自己設定的分數規則進行優先級更改並刪除分數為零
（無效）的IP 提供代理接口供爬蟲工具使用.

簡單實現一個代理池

#代理池：列表
import random

#字典都是網上找的代理ip
proxy_list = [
    {'https':'121.231.94.44:8888'},
    {'https':'131.231.94.44:8888'},
    {'https':'141.231.94.44:8888'}
]
#指定url
url = 'https://www.baidu.com/s?wd=ip'

#proxies=random.choice(proxy_list) 使用代理池
page_text = requests.get(url,headers=headers,proxies=random.choice(proxy_list)).text

with open('ip.html','w',encoding='utf-8') as fp:
    fp.write(page_text)

構建一個代理池

import random
import requests
from lxml import etree

headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36',
    'Connection':"close"
}

#從代理精靈中提取代理ip
ip_url = 'http://t.11jsq.com/index.php/api/entry?method=proxyServer.generate_api_url&packid=1&fa=0&fetch_key=&groupid=0&qty=4&time=1&pro=&city=&port=1&format=html&ss=5&css=&dt=1&specialTxt=3&specialJson=&usertype=2'
page_text = requests.get(ip_url,headers=headers).text
tree = etree.HTML(page_text)
ip_list = tree.xpath('//body//text()')

#爬取西祠代理
url = 'https://www.xicidaili.com/nn/%d'
proxy_list_http = []
proxy_list_https = []
for page in range(1,20):
    new_url = format(url%page)
    ip_port = random.choice(ip_list)
    page_text = requests.get(new_url,headers=headers,proxies={'https':ip_port}).text
    tree = etree.HTML(page_text)
    #tbody不可以出現在xpath表達式中
    tr_list = tree.xpath('//*[@id="ip_list"]//tr')[1:]
    for tr in tr_list:
        ip = tr.xpath('./td[2]/text()')[0]
        port = tr.xpath('./td[3]/text()')[0]
        t_type = tr.xpath('./td[6]/text()')[0]
        ips = ip+':'+port
        if t_type == 'HTTP':
            dic = {
                t_type: ips
            }
            proxy_list_http.append(dic)
        else:
            dic = {
                t_type:ips
            }
            proxy_list_https.append(dic)
print(len(proxy_list_http),len(proxy_list_https))


#檢測 (這里可以進行持久化儲存)
for ip in proxy_list_http:
    response = requests.get('https://www/sogou.com',headers=headers,proxies={'https':ip})
    if response.status_code == '200':
        print('檢測到了可用ip')

cookie的處理

手動處理：將cookie封裝到headers中

自動處理：session對象。可以創建一個session對象，改對象可以像requests一樣進行請求發送。
不同之處在於如果在使用session進行請求發送的過程中產生了cookie，則cookie會被自動存儲在session對象中。

爬取雪球網首頁新聞信息 https://xueqiu.com/

爬取過程中遇到問題

import requests

headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36',
}
url = 'https://xueqiu.com/v4/statuses/public_timeline_by_category.json?since_id=-1&max_id=20349203&count=15&category=-1'
page_text = requests.get(url=url,headers=headers).json()
print(page_text)

#執行結果
{'error_description': '遇到錯誤，請刷新頁面或者重新登錄帳號后再試', 'error_uri': '/v4/statuses/public_timeline_by_category.json', 'error_data': None, 'error_code': '400016'} 

#分析發現,正常的瀏覽器請求攜帶有cookie數據

解決辦法手動添加cookie信息 (不推薦,因為有的網站cookie可能是變動的,這樣就寫死了)

#對雪球網中的新聞數據進行爬取https://xueqiu.com/
import requests

headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36',
    'Cookie':'aliyungf_tc=AQAAAAl2aA+kKgkAtxdwe3JmsY226Y+n; acw_tc=2760822915681668126047128e605abf3a5518432dc7f074b2c9cb26d0aa94; xq_a_token=75661393f1556aa7f900df4dc91059df49b83145; xq_r_token=29fe5e93ec0b24974bdd382ffb61d026d8350d7d; u=121568166816578; device_id=24700f9f1986800ab4fcc880530dd0ed',
}

url = 'https://xueqiu.com/v4/statuses/public_timeline_by_category.json?since_id=-1&max_id=20349203&count=15&category=-1'
page_text = requests.get(url=url,headers=headers).json()
print(page_text)

#執行結果
{'list': [{'id': 20349202, 'category': 0, 'data': '{"id":132614531,"title":"狼來了！今天，中囯電信行業打響第一槍！
流量費用要降價了！","description":"狼，終究來了！ 剛剛傳來大消息，中國工信部正式宣布：英國電信（BT）
獲得了在中國全國性經營通信的牌照。 隨后，英國電信也在第一時間證實這一消息！他們興高采烈地表示：
取得牌照，意味着英國電信在中國邁出重要的一步！ 是的，你沒有看錯：英國電信！這是英國最大的電信公司，
也是一家有着超過...","target":"/3583653389/132614531","reply_count":75,"retweet_count":7,"topic_title":"狼來了！
今天，中囯電信行業打響第一槍！流量費用要降價了！","topic_desc":"狼，終究來了！ 剛剛傳來大消息， 中國工信部正式宣布：英...}.....省略

自動獲取cookie(推薦,cookie是變化的也沒問題)

import requests

#創建session對象
session = requests.Session()
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36',
}
session.get('https://xueqiu.com',headers=headers)#會自動把請求中的cookie信息攜帶上
url = 'https://xueqiu.com/v4/statuses/public_timeline_by_category.json?since_id=-1&max_id=20349203&count=15&category=-1'
page_text = session.get(url=url,headers=headers).json()
print(page_text)

#執行結果
{'list': [{'id': 20349202, 'category': 0, 'data': '{"id":132614531,"title":"狼來了！今天，中囯電信行業打響第一槍！
流量費用要降價了！","description":"狼，終究來了！ 剛剛傳來大消息，中國工信部正式宣布：英國電信（BT）
獲得了在中國全國性經營通信的牌照。 隨后，英國電信也在第一時間證實這一消息！他們興高采烈地表示：
取得牌照，意味着英國電信在中國邁出重要的一步！ 是的，你沒有看錯：英國電信！這是英國最大的電信公司，
也是一家有着超過...","target":"/3583653389/132614531","reply_count":75,"retweet_count":7,"topic_title":"狼來了！
今天，中囯電信行業打響第一槍！流量費用要降價了！","topic_desc":"狼，終究來了！ 剛剛傳來大消息， 中國工信部正式宣布：英...}......省略

添加cookie的方法

一般的情況使用requests.Session()方法就可以解決cookies問題，但是在途中添加cookie遇到了一些問題。

第一種：
session = requests.Session()
session.cookies['cookie'] = 'cookie-value'
功能：可以添加cookie，不會清除原cookie
缺點：不能設置path，domain等參數

第二種：
session = requests.Session()
session.cookies.set('cookie-name', 'cookie-value', path='/', domain='.abc.com')
功能：設置path、domain等參數。
缺點：清楚原來的cookies

第三種：
session = requests.Session()
requests.utils.add_dict_to_cookiejar(session.cookies, cookie_dict)
功能：可以添加cookie，不會清除原cookie
缺點：不能設置path，domain等參數

第四種：(有問題問解決)
session = requests.Session()
c = requests.cookies.RequestsCookieJar()
c.set('cookie-name', 'cookie-value', path='/', domain='.abc.com')
session.cookies.update(c)
功能：既能添加cookies，還能添加path，domain等參數。

第五種：
session = requests.Session()
session.get(url='www.xxx.com')
headers = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,
    'Connection': 'keep-alive',
    'Host': 'www.airchina.com.cn',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36',
}
cookie_str = 'cookie1=xxxx1;cookie=xxxx2'
headers['Cookie'] = cookie_str
session.get(url = 'www.xxx2.com',headers=headers)
功能：session的cookie與headers的cookie都能生效，但是只有使用這個headers才行

頁面中驗證碼識別

識別該網站驗證碼 https://so.gushiwen.org/user/login.aspx?from=http://so.gushiwen.org/user/collect.aspx

解決辦法

驗證碼的識別推薦平台
- 超級鷹：http://www.chaojiying.com/about.html (這里我們使用超級鷹)
	- 注冊：（用戶中心身份）
	- 登陸：
	- 創建一個軟件：899370
	- 下載示例代碼
- 雲打碼:http://www.yundama.com/

實現過程

識別網頁驗證碼

#超級鷹代碼
import requests
from hashlib import md5

class Chaojiying_Client(object):

    def __init__(self, username, password, soft_id):
        self.username = username
        password =  password.encode('utf8')
        self.password = md5(password).hexdigest()
        self.soft_id = soft_id
        self.base_params = {
            'user': self.username,
            'pass2': self.password,
            'softid': self.soft_id,
        }
        self.headers = {
            'Connection': 'Keep-Alive',
            'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)',
        }

    def PostPic(self, im, codetype):
        """
        im: 圖片字節
        codetype: 題目類型 參考 http://www.chaojiying.com/price.html
        """
        params = {
            'codetype': codetype,
        }
        params.update(self.base_params)
        files = {'userfile': ('ccc.jpg', im)}
        r = requests.post('http://upload.chaojiying.net/Upload/Processing.php', data=params, files=files, headers=self.headers)
        return r.json()

    def ReportError(self, im_id):
        """
        im_id:報錯題目的圖片ID
        """
        params = {
            'id': im_id,
        }
        params.update(self.base_params)
        r = requests.post('http://upload.chaojiying.net/Upload/ReportError.php', data=params, headers=self.headers)
        return r.json()
    
#爬蟲代碼
#識別古詩文網中的驗證碼
from lxml import etree

#識別古詩文網中的驗證碼
def tranformImgData(imgPath,t_type):#調用超級鷹
    chaojiying = Chaojiying_Client('bobo3280948', 'bobo3284148', '899370')#超級鷹賬戶 密碼 軟件id
    im = open(imgPath, 'rb').read()
    return chaojiying.PostPic(im, t_type)['pic_str']

headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36',
}

url = 'https://so.gushiwen.org/user/login.aspx?from=http://so.gushiwen.org/user/collect.aspx'
page_text = requests.get(url,headers=headers).text
tree = etree.HTML(page_text)
img_src = 'https://so.gushiwen.org'+tree.xpath('//*[@id="imgCode"]/@src')[0]
img_data = requests.get(img_src,headers=headers).content
with open('./code.jpg','wb') as fp:
    fp.write(img_data)
yzm = tranformImgData('./code.jpg',1004)#保存的驗證碼圖片地址 驗證碼對應超級鷹的驗證碼類型對應號    
print(yzm)
#執行結果 成功解析驗證碼
d145

模擬登錄

import requests
from hashlib import md5

class Chaojiying_Client(object):

    def __init__(self, username, password, soft_id):
        self.username = username
        password =  password.encode('utf8')
        self.password = md5(password).hexdigest()
        self.soft_id = soft_id
        self.base_params = {
            'user': self.username,
            'pass2': self.password,
            'softid': self.soft_id,
        }
        self.headers = {
            'Connection': 'Keep-Alive',
            'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)',
        }

    def PostPic(self, im, codetype):
        """
        im: 圖片字節
        codetype: 題目類型 參考 http://www.chaojiying.com/price.html
        """
        params = {
            'codetype': codetype,
        }
        params.update(self.base_params)
        files = {'userfile': ('ccc.jpg', im)}
        r = requests.post('http://upload.chaojiying.net/Upload/Processing.php', data=params, files=files, headers=self.headers)
        return r.json()

    def ReportError(self, im_id):
        """
        im_id:報錯題目的圖片ID
        """
        params = {
            'id': im_id,
        }
        params.update(self.base_params)
        r = requests.post('http://upload.chaojiying.net/Upload/ReportError.php', data=params, headers=self.headers)
        return r.json()
    
    
from lxml import etree


#識別古詩文網中的驗證碼
def tranformImgData(imgPath,t_type):#調用超級鷹
    chaojiying = Chaojiying_Client('bobo328410948', 'bobo328410948', '899370')
    im = open(imgPath, 'rb').read()
    return chaojiying.PostPic(im, t_type)['pic_str']

headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36',
}

#模擬登陸
s = requests.Session()
url = 'https://so.gushiwen.org/user/login.aspx?from=http://so.gushiwen.org/user/collect.aspx'
page_text = s.get(url,headers=headers).text
tree = etree.HTML(page_text)
img_src = 'https://so.gushiwen.org'+tree.xpath('//*[@id="imgCode"]/@src')[0]
img_data = s.get(img_src,headers=headers).content
with open('./code.jpg','wb') as fp:
    fp.write(img_data)
    
#動態獲取變化的請求參數
__VIEWSTATE = tree.xpath('//*[@id="__VIEWSTATE"]/@value')[0]
__VIEWSTATEGENERATOR = tree.xpath('//*[@id="__VIEWSTATEGENERATOR"]/@value')[0]
    
code_text = tranformImgData('./code.jpg',1004)
login_url = 'https://so.gushiwen.org/user/login.aspx?from=http%3a%2f%2fso.gushiwen.org%2fuser%2fcollect.aspx'
data = {
    '__VIEWSTATE': __VIEWSTATE,
    '__VIEWSTATEGENERATOR': __VIEWSTATEGENERATOR,
    'from':'http://so.gushiwen.org/user/collect.aspx',
    'email': 'www.zhangbowudi@qq.com',
    'pwd': 'bobo328410948',
    'code': code_text,
    'denglu': '登錄',
}
page_text = s.post(url=login_url,headers=headers,data=data).text
with open('login.html','w',encoding='utf-8') as fp:
    fp.write(page_text)
    
#動態變化的請求參數  通常情況下動態變化的請求參數都會被隱藏在前台頁面源碼中

使用 multiprocessing.dummy Pool 線程池

模擬請求

#未使用線程池(模擬請求)
import time
from time import sleep
start = time.time()
urls = [
    'www.1.com',
    'www.2.com',
    'www.3.com',
]
def get_request(url):
    print('正在訪問:%s'%url)
    sleep(2)
    print('訪問結束:%s'%url)
    
for url in urls:
    get_request(url)
print('總耗時:',time.time()-start)

#執行結果
正在訪問:www.1.com
訪問結束:www.1.com
正在訪問:www.2.com
訪問結束:www.2.com
正在訪問:www.3.com
訪問結束:www.3.com
總耗時: 6.000494718551636
    
#使用線程池 (模擬請求)
import time
from time import sleep
from multiprocessing.dummy import Pool

start = time.time()
urls = [
    'www.1.com',
    'www.2.com',
    'www.3.com',
]

def get_request(url):
    print('正在訪問:%s' % url)
    sleep(2)
    print('訪問結束:%s' % url)
    
pool = Pool(3)
pool.map(get_request, urls)
print('總耗時:', time.time() - start)

#執行結果
正在訪問:www.1.com
正在訪問:www.2.com
正在訪問:www.3.com
訪問結束:www.2.com
訪問結束:www.3.com
訪問結束:www.1.com
總耗時: 2.037109613418579

簡單使用Flask模擬server端進行測試

#server
from flask import Flask
from time import sleep
app = Flask(__name__)
@app.route('/index')
def index():
    sleep(2)
    return 'hello'

if __name__ == '__main__':
    app.run()
    
#爬蟲請求代碼    
import time
import requests
from multiprocessing.dummy import Pool
start = time.time()
urls = [
    'http://localhost:5000/index',
    'http://localhost:5000/index',
    'http://localhost:5000/index',
]
def get_request(url):
    page_text = requests.get(url).text
    print(page_text)

pool = Pool(3)
pool.map(get_request, urls)
print('總耗時：', time.time() - start)

#執行結果
hello
hello
hello
總耗時： 3.0322463512420654

單線程+多任務異步協程

- 協程
  - 在函數（特殊的函數）定義的時候，如果使用了async修飾的話，則改函數調用后會返回一個協程對象，並且函數內部的實現語句不會被立即執行
- 任務對象
  - 任務對象就是對協程對象的進一步封裝。任務對象==高級的協程對象==特殊的函數
  - 任務對象時必須要注冊到事件循環對象中
  - 給任務對象綁定回調：爬蟲的數據解析中
- 事件循環
  - 當做是一個容器，容器中必須存放任務對象。
  - 當啟動事件循環對象后，則事件循環對象會對其內部存儲任務對象進行異步的執行。
- aiohttp:支持異步網絡請求的模塊

簡單了解 asyncio異步協程函數

import asyncio
def callback(task):#作為任務對象的回調函數
    print('i am callback and ',task.result())#task.result()接受特殊函數的返回值

async def test(): #特殊函數
    print('i am test()')
    return 'bobo'

c = test()#c為協程對象
#封裝了一個任務對象
task = asyncio.ensure_future(c)
#綁定回調函數
task.add_done_callback(callback)
#創建一個事件循環的對象
loop = asyncio.get_event_loop()
#將任務對象注冊到事件循環中
loop.run_until_complete(task)

#執行結果
i am test()
i am callback and  bobo

協程+多任務(模擬請求)

import time
import asyncio

start = time.time()
# 在特殊函數內部的實現中不可以出現不支持異步的模塊代碼
async def get_request(url):
    await asyncio.sleep(2)
    print('訪問成功:', url)


urls = [
    'www.1.com',
    'www.2.com'
]
tasks = []
for url in urls:
    c = get_request(url)
    task = asyncio.ensure_future(c)
    tasks.append(task)

loop = asyncio.get_event_loop()
# 注意：掛起操作需要手動處理
loop.run_until_complete(asyncio.wait(tasks))
print(time.time() - start)

#執行結果
訪問成功: www.1.com
訪問成功: www.2.com
2.002183198928833

使用requests模塊,發現並不能實現異步

#server端
from flask import Flask
from time import sleep
app = Flask(__name__)
@app.route('/index')
def index():
    sleep(2)
    return 'hello'
@app.route('/index1')
def index1():
    sleep(2)
    return 'hello1'
if __name__ == '__main__':
    app.run()
 
#爬蟲代碼
import requests
import time
import asyncio
s = time.time()
urls = [
    'http://127.0.0.1:5000/index',
    'http://127.0.0.1:5000/home'
]
async def get_request(url):
    page_text = requests.get(url).text
    return page_text

tasks = []
for url in urls:
    c = get_request(url)
    task = asyncio.ensure_future(c)
    tasks.append(task)

loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))

print(time.time()-s)    

#執行結果 並未實現異步
4.021323204040527

#因為requests不支持異步,需要使用aiohttp

使用aiohttp模塊,實現了異步

#server端
from flask import Flask
from time import sleep
app = Flask(__name__)
@app.route('/index')
def index():
    sleep(2)
    return 'index'
@app.route('/home')
def index1():
    sleep(2)
    return 'home'
if __name__ == '__main__':
    app.run()
    
#爬蟲代碼   
import aiohttp
import time
import asyncio

s = time.time()
urls = [
    'http://127.0.0.1:5000/index',
    'http://127.0.0.1:5000/home'
]


async def get_request(url):
    #每個with前要加async
    async with aiohttp.ClientSession() as s:
        #在阻塞操作前加await
        async with await s.get(url=url) as response:#get(url=url,headers,params,proxy)可用參數 
            page_text = await response.text()#要加括號,是方法
            print(page_text)
    return page_text


tasks = []
for url in urls:
    c = get_request(url)
    task = asyncio.ensure_future(c)
    tasks.append(task)

loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))

print(time.time() - s)

#執行結果
index
home
2.016155242919922

示例二

########################test.html文件########################
<!DOCTYPE html>
<html lang="zh-CN">
<head>
  <meta charset="utf-8">
  <meta http-equiv="X-UA-Compatible" content="IE=edge">
  <meta name="viewport" content="width=device-width, initial-scale=1">
  <!-- 上述3個meta標簽*必須*放在最前面，任何其他內容都*必須*跟隨其后！ -->
  <title>Bootstrap 101 Template</title>
  <!-- Bootstrap -->
  <link href="bootstrap-3.3.7-dist/css/bootstrap.min.css" rel="stylesheet">
</head>
<body>
<h1>你好，世界！</h1>
<ul>
  <li>i am hero!!!</li>
  <li>i am superMan!!!</li>
  <li>i am Spider!!!</li>
</ul>
</body>
</html>

########################server端########################
import time
from flask import Flask,render_template

app = Flask(__name__)

@app.route('/bobo')
def index_bobo():
    time.sleep(2)
    return render_template('test.html')

@app.route('/jay')
def index_jay():
    time.sleep(2)
    return render_template('test.html')

@app.route('/tom')
def index_tom():
    time.sleep(2)
    return render_template('test.html')

if __name__ == '__main__':
    app.run(threaded=True)
    
########################爬蟲代碼########################
import time
import aiohttp
import asyncio
from lxml import etree

start = time.time()
urls = [
    'http://127.0.0.1:5000/bobo',
    'http://127.0.0.1:5000/jay',
    'http://127.0.0.1:5000/tom',
    'http://127.0.0.1:5000/bobo',
    'http://127.0.0.1:5000/jay',
    'http://127.0.0.1:5000/tom',
    'http://127.0.0.1:5000/bobo',
    'http://127.0.0.1:5000/jay',
    'http://127.0.0.1:5000/tom',
    'http://127.0.0.1:5000/bobo',
    'http://127.0.0.1:5000/jay',
    'http://127.0.0.1:5000/tom'
]

# 特殊的函數：請求發送和響應數據的捕獲
# 細節：在每一個with前加上async，在每一個阻塞操作的前邊加上await
async def get_request(url):
    async with aiohttp.ClientSession() as s:
        # s.get(url,headers,proxy="http://ip:port",params)
        async with await s.get(url) as response:
            page_text = await response.text()  # read()返回的是byte類型的數據
            return page_text

# 回調函數
def parse(task):
    page_text = task.result()
    tree = etree.HTML(page_text)
    parse_data = tree.xpath('//li/text()')
    print(parse_data)

tasks = []
for url in urls:
    c = get_request(url)
    task = asyncio.ensure_future(c)
    task.add_done_callback(parse)
    tasks.append(task)

loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))

print(time.time() - start)

#執行結果 實現了異步
['i am hero!!!', 'i am superMan!!!', 'i am Spider!!!']
['i am hero!!!', 'i am superMan!!!', 'i am Spider!!!']
['i am hero!!!', 'i am superMan!!!', 'i am Spider!!!']
['i am hero!!!', 'i am superMan!!!', 'i am Spider!!!']
['i am hero!!!', 'i am superMan!!!', 'i am Spider!!!']
['i am hero!!!', 'i am superMan!!!', 'i am Spider!!!']
['i am hero!!!', 'i am superMan!!!', 'i am Spider!!!']
['i am hero!!!', 'i am superMan!!!', 'i am Spider!!!']
['i am hero!!!', 'i am superMan!!!', 'i am Spider!!!']
['i am hero!!!', 'i am superMan!!!', 'i am Spider!!!']
['i am hero!!!', 'i am superMan!!!', 'i am Spider!!!']
['i am hero!!!', 'i am superMan!!!', 'i am Spider!!!']
2.094982147216797

總結

- 單線程+多任務異步協程
  - 協程
    - 如果一個函數的定義被asyic修飾后，則改函數調用后會返回一個協程對象。
  - 任務對象：
    - 就是對協程對象的進一步封裝
  - 綁定回調
    - task.add_done_callback(func):func(task):task.result()
  - 事件循環對象
    - 事件循環對象是用來裝載任務對象。該對象被啟動后，則會異步的處理調用其內部裝載的每一個任務對象。（將任務對象手動進行掛起操作）
  - aynic，await
  - 注意事項：在特殊函數內部不可以出現不支持異步模塊的代碼，否則會中斷整個異步的效果！！！
- aiohttp支持異步請求的模塊

作者： 郭楷豐

出處： https://www.cnblogs.com/guokaifeng/

聲援博主：如果您覺得文章對您有幫助，可以點擊文章右下角 【推薦】一下。您的鼓勵是博主的最大動力！

自勉：生活，需要追求；夢想，需要堅持；生命，需要珍惜；但人生的路上，更需要堅強。 帶着感恩的心啟程，學會愛，愛父母，愛自己，愛朋友，愛他人。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 爬蟲 requests模塊高級用法 requests模塊高級操作之proxies python爬蟲 requests模塊高級操作, 代理,模擬登錄 python 接口測試—requests模塊的高級用法 requests模塊 requests模塊 python requests 高級用法 requests高級用法 Python Requests 高級用法爬蟲—Requests高級用法

requests模塊 高級應用

requests模塊 高級應用

HttpConnectinPool 問題解決

IP代理

簡單使用代理

爬蟲代碼使用代理

瀏覽器設置代理

代理池

代理池的作用

簡單實現一個代理池

構建一個代理池

cookie的處理

爬取雪球網首頁新聞信息 https://xueqiu.com/

爬取過程中遇到問題

解決辦法手動添加cookie信息 (不推薦,因為有的網站cookie可能是變動的,這樣就寫死了)

自動獲取cookie(推薦,cookie是變化的也沒問題)

添加cookie的方法

頁面中驗證碼識別

識別該網站驗證碼 https://so.gushiwen.org/user/login.aspx?from=http://so.gushiwen.org/user/collect.aspx

解決辦法

實現過程

識別網頁驗證碼

模擬登錄

使用 multiprocessing.dummy Pool 線程池

模擬請求

簡單使用Flask模擬server端 進行測試

單線程+多任務異步協程

簡單了解 asyncio異步協程函數

協程+多任務(模擬請求)

使用requests模塊,發現並不能實現異步

使用aiohttp模塊,實現了異步

示例二

總結

免責聲明！

requests模塊高級應用

requests模塊高級應用

簡單使用Flask模擬server端進行測試