Python爬蟲：requests 庫詳解，cookie操作與實戰

本文轉載自查看原文 2019-12-06 14:49 992 Python 爬蟲學習

原文

第三方庫 requests是基於urllib編寫的。比urllib庫強大，非常適合爬蟲的編寫。

安裝： pip install requests

簡單的爬百度首頁的例子：

response.text 和 response.content的區別：

response.text是解過碼的字符串。比較容易出現亂碼
response.content 未解碼的二進制格式(bytes). 適用於文本，圖片和音樂。如果是文本，可以使用 response.content.decode('utf-8') 解碼

requests 庫支持的請求方法：

import requests

requests.get("http://xxxx.com/")
requests.post("http://xxxx.com/post", data = {'key':'value'})
requests.put("http://xxxx.com/put", data = {'key':'value'})
requests.delete("http://xxxx.com/delete")
requests.head("http://xxxx.com/get")
requests.options("http://xxxx.com/get")

發送帶參數的get 請求：

在get方法里設置字典格式的params參數即可。requests 方法會自動完成url的拼接

import requests

params = {
    "wd": "python", "pn": 10,
}

response = requests.get('https://www.baidu.com/s', params=params)
print(response.url)
print(response.text)
'''
需要設置header,百度會進行反爬驗證
'''

發送帶數據的post 請求:

只需要在post方法里設置data參數即可。 raise_for_status()會表示成功或失敗

import requests


post_data = {'username': 'value1', 'password': 'value2'}

response = requests.post("http://xxx.com/login/", data=post_data)
response.raise_for_status()

post 文件的例子：

>>> import requests
>>> url = 'http://httpbin.org/post'
>>> files = {'file': open('report.xls', 'rb')}
>>> r = requests.post(url, files=files)

設置與查看請求頭(headers):

很多網站有反爬機制，如果一個請求不攜帶請求頭headers,很可能被禁止訪問。

import requests
headers = {

    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/"
                 "537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
}

response1 =requests.get("https://www.baidu.com", headers=headers)
response2 =requests.post("https://www.xxxx.com", data={"key": "value"}, 
headers=headers)

print(response1.headers)
print(response1.headers['Content-Type'])
print(response2.text)

設置代理Proxy：

有的網站反爬機制會限制單位時間內同一IP的請求次數，我們可以通過設置 IP proxy代理來應對這個反爬機制。

import requests

proxies = {
  "http": "http://10.10.1.10:3128",
  "https": "http://10.10.1.10:1080",
}

requests.get("http://example.org", proxies=proxies)

Cookie的獲取和添加：

有時候我們需要爬取登錄后才能訪問的頁面，這時我們就需要借助cookie來實現模擬登陸和會話維持了。

當用戶首次發送請求時，服務器端一般會生成並存儲一小段信息，包含在response數據里。如果這一小段信息存儲在客戶端（瀏覽器或磁盤），我們稱之為cookie.如果這一小段信息存儲在服務器端，我們稱之為session(會話).這樣當用戶下次發送請求到不同頁面時，請求自動會帶上cookie,這樣服務器就制定用戶之前已經登錄訪問過了。

可以通過打印 response.cookies來獲取查看cookie內容，從而知道首次請求后服務器是否生成了cookie.

發送請求時添加cookie的方法：

設置cookies參數

import requests

headers = { "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/" "537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36" } cookies = {"cookie_name": "cookie_value", } response = requests.get("https://www.baidu.com", headers=headers, cookies=cookies)

先實例化一個 RequestCookieJar的類，然后把值set進去，最后在get,post方法里面指定cookie參數

Session會話的維持:

session 與cookie不同，因為session一般存儲在服務器端。session對象能夠幫我們跨請求保持某些參數，也會在同一個session實例發出的所有請求之間保持cookies.

為了保持會話的連續，我們最好的辦法是先創建一個session對象，用它打開一個url,而不是直接使用 request.get方法打開一個url.

每當我們使用這個session對象重新打開一個url時，請求頭都會帶上首次產生的cookie,實現了會話的延續。

例子：

爬百度前20條搜索記錄。（結果還是有點問題的，因為跳轉的太多了，搜出不是對應的大條目)

#coding: utf-8
'''
爬取百度搜索前20個搜索頁面的標題和鏈接
'''
import requests
import sys
from bs4 import BeautifulSoup as bs
import re
import chardet

headers = {
'Accept': 'text/javascript, application/javascript, application/ecmascript, application/x-ecmascript, */*; q=0.01',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language':'zh-CN,zh;q=0.9',
'Connection': 'keep-alive',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest'
}

def main(keyword):
    file_name = "{}.txt".format(keyword)
    f = open(file_name,'w+', encoding='utf-8')
    f.close()
    for pn in range(0,20,10):
        params = {'wd':keyword,'pn':pn}
        response = requests.get("https://www.baidu.com/s",params=params,headers=headers)
        soup = bs(response.content,'html.parser')
        urls = soup.find_all(name='a',attrs={"href": re.compile('.')})
        for i in urls:
            if 'http://www.baidu.com/link?url=' in i.get('href'):
                a = requests.get(url=i.get('href'),headers=headers)
                print(i.get('href'))
                soup1 = bs(a.content,'html.parser')
                title = soup1.title.string
                with open(keyword+'.txt','r',encoding='utf-8') as f:
                    if a.url not in f.read():
                        f = open(keyword+'.txt','a',encoding='utf-8')
                        f.write(title + '\n')
                        f.write(a.url + '\n')
                        f.close()

if __name__ == '__main__':
    keyword ='Django'
    main(keyword)
    print("下載完成")

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python requests 的cookie 操作 Python 爬蟲（二）：Requests 庫 python爬蟲之一：requests庫網絡爬蟲_Requests庫網絡爬蟲實戰 Python爬蟲之requests庫的使用 python爬蟲---requests庫的用法 python網絡爬蟲之requests庫 Python爬蟲之requests庫介紹(一) Python 爬蟲實戰（二）：使用 requests-html python3爬蟲 - cookie登錄實戰