python爬蟲request模塊詳解


requests模塊

使用requests可以模擬瀏覽器的請求,比起之前用到的urllib,requests模塊的api更加便捷(本質就是封裝了urllib3)

注意:requests庫發送請求將網頁內容下載下來以后,並不會執行js代碼,這需要我們自己分析目標站點然后發起新的request請求

官方文檔:http://cn.python-requests.org/zh_CN/latest/

安裝:pip3 install requests

requests模塊的各種請求方式

源碼構成如下

# 以上方法均是在此方法的基礎上構建

requests.request(method, url, **kwargs)

其中最常用的請求方式就是post和get請求,泵智商,post和get就是封裝了request請求的請求方式

>>> r = requests.get('https://api.github.com/events')
相當於requests,request(method='get', 'https://api.github.com/events')
>>> r = requests.post('http://httpbin.org/post', data = {'key':'value'})
相當於requests,request(method='post', 'https://api.github.com/events', data = {'key':'value'})

requests,request方法詳解

request()源碼

def request(method, url, **kwargs):
    """Constructs and sends a :class:`Request <Request>`.

    :param method: method for the new :class:`Request` object.
    :param url: URL for the new :class:`Request` object.
    :param params: (optional) Dictionary or bytes to be sent in the query string for the :class:`Request`.
    :param data: (optional) Dictionary or list of tuples ``[(key, value)]`` (will be form-encoded), bytes, or file-like object to send in the body of the :class:`Request`.
    :param json: (optional) json data to send in the body of the :class:`Request`.
    :param headers: (optional) Dictionary of HTTP Headers to send with the :class:`Request`.
    :param cookies: (optional) Dict or CookieJar object to send with the :class:`Request`.
    :param files: (optional) Dictionary of ``'name': file-like-objects`` (or ``{'name': file-tuple}``) for multipart encoding upload.
        ``file-tuple`` can be a 2-tuple ``('filename', fileobj)``, 3-tuple ``('filename', fileobj, 'content_type')``
        or a 4-tuple ``('filename', fileobj, 'content_type', custom_headers)``, where ``'content-type'`` is a string
        defining the content type of the given file and ``custom_headers`` a dict-like object containing additional headers
        to add for the file.
    :param auth: (optional) Auth tuple to enable Basic/Digest/Custom HTTP Auth.
    :param timeout: (optional) How many seconds to wait for the server to send data
        before giving up, as a float, or a :ref:`(connect timeout, read
        timeout) <timeouts>` tuple.
    :type timeout: float or tuple
    :param allow_redirects: (optional) Boolean. Enable/disable GET/OPTIONS/POST/PUT/PATCH/DELETE/HEAD redirection. Defaults to ``True``.
    :type allow_redirects: bool
    :param proxies: (optional) Dictionary mapping protocol to the URL of the proxy.
    :param verify: (optional) Either a boolean, in which case it controls whether we verify
            the server's TLS certificate, or a string, in which case it must be a path
            to a CA bundle to use. Defaults to ``True``.
    :param stream: (optional) if ``False``, the response content will be immediately downloaded.
    :param cert: (optional) if String, path to ssl client cert file (.pem). If Tuple, ('cert', 'key') pair.
    :return: :class:`Response <Response>` object
    :rtype: requests.Response

    Usage::

      >>> import requests
      >>> req = requests.request('GET', 'http://httpbin.org/get')
      <Response [200]>
    """

    # By using the 'with' statement we are sure the session is closed, thus we
    # avoid leaving sockets open which can trigger a ResourceWarning in some
    # cases, and look like a memory leak in others.
    with sessions.Session() as session:
        return session.request(method=method, url=url, **kwargs)

  下面對源碼中的各個屬性進行分析

method和url

指名請求方式和請求路徑

requests.request(method='get', url='http://127.0.0.1:8000/test/')
requests.request(method='post', url='http://127.0.0.1:8000/test/')

params

requests模塊發送請求有data、json、params三種攜帶參數的方法。

params在get請求中使用,data、json在post請求中使用。

params可以接收的參數:

- 可以是字典
- 可以是字符串
字典字符串都會被自動編碼發送到url
- 可以是字節(必須是ascii編碼以內)

接收字典字符串都會被自動編碼發送到url,如下

import requests
wd='egon老師'
pn=1

response=requests.get('https://www.baidu.com/s',
                      params={
                          'wd':wd,
                          'pn':pn
                      },
                      headers={
                        'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36',
                      })
print(response.url)
# 輸出為:https://www.baidu.com/s?wd=egon%E8%80%81%E5%B8%88&pn=1
# 可見url已被自動編碼

  上面代碼相當於如下代碼,params編碼轉換本質上是用urlencode

import requests
from urllib.parse import urlencode
wd='egon老師'
encode_res=urlencode({'k':wd},encoding='utf-8')
keyword=encode_res.split('=')[1]
print(keyword)
# 然后拼接成url
url='https://www.baidu.com/s?wd=%s&pn=1' %keyword

response=requests.get(url,
                      headers={
                        'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36',
                      })
print(response.url)
# 輸出為:https://www.baidu.com/s?wd=egon%E8%80%81%E5%B8%88&pn=1

  還有一點注意的就是接收字節數據時,不能傳非ASCII碼外的字符,如下就是錯誤的

import requests

# re = requests.request(method='get',
#                  url='http://127.0.0.1:8000/test/',
#                  params=bytes("k1=v1&k2=水電費&k3=v3&k3=vv3", encoding='utf8'))

data

requests模塊發送請求有data、json、params三種攜帶參數的方法。params在get請求中使用,data、json在post請求中使用。

data可以接收的參數為:字典,字符串,字節,文件對象,data和json兩者的區別在於data的請求體為name=alex&age=18格式而json請求體為‘{'k1': 'v1', 'k2': '水電費'}’(字符串)

requests.request(method='POST',
                 url='http://127.0.0.1:8000/test/',
                 data={'k1': 'v1', 'k2': '水電費'})

requests.request(method='POST',
                 url='http://127.0.0.1:8000/test/',
                 data="k1=v1; k2=v2; k3=v3; k3=v4"
                 )

requests.request(method='POST',
                 url='http://127.0.0.1:8000/test/',
                 data="k1=v1;k2=v2;k3=v3;k3=v4",
                 headers={'Content-Type': 'application/x-www-form-urlencoded'}
                 )

requests.request(method='POST',
                 url='http://127.0.0.1:8000/test/',
                 data=open('data_file.py', mode='r', encoding='utf-8'),  # 文件內容是:k1=v1;k2=v2;k3=v3;k3=v4
                 headers={'Content-Type': 'application/x-www-form-urlencoded'}
                 )

json

將json中對應的數據進行序列化成一個字符串,json.dumps(...)

然后發送到服務器端的body中,並且Content-Type是 {'Content-Type': 'application/json'}

標志:payload

requests.request(method='POST',
                 url='http://127.0.0.1:8000/test/',
                 json={'k1': 'v1', 'k2': '水電費'})

headers

發送請求頭到服務器

requests.request(method='POST',
                 url='http://127.0.0.1:8000/test/',
                 json={'k1': 'v1', 'k2': '水電費'},
                 headers={'Content-Type': 'application/x-www-form-urlencoded'}
                 )

cookies

# 發送Cookie到服務器端
requests.request(method='POST',
                 url='http://127.0.0.1:8000/test/',
                 data={'k1': 'v1', 'k2': 'v2'},
                 cookies={'cook1': 'value1'},
                 )
# 也可以使用CookieJar(字典形式就是在此基礎上封裝)
from http.cookiejar import CookieJar
from http.cookiejar import Cookie

obj = CookieJar()
obj.set_cookie(Cookie(version=0, name='c1', value='v1', port=None, domain='', path='/', secure=False, expires=None,
                      discard=True, comment=None, comment_url=None, rest={'HttpOnly': None}, rfc2109=False,
                      port_specified=False, domain_specified=False, domain_initial_dot=False, path_specified=False)
               )
requests.request(method='POST',
                 url='http://127.0.0.1:8000/test/',
                 data={'k1': 'v1', 'k2': 'v2'},
                 cookies=obj)

files

發送文件
file_dict = {
    'f1': open('readme', 'rb')
}
requests.request(method='POST',
                 url='http://127.0.0.1:8000/test/',
                 files=file_dict)

發送文件,定制文件名
file_dict = {
    'f1': ('test.txt', open('readme', 'rb'))
}
requests.request(method='POST',
                 url='http://127.0.0.1:8000/test/',
                 files=file_dict)

發送文件,定制文件名
file_dict = {
    'f1': ('test.txt', "hahsfaksfa9kasdjflaksdjf")
}
requests.request(method='POST',
                 url='http://127.0.0.1:8000/test/',
                 files=file_dict)

發送文件,定制文件名
file_dict = {
    'f1': ('test.txt', "hahsfaksfa9kasdjflaksdjf", 'application/text', {'k1': '0'})
}
requests.request(method='POST',
                 url='http://127.0.0.1:8000/test/',
                 files=file_dict)

auth認證

解決瀏覽器的自帶認證問題

認證設置:登陸網站是,彈出一個框,要求你輸入用戶名密碼(與alter很類似),此時是無法獲取html的,但本質原理是拼接成請求頭發送

r.headers['Authorization'] = _basic_auth_str(self.username, self.password)

一般的網站都不用默認的加密方式,都是自己寫,那么我們就需要按照網站的加密方式,自己寫一個類似於_basic_auth_str的方法
得到加密字符串后添加到請求頭:r.headers['Authorization'] =func('.....')

HTTPBasicAuth實際是向瀏覽器發一個帶有Authorization:.................的請求

HTTPBasicAuth
from requests.auth import HTTPBasicAuth, HTTPDigestAuth

ret = requests.get('https://api.github.com/user', auth=HTTPBasicAuth('wupeiqi', 'sdfasdfasdf'))
print(ret.text)

  auth別的使用方式

# ret = requests.get('http://192.168.1.1',
# auth=HTTPBasicAuth('admin', 'admin'))
# ret.encoding = 'gbk'
# print(ret.text)

# ret = requests.get('http://httpbin.org/digest-auth/auth/user/pass', auth=HTTPDigestAuth('user', 'pass'))
# print(ret)

timeout

兩種超時:float or tuple
timeout=0.1 #代表接收數據的超時時間
timeout=(0.1,0.2)#0.1代表鏈接超時 0.2代表接收數據的超時時間

import requests
respone=requests.get('https://www.baidu.com',
                     timeout=0.0001)

redirects

ret = requests.get('http://127.0.0.1:8000/test/', allow_redirects=False)
print(ret.text)

proxies

代理設置

# 根據協議來確定發送請求時候的ip地址
proxies = {
    "http": "61.172.249.96:80",
    "https": "http://61.185.219.126:3128",
}

# 根據接收請求的地址來確定用什么地址發送

proxies = {'http://10.20.1.128': 'http://10.10.1.10:5323'}

ret = requests.get("http://www.proxy360.cn/Proxy", proxies=proxies)
print(ret.headers)

from requests.auth import HTTPProxyAuth

proxyDict = {
    'http': '77.75.105.165',
    'https': '77.75.105.165'
}
auth = HTTPProxyAuth('username', 'mypassword')

r = requests.get("http://www.google.com", proxies=proxyDict, auth=auth)
print(r.text)

#支持socks代理,安裝:pip install requests[socks]
import requests
proxies = {
    'http': 'socks5://user:pass@host:port',
    'https': 'socks5://user:pass@host:port'
}
respone=requests.get('https://www.12306.cn',
                     proxies=proxies)

print(respone.status_code)

stream

ret = requests.get('http://127.0.0.1:8000/test/', stream=True)
print(ret.content)
ret.close()

# from contextlib import closing
# with closing(requests.get('http://httpbin.org/get', stream=True)) as r:
# # 在此處理響應。
# for i in r.iter_content():
# print(i)

session

import requests

session = requests.Session()

### 1、首先登陸任何頁面,獲取cookie

i1 = session.get(url="http://dig.chouti.com/help/service")

### 2、用戶登陸,攜帶上一次的cookie,后台對cookie中的 gpsd 進行授權
i2 = session.post(
    url="http://dig.chouti.com/login",
    data={
        'phone': "8615131255089",
        'password': "xxxxxx",
        'oneMonth': ""
    }
)

i3 = session.post(
    url="http://dig.chouti.com/link/vote?linksId=8589623",
)
print(i3.text)

編碼問題

import requests
response=requests.get('http://www.autohome.com/news')
# response.encoding='gbk' #汽車之家網站返回的頁面內容為gb2312編碼的,而requests的默認編碼為ISO-8859-1,如果不設置成gbk則中文亂碼
print(response.text)

  

1


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM