requests.get()參數

本文轉載自查看原文 2019-08-13 20:54 23684 爬蟲

查詢參數-params

1.參數類型

　　字典,字典中鍵值對作為查詢參數

2.使用方法

1、res = requests.get(url,params=params,headers=headers)
2、特點: 
   * url為基准的url地址，不包含查詢參數
   * 該方法會自動對params字典編碼,然后和url拼接

3.示例

import requests

baseurl = 'http://tieba.baidu.com/f?'
params = {
  'kw' : '趙麗穎吧',
  'pn' : '50'
}
headers = {'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; InfoPath.3)'}
# 自動對params進行編碼,然后自動和url進行拼接,去發請求
res = requests.get(baseurl,params=params,headers=headers)
res.encoding = 'utf-8'
print(res.text)

web客戶端驗證參數-auth

1.作用類型

1、針對於需要web客戶端用戶名密碼認證的網站
2、auth = ('username','password')

2.通過用戶名賬號密碼獲取筆記名稱案例

import requests
from lxml import etree
import os

class NoteSpider(object):
    def __init__(self):
        self.url = 'http://code.com.cn/Code/aid1904/redis/'
        self.headers = {'User-Agent':'Mozilla/5.0'}
        self.auth = ('code','code_2013')

    # 獲取
    def get_html(self):
        html = requests.get(url=self.url,auth=self.auth,headers=self.headers).text
        return html

    # 解析提取數據 + 把筆記壓縮包下載完成
    def parse_page(self):
        html = self.get_html()
        xpath_bds = '//a/@href'
        parse_html = etree.HTML(html)
        # r_list : ['../','day01','day02','redis_day01.zip']
        r_list = parse_html.xpath(xpath_bds)
        for r in r_list:
            if r.endswith('zip') or r.endswith('rar'):
                print(r)

if __name__ == '__main__':
    spider = NoteSpider()
    spider.parse_page()

思考：爬取具體的筆記文件？

import requests
from lxml import etree
import os

class NoteSpider(object):
    def __init__(self):
        self.url = 'http://code.com.cn/Code/redis/'
        self.headers = {'User-Agent':'Mozilla/5.0'}
        self.auth = ('code','code_2013')

    # 獲取
    def get_html(self):
        html = requests.get(url=self.url,auth=self.auth,headers=self.headers).text
        return html

    # 解析提取數據 + 把筆記壓縮包下載完成
    def parse_page(self):
        html = self.get_html()
        xpath_bds = '//a/@href'
        parse_html = etree.HTML(html)
        # r_list : ['../','day01','day02','redis_day01.zip']
        r_list = parse_html.xpath(xpath_bds)
        for r in r_list:
            if r.endswith('zip') or r.endswith('rar'):
                file_url = self.url + r
                self.save_files(file_url,r)

    def save_files(self,file_url,r):
        html_content = requests.get(file_url,headers=self.headers,auth=self.auth).content
        # 判斷保存路徑是否存在
        directory = '/home/redis/'
        filename = directory + r
　　　　 #適用頻率很高
　　　　 #if not os.path.exists('路徑'):
　　　　 #　　os.makedirs('路徑') 可遞歸創建
　　　　 #　　os.mkdir('路徑')不能遞歸創建
        if not os.path.exists(directory):
            os.makedirs(directory)
　　　　 

        with open(filename,'wb') as f:
            f.write(html_content)
            print(r,'下載成功')

if __name__ == '__main__':
    spider = NoteSpider()
    spider.parse_page()

SSL證書認證參數-verify

1.適用網站及場景

1、適用網站: https類型網站但是沒有經過 證書認證機構 認證的網站
2、適用場景: 拋出 SSLError 異常則考慮使用此參數

2.參數類型

1、verify=True(默認)   : 檢查證書認證
2、verify=False（常用）: 忽略證書認證
# 示例
response = requests.get(
    url=url,
    params=params,
    headers=headers,
    verify=False
)

代理參數-proxies

1.定義

1、定義: 代替你原來的IP地址去對接網絡的IP地址。
2、作用: 隱藏自身真實IP,避免被封。

2.普通代理

　　獲取代理IP網站

西刺代理、快代理、全網代理、代理精靈、... ...

　　參數類型

1、語法結構
       proxies = {
           '協議':'協議://IP:端口號'
       }
2、示例
    proxies = {
        'http':'http://IP:端口號',
        'https':'https://IP:端口號'
    }

　　示例代碼

　　　　(1)使用免費普通代理IP訪問測試網站: http://httpbin.org/get

import requests

url = 'http://httpbin.org/get'
headers = {
    'User-Agent':'Mozilla/5.0'
}
# 定義代理,在代理IP網站中查找免費代理IP
proxies = {
    'http':'http://112.85.164.220:9999',
    'https':'https://112.85.164.220:9999'
}
html = requests.get(url,proxies=proxies,headers=headers,timeout=5).text
print(html)

　　　　思考: 建立一個自己的代理IP池，隨時更新用來抓取網站數據

1.從代理IP網站上,抓取免費的代理IP
2.測試抓取的IP,可用的保存在文件中

　　　　(2)寫一個獲取收費開放代理的接口

# 獲取開放代理的接口
import requests

def test_ip(ip):
    url = 'http://www.baidu.com/'
    proxies = {
        'http':'http://{}'.format(ip),
        'https':'https://{}'.format(ip),
    }
    try:
        res = requests.get(url=url,proxies=proxies,timeout=8)
        if res.status_code == 200:
               return True
       except Exception as e:
               return False

# 提取代理IP
def get_ip_list():
  api_url = 'http://dev.kdlapi.com/api/getproxy/?orderid=946562662041898&num=100&protocol=1&method=2&an_an=1&an_ha=1&sep=2'
  html = requests.get(api_url).content.decode('utf-8','ignore')
  ip_port_list = html.split('\n')

  for ip in ip_port_list:
    with open('proxy_ip.txt','a') as f:
        if test_ip(ip):
            f.write(ip + '\n')

if __name__ == '__main__':
    get_ip_list()

實現代碼

　　　　(3)使用隨機收費開放代理IP寫爬蟲

import random
import requests

class BaiduSpider(object):
    def __init__(self):
        self.url = 'http://www.baidu.com/'
        self.headers = {'User-Agent' : 'Mozilla/5.0'}
        self.blag = 1

    def get_proxies(self):
        with open('proxy_ip.txt','r') as f:
            #f.readlines:['1.1.1.1:111\n','2.2.2.2:22\n']
            result = f.readlines()
        #[:-1] -> 切掉ip,port后的\n
        proxy_ip = random.choice(result)[:-1]
        proxy_ip = {
            'http':'http://{}'.format(proxy_ip),
            'https': 'https://{}'.format(proxy_ip)
        }
        return proxy_ip

    def get_html(self):
        proxies = self.get_proxies()
        if self.blag <= 3:
            try:
                html = requests.get(url=self.url,proxies=proxies,headers=self.headers,timeout=5).text
                print(html)
            except Exception as e:
                print('Retry')
                self.blag += 1
                self.get_html()

if __name__ == '__main__':
    spider = BaiduSpider()
    spider.get_html()

實現代碼

3.私密代理

　　語法格式

1、語法結構
proxies = {
    '協議':'協議://用戶名:密碼@IP:端口號'
}

2、示例
proxies = {
    'http':'http://用戶名:密碼@IP:端口號',
    'https':'https://用戶名:密碼@IP:端口號'
}

　　示例代碼

import requests
url = 'http://httpbin.org/get'
proxies = {
    'http': 'http://309435365:szayclhp@106.75.71.140:16816',
    'https':'https://309435365:szayclhp@106.75.71.140:16816',
}
headers = {
    'User-Agent' : 'Mozilla/5.0',
}

html = requests.get(url,proxies=proxies,headers=headers,timeout=5).text
print(html)

urllib和urllib2關系

#python2
urllib :URL地址編碼
urllib2:請求
#python3 - 把python2中urllib和urllib2合並
urllib.parse:編碼
urllib.requests: 請求

控制台抓包

打開方式幾常用選項

1、打開瀏覽器，F12打開控制台，找到Network選項卡
2、控制台常用選項
   1、Network: 抓取網絡數據包
        1、ALL: 抓取所有的網絡數據包
        2、XHR：抓取異步加載的網絡數據包
        3、JS : 抓取所有的JS文件
   2、Sources: 格式化輸出並打斷點調試JavaScript代碼，助於分析爬蟲中一些參數
   3、Console: 交互模式，可對JavaScript中的代碼進行測試
3、抓取具體網絡數據包后
   1、單擊左側網絡數據包地址，進入數據包詳情，查看右側
   2、右側:
       1、Headers: 整個請求信息
            General、Response Headers、Request Headers、Query String、Form Data
       2、Preview: 對響應內容進行預覽
       3、Response：響應內容

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 requests.get() 的 headers 參數爬蟲——requests.get爬蟲模塊參數 requests.get()解析 #無參數get請求 r = requests.get(url) #有參數get請求 r = requests.get(url,params=params) python接口自動化測試(二)-requests.get() 七、Python3中urlopen和requests.get() 方法的區別使用requests.get下載大文件－Python requests.get(url).json()返回中文亂碼 Requests庫的主要方法：requests.request為requests.get和requests.post兩個的匯總，只是需要傳方法