python爬蟲之requests的使用


一、爬蟲基本知識

 1 爬蟲原理:
 2         什么是爬蟲?
 3             爬蟲指的是爬取數據。
 4 
 5         什么是互聯網?
 6             由一堆網絡設備把一台一台的計算機互聯到一起。
 7 
 8         互聯網建立的目的?
 9             數據的傳遞與數據的共享。
10 
11         上網的全過程:
12             - 普通用戶
13                 打開瀏覽器 --> 往目標站點發送請求 --> 接收響應數據 --> 渲染到頁面上。
14 
15             - 爬蟲程序
16                 模擬瀏覽器 --> 往目標站點發送請求 --> 接收響應數據 --> 提取有用的數據 --> 保存到本地/數據庫。
17 
18         瀏覽器發送的是什么請求?
19             http協議的請求:
20                 - 請求url
21                 - 請求方式:
22                     GET、POST
23 
24                 - 請求頭:
25                     cookies
26                     user-agent
27                     host
28 
29         爬蟲的全過程:
30             1、發送請求 (請求庫)
31                 - requests模塊
32                 - selenium模塊
33 
34             2、獲取響應數據(服務器返回)
35 
36             3、解析並提取數據(解析庫)
37                 - re正則
38                 - bs4(BeautifulSoup4)
39                 - Xpath
40 
41             4、保存數據(存儲庫)
42                 - MongoDB
43 
44             1、3、4需要手動寫。
45 
46             - 爬蟲框架
47                 Scrapy(基於面向對象)
48 53 
54         使用Chrome瀏覽器工具
55             打開開發者模式 ----> network ---> preserve log、disable cache

二、requests庫的安裝

   1、在DOS中輸入“pip3 install requests”進行安裝

 2、在pycharm中進行安裝

         

三、基於HTTP協議的requests的請求機制

 1、http協議:(以請求百度為例)
  (1)請求url:
      https://www.baidu.com/

  (2)請求方式:
    GET

  (3)請求頭:
    Cookie: 可能需要關注。
    User-Agent: 用來證明你是瀏覽器
    注意: 去瀏覽器的request headers中查找
    Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36
    Host: www.baidu.com
  

    2、瀏覽器的使用

 

         

 3、requests幾種使用方式

1 >>> import requests
2 >>> r = requests.get('https://api.github.com/events')
3 >>> r = requests.post('http://httpbin.org/post', data = {'key':'value'})
4 >>> r = requests.put('http://httpbin.org/put', data = {'key':'value'})
5 >>> r = requests.delete('http://httpbin.org/delete')
6 >>> r = requests.head('http://httpbin.org/get')
7 >>> r = requests.options('http://httpbin.org/get')

  4、爬取百度主頁 

 1 import requests
 2 
 3 response = requests.get(url='https://www.baidu.com/')
 4 response.encoding = 'utf-8'
 5 print(response)  # <Response [200]>
 6 # 返回響應狀態碼
 7 print(response.status_code)  # 200
 8 # 返回響應文本
 9 # print(response.text)
10 print(type(response.text))  # <class 'str'>
11 #將爬取的內容寫入xxx.html文件
12 with open('baidu.html', 'w', encoding='utf-8') as f:
13     f.write(response.text)
 

四、GET請求講解

 1、請求頭headers使用(以訪問“知乎發現”為例)

 (1)、直接爬取,則會出錯:   

1 訪問”知乎發現“
2 import requests
3 response = requests.get(url='https://www.zhihu.com/explore')
4 print(response.status_code)  # 400
5 print(response.text)  # 返回錯誤頁面

 (2)添加請求頭之后即可正常爬取

 1 # 攜帶請求頭參數訪問知乎:
 2 import requests
 3 
 4 #請求頭字典
 5 headers = {
 6     'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36'
 7 }
 8 #在get請求內,添加user-agent
 9 response = requests.get(url='https://www.zhihu.com/explore', headers=headers)
10 print(response.status_code)  # 200
11 # print(response.text)
12 with open('zhihu.html', 'w', encoding='utf-8') as f:
13     f.write(response.text)

 2、params請求參數

 (1)在訪問某些網站時,url會特別長,而且有一長串看不懂的字符串,這時可以用params進行參數替換

 1 import requests
 2 from urllib.parse import urlencode
 3 #以百度搜索“蔡徐坤”為例
 4 # url = 'https://www.baidu.com/s?wd=%E8%94%A1%E5%BE%90%E5%9D%A4'
 5 '''
 6 方法1:
 7 url = 'https://www.baidu.com/s?' + urlencode({"wd": "蔡徐坤"})
 8 headers = {
 9     'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36'
10 }
11 response = requests.get(url,headers)
12 '''
13 #方法2:
14 url = 'https://www.baidu.com/s?'
15 headers = {
16     'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36'
17 }
18 # 在get方法中添加params參數
19 response = requests.get(url, headers=headers, params={"wd": "蔡徐坤"})
20 print(url) # https://www.baidu.com/s?wd=%E8%94%A1%E5%BE%90%E5%9D%A4
21 # print(response.text)
22 with open('xukun.html', 'w', encoding='utf-8') as f:
23     f.write(response.text)

 3、cookies參數使用

  (1)攜帶登錄cookies破解github登錄驗證

 1 攜帶cookies
 2 攜帶登錄cookies破解github登錄驗證
 3 
 4 請求url:
 5     https://github.com/settings/emails
 6     
 7 請求方式:
 8     GET
 9     
10 請求頭:
11     User-Agen
12     
13     Cookie: has_recent_activity=1; _ga=GA1.2.1416117396.1560496852; _gat=1; tz=Asia%2FShanghai; _octo=GH1.1.1728573677.1560496856; _device_id=1cb66c9a9599576a3b46df2455810999; user_session=1V8n9QfKpbgB-DhS4A7l3Tb3jryARZZ02NDdut3J2hy-8scm; __Host-user_session_same_site=1V8n9QfKpbgB-DhS4A7l3Tb3jryARZZ02NDdut3J2hy-8scm; logged_in=yes; dotcom_user=TankJam; _gh_sess=ZS83eUYyVkpCWUZab21lN29aRHJTUzgvWjRjc2NCL1ZaMHRsdGdJeVFQM20zRDdPblJ1cnZPRFJjclZKNkcrNXVKbTRmZ3pzZzRxRFExcUozQWV4ZG9kOUQzZzMwMzA2RGx5V2dSaTMwaEZ2ZDlHQ0NzTTBtdGtlT2tVajg0c0hYRk5IOU5FelYxanY4T1UvVS9uV0YzWmF0a083MVVYVGlOSy9Edkt0aXhQTmpYRnVqdFAwSFZHVHZQL0ZyQyt0ZjROajZBclY4WmlGQnNBNTJpeEttb3RjVG1mM0JESFhJRXF5M2IwSlpHb1Mzekc5M0d3OFVIdGpJaHg3azk2aStEcUhPaGpEd2RyMDN3K2pETmZQQ1FtNGNzYnVNckR4aWtibkxBRC8vaGM9LS1zTXlDSmFnQkFkWjFjanJxNlhCdnRRPT0%3D--04f6f3172b5d01244670fc8980c2591d83864f60
14     

  方法一:在請求頭中拼接cookies

 1 import requests
 2 
 3 # 請求url
 4 url = 'https://github.com/settings/emails'
 5 
 6 # 請求頭
 7 headers = {
 8     'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36',
 9     # 在請求頭中拼接cookies
10     # 'Cookie': 'has_recent_activity=1; _ga=GA1.2.1416117396.1560496852; _gat=1; tz=Asia%2FShanghai; _octo=GH1.1.1728573677.1560496856; _device_id=1cb66c9a9599576a3b46df2455810999; user_session=1V8n9QfKpbgB-DhS4A7l3Tb3jryARZZ02NDdut3J2hy-8scm; __Host-user_session_same_site=1V8n9QfKpbgB-DhS4A7l3Tb3jryARZZ02NDdut3J2hy-8scm; logged_in=yes; dotcom_user=TankJam; _gh_sess=ZS83eUYyVkpCWUZab21lN29aRHJTUzgvWjRjc2NCL1ZaMHRsdGdJeVFQM20zRDdPblJ1cnZPRFJjclZKNkcrNXVKbTRmZ3pzZzRxRFExcUozQWV4ZG9kOUQzZzMwMzA2RGx5V2dSaTMwaEZ2ZDlHQ0NzTTBtdGtlT2tVajg0c0hYRk5IOU5FelYxanY4T1UvVS9uV0YzWmF0a083MVVYVGlOSy9Edkt0aXhQTmpYRnVqdFAwSFZHVHZQL0ZyQyt0ZjROajZBclY4WmlGQnNBNTJpeEttb3RjVG1mM0JESFhJRXF5M2IwSlpHb1Mzekc5M0d3OFVIdGpJaHg3azk2aStEcUhPaGpEd2RyMDN3K2pETmZQQ1FtNGNzYnVNckR4aWtibkxBRC8vaGM9LS1zTXlDSmFnQkFkWjFjanJxNlhCdnRRPT0%3D--04f6f3172b5d01244670fc8980c2591d83864f60'
11 }
12 github_res = requests.get(url, headers=headers)

   方法二:將cookies做為get的一個參數

 1 import requests
 2 headers = {
 3     'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36'}
 4 cookies = {
 5     'Cookie': 'has_recent_activity=1; _ga=GA1.2.1416117396.1560496852; _gat=1; tz=Asia%2FShanghai; _octo=GH1.1.1728573677.1560496856; _device_id=1cb66c9a9599576a3b46df2455810999; user_session=1V8n9QfKpbgB-DhS4A7l3Tb3jryARZZ02NDdut3J2hy-8scm; __Host-user_session_same_site=1V8n9QfKpbgB-DhS4A7l3Tb3jryARZZ02NDdut3J2hy-8scm; logged_in=yes; dotcom_user=TankJam; _gh_sess=ZS83eUYyVkpCWUZab21lN29aRHJTUzgvWjRjc2NCL1ZaMHRsdGdJeVFQM20zRDdPblJ1cnZPRFJjclZKNkcrNXVKbTRmZ3pzZzRxRFExcUozQWV4ZG9kOUQzZzMwMzA2RGx5V2dSaTMwaEZ2ZDlHQ0NzTTBtdGtlT2tVajg0c0hYRk5IOU5FelYxanY4T1UvVS9uV0YzWmF0a083MVVYVGlOSy9Edkt0aXhQTmpYRnVqdFAwSFZHVHZQL0ZyQyt0ZjROajZBclY4WmlGQnNBNTJpeEttb3RjVG1mM0JESFhJRXF5M2IwSlpHb1Mzekc5M0d3OFVIdGpJaHg3azk2aStEcUhPaGpEd2RyMDN3K2pETmZQQ1FtNGNzYnVNckR4aWtibkxBRC8vaGM9LS1zTXlDSmFnQkFkWjFjanJxNlhCdnRRPT0%3D--04f6f3172b5d01244670fc8980c2591d83864f60'
 6 }
 7 
 8 github_res = requests.get(url, headers=headers, cookies=cookies)
 9 
10 print('15622792660' in github_res.text)

 五、POST請求講解

 1、GET和POST介紹
  (1)GET請求: (HTTP默認的請求方法就是GET)
       * 沒有請求體
       * 數據必須在1K之內!
       * GET請求數據會暴露在瀏覽器的地址欄中

   (2)GET請求常用的操作:
         1. 在瀏覽器的地址欄中直接給出URL,那么就一定是GET請求
         2. 點擊頁面上的超鏈接也一定是GET請求
         3. 提交表單時,表單默認使用GET請求,但可以設置為POST


   (3)POST請求
      (1). 數據不會出現在地址欄中
      (2). 數據的大小沒有上限
      (3). 有請求體
      (4). 請求體中如果存在中文,會使用URL編碼!

!!!requests.post()用法與requests.get()完全一致,特殊的是requests.post()有一個data參數,用來存放請求體數據!

 2、POST請求自動登錄github

  對於登錄來說,應該在登錄輸入框內輸錯用戶名或密碼然后抓包分析通信流程,假如輸對了瀏覽器就直接跳轉了,還分析什么鬼?就算累死你也找不到數據包

 1 '''
 2 
 3 POST請求自動登錄github。
 4     github反爬:
 5         1.session登錄請求需要攜帶login頁面返回的cookies
 6         2.email頁面需要攜帶session頁面后的cookies
 7 '''
 8 
 9 import requests
10 import re
11 # 一 訪問login獲取authenticity_token
12 login_url = 'https://github.com/login'
13 headers = {
14     'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36',
15     'Referer': 'https://github.com/'
16 }
17 login_res = requests.get(login_url, headers=headers)
18 # print(login_res.text)
19 authenticity_token = re.findall('name="authenticity_token" value="(.*?)"', login_res.text, re.S)[0]
20 # print(authenticity_token)
21 login_cookies = login_res.cookies.get_dict()
22 
23 
24 # 二 攜帶token在請求體內往session發送POST請求
25 session_url = 'https://github.com/session'
26 
27 session_headers = {
28     'Referer': 'https://github.com/login',
29     'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36',
30 }
31 
32 form_data = {
33     "commit": "Sign in",
34     "utf8": "",
35     "authenticity_token": authenticity_token,
36     "login": "username",
37     "password": "githubpassword",
38     'webauthn-support': "supported"
39 }
40 
41 # 三 開始測試是否登錄
42 session_res = requests.post(
43     session_url,
44     data=form_data,
45     cookies=login_cookies,
46     headers=session_headers,
47     # allow_redirects=False
48 )
49 
50 session_cookies = session_res.cookies.get_dict()
51 
52 url3 = 'https://github.com/settings/emails'
53 email_res = requests.get(url3, cookies=session_cookies)
54 
55 print('賬號' in email_res.text)
56 
57 自動登錄github(手動處理cookies信息)

 六、response響應

1、response屬性

復制代碼
import requests headers = { 'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.76 Mobile Safari/537.36', } response = requests.get('https://www.github.com', headers=headers) # response響應 print(response.status_code) # 獲取響應狀態碼 print(response.url) # 獲取url地址 print(response.text) # 獲取文本 print(response.content) # 獲取二進制流 print(response.headers) # 獲取頁面請求頭信息 print(response.history) # 上一次跳轉的地址 print(response.cookies) # # 獲取cookies信息 print(response.cookies.get_dict()) # 獲取cookies信息轉換成字典 print(response.cookies.items()) # 獲取cookies信息轉換成字典 print(response.encoding) # 字符編碼 print(response.elapsed) # 訪問時間

 七、requests高級用法

1、超時設置

# 超時設置
# 兩種超時:float or tuple
# timeout=0.1  # 代表接收數據的超時時間
# timeout=(0.1,0.2)  # 0.1代表鏈接超時  0.2代表接收數據的超時時間

import requests

response = requests.get('https://www.baidu.com',
                        timeout=0.0001)

2、使用代理

復制代碼
# 官網鏈接: http://docs.python-requests.org/en/master/user/advanced/#proxies # 代理設置:先發送請求給代理,然后由代理幫忙發送(封ip是常見的事情) import requests proxies={ # 帶用戶名密碼的代理,@符號前是用戶名與密碼 'http':'http://tank:123@localhost:9527', 'http':'http://localhost:9527', 'https':'https://localhost:9527', } response=requests.get('https://www.12306.cn', proxies=proxies) print(response.status_code) # 支持socks代理,安裝:pip install requests[socks] import requests proxies = { 'http': 'socks5://user:pass@host:port', 'https': 'socks5://user:pass@host:port' } respone=requests.get('https://www.12306.cn', proxies=proxies) print(respone.status_code)
復制代碼

 

 

 

 

 

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM