一、爬蟲基本知識
1 爬蟲原理: 2 什么是爬蟲? 3 爬蟲指的是爬取數據。 4 5 什么是互聯網? 6 由一堆網絡設備把一台一台的計算機互聯到一起。 7 8 互聯網建立的目的? 9 數據的傳遞與數據的共享。 10 11 上網的全過程: 12 - 普通用戶 13 打開瀏覽器 --> 往目標站點發送請求 --> 接收響應數據 --> 渲染到頁面上。 14 15 - 爬蟲程序 16 模擬瀏覽器 --> 往目標站點發送請求 --> 接收響應數據 --> 提取有用的數據 --> 保存到本地/數據庫。 17 18 瀏覽器發送的是什么請求? 19 http協議的請求: 20 - 請求url 21 - 請求方式: 22 GET、POST 23 24 - 請求頭: 25 cookies 26 user-agent 27 host 28 29 爬蟲的全過程: 30 1、發送請求 (請求庫) 31 - requests模塊 32 - selenium模塊 33 34 2、獲取響應數據(服務器返回) 35 36 3、解析並提取數據(解析庫) 37 - re正則 38 - bs4(BeautifulSoup4) 39 - Xpath 40 41 4、保存數據(存儲庫) 42 - MongoDB 43 44 1、3、4需要手動寫。 45 46 - 爬蟲框架 47 Scrapy(基於面向對象) 48 53 54 使用Chrome瀏覽器工具 55 打開開發者模式 ----> network ---> preserve log、disable cache
二、requests庫的安裝
1、在DOS中輸入“pip3 install requests”進行安裝
2、在pycharm中進行安裝
三、基於HTTP協議的requests的請求機制
1、http協議:(以請求百度為例)
(1)請求url:
https://www.baidu.com/
(2)請求方式:
GET
(3)請求頭:
Cookie: 可能需要關注。
User-Agent: 用來證明你是瀏覽器
注意: 去瀏覽器的request headers中查找
Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36
Host: www.baidu.com
2、瀏覽器的使用
3、requests幾種使用方式
1 >>> import requests 2 >>> r = requests.get('https://api.github.com/events') 3 >>> r = requests.post('http://httpbin.org/post', data = {'key':'value'}) 4 >>> r = requests.put('http://httpbin.org/put', data = {'key':'value'}) 5 >>> r = requests.delete('http://httpbin.org/delete') 6 >>> r = requests.head('http://httpbin.org/get') 7 >>> r = requests.options('http://httpbin.org/get')
4、爬取百度主頁
1 import requests 2 3 response = requests.get(url='https://www.baidu.com/') 4 response.encoding = 'utf-8' 5 print(response) # <Response [200]> 6 # 返回響應狀態碼 7 print(response.status_code) # 200 8 # 返回響應文本 9 # print(response.text) 10 print(type(response.text)) # <class 'str'> 11 #將爬取的內容寫入xxx.html文件 12 with open('baidu.html', 'w', encoding='utf-8') as f: 13 f.write(response.text)
四、GET請求講解
1、請求頭headers使用(以訪問“知乎發現”為例)
(1)、直接爬取,則會出錯:
1 訪問”知乎發現“ 2 import requests 3 response = requests.get(url='https://www.zhihu.com/explore') 4 print(response.status_code) # 400 5 print(response.text) # 返回錯誤頁面
(2)添加請求頭之后即可正常爬取
1 # 攜帶請求頭參數訪問知乎: 2 import requests 3 4 #請求頭字典 5 headers = { 6 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36' 7 } 8 #在get請求內,添加user-agent 9 response = requests.get(url='https://www.zhihu.com/explore', headers=headers) 10 print(response.status_code) # 200 11 # print(response.text) 12 with open('zhihu.html', 'w', encoding='utf-8') as f: 13 f.write(response.text)
2、params請求參數
(1)在訪問某些網站時,url會特別長,而且有一長串看不懂的字符串,這時可以用params進行參數替換
1 import requests 2 from urllib.parse import urlencode 3 #以百度搜索“蔡徐坤”為例 4 # url = 'https://www.baidu.com/s?wd=%E8%94%A1%E5%BE%90%E5%9D%A4' 5 ''' 6 方法1: 7 url = 'https://www.baidu.com/s?' + urlencode({"wd": "蔡徐坤"}) 8 headers = { 9 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36' 10 } 11 response = requests.get(url,headers) 12 ''' 13 #方法2: 14 url = 'https://www.baidu.com/s?' 15 headers = { 16 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36' 17 } 18 # 在get方法中添加params參數 19 response = requests.get(url, headers=headers, params={"wd": "蔡徐坤"}) 20 print(url) # https://www.baidu.com/s?wd=%E8%94%A1%E5%BE%90%E5%9D%A4 21 # print(response.text) 22 with open('xukun.html', 'w', encoding='utf-8') as f: 23 f.write(response.text)
3、cookies參數使用
(1)攜帶登錄cookies破解github登錄驗證
1 攜帶cookies 2 攜帶登錄cookies破解github登錄驗證 3 4 請求url: 5 https://github.com/settings/emails 6 7 請求方式: 8 GET 9 10 請求頭: 11 User-Agen 12 13 Cookie: has_recent_activity=1; _ga=GA1.2.1416117396.1560496852; _gat=1; tz=Asia%2FShanghai; _octo=GH1.1.1728573677.1560496856; _device_id=1cb66c9a9599576a3b46df2455810999; user_session=1V8n9QfKpbgB-DhS4A7l3Tb3jryARZZ02NDdut3J2hy-8scm; __Host-user_session_same_site=1V8n9QfKpbgB-DhS4A7l3Tb3jryARZZ02NDdut3J2hy-8scm; logged_in=yes; dotcom_user=TankJam; _gh_sess=ZS83eUYyVkpCWUZab21lN29aRHJTUzgvWjRjc2NCL1ZaMHRsdGdJeVFQM20zRDdPblJ1cnZPRFJjclZKNkcrNXVKbTRmZ3pzZzRxRFExcUozQWV4ZG9kOUQzZzMwMzA2RGx5V2dSaTMwaEZ2ZDlHQ0NzTTBtdGtlT2tVajg0c0hYRk5IOU5FelYxanY4T1UvVS9uV0YzWmF0a083MVVYVGlOSy9Edkt0aXhQTmpYRnVqdFAwSFZHVHZQL0ZyQyt0ZjROajZBclY4WmlGQnNBNTJpeEttb3RjVG1mM0JESFhJRXF5M2IwSlpHb1Mzekc5M0d3OFVIdGpJaHg3azk2aStEcUhPaGpEd2RyMDN3K2pETmZQQ1FtNGNzYnVNckR4aWtibkxBRC8vaGM9LS1zTXlDSmFnQkFkWjFjanJxNlhCdnRRPT0%3D--04f6f3172b5d01244670fc8980c2591d83864f60 14
方法一:在請求頭中拼接cookies
1 import requests 2 3 # 請求url 4 url = 'https://github.com/settings/emails' 5 6 # 請求頭 7 headers = { 8 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36', 9 # 在請求頭中拼接cookies 10 # 'Cookie': 'has_recent_activity=1; _ga=GA1.2.1416117396.1560496852; _gat=1; tz=Asia%2FShanghai; _octo=GH1.1.1728573677.1560496856; _device_id=1cb66c9a9599576a3b46df2455810999; user_session=1V8n9QfKpbgB-DhS4A7l3Tb3jryARZZ02NDdut3J2hy-8scm; __Host-user_session_same_site=1V8n9QfKpbgB-DhS4A7l3Tb3jryARZZ02NDdut3J2hy-8scm; logged_in=yes; dotcom_user=TankJam; _gh_sess=ZS83eUYyVkpCWUZab21lN29aRHJTUzgvWjRjc2NCL1ZaMHRsdGdJeVFQM20zRDdPblJ1cnZPRFJjclZKNkcrNXVKbTRmZ3pzZzRxRFExcUozQWV4ZG9kOUQzZzMwMzA2RGx5V2dSaTMwaEZ2ZDlHQ0NzTTBtdGtlT2tVajg0c0hYRk5IOU5FelYxanY4T1UvVS9uV0YzWmF0a083MVVYVGlOSy9Edkt0aXhQTmpYRnVqdFAwSFZHVHZQL0ZyQyt0ZjROajZBclY4WmlGQnNBNTJpeEttb3RjVG1mM0JESFhJRXF5M2IwSlpHb1Mzekc5M0d3OFVIdGpJaHg3azk2aStEcUhPaGpEd2RyMDN3K2pETmZQQ1FtNGNzYnVNckR4aWtibkxBRC8vaGM9LS1zTXlDSmFnQkFkWjFjanJxNlhCdnRRPT0%3D--04f6f3172b5d01244670fc8980c2591d83864f60' 11 } 12 github_res = requests.get(url, headers=headers)
方法二:將cookies做為get的一個參數
1 import requests 2 headers = { 3 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36'} 4 cookies = { 5 'Cookie': 'has_recent_activity=1; _ga=GA1.2.1416117396.1560496852; _gat=1; tz=Asia%2FShanghai; _octo=GH1.1.1728573677.1560496856; _device_id=1cb66c9a9599576a3b46df2455810999; user_session=1V8n9QfKpbgB-DhS4A7l3Tb3jryARZZ02NDdut3J2hy-8scm; __Host-user_session_same_site=1V8n9QfKpbgB-DhS4A7l3Tb3jryARZZ02NDdut3J2hy-8scm; logged_in=yes; dotcom_user=TankJam; _gh_sess=ZS83eUYyVkpCWUZab21lN29aRHJTUzgvWjRjc2NCL1ZaMHRsdGdJeVFQM20zRDdPblJ1cnZPRFJjclZKNkcrNXVKbTRmZ3pzZzRxRFExcUozQWV4ZG9kOUQzZzMwMzA2RGx5V2dSaTMwaEZ2ZDlHQ0NzTTBtdGtlT2tVajg0c0hYRk5IOU5FelYxanY4T1UvVS9uV0YzWmF0a083MVVYVGlOSy9Edkt0aXhQTmpYRnVqdFAwSFZHVHZQL0ZyQyt0ZjROajZBclY4WmlGQnNBNTJpeEttb3RjVG1mM0JESFhJRXF5M2IwSlpHb1Mzekc5M0d3OFVIdGpJaHg3azk2aStEcUhPaGpEd2RyMDN3K2pETmZQQ1FtNGNzYnVNckR4aWtibkxBRC8vaGM9LS1zTXlDSmFnQkFkWjFjanJxNlhCdnRRPT0%3D--04f6f3172b5d01244670fc8980c2591d83864f60' 6 } 7 8 github_res = requests.get(url, headers=headers, cookies=cookies) 9 10 print('15622792660' in github_res.text)
五、POST請求講解
1、GET和POST介紹
(1)GET請求: (HTTP默認的請求方法就是GET)
* 沒有請求體
* 數據必須在1K之內!
* GET請求數據會暴露在瀏覽器的地址欄中
(2)GET請求常用的操作:
1. 在瀏覽器的地址欄中直接給出URL,那么就一定是GET請求
2. 點擊頁面上的超鏈接也一定是GET請求
3. 提交表單時,表單默認使用GET請求,但可以設置為POST
(3)POST請求
(1). 數據不會出現在地址欄中
(2). 數據的大小沒有上限
(3). 有請求體
(4). 請求體中如果存在中文,會使用URL編碼!
!!!requests.post()用法與requests.get()完全一致,特殊的是requests.post()有一個data參數,用來存放請求體數據!
2、POST請求自動登錄github
對於登錄來說,應該在登錄輸入框內輸錯用戶名或密碼然后抓包分析通信流程,假如輸對了瀏覽器就直接跳轉了,還分析什么鬼?就算累死你也找不到數據包
1 ''' 2 3 POST請求自動登錄github。 4 github反爬: 5 1.session登錄請求需要攜帶login頁面返回的cookies 6 2.email頁面需要攜帶session頁面后的cookies 7 ''' 8 9 import requests 10 import re 11 # 一 訪問login獲取authenticity_token 12 login_url = 'https://github.com/login' 13 headers = { 14 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36', 15 'Referer': 'https://github.com/' 16 } 17 login_res = requests.get(login_url, headers=headers) 18 # print(login_res.text) 19 authenticity_token = re.findall('name="authenticity_token" value="(.*?)"', login_res.text, re.S)[0] 20 # print(authenticity_token) 21 login_cookies = login_res.cookies.get_dict() 22 23 24 # 二 攜帶token在請求體內往session發送POST請求 25 session_url = 'https://github.com/session' 26 27 session_headers = { 28 'Referer': 'https://github.com/login', 29 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36', 30 } 31 32 form_data = { 33 "commit": "Sign in", 34 "utf8": "✓", 35 "authenticity_token": authenticity_token, 36 "login": "username", 37 "password": "githubpassword", 38 'webauthn-support': "supported" 39 } 40 41 # 三 開始測試是否登錄 42 session_res = requests.post( 43 session_url, 44 data=form_data, 45 cookies=login_cookies, 46 headers=session_headers, 47 # allow_redirects=False 48 ) 49 50 session_cookies = session_res.cookies.get_dict() 51 52 url3 = 'https://github.com/settings/emails' 53 email_res = requests.get(url3, cookies=session_cookies) 54 55 print('賬號' in email_res.text) 56 57 自動登錄github(手動處理cookies信息)
六、response響應
1、response屬性
import requests headers = { 'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.76 Mobile Safari/537.36', } response = requests.get('https://www.github.com', headers=headers) # response響應 print(response.status_code) # 獲取響應狀態碼 print(response.url) # 獲取url地址 print(response.text) # 獲取文本 print(response.content) # 獲取二進制流 print(response.headers) # 獲取頁面請求頭信息 print(response.history) # 上一次跳轉的地址 print(response.cookies) # # 獲取cookies信息 print(response.cookies.get_dict()) # 獲取cookies信息轉換成字典 print(response.cookies.items()) # 獲取cookies信息轉換成字典 print(response.encoding) # 字符編碼 print(response.elapsed) # 訪問時間
七、requests高級用法
1、超時設置
# 超時設置 # 兩種超時:float or tuple # timeout=0.1 # 代表接收數據的超時時間 # timeout=(0.1,0.2) # 0.1代表鏈接超時 0.2代表接收數據的超時時間 import requests response = requests.get('https://www.baidu.com', timeout=0.0001)
2、使用代理
# 官網鏈接: http://docs.python-requests.org/en/master/user/advanced/#proxies # 代理設置:先發送請求給代理,然后由代理幫忙發送(封ip是常見的事情) import requests proxies={ # 帶用戶名密碼的代理,@符號前是用戶名與密碼 'http':'http://tank:123@localhost:9527', 'http':'http://localhost:9527', 'https':'https://localhost:9527', } response=requests.get('https://www.12306.cn', proxies=proxies) print(response.status_code) # 支持socks代理,安裝:pip install requests[socks] import requests proxies = { 'http': 'socks5://user:pass@host:port', 'https': 'socks5://user:pass@host:port' } respone=requests.get('https://www.12306.cn', proxies=proxies) print(respone.status_code)