Python3網絡爬蟲：requests爬取動態網頁內容

本文轉載自查看原文 2020-10-21 23:39 766 python3ftp/ 設計模式2020/ 大一編程基礎/ 高德news2021/ pythonmail20

Python版本：python3.+
運行環境：OSX
IDE：pycharm

一、工具准備

抓包工具：在OSX下,我使用的是Charles4.0

下載鏈接以及安裝教程:http://www.sdifen.com/charles4.html
安裝完成后，要給Charles安裝證書,Mac上使用Charles對https請求抓包–安裝Root Certificate中前半部分講的就是Charles安裝證書的步驟，后半部分是設置抓取手機包的步驟，在這我們用不到

代理工具:chrome插件 SwitchOmega
- 由於Charles的工作原理是以自身作為代理,讓比如chrome瀏覽器通過該代理來進行收發數據,以此達到抓包的目的。我們可以通過chrome設置來代理，但是chrome插件 SwitchOmega能夠讓我們方便地自由切換代理，以此達到高效，快捷的生活。(笑)

SwitchOmega使用方法

安裝完插件,如圖,點擊選項
點擊新建情景模式
自定義一個名字,選擇代理服務器,點擊創建
代理協議填http,代理服務器填127.0.0.1,代理端口填8888,點擊應用選項,會提示保存成功
退出到首頁,在目標網頁(https://unsplash.com/)下,如圖,可以通過選擇我們設置好的charles來設置全局代理(所有網站的數據請求都通過代理)
也可以通過僅為該網站設置代理

二、知識儲備

requests.get()：requests官方文檔
在這里，request.get()中我是用到了 verify=False,stream=True這兩個額外參數
verify = False能越過網站的SSL驗證
stream=True保持流的開啟,直到流的關閉。在下載圖片過程中，能讓圖片在完全下載下來之前保持數據流不關閉,以保障圖片下載的完全性。如果去掉該參數再下載圖片,則會發現圖片無法下載成功。
contextlib.closing()：contextlib庫下的closing方法，功能是將對象變成上下文對象,以此來支持with。
使用過 with open() 就知道 with的好處是能幫我們自動關閉資源對象，以此來簡化代碼。而實際上，任何對象，只要正確實現了上下文管理，就可以用於with語句。這里還有一篇關於with語句與上下文管理器

三、思路整理

這次抓取的動態網頁是個壁紙網站,上面有精美的壁紙,我們的目的就是通過爬蟲,把該網站上的壁紙原圖下載到本地

網站url:https://unsplash.com/

已知該網站是動態網站,那就需要通過抓取網站的js包來分析,它是如何獲得數據的。

通過Charles來獲取打開網站時的請求
從中查找到有用的json
從json中的數據,比較下載鏈接的url,發現,下載鏈接變化部分就是圖片的id
從json中爬出圖片id，在下載鏈接上填充id部分,執行下載操作

四、具體步驟

通過Charles來獲取打開網站時的請求,如圖

從圖中不難發現在headers中有一個authorization Client-ID的參數,這需要記下來，加在我們自己的請求頭中。(由於是學習筆記的原因,該參數已知為反爬蟲所需要的參數，具體檢測反爬的操作,估計可以自己添加參數一個一個嘗試)
從中查找到有用的json
在
在這里我們就發現了存有圖片ID
在網頁上點擊下載圖片,,從抓包中抓取下載鏈接，發現下載鏈接變化部分就是圖片的id

這樣，就能確定爬取圖片的具體步驟了。
從json中爬出圖片id，在下載鏈接上填充id部分,執行下載操作
分析完之后,總結一下代碼步驟
- 爬取包中的圖片id,保存到list中
- 依次按照圖片id，進行圖片的下載

五、代碼整理

代碼步驟
1. 爬取圖片id,保存到list中去

# -*- coding:utf-8 -*- import requests,json def get_ids(): # target_url = 'http://unsplash.com/napi/feeds/home' id_url = 'http://unsplash.com/napi/feeds/home' header = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) ' 'AppleWebKit/537.36 (KHTML, like Gecko) ' 'Chrome/61.0.3163.79 Safari/537.36', 'authorization': '*************'#此部分參數通過抓包獲取 } id_lists = [] # SSLerror 通過添加 verify=False來解決 try: response = requests.get(id_url, headers=header, verify=False, timeout=30) response.encoding = 'utf-8' print(response.text) dic = json.loads(response.text) # print(dic) print(type(dic)) print("next_page:{}".format(dic['next_page'])) for each in dic['photos']: # print("圖片ID:{}".format(each['id'])) id_lists.append(each['id']) print("圖片id讀取完成") return id_lists except: print("圖片id讀取發生異常") return False if __name__=='__main__': id_lists = get_ids() if not id_lists is False: for id in id_lists: print(id)

結果如圖所示，圖片ID已經成功打印出來了
這里寫圖片描述

依據圖片id,進行圖片的下載

import os from contextlib import closing import requests from datetime import datetime def download(img_id): header = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) ' 'AppleWebKit/537.36 (KHTML, like Gecko) ' 'Chrome/61.0.3163.79 Safari/537.36', 'authorization': '***********'#此參數需從包中獲取 } file_path = 'images' download_url = 'https://unsplash.com/photos/{}/download?force=true' download_url = download_url.format(img_id) if file_path not in os.listdir(): os.makedirs('images') # 2種下載方法 # 方法1 # urlretrieve(download_url,filename='images/'+img_id) # 方法2 requests文檔推薦方法 # response = requests.get(download_url, headers=self.header,verify=False, stream=True) # response.encoding=response.apparent_encoding chunk_size = 1024 with closing(requests.get(download_url, headers=header, verify=False, stream=True)) as response: file = '{}/{}.jpg'.format(file_path, img_id) if os.path.exists(file): print("圖片{}.jpg已存在,跳過本次下載".format(img_id)) else: try: start_time = datetime.now() with open(file, 'ab+') as f: for chunk in response.iter_content(chunk_size=chunk_size): f.write(chunk) f.flush() end_time = datetime.now() sec = (end_time - start_time).seconds print("下載圖片{}完成,耗時:{}s".format(img_id, sec)) except: if os.path.exists(file): os.remove(file) print("下載圖片{}失敗".format(img_id)) if __name__=='__main__': img_id = 'vgpHniLr9Uw' download(img_id)

1
運行結果:

這里寫圖片描述
下載前

下載后

合並代碼，進行批量下載

# -*- coding:utf-8 -*- import requests,json from urllib.request import urlretrieve import os from datetime import datetime from contextlib import closing import time class UnsplashSpider: def __init__(self): self.id_url = 'http://unsplash.com/napi/feeds/home' self.header = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) ' 'AppleWebKit/537.36 (KHTML, like Gecko) ' 'Chrome/61.0.3163.79 Safari/537.36', 'authorization': '***********'#此部分需要自行添加 } self.id_lists = [] self.download_url='https://unsplash.com/photos/{}/download?force=true' print("init") def get_ids(self): # target_url = 'http://unsplash.com/napi/feeds/home' # target_url = 'https://unsplash.com/' #SSLerror 通過添加 verify=False來解決 try: response = requests.get(self.id_url,headers=self.header,verify=False, timeout=30) response.encoding = 'utf-8' # print(response.text) dic = json.loads(response.text) # print(dic) print(type(dic)) print("next_page:{}".format(dic['next_page'])) for each in dic['photos']: # print("圖片ID:{}".format(each['id'])) self.id_lists.append(each['id']) print("圖片id讀取完成") return self.id_lists except: print("圖片id讀取發生異常") return False def download(self,img_id): file_path = 'images' download_url = self.download_url.format(img_id) if file_path not in os.listdir(): os.makedirs('images') # 2種下載方法 # 方法1 # urlretrieve(download_url,filename='images/'+img_id) # 方法2 requests文檔推薦方法 # response = requests.get(download_url, headers=self.header,verify=False, stream=True) # response.encoding=response.apparent_encoding chunk_size=1024 with closing(requests.get(download_url, headers=self.header,verify=False, stream=True)) as response: file = '{}/{}.jpg'.format(file_path,img_id) if os.path.exists(file): print("圖片{}.jpg已存在,跳過本次下載".format(img_id)) else: try: start_time = datetime.now() with open(file,'ab+') as f: for chunk in response.iter_content(chunk_size = chunk_size): f.write(chunk) f.flush() end_time = datetime.now() sec = (end_time - start_time).seconds print("下載圖片{}完成,耗時:{}s".format(img_id,sec)) except: print("下載圖片{}失敗".format(img_id)) if __name__=='__main__': us = UnsplashSpider() id_lists = us.get_ids() if not id_lists is False: for id in id_lists: us.download(id) #合理的延時,以尊敬網站 time.sleep(1)

六、結語

因為本文是學習筆記,中間省略了一些細節。

結合其他資料一起學習，發現爬取動態網站的關鍵點是抓包分析。只要能從包中分析出關鍵的數據，剩下寫爬蟲的步驟

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 《python3網絡爬蟲開發實戰》--動態渲染頁面爬取 Python爬蟲爬取動態網頁 Python 使用selenium+webdriver爬取動態網頁內容【python】第一個爬蟲：用requests庫爬取網頁內容 python的requests模塊爬取網頁內容 Python3網絡爬蟲(七)：使用Beautiful Soup爬取小說 Python 爬蟲-selenium動態網頁爬取 Python3網絡爬蟲--爬取有聲小說（附源碼） python爬取動態網頁2，從JavaScript文件讀取內容 python爬取網頁內容demo