爬取整個網站[爬蟲進階筆記]

本文轉載自查看原文 2022-03-06 15:01 1497 Python爬蟲

從爬取一頁數據到爬取所有數據

　　　　　先說一下靜態網頁爬蟲的大概流程

數據加載方式
通過點擊第二頁發現，網站后面多了 ?start=25 字段
這部分被稱為 查詢字符串，查詢字符串作為用於搜索的參數或處理的數據傳送給服務器處理，格式是 ?key1=value1&key2=value2。
我們多翻幾頁豆瓣讀書的頁面，觀察一下網址的變化規律：
不難發現：第二頁 start=25，第三頁 start=50，第十頁 start=225，而每頁的書籍數量是 25。
因此 start 的計算公式為 start = 25 * (頁碼數 - 1)（25 為每頁展示的數量）。
可以寫一段代碼自動生成所有所要查找的網頁地址

 1 url = 'https://book.douban.com/top250?start={}'
 2 # num 從 0 開始因此不用再 -1
 3 urls = [url.format(num * 25) for num in range(10)]
 4 print(urls)
 5 # 輸出：
 6 # [
 7 #   'https://book.douban.com/top250?start=0',
 8 #   'https://book.douban.com/top250?start=25',
 9 #   'https://book.douban.com/top250?start=50',
10 #   'https://book.douban.com/top250?start=75',
11 #   'https://book.douban.com/top250?start=100',
12 #   'https://book.douban.com/top250?start=125',
13 #   'https://book.douban.com/top250?start=150',
14 #   'https://book.douban.com/top250?start=175',
15 #   'https://book.douban.com/top250?start=200',
16 #   'https://book.douban.com/top250?start=225'
17 # ]

生成所有所需網頁地址

有了所有網頁的鏈接后，我們就可以爬取整個網站的數據了

 1 import requests
 2 import time
 3 from bs4 import BeautifulSoup
 4 
 5 # 將獲取豆瓣讀書數據的代碼封裝成函數
 6 def get_douban_books(url):
 7   headers = {
 8     'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36'
 9   }
10   res = requests.get(url, headers=headers)
11   soup = BeautifulSoup(res.text, 'html.parser')
12   items = soup.find_all('div', class_='pl2')
13   for i in items:
14     tag = i.find('a')
15     name = tag['title']
16     link = tag['href']
17     print(name, link)
18 
19 url = 'https://book.douban.com/top250?start={}'
20 urls = [url.format(num * 25) for num in range(10)]
21 for item in urls:
22   get_douban_books(item)
23   # 暫停 1 秒防止訪問太快被封
24   time.sleep(1)

爬取整個網站

反爬蟲：限制頻繁、非正常網頁瀏覽

不管是瀏覽器還是爬蟲，訪問網站時都會帶上一些信息用於身份識別。而這些信息都被存儲在一個叫請求頭（request headers）的地方。
服務器會通過請求頭里的信息來判別訪問者的身份。請求頭里的字段有很多，我們暫時只需了解 user-agent（用戶代理）即可。user-agent 里包含了操作系統、瀏覽器類型、版本等信息，通過修改它我們就能成　　　　　　功地偽裝成瀏覽器。

requests 的官方文檔（http://cn.python-requests.org/zh_CN/latest/）

判別身份是最簡單的一種反爬蟲方式，我們也能通過一行代碼，將爬蟲偽裝成瀏覽器輕易地繞過這個限制。所以，大部分網站還會進行 IP 限制防止過於頻繁的訪問。

IP（Internet Protocol）全稱互聯網協議地址，意思是分配給用戶上網使用的網際協議的設備的數字標簽。你可以將 IP 地址理解為門牌號，我只要知道你家的門牌號就能找到你家。
當我們爬取大量數據時，如果我們不加以節制地訪問目標網站，會使網站超負荷運轉，一些個人小網站沒什么反爬蟲措施可能因此癱瘓。而大網站一般會限制你的訪問頻率，因為正常人是不會在 1s 內訪問幾十次甚至上百次網站的。
常使用 time.sleep() 來降低訪問的頻率
也可以使用代理來解決 IP 限制的問題即通過別的 IP 訪問網站
官方文檔—— https://cn.python-requests.org/zh_CN/latest/user/advanced.html#proxies

1 import requests
2 
3 proxies = {
4   "http": "http://10.10.1.10:3128",
5   "https": "http://10.10.1.10:1080",
6 }
7 
8 requests.get("http://example.org", proxies=proxies)

在爬取大量數據時我們需要很多的 IP 用於切換。因此，我們需要建立一個 IP 代理池（列表），每次從中隨機選擇一個傳給 proxies 參數。

我們來看一下如何實現：

 1 import requests
 2 import random
 3 from bs4 import BeautifulSoup
 4 
 5 def get_douban_books(url, proxies):
 6   headers = {
 7     'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36'
 8   }
 9   # 使用代理爬取數據
10   res = requests.get(url, proxies=proxies, headers=headers)
11   soup = BeautifulSoup(res.text, 'html.parser')
12   items = soup.find_all('div', class_='pl2')
13   for i in items:
14     tag = i.find('a')
15     name = tag['title']
16     link = tag['href']
17     print(name, link)
18 
19 url = 'https://book.douban.com/top250?start={}'
20 urls = [url.format(num * 25) for num in range(10)]
21 # IP 代理池（瞎寫的並沒有用）
22 proxies_list = [
23   {
24     "http": "http://10.10.1.10:3128",
25     "https": "http://10.10.1.10:1080",
26   },
27   {
28     "http": "http://10.10.1.11:3128",
29     "https": "http://10.10.1.11:1080",
30   },
31   {
32     "http": "http://10.10.1.12:3128",
33     "https": "http://10.10.1.12:1080",
34   }
35 ]
36 for i in urls:
37   # 從 IP 代理池中隨機選擇一個
38   proxies = random.choice(proxies_list)
39   get_douban_books(i, proxies)

代理池的實現

爬蟲中的君子協議——robots.txt

robots.txt 是一種存放於網站根目錄下的文本文件，用於告訴爬蟲此網站中的哪些內容是不應被爬取的，哪些是可以被爬取的。

我們只要在網站域名后加上 /robots.txt 即可查看，

比如豆瓣讀書的 robots.txt 地址是：https://book.douban.com/robots.txt。打開它后的內容如下：

1 User-agent: *
2 Disallow: /subject_search
3 Disallow: /search
4 Disallow: /new_subject
5 Disallow: /service/iframe
6 Disallow: /j/
7 
8 User-agent: Wandoujia Spider
9 Disallow: /

User-agent: * 表示針對所有爬蟲（* 是通配符），接下來是符合該 user-agent 的爬蟲要遵守的規則。比如 Disallow: /search 表示禁止爬取 /search 這個頁面，其他同理。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python爬蟲-基礎入門-爬取整個網站《3》「爬蟲」從某網站爬取數據爬蟲之爬取拉鈎網站 typescript 學習筆記 - 簡單網頁爬蟲1：爬取整個網頁的內容爬蟲實戰系列（一）：爬取某網站圖片怎么反爬蟲爬取網站信息 python爬蟲：爬取某網站視頻 Python爬蟲實踐——爬取網站文章爬蟲小案例——爬取網站小說爬取簡單反爬蟲網站實戰