近來學習爬取Pexels圖片時,發現書上代碼會拋出ConnectionError,經查閱資料知,可能是向網頁申請過於頻繁被禁,可使用time.sleep(),減緩爬取速度,但考慮到爬取數據較多,運行時間過長,所以選擇對拋出的異常pass,在此修正。
開發環境:(Windows)eclipse+pydev
爬取網址:傳送門
1、通過觀察網頁可一直下滑更新知,該網頁使用了異步加載技術(AJAX)
2、觀察網頁源代碼,F12——>NETWORK——>Headers,得請求URL
3、逐步刪除URL字符串,把URL縮短,當使用"search/book/?page=2"時,可返回正常網頁內容
代碼展示:
# _*_ coding:utf-8 _*_
import requests
from bs4 import BeautifulSoup
headers ={
'accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36'
}
urls = ['https://www.pexels.com/search/book/?page={}'.format(str(i)) for i in range(1,20)]
list = [] #初始化列表,存儲圖片URLS
path = 'D:\Pyproject\pexels\picture'
for url in urls:
try:
wb_data = requests.get(url, headers = headers)
soul = BeautifulSoup(wb_data.text, 'lxml')
imgs = soul.select('article > a > img')
for img in imgs:
photo = img.get('src')
list.append(photo)
print('加載成功')
except ConnectionError:
print('pass disappoint')
for item in list:
try:
data = requests.get(item, headers = headers)
fp = open(path + item.split('?')[0][-10:], 'wb')
fp.write(data.content)
fp.close
print('下載成功')
except ConnectionError:
print('pass')
可以加入Time.time()觀察程序運行時間
import time
start_time = time.time()
# program code
end_time = time.time()
print(start_time - end_time)
寫入圖片內容時代碼迭代
with open(path + item.split('?')[0][-10:]) as fp:
fp.write(data.content)