任務目標:下載王者榮耀官方壁紙頁面的所有頁面所有規格的壁紙到指定總文件夾中,每種壁紙都有一個該壁紙名稱的文件夾,該文件夾中下載的是所有規格的該壁紙
王者榮耀官方壁紙網頁url為 https://pvp.qq.com/web201605/wallpaper.shtml
經查看網頁結構,發現網頁源代碼中並沒有壁紙相關信息,因此壁紙數據是動態加載的。打開chrome開發者工具,Network選項,刷新頁面,找到對應發送壁紙數據的url為https://apps.game.qq.com/cgi-bin/ams/module/ishow/V1.0/query/workList_inc.cgi?activityId=2735&sVerifyCode=ABCD&sDataType=JSON&iListNum=20&totalpage=0&page=0&iOrder=0&iSortNumClose=1&jsoncallback=jQuery171041143228271859056_1605079993513&iAMSActivityId=51991&everyRead=true&iTypeId=2&iFlowId=267733&iActId=2735&iModuleId=2735&=1605079993712
經測試發現,url中的page=0的屬性展示的是第一頁的數據,更改page的值即可獲取其他頁面的數據。
由於響應數據是jQuery+一串數字開頭,因此可以去掉url中的jsoncallback=jQuery171041143228271859056_1605079993513參數,這樣就可以返回標准格式的json數據。
查看該json數據得知,壁紙數據信息在List下,每頁20個,其中的sProdImgNo_1-sProdImgNo_8屬性值,表示八種不同規格的壁紙url,但該url為加密過的url,通過urllib庫的parse模塊的unquote方法可以獲取解密后的url,解密后url后綴為200,改為0后即可獲取正確規格的壁紙。
通過常規requests方法直接獲取數據是單線程的,圖片需下完一張才可以下另一張,效率太低,為此,可以使用生產者消費者模式來實現多線程爬取,參考代碼如下:
# 爬取王者榮耀官方頁面壁紙
from time import time
import requests
from urllib import parse
import os
import threading
from queue import Queue
import re
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like'
' Gecko) Chrome/81.0.4044.129 Safari/537.36'
}
class Producer(threading.Thread):
def __init__(self, page_queue, image_queue, *args, **kwargs):
super(Producer, self).__init__(*args, **kwargs)
self.page_queue = page_queue
self.image_queue = image_queue
def run(self):
while not self.page_queue.empty():
url = self.page_queue.get()
response = requests.get(url, headers=headers)
datas = response.json()['List']
for data in datas:
image_urls = extract_images(data)
file_name = parse.unquote(data['sProdName']).replace(':', ':').strip()
image_path = os.path.join('wangzhebizhi_images', file_name)
for image_url in image_urls:
if not os.path.exists(image_path):
os.mkdir(image_path)
self.image_queue.put({'image_url': image_url, 'image_path': image_path})
class Consumer(threading.Thread):
def __init__(self, image_queue, *args, **kwargs):
super(Consumer, self).__init__(*args, **kwargs)
self.image_queue = image_queue
def run(self):
while True:
try:
image_info = self.image_queue.get(timeout=5)
image_url = image_info['image_url']
image_path = image_info['image_path']
pattern = re.compile(r'(.*?)sProdImgNo_(.)\.jpg')
num = pattern.search(image_url).group(2)
with open(os.path.join(image_path, num + '.jpg'), 'wb') as f:
response = requests.get(image_url, headers=headers)
f.write(response.content)
print(image_path + num + '.jpg下載完成')
except:
print('全部下載完成')
end_time = time()
print('程序耗時:' + str(end_time - start_time))
break
# 解析網頁url
def extract_images(data):
image_urls = []
for i in range(1, 9):
image_url = parse.unquote(data['sProdImgNo_{}'.format(i)].replace('200', '0'))
image_urls.append(image_url)
return image_urls
def main():
page_queue = Queue(50)
image_queue = Queue(1000)
a = 2 # 要爬取多少頁
for i in range(a):
url = 'https://apps.game.qq.com/cgi-bin/ams/module/ishow/V1.0/query/' \
'workList_inc.cgi?activityId=2735&sVerifyCode=ABCD&sDataType=JSON&iListNum=20' \
'&totalpage=0&page={}&iOrder=0&iSortNumClose=1&iAMSActivityId=51991&' \
'everyRead=true&iTypeId=2&iFlowId=267733&iActId=2735&iModuleId=2735&_=1601534798968'.format(i)
page_queue.put(url)
for i in range(5):
th = Producer(page_queue, image_queue, name='生產者{}號'.format(i+1))
th.start()
for i in range(20):
th = Consumer(image_queue, name='消費者{}號'.format(i))
th.start()
if __name__ == '__main__':
start_time = time()
main()
在運行該py文件前,需先在該文件同級目錄下創建一個文件夾 wangzhebizhi_images
運行該py文件,可以發現,下載速度較單線程requests爬取快了很多。