該文章以爬取手機斗魚APP為例,我們希望爬取關鍵字”顏值“里面的主播大圖,點擊獲取:前期手機配置與fiddler配置。
配置好fiddler和手機之后,打開斗魚APP,用fiddler抓包,這里的數據返回的都是JSON數據,所以我們可以直接用json提取。
首先找到鏈接,這里作者抓到的鏈接使用是有問題的,真實科研的鏈接是:https://capi.douyucdn.cn*******&offset=0 .這
里不方便公布別人的資源鏈接,請自己多抓,自行尋找。鏈接效果圖如下:

根據這個鏈接,使用scrapy爬取數據:
創建爬蟲項目:scrapy startproject douyu
創建爬蟲文件:scrapy genspider yanzhi douyucdn.cn
明確爬蟲目標:
1 import scrapy 2 3 4 class DouyuItem(scrapy.Item): 5 # define the fields for your item here like: 6 # name = scrapy.Field() 7 # 主播名 8 nick_name = scrapy.Field() 9 # 房間號 10 room_id = scrapy.Field() 11 # 作者所在城市 12 anchor_city = scrapy.Field() 13 # 圖片鏈接 14 image_link = scrapy.Field() 15 # 文件儲存的地址 16 image_path = scrapy.Field() 17 18 source = scrapy.Field() 19 utc_time = scrapy.Field()
編寫爬蟲文件:
1 import scrapy 2 import json 3 from douyu.items import DouyuItem 4 5 6 class YanzhiSpider(scrapy.Spider): 7 name = 'yanzhi' 8 allowed_domains = ['douyucdn.cn'] 9 offset = 0 10 base_url = 'https://capi.douyucdn.cn**************&offset=' 11 start_urls = [base_url + str(offset)] 12 13 def parse(self, response): 14 node_list = json.loads(response.body.decode())["data"] 15 16 if not node_list: 17 return 18 19 for node in node_list: 20 item = DouyuItem() 21 item["nick_name"] = node["nickname"] 22 item["room_id"] = node["room_id"] 23 item["anchor_city"] = node["anchor_city"] 24 item["image_link"] = node["vertical_src"] 25 yield item 26 self.offset += 20 27 yield scrapy.Request(url=self.base_url+str(self.offset), callback=self.parse)
編寫管道文件:
1 from scrapy.pipelines.images import ImagesPipeline 2 from douyu.settings import IMAGES_STORE 3 from datetime import datetime 4 import scrapy 5 import os 6 7 8 class ImageSource(object): 9 def process_item(self, item, spider): 10 item["source"] = spider.name 11 item["utc_time"] = str(datetime.utcnow()) 12 print("**"*20) 13 return item 14 15 16 class DouyuImagesPipeline(ImagesPipeline): 17 18 # 發送圖片鏈接請求 19 def get_media_requests(self, item, info): 20 # 獲取item數據的圖片鏈接 21 image_link = item["image_link"] 22 print(image_link) 23 # 發送圖片請求,響應默認會保存在指定路徑下 24 yield scrapy.Request(url=image_link) 25 26 def item_completed(self, results, item, info): 27 # 每個result是一個圖片信息,去除圖片原來的路徑 28 image_path = [x["path"] for ok, x in results if ok] 29 print(results) 30 31 # 先保存當前圖片的路徑 32 old_name = IMAGES_STORE + "/" + image_path[0] 33 # 更改當前圖片的路徑名字 34 new_name = IMAGES_STORE + "/" + item["nick_name"] + ".jpg" 35 item["image_path"] = new_name 36 try: 37 # 修改為新的路徑名,文件名 38 os.rename(old_name, new_name) 39 except Exception as e: 40 print("[INFO]:圖片已經修改\n", e) 41 return item
編寫下載中間件:
1 import random 2 from douyu.settings import USER_AGENTS as UA 3 4 5 class UserAgentMiddleware(object): 6 7 """ 8 給每一次請求隨機賦值一個User_Agent 9 """ 10 11 def process_request(self, request, spider): 12 user_agent = random.choice(UA) 13 request.headers['User_Agent'] = user_agent
配置管道和下載中間件:
1 IMAGES_STORE = '/home/dan/data/images' 2 # Crawl responsibly by identifying yourself (and your website) on the user-agent 3 USER_AGENTS = [ 4 "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)", 5 "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)", 6 "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)", 7 "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)", 8 "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6", 9 "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1", 10 "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0", 11 "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5", 12 "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3192.0 Safari/537.36Name" 13 ] 14 15 16 17 # Enable or disable downloader middlewares 18 # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html 19 DOWNLOADER_MIDDLEWARES = { 20 'douyu.middlewares.UserAgentMiddleware': 543, 21 } 22 23 24 # Configure item pipelines 25 # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html 26 ITEM_PIPELINES = { 27 'douyu.pipelines.ImageSource': 100, 28 'douyu.pipelines.DouyuImagesPipeline': 200, 29 }
這里的有些配置在前面已經配置好了,只是在這里同一書寫。
為了方便執行與數據管理,我們創建一個py文件執行爬蟲文件,並刪除冗余的文件夾full
1 import os 2 3 print("開始執行爬蟲程序") 4 os.system("scrapy crawl yanzhi") 5 print("刪除多余的文件") 6 os.rmdir("/home/dan/data/images/full")
運行上面的代碼,下面展示部分結果和數據:

文件夾下保存的數據:

上面的代碼很簡單,所以沒有進行說明,如果沒有基礎還是先看看基礎吧。
有木有很心動,代碼已給,燥起來吧騷年!!!
