初識scrapy框架(十)------ 爬取手機APP的圖片


  該文章以爬取手機斗魚APP為例,我們希望爬取關鍵字”顏值“里面的主播大圖,點擊獲取:前期手機配置與fiddler配置

配置好fiddler和手機之后,打開斗魚APP,用fiddler抓包,這里的數據返回的都是JSON數據,所以我們可以直接用json提取。

首先找到鏈接,這里作者抓到的鏈接使用是有問題的,真實科研的鏈接是:https://capi.douyucdn.cn*******&offset=0  .這

里不方便公布別人的資源鏈接,請自己多抓,自行尋找。鏈接效果圖如下:

 

根據這個鏈接,使用scrapy爬取數據:

創建爬蟲項目:scrapy startproject douyu

創建爬蟲文件:scrapy genspider yanzhi  douyucdn.cn

明確爬蟲目標:

 1 import scrapy
 2 
 3 
 4 class DouyuItem(scrapy.Item):
 5     # define the fields for your item here like:
 6     # name = scrapy.Field()
 7     # 主播名
 8     nick_name = scrapy.Field()
 9     # 房間號
10     room_id = scrapy.Field()
11     # 作者所在城市
12     anchor_city = scrapy.Field()
13     # 圖片鏈接
14     image_link = scrapy.Field()
15     # 文件儲存的地址
16     image_path = scrapy.Field()
17 
18     source = scrapy.Field()
19     utc_time = scrapy.Field()

編寫爬蟲文件:

 1 import scrapy
 2 import json
 3 from douyu.items import DouyuItem
 4 
 5 
 6 class YanzhiSpider(scrapy.Spider):
 7     name = 'yanzhi'
 8     allowed_domains = ['douyucdn.cn']
 9     offset = 0
10     base_url = 'https://capi.douyucdn.cn**************&offset='
11     start_urls = [base_url + str(offset)]
12 
13     def parse(self, response):
14         node_list = json.loads(response.body.decode())["data"]
15 
16         if not node_list:
17             return
18 
19         for node in node_list:
20             item = DouyuItem()
21             item["nick_name"] = node["nickname"]
22             item["room_id"] = node["room_id"]
23             item["anchor_city"] = node["anchor_city"]
24             item["image_link"] = node["vertical_src"]
25             yield item
26         self.offset += 20
27         yield scrapy.Request(url=self.base_url+str(self.offset), callback=self.parse)

編寫管道文件:

 1 from scrapy.pipelines.images import ImagesPipeline
 2 from douyu.settings import IMAGES_STORE
 3 from datetime import datetime
 4 import scrapy
 5 import os
 6 
 7 
 8 class ImageSource(object):
 9     def process_item(self, item, spider):
10         item["source"] = spider.name
11         item["utc_time"] = str(datetime.utcnow())
12         print("**"*20)
13         return item
14 
15 
16 class DouyuImagesPipeline(ImagesPipeline):
17 
18     # 發送圖片鏈接請求
19     def get_media_requests(self, item, info):
20         # 獲取item數據的圖片鏈接
21         image_link = item["image_link"]
22         print(image_link)
23         # 發送圖片請求,響應默認會保存在指定路徑下
24         yield scrapy.Request(url=image_link)
25 
26     def item_completed(self, results, item, info):
27         # 每個result是一個圖片信息,去除圖片原來的路徑
28         image_path = [x["path"] for ok, x in results if ok]
29         print(results)
30 
31         # 先保存當前圖片的路徑
32         old_name = IMAGES_STORE + "/" + image_path[0]
33         # 更改當前圖片的路徑名字
34         new_name = IMAGES_STORE + "/" + item["nick_name"] + ".jpg"
35         item["image_path"] = new_name
36         try:
37             # 修改為新的路徑名,文件名
38             os.rename(old_name, new_name)
39         except Exception as e:
40             print("[INFO]:圖片已經修改\n", e)
41         return item

編寫下載中間件:

 1 import random
 2 from douyu.settings import USER_AGENTS as UA
 3 
 4 
 5 class UserAgentMiddleware(object):
 6 
 7     """
 8         給每一次請求隨機賦值一個User_Agent
 9     """
10 
11     def process_request(self, request, spider):
12         user_agent = random.choice(UA)
13         request.headers['User_Agent'] = user_agent

配置管道和下載中間件:

 1 IMAGES_STORE = '/home/dan/data/images'
 2 # Crawl responsibly by identifying yourself (and your website) on the user-agent
 3 USER_AGENTS = [
 4             "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
 5             "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
 6             "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
 7             "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
 8             "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
 9             "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
10             "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
11             "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
12             "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3192.0 Safari/537.36Name"
13             ]
14 
15 
16 
17 # Enable or disable downloader middlewares
18 # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
19 DOWNLOADER_MIDDLEWARES = {
20    'douyu.middlewares.UserAgentMiddleware': 543,
21 }
22 
23 
24 # Configure item pipelines
25 # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
26 ITEM_PIPELINES = {
27     'douyu.pipelines.ImageSource': 100,
28     'douyu.pipelines.DouyuImagesPipeline': 200,
29 }

這里的有些配置在前面已經配置好了,只是在這里同一書寫。

為了方便執行與數據管理,我們創建一個py文件執行爬蟲文件,並刪除冗余的文件夾full

1 import os
2 
3 print("開始執行爬蟲程序")
4 os.system("scrapy crawl yanzhi")
5 print("刪除多余的文件")
6 os.rmdir("/home/dan/data/images/full")

 

運行上面的代碼,下面展示部分結果和數據:

文件夾下保存的數據:

上面的代碼很簡單,所以沒有進行說明,如果沒有基礎還是先看看基礎吧。

有木有很心動,代碼已給,燥起來吧騷年!!!

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM