scrapy爬蟲學習系列五：圖片的抓取和下載

本文轉載自查看原文 2017-08-31 16:15 1454 爬蟲/ python/ scrapy

系列文章列表：

scrapy爬蟲學習系列一：scrapy爬蟲環境的准備：　　 http://www.cnblogs.com/zhaojiedi1992/p/zhaojiedi_python_007_scrapy01.html

scrapy爬蟲學習系列二：scrapy簡單爬蟲樣例學習：　　http://www.cnblogs.com/zhaojiedi1992/p/zhaojiedi_python_007_scrapy02.html

scrapy爬蟲學習系列三：scrapy部署到scrapyhub上：　 http://www.cnblogs.com/zhaojiedi1992/p/zhaojiedi_python_004_scrapyhub.html

scrapy爬蟲學習系列四：portia的學習入門：　　　　　 http://www.cnblogs.com/zhaojiedi1992/p/zhaojiedi_python_010_scrapy04.html

scrapy爬蟲學習系列五：圖片的抓取和下載： http://www.cnblogs.com/zhaojiedi1992/p/zhaojiedi_python_011_scrapy05.html

scrapy爬蟲學習系列六：官方文檔的學習： https://github.com/zhaojiedi1992/My_Study_Scrapy

注意：我自己新建的一個QQ群（新建的）,歡迎大家加入一起學習一起進步，群號646187336

這篇文章主要對一個車標網（http://car.bitauto.com/qichepinpai）的圖片進行抓取，並按照圖片的alt屬性值去設置輸出圖片命名。

本文的最終源碼下載地址（github）：https://github.com/zhaojiedi1992/caricon

1.創建工程和爬蟲

C:\Users\Administrator>e:


E:\>cd scrapytest

E:\scrapytest>scrapy startproject caricon
New Scrapy project 'caricon', using template directory 'C:\\Program Files\\Anaconda3\\lib\\site-packages\\scrapy\\templa
tes\\project', created in:
    E:\scrapytest\caricon

You can start your first spider with:
    cd caricon
    scrapy genspider example example.com

E:\scrapytest>cd caricon

E:\scrapytest\caricon>scrapy genspider car car.bitauto.com/qichepinpai
Created spider 'car' using template 'basic' in module:
  caricon.spiders.car

4.修改item

添加字段，修改后為如下內容：

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class CariconItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    image_urls = scrapy.Field()
    images = scrapy.Field()
    alt = scrapy.Field()

image_urls ：作為項目的圖片網址（需要我們指定url）。
images ：下載的影像信息（這個字段不是我們填充的）。

注意：上面的alt字段是我自己加的，image_urls ，images這2個字段是請求圖片的默認字段，必須要有的，建議使用默認字段。你要是喜歡折騰可以參考這個網址：https://docs.scrapy.org/en/latest/topics/media-pipeline.html#usage-example

3.修改爬蟲

這里我們先使用火狐瀏覽器的Firefinder插件找找我們需要提取的圖片，圖片如下：

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html


class CariconPipeline(object):
    def process_item(self, item, spider):
        return item
from scrapy.contrib.pipeline.images import ImagesPipeline
from scrapy.http import Request
from scrapy.exceptions import DropItem
import os

class MyImagesPipeline(ImagesPipeline):
    def file_path(self, request, response=None, info=None):
        #url_file_name= request.url.split('/')[-1]
        #image_guid = hashlib.sha1(to_bytes(url)).hexdigest()
        alt_name=request.meta["alt"]
        return 'full/%s%s' % (alt_name, os.path.splitext(request.url)[-1])

    def get_media_requests(self, item, info):
        yield Request(item["image_urls"][0], meta={'alt':item["alt"]})

代碼簡介：通常我們使用官方的那個imagepipeline導出的文件是SHA1 hash 你的url作為文件名，很難區別啊，這里使用到了request方法的meta參數，把我們的圖片的alt屬性傳遞過去，這樣我們返回文件名的時候就可以使用這個alt的名字來區別了。（但是如果alt重復又替換了原來的圖片的）

注意，firefinder這個插件依賴與firebug的，你可以在你的瀏覽器找類似firefinder的工具。

6.修改setttings.py文件

修改下面片段為如下內容：

ITEM_PIPELINES = {
    'caricon.pipelines.MyImagesPipeline': 300,
}

IMAGES_STORE = r'e:\test\pic\'

當然我們這里可以使用官方的imagepipeline（scrapy.pipelines.images.ImagesPipeline）

6.運行爬蟲

E:\scrapytest\caricon>scrapy crawl car

7.查看結果

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 scrapy爬蟲學習系列三：scrapy部署到scrapyhub上用Scrapy爬蟲下載圖片(豆瓣電影圖片) scrapy爬蟲學習系列二：scrapy簡單爬蟲樣例學習 scrapy爬蟲學習系列一：scrapy爬蟲環境的准備 scrapy爬蟲學習系列四：portia的學習入門爬蟲之scrapy下載文件和圖片 Python爬蟲——利用Scrapy批量下載圖片 scrapy 在爬取過程中抓取下載圖片 Scrapy學習篇（九）之文件與圖片下載 scrapy爬蟲系列之三--爬取圖片保存到本地