Image Pipeline


Image Pipeline

Scrapy 提供了專門下載文件或者圖片的Pipeline,下載圖片與文件的原理同抓取網頁的原理是一樣的,所以他們的下載過程支持多線程與異步,十分的高效

Image Pipeline的工作流程

  1. itemPipeline從item中獲取需要下載的數據,通過Request重新放入到項目隊列等待調度器調度下載
  2. 當圖片下載完成,另一個組(images)將被更新到結構中,其中包括下載圖片的信息,比如下載路徑,源抓取地址(從image_urls組獲得)和圖片的校驗碼. images列表中的圖片順序將和源image_urls組保持一致.如果某個圖片下載失敗,將會記錄下錯誤信息,圖片也不會出現在images組中

參考鏈接

案例

  1. 首先在settings中配置圖片存放路徑

    IMAGES_STORE = './images'
    
  2. 在item中定義需要的數據結構

    class Images360Item(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        collection = table = "images"
        id = scrapy.Field()
        title = scrapy.Field()
        url = scrapy.Field()
        thumb = scrapy.Field()
    
  3. 定義spider與parse

    import scrapy
    from urllib.parse import urlencode
    from scrapy import Request
    from images360.images360.items import Images360Item
    
    class ImagesSpider(scrapy.Spider):
        name = 'images'
        allowed_domains = ['images.so.com']
        start_urls = ['http://images.so.com/']
    
        def start_requests(self):
            data = {'ch': 'photography',
                    'listtype': 'hot', }
            base_url = 'http://images.so.com/zj?'
            for page in range(1, self.settings.get('MAX_PAGE_SIZE') + 1):
                sn = page * 30
                data['sn'] = sn
                params = urlencode(data)
                url = base_url + params
                print(url)
                yield Request(url, self.parse)
    
        def parse(self, response):
            html = json.loads(response.text)
            datas = html.get('list', '')
            if datas:
                for data in datas:
                    images_item = Images360Item()
                    images_item['id'] = data.get('imageid', '')
                    images_item['title'] = data.get('group_title', '')
                    images_item['url'] = data.get('qhimg_url', '')
                    images_item['thumb'] = data.get('qhimg_thumb_url', '')
                    yield images_item
    
  4. 定義項目管道

    from scrapy import Request
    from scrapy.exceptions import DropItem
    from scrapy.pipelines.images import ImagesPipeline
    
    class ImagesPipeline(ImagesPipeline):
    
        # 將item中的url取出來 通過Request繼續放入到調度器中執行
        def get_media_requests(self, item, info):
            yield Request(item['url'])
    
        # request對應的是當前下載對象,該函數用於放回 文件名
        def file_path(self, request, response=None, info=None):
            url = request.url
            print('url============', url)
            file_name = url.split('/')[-1]
            return file_name
    
        # 單個item完成下載時的處理方法
        def item_completed(self,results,item,info):
    				# results為Item對應的下載結果
            # print(results) 
            # [(True, {'url': 'http://p2.so.qhimgs1.com/t01b866193d9b2101de.jpg', 'path': 't01b866193d9b2101de.jpg',
            #          'checksum': 'e074b5cbacd22ac38480d84506fedf02'})]
    
            image_path = [x['path'] for ok,x in results if ok]
            if image_path:
                return item
            else:
                raise DropItem('image download failed')
    

    注:ImagePipeline的優先級別應該比存入數據庫的級別高


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM