scrapy中有個自帶的pipeline工具,ImagesPipeline,可以專門用來儲存圖片到本地。
但默認儲存地址無法配置,所以我們需要寫一個自己的pipeline用於儲存圖片。
先分析一下我們的需求:
1.修改圖片路徑,路徑根據采集到的item中的數據變化;
2.將數據庫中保存圖片的url更改為我們的本地文件路徑。
首先需要繼承原pipeline:
class DownloadImagesPipeline(ImagesPipeline):
然后我們可以查看源碼,看看需要改那些地方:
首先是file_path方法,該方法返回了圖片儲存路徑:
def file_path(self, request, response=None, info=None): ## start of deprecation warning block (can be removed in the future) def _warn(): from scrapy.exceptions import ScrapyDeprecationWarning import warnings warnings.warn('ImagesPipeline.image_key(url) and file_key(url) methods are deprecated, ' 'please use file_path(request, response=None, info=None) instead', category=ScrapyDeprecationWarning, stacklevel=1) # check if called from image_key or file_key with url as first argument if not isinstance(request, Request): _warn() url = request else: url = request.url # detect if file_key() or image_key() methods have been overridden if not hasattr(self.file_key, '_base'): _warn() return self.file_key(url) elif not hasattr(self.image_key, '_base'): _warn() return self.image_key(url) ## end of deprecation warning block image_guid = hashlib.sha1(to_bytes(url)).hexdigest() # change to request.url after deprecation return 'full/%s.jpg' % (image_guid)
然后是item_completed方法,該方法返回了item。
def item_completed(self, results, item, info): if isinstance(item, dict) or self.images_result_field in item.fields: item[self.images_result_field] = [x for ok, x in results if ok] return item
最后是他們的請求方法get_media_requests,我們需要傳入item的內容用於文件夾的命名:
def get_media_requests(self, item, info): return [Request(x) for x in item.get(self.images_urls_field, [])]
好,我們現在開始重寫這三個方法:
首先重寫get_media_requests,傳入文件夾名稱,這里加了一個判斷避免報錯,同時將return改成了yield,使用return也是可以的,這一塊主要是為了校驗fetch_date,以及傳入fetch_date:
def get_media_requests(self, item, info): if isinstance(item, LiveItem) and item.get('image') and item.get('fetch_date'): yield Request(item['image'].replace('\\', '/'), meta={'fetch_date': item.get('fetch_date')})
然后是file_path, 我們只需要復制源碼過來修改一下儲存路徑即可:
def file_path(self, request, response=None, info=None): ## start of deprecation warning block (can be removed in the future) def _warn(): from scrapy.exceptions import ScrapyDeprecationWarning import warnings warnings.warn('ImagesPipeline.image_key(url) and file_key(url) methods are deprecated, ' 'please use file_path(request, response=None, info=None) instead', category=ScrapyDeprecationWarning, stacklevel=1) # check if called from image_key or file_key with url as first argument if not isinstance(request, Request): _warn() url = request else: url = request.url # detect if file_key() or image_key() methods have been overridden if not hasattr(self.file_key, '_base'): _warn() return self.file_key(url) elif not hasattr(self.image_key, '_base'): _warn() return self.image_key(url) ## end of deprecation warning block image_guid = hashlib.sha1(to_bytes(request.url)).hexdigest() return '%s/%s.jpg' % (int(time.mktime(time.strptime(request.meta['fetch_date'], "%Y-%m-%d %H:%M:%S"))),image_guid)
我們的圖片下載完成后,會使用一個元組(即results)傳入 item_completed 方法,其中包含一些圖片的信息,我們可以打印看看:
[(True, {'url': 'https://rpic.douyucdn.cn/asrpic/180918/5070841_1710.jpg/dy1', 'path': '1537261918/7ccaf3dbc7aef44c597cbd1ec4f01ca2fe1995c5.jpg', 'checksum': '92eeb26633a9631ba457f4f524b2d8c2'})]
所以這里我們可以直接對item中的url進行修改為path中的內容即可:
def item_completed(self, results, item, info): image_paths = [info.get('path', None) for success, info in results if success and info] if not image_paths: return item if isinstance(item, LiveItem): item['image'] = u''.join(image_paths) return item