scrapy在使用pipelines的時候,我們經常導出csv,json.jsonlines等等格式。每次都需要寫一個類去導出,很麻煩。
這里我整理一個pipeline文件,支持多種格式的。
# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html # -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html from scrapy import signals from scrapy.exporters import * import logging logger=logging.getLogger(__name__) class BaseExportPipeLine(object): def __init__(self,**kwargs): self.files = {} self.exporter=kwargs.pop("exporter",None) self.dst=kwargs.pop("dst",None) self.option=kwargs @classmethod def from_crawler(cls, crawler): pipeline = cls() crawler.signals.connect(pipeline.spider_opened, signals.spider_opened) crawler.signals.connect(pipeline.spider_closed, signals.spider_closed) return pipeline def spider_opened(self, spider): file = open(self.dst, 'wb') self.files[spider] = file self.exporter = self.exporter(file,**self.option) self.exporter.start_exporting() def spider_closed(self, spider): self.exporter.finish_exporting() file = self.files.pop(spider) file.close() def process_item(self, item, spider): self.exporter.export_item(item) return item # # 'fields_to_export':["url","edit_url","title"] 設定只導出部分字段,以下幾個pipeline都支持這個參數 # 'export_empty_fields':False 設定是否導出空字段 以下幾個pipeline都支持這個參數 # 'encoding':'utf-8' 設定默認編碼,以下幾個pipeline都支持這個參數 # 'indent' :1: 設置縮進,這個參數主要給JsonLinesExportPipeline使用 # "item_element":"item"設置xml節點元素的名字,只能XmlExportPipeline使用,效果是<item></item> # "root_element":"items"設置xml根元素的名字,只能XmlExportPipeline使用,效果是<items>里面是很多item</items> # "include_headers_line":True 是否包含字段行, 只能CsvExportPipeline使用 # "join_multivalued":","設置csv文件的分隔符號, 只能CsvExportPipeline使用 # 'protocol':2設置PickleExportPipeline 導出協議,只能PickleExportPipeline使用 # "dst":"items.json" 設置目標位置 class JsonExportPipeline(BaseExportPipeLine): def __init__(self): option={"exporter":JsonItemExporter,"dst":"items.json","encoding":"utf-8","indent":4,} super(JsonExportPipeline, self).__init__(**option) class JsonLinesExportPipeline(BaseExportPipeLine): def __init__(self): option={"exporter":JsonLinesItemExporter,"dst":"items.jl","encoding":"utf-8"} super(JsonLinesExportPipeline, self).__init__(**option) class XmlExportPipeline(BaseExportPipeLine): def __init__(self): option={"exporter":XmlItemExporter,"dst":"items.xml","item_element":"item","root_element":"items","encoding":'utf-8'} super(XmlExportPipeline, self).__init__(**option) class CsvExportPipeline(BaseExportPipeLine): def __init__(self): # 設置分隔符的這個,我這里測試是不成功的 option={"exporter":CsvItemExporter,"dst":"items.csv","encoding":"utf-8","include_headers_line":True, "join_multivalued":","} super(CsvExportPipeline, self).__init__(**option) class PickleExportPipeline(BaseExportPipeLine): def __init__(self): option={"exporter":PickleItemExporter,"dst":"items.pickle",'protocol':2} super(PickleExportPipeline, self).__init__(**option) class MarshalExportPipeline(BaseExportPipeLine): def __init__(self): option={"exporter":MarshalItemExporter,"dst":"items.marsha"} super(MarshalExportPipeline, self).__init__(**option) class PprintExportPipeline(BaseExportPipeLine): def __init__(self): option={"exporter":PprintItemExporter,"dst":"items.pprint.jl"} super(PprintExportPipeline, self).__init__(**option)
上面的定義好之后。我們就可以在settings.py里面設置導出指定的類了。
ITEM_PIPELINES = { 'ScrapyCnblogs.pipelines.PprintExportPipeline': 300, #'ScrapyCnblogs.pipelines.JsonLinesExportPipeline': 302, #'ScrapyCnblogs.pipelines.JsonExportPipeline': 303, #'ScrapyCnblogs.pipelines.XmlExportPipeline': 304, }
是不是很強大。如果你感興趣,可以去github上找找這個部分的源碼,地址如下:https://github.com/scrapy/scrapy/blob/master/scrapy/exporters.py
exporters的測試代碼在這個位置:https://github.com/scrapy/scrapy/blob/master/tests/test_exporters.py,有興趣的話,可以拜讀下人家的源碼吧。
詳細的使用案例,可以參考我的一個github項目: https://github.com/zhaojiedi1992/ScrapyCnblogs
