scrapy pipelines導出各種格式

本文轉載自查看原文 2017-08-22 17:55 1140 爬蟲/ python/ scrapy

scrapy在使用pipelines的時候，我們經常導出csv,json.jsonlines等等格式。每次都需要寫一個類去導出，很麻煩。

這里我整理一個pipeline文件，支持多種格式的。

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

from scrapy import signals
from scrapy.exporters import *
import logging
logger=logging.getLogger(__name__)
class BaseExportPipeLine(object):
    def __init__(self,**kwargs):
        self.files = {}
        self.exporter=kwargs.pop("exporter",None)
        self.dst=kwargs.pop("dst",None)
        self.option=kwargs
    @classmethod
    def from_crawler(cls, crawler):
        pipeline = cls()
        crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
        crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
        return pipeline

    def spider_opened(self, spider):
        file = open(self.dst, 'wb')
        self.files[spider] = file
        self.exporter = self.exporter(file,**self.option)
        self.exporter.start_exporting()

    def spider_closed(self, spider):
        self.exporter.finish_exporting()
        file = self.files.pop(spider)
        file.close()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

# 
# 'fields_to_export':["url","edit_url","title"] 設定只導出部分字段,以下幾個pipeline都支持這個參數
# 'export_empty_fields':False 設定是否導出空字段 以下幾個pipeline都支持這個參數
# 'encoding':'utf-8' 設定默認編碼，以下幾個pipeline都支持這個參數
# 'indent' :1： 設置縮進，這個參數主要給JsonLinesExportPipeline使用
# "item_element":"item"設置xml節點元素的名字，只能XmlExportPipeline使用,效果是<item></item>
# "root_element":"items"設置xml根元素的名字，只能XmlExportPipeline使用，效果是<items>里面是很多item</items>
# "include_headers_line":True 是否包含字段行， 只能CsvExportPipeline使用
# "join_multivalued":","設置csv文件的分隔符號， 只能CsvExportPipeline使用
# 'protocol':2設置PickleExportPipeline 導出協議，只能PickleExportPipeline使用
# "dst":"items.json" 設置目標位置
class JsonExportPipeline(BaseExportPipeLine):
    def __init__(self):
        option={"exporter":JsonItemExporter,"dst":"items.json","encoding":"utf-8","indent":4,}
        super(JsonExportPipeline, self).__init__(**option)
class JsonLinesExportPipeline(BaseExportPipeLine):
    def __init__(self):
        option={"exporter":JsonLinesItemExporter,"dst":"items.jl","encoding":"utf-8"}
        super(JsonLinesExportPipeline, self).__init__(**option)
class XmlExportPipeline(BaseExportPipeLine):
    def __init__(self):
        option={"exporter":XmlItemExporter,"dst":"items.xml","item_element":"item","root_element":"items","encoding":'utf-8'}
        super(XmlExportPipeline, self).__init__(**option)
class CsvExportPipeline(BaseExportPipeLine):
    def __init__(self):
        # 設置分隔符的這個，我這里測試是不成功的
        option={"exporter":CsvItemExporter,"dst":"items.csv","encoding":"utf-8","include_headers_line":True, "join_multivalued":","}
        super(CsvExportPipeline, self).__init__(**option)
class  PickleExportPipeline(BaseExportPipeLine):
    def __init__(self):
        option={"exporter":PickleItemExporter,"dst":"items.pickle",'protocol':2}
        super(PickleExportPipeline, self).__init__(**option)
class  MarshalExportPipeline(BaseExportPipeLine):
    def __init__(self):
        option={"exporter":MarshalItemExporter,"dst":"items.marsha"}
        super(MarshalExportPipeline, self).__init__(**option)
class  PprintExportPipeline(BaseExportPipeLine):
    def __init__(self):
        option={"exporter":PprintItemExporter,"dst":"items.pprint.jl"}
        super(PprintExportPipeline, self).__init__(**option)

上面的定義好之后。我們就可以在settings.py里面設置導出指定的類了。

ITEM_PIPELINES = {
    'ScrapyCnblogs.pipelines.PprintExportPipeline': 300,
    #'ScrapyCnblogs.pipelines.JsonLinesExportPipeline': 302,
    #'ScrapyCnblogs.pipelines.JsonExportPipeline': 303,
    #'ScrapyCnblogs.pipelines.XmlExportPipeline': 304,
}

是不是很強大。如果你感興趣，可以去github上找找這個部分的源碼，地址如下：https://github.com/scrapy/scrapy/blob/master/scrapy/exporters.py

exporters的測試代碼在這個位置：https://github.com/scrapy/scrapy/blob/master/tests/test_exporters.py，有興趣的話，可以拜讀下人家的源碼吧。

詳細的使用案例，可以參考我的一個github項目： https://github.com/zhaojiedi1992/ScrapyCnblogs

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Scrapy用pipelines把字典保存為csv格式 python之scrapy模塊pipelines scrapy框架中多個spider,tiems,pipelines的使用及運行方法 scrapy pipeline導出csv亂碼處理，關鍵點是要把編碼格式改為utf-8-sig scrapy 爬取圖片，數據沒有下載下來， [scrapy.pipelines.files] WARNING: File (code: 301): Error downloading file from 導出excel，格式設置 ClickHouse導出格式 java 導出dbf格式將csv導出json格式 Excel 導出格式設置