如何將爬取的數據寫入ES中


前面章節一直在說ES相關知識點,現在是如何實現將爬取到的數據寫入到ES中,首先的知道ES的python接口叫elasticsearch dsl

鏈接:https://github.com/elastic/elasticsearch-dsl-py

 

什么是elasticsearch dsl:

Elasticsearch DSL是一個高級庫,其目的是幫助編寫和運行針對Elasticsearch的查詢

安裝:

pip install elasticsearch-dsl

 

首先我們在項目文件中新建一個名為models的包,然后在包里新建一個名為es.types.py的文件,用於定義ES的數據的定義

# !/usr/bin/env python
# -*- coding:utf-8 -*-
from datetime import datetime
from elasticsearch_dsl import  Date,DocType,Text,Integer,analyzer,Completion,Keyword,Integer
from elasticsearch_dsl.connections import connections
connections.create_connection(hosts=["localhost"])

class ActicleType(DocType):
      #伯樂在線文章類型
      title = Text(analyzer ="ik_max_word")
      create_date = Date()
      url = Keyword()
      url_object_id = Keyword()
      front_image_url = Keyword()
      front_image_path = Keyword()
      praise_nums = Integer()
      comment_nums = Integer()
      fav_nums = Integer()
      tags = Text(analyzer="ik_max_word")
      content = Text(analyzer="ik_max_word")

      class Meta:
            index = "jobbile"
            doc_type = "article"

if __name__=="__main__":
      ActicleType.init()

然后再items中編寫如下文件:

# !/usr/bin/env python
# -*- coding:utf-8 -*-
from models.es_types import ArticleType

def save_to_es(self):
    artcle = ArticleType()
    artcle.title = self['title']
    artcle.cteate_date = self['cteate_date']
    artcle.content = remove_tags(self['content'])
    artcle.front_image_url = self['front_image']
    if "front_image_path" in self:
        artcle.front_image_path = self['front_image_path']
    artcle.praise_nums = self['praise_nums']
    artcle.fav_nums = self['fav_nums']
    artcle.comment_nums = self['comment_nums']
    artcle.url = self['url']
    artcle.tags = self['tags']
    artcle.meta.id = self['url_object_id']

    artcle.save()
    return

然后再pipeline中編寫如下文件:

# !/usr/bin/env python
# -*- coding:utf-8 -*-
from models.es_types import ArticleType
from w3lib.html import remove_tags

class ElasticsearchPipeline(object):
    #將數據寫入到es中
    def process_item(self,item,spider):
        #將item轉換為es數據
        item.save_to_es()
        return item

最后再settings中編寫如下文件:

# !/usr/bin/env python
# -*- coding:utf-8 -*-
ITEM_PIPELINES = {
   'ArticleSpider.pipelines.ElasticsearchPipeline': 300,
}

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM