solr6.6 導入 pdf/doc/txt/json/csv/xml文件

本文轉載自查看原文 2017-11-28 10:26 2695 solr6.6 導入 pdf/doc/txt/json/csv/xml文件/ 搜索引擎Solr系列

　　　　文本主要介紹通過solr界面dataimport工具導入文件，包括pdf、doc、txt 、json、csv、xml等文件，看索引結果有什么不同。其實關鍵是managed-schema、solrconfig.xml和data-config.xml（需要創建）這三個配置文件。

　　1、創建core

　　　　啟動solr，創建mycore

　　　　solr start

　　　　solr create -c mycore

　　2、修改配置

　　　2.1、創建data-config.xml文件

　　　　找到剛才創建的mycore文件夾，solr-6.6.0\server\solr\mycore，在下面的conf文件夾下建立data-config.xml文件，具體參見文件夾下solr-6.6.0\example\example-DIH\solr\tika\conf\tika-data-config.xml的內容：

<dataConfig>
  <dataSource type="BinFileDataSource"/>
  <document>
    <entity name="file" processor="FileListEntityProcessor" dataSource="null"
            baseDir="${solr.install.dir}/example/exampledocs" fileName=".*pdf"
            rootEntity="false">

      <field column="file" name="id"/>

      <entity name="pdf" processor="TikaEntityProcessor"
              url="${file.fileAbsolutePath}" format="text">

        <field column="Author" name="author" meta="true"/>
        <!-- in the original PDF, the Author meta-field name is upper-cased,
          but in Solr schema it is lower-cased
         -->

        <field column="title" name="title" meta="true"/>
        <field column="dc:format" name="format" meta="true"/>

        <field column="text" name="text"/>

      </entity>
    </entity>
  </document>
</dataConfig>

　　　　修改如下：

<dataConfig>
  <dataSource type="BinFileDataSource"/>
  <document>
    <entity name="file" processor="FileListEntityProcessor" dataSource="null"
            baseDir="D:/work/Solr/Import" fileName=".(doc)|(pdf)|(docx)|(txt)|(csv)|(json)|(xml)|(pptx)|(pptx)|(ppt)|(xls)|(xlsx)"
            rootEntity="false">

      <field column="file" name="id"/>
      <field column="fileSize" name="fileSize"/>
      <field column="fileLastModified" name="fileLastModified"/>
      <field column="fileLastModified" name="fileLastModified"/>
      <field column="fileAbsolutePath" name="fileAbsolutePath"/>
      <entity name="pdf" processor="TikaEntityProcessor"
              url="${file.fileAbsolutePath}" format="text">

        <field column="Author" name="author" meta="true"/>
        <!-- in the original PDF, the Author meta-field name is upper-cased,
          but in Solr schema it is lower-cased
         -->

        <field column="title" name="title" meta="true"/>
        <field column="text" name="text"/>

      </entity>
    </entity>
  </document>
</dataConfig>

　　　　fileName :（必選）使用正則表達式匹配文件

　　　　baseDir : (必選) 文件目錄

　　　　 recursive : 是否遞歸的獲取文件，默認false

　　　　rootEntity :在這里必須是false(除非你只想索引文件名)。在默認情況下，document元素下就是根實體了，如果沒有根實體的話，直接在實體下面的實體將會被看做跟實體。

　　　　　　　　對於根實體對應的數據庫中返回的數據的每一行，solr都將生成一個document

　　　　dataSource :如果你是用solr1.3，那就必須設為"null"，因為它沒使用任何dataSourde。不需要在solr1.4中指定它，它只是意味着我們不創建一個dataSource實例。在大多數情況下，

　　　　　　　　只有一個DataSource（JdbcDataSource），當使用FileListEntityProcessor 的時候DataSource不是必須的

　　　　processor:只有當datasource不是RDBMS時才是必須的

　　　　onError :默認是"abort"，"skip"表示跳過當前文檔，"continue"表示對錯誤視而不見

　　2.2、修改solrconfig.xml文件

　　　　增加如下內容：

 <requestHandler name="/dataimport" class="solr.DataImportHandler">
    <lst name="defaults">
      <str name="config">data-config.xml</str>
    </lst>
  </requestHandler>

　　2.3、修改managed-schema

　　　　配置中文詞庫，具體參見：http://www.cnblogs.com/shaosks/p/7843218.html，增加如下內容：

<!-- mmseg4j fieldType-->
  <fieldType name="text_mmseg4j_complex" class="solr.TextField" positionIncrementGap="100" >
    <analyzer>
      <tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="complex" />
    </analyzer>
  </fieldType>
  <fieldType name="text_mmseg4j_maxword" class="solr.TextField" positionIncrementGap="100" >
    <analyzer>
      <tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="max-word" />
    </analyzer>
  </fieldType>
  <fieldType name="text_mmseg4j_simple" class="solr.TextField" positionIncrementGap="100" >
    <analyzer>
      <tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="simple" />
    </analyzer>
  </fieldType>

　　　　增加以下三列，因為id列默認已經有了，不用創建，注意title和text兩個字段的類型用了上面的text_mmseg4j_complex

 <field name="title" type="text_mmseg4j_complex" indexed="true" stored="true"/>
  <field name="text" type="text_mmseg4j_complex" indexed="true" stored="true" omitNorms ="true"/>
  <field name="author" type="string" indexed="true" stored="true"/>
  <field name="fileSize" type="long" indexed="true" stored="true"/>
  <field name="fileLastModified" type="date" indexed="true" stored="true"/>
  <field name="fileAbsolutePath" type="string" indexed="true" stored="true"/>