solr+jieba結巴分詞

本文轉載自查看原文 2018-02-02 12:40 1534 jieba/ solr/ 結巴分詞/ 檢索

為什么選擇結巴分詞

分詞效率高
詞料庫構建時使用的是jieba (python)

結巴分詞Java版本

下載

git clone https://github.com/huaban/jieba-analysis

編譯

cd jieba-analysis
mvn install

注意

如果mvn版本較高，需要修改pom.xml文件，在plugins前面增加

solr tokenizer版本

https://github.com/sing1ee/analyzer-solr (solr 5)
https://github.com/sing1ee/jieba-solr.git (solr 4)

支持solr 6或7或更高

如果你的solr像我一樣，版本比較新，需要對代碼稍做修改，但改動其實不大。(根據給編譯時報的錯誤做修改即可)

build.gradle的diff

diff --git a/build.gradle b/build.gradle
index 2a87525..06c5cc3 100644
--- a/build.gradle
+++ b/build.gradle
@@ -1,4 +1,4 @@
-group = 'analyzer.solr5'
+group = 'analyzer.solr7'
version = '1.0'
apply plugin: 'java'
apply plugin: "eclipse"
@@ -14,15 +14,14 @@ repositories {
dependencies {
testCompile group: 'junit', name: 'junit', version: '4.11'

- compile("org.apache.lucene:lucene-core:5.0.0")
- compile("org.apache.lucene:lucene-queryparser:5.0.0")
- compile("org.apache.lucene:lucene-analyzers-common:5.0.0")
- compile('com.huaban:jieba-analysis:1.0.0')
-// compile("org.fnlp:fnlp-core:2.0-SNAPSHOT")
+ compile("org.apache.lucene:lucene-core:7.1.0")
+ compile("org.apache.lucene:lucene-queryparser:7.1.0")
+ compile("org.apache.lucene:lucene-analyzers-common:7.1.0")
+ compile files('libs/jieba-analysis-1.0.3.jar')
compile("edu.stanford.nlp:stanford-corenlp:3.5.1")
}

task "create-dirs" << {
sourceSets*.java.srcDirs*.each { it.mkdirs() }
sourceSets*.resources.srcDirs*.each { it.mkdirs() }
-}
\ No newline at end of file
+}

編譯

./gladlew build

集成到solr

拷貝jar包到solr的目錄下：server/solr-webapp/webapp/WEB-INF/lib

schema修改

    <fieldType name="text_jieba" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="analyzer.solr7.jieba.JiebaTokenizerFactory"  segMode="SEARCH"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ch.txt" />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="analyzer.solr7.jieba.JiebaTokenizerFactory"  segMode="SEARCH"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ch.txt" />
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English"/>
      </analyzer>
    </fieldType>

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 結巴（jieba）分詞 python結巴(jieba)分詞 jieba結巴分詞 jieba GitHUb 結巴分詞 jieba分詞 python 結巴分詞(jieba)詳解 jieba: 結巴中文分詞模塊 jieba結巴分詞庫中文分詞結巴（jieba）中文分詞及其應用實踐 python使用結巴分詞(jieba)創建自己的詞典/詞庫結巴分詞 java 高性能實現，優雅易用的 api 設計，性能優於 huaban jieba 分詞