集成到ElasticSearch

git clone git@github.com:hongfuli/elasticsearch-analysis-jieba.git
cd elasticsearch-analysis-jieba
mvn package

把release/elasticsearch-analysis-jieba-{version}.zip文件解壓到 elasticsearch 的 plugins 目錄下，重啟elasticsearch即可。

直接使用Tokenizer分詞

可直接使用 com.github.hongfuli.jieba.Tokenizer 對文本字符進行分詞，方法參數完全和 jieba python 一致。

imort com.github.hongfuli.jieba.Tokenizer

Tokenizer t = new Tokenizer();
t.cut("這是一個伸手不見五指的黑夜。我叫孫悟空，我愛北京，我愛Python和C++。", false, true);

集成到Lucene

import com.github.hongfuli.jieba.lucene.JiebaAnalyzer;

Analyzer analyzer = new JiebaAnalyzer();
try(TokenStream ts = analyzer.tokenStream("field", "這是一個伸手不見五指的黑夜。我叫孫悟空，我愛北京，我愛Python和C++。")) {
      StringBuilder b = new StringBuilder();
      CharTermAttribute termAtt = ts.getAttribute(CharTermAttribute.class);
      PositionIncrementAttribute posIncAtt = ts.getAttribute(PositionIncrementAttribute.class);
      PositionLengthAttribute posLengthAtt = ts.getAttribute(PositionLengthAttribute.class);
      OffsetAttribute offsetAtt = ts.getAttribute(OffsetAttribute.class);
      assertNotNull(offsetAtt);
      ts.reset();
      int pos = -1;
      while (ts.incrementToken()) {
        pos += posIncAtt.getPositionIncrement();
        b.append(termAtt);
        b.append(" at pos=");
        b.append(pos);
        if (posLengthAtt != null) {
          b.append(" to pos=");
          b.append(pos + posLengthAtt.getPositionLength());
        }
        b.append(" offsets=");
        b.append(offsetAtt.startOffset());
        b.append('-');
        b.append(offsetAtt.endOffset());
        b.append('\n');
      }
      ts.end();
      return b.toString();
    }

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 elasticsearch配置jieba分詞 elasticsearch添加jieba分詞器分詞————jieba分詞（Python） ElasticSearch安裝分詞插件IK jieba GitHUb 結巴分詞 jieba分詞 jieba 分詞庫（python）運用jieba庫分詞結巴（jieba）分詞 python jieba分詞詞性 python 分詞庫jieba