ES中文分詞器安裝以及自定義配置

本文轉載自查看原文 2020-07-12 17:31 2087 搜索

之前我們創建索引，查詢數據，都是使用的默認的分詞器，分詞效果不太理想，會把text的字段分成一個一個漢字，然后搜索的時候也會把搜索的句子進行分詞，所以這里就需要更加智能的分詞器IK分詞器了。

ik分詞器的下載和安裝，測試

第一：下載地址：https://github.com/medcl/elasticsearch-analysis-ik/releases ，這里你需要根據你的Es的版本來下載對應版本的IK，這里我使用的是6.8.10的ES，所以就下載ik-6.8.10.zip的文件。

解壓-->將文件復制到 es的安裝目錄/plugin/ik下面即可，完成之后效果如下：

到這里已經完成了，不需要去elasticSearch的 elasticsearch.yml 文件去配置。

重啟ElasticSearch

測試效果

未使用ik分詞器的效果

### 原生分詞
GET /_analyze
{
  "analyzer": "standard",
  "text": "中華人民共和國"
}

效果：

{
  "tokens" : [
    {
      "token" : "中",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<IDEOGRAPHIC>",
      "position" : 0
    },
    {
      "token" : "華",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "<IDEOGRAPHIC>",
      "position" : 1
    },
    {
      "token" : "人",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "<IDEOGRAPHIC>",
      "position" : 2
    },
    {
      "token" : "民",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "<IDEOGRAPHIC>",
      "position" : 3
    },
    {
      "token" : "共",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "<IDEOGRAPHIC>",
      "position" : 4
    },
    {
      "token" : "和",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "<IDEOGRAPHIC>",
      "position" : 5
    },
    {
      "token" : "國",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "<IDEOGRAPHIC>",
      "position" : 6
    }
  ]
}

ik_smart分詞效果：

# ik_smart：會做最粗粒度的拆分
GET /_analyze
{
  "analyzer": "ik_smart",
  "text": "中華人民共和國"
}

效果：

{
  "tokens" : [
    {
      "token" : "中華人民共和國",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 0
    }
  ]
}

ik_max_word會將文本做最細粒度的拆分

## ik_max_word會將文本做最細粒度的拆分
GET /_analyze
{
  "analyzer": "ik_max_word",
  "text": "中華人民共和國"
}

效果：

{
  "tokens" : [
    {
      "token" : "中華人民共和國",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "中華人民",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "中華",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "華人",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "人民共和國",
      "start_offset" : 2,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "人民",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
      "token" : "共和國",
      "start_offset" : 4,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 6
    },
    {
      "token" : "共和",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 7
    },
    {
      "token" : "國",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "CN_CHAR",
      "position" : 8
    }
  ]
}

對於上面兩個分詞效果的解釋：

如果未安裝ik分詞器，那么，你如果寫 "analyzer": "ik_max_word"，那么程序就會報錯，因為你沒有安裝ik分詞器
如果你安裝了ik分詞器之后，你不指定分詞器，不加上 "analyzer": "ik_max_word" 這句話，那么其分詞效果跟你沒有安裝ik分詞器是一致的，也是分詞成每個漢字。

自定義擴展詞

一些熱詞，自定義的詞，ik是不會收錄的，這時候我們需要自定義擴展。
比如：王者榮耀。
分詞的效果如下,顯然是不滿足我們需求的，這時候就需要自定義.

GET /_analyze
{
  "analyzer": "ik_smart",
  "text": "王者榮耀"
}

效果：

{
  "tokens" : [
    {
      "token" : "王者",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "榮耀",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 1
    }
  ]
}

在config目錄下新建ext.dic文件

王者榮耀

進入 es安裝目錄/plugins/ik/config

編輯IKAnalyzer.cfg.xml文件

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
    <comment>IK Analyzer 擴展配置</comment>
    <!--用戶可以在這里配置自己的擴展字典 -->
    <entry key="ext_dict">ext.dic</entry>
     <!--用戶可以在這里配置自己的擴展停止詞字典-->
    <entry key="ext_stopwords"></entry>
    <!--用戶可以在這里配置遠程擴展字典 -->
    <!-- <entry key="remote_ext_dict">words_location</entry> -->
    <!--用戶可以在這里配置遠程擴展停止詞字典-->
    <!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

重啟es,測試效果

{
  "tokens" : [
    {
      "token" : "王者榮耀",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 0
    }
  ]
}

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 配置ES IK分詞器自定義字典 Elasticsearch之中文分詞器插件es-ik的自定義詞庫 Elasticsearch之中文分詞器插件es-ik的自定義詞庫 Elasticsearch筆記六之中文分詞器及自定義分詞器 ES ik中文分詞器的安裝 Elasticsearch之中文分詞器插件es-ik的自定義熱更新詞庫 Elasticsearch之中文分詞器插件es-ik的自定義熱更新詞庫 Lucene 自定義分詞器 Lucene 自定義分詞器 ES 09 - Elasticsearch如何定制分詞器 (自定義分詞策略)