ElasticSearch中文分詞器-IK分詞器的使用

本文轉載自查看原文 2019-11-07 11:06 2760 ElasticSearch

IK分詞器的使用

首先我們通過Postman發送GET請求查詢分詞效果

GET http://localhost:9200/_analyze
{
	"text":"農業銀行"
}

得到如下結果，可以發現es的默認分詞器無法識別中文中農業、銀行這樣的詞匯，而是簡單的將每個字拆完分為一個詞，這顯然不符合我們的使用要求。

{
    "tokens": [
        {
            "token": "農",
            "start_offset": 0,
            "end_offset": 1,
            "type": "<IDEOGRAPHIC>",
            "position": 0
        },
        {
            "token": "業",
            "start_offset": 1,
            "end_offset": 2,
            "type": "<IDEOGRAPHIC>",
            "position": 1
        },
        {
            "token": "銀",
            "start_offset": 2,
            "end_offset": 3,
            "type": "<IDEOGRAPHIC>",
            "position": 2
        },
        {
            "token": "行",
            "start_offset": 3,
            "end_offset": 4,
            "type": "<IDEOGRAPHIC>",
            "position": 3
        }
    ]
}

首先我們訪問 https://github.com/medcl/elasticsearch-analysis-ik/releases 下載與es對應版本的中文分詞器。將解壓后的后的文件夾放入es根目錄下的plugins目錄下，重啟es即可使用。

我們這次加入新的參數"analyzer":"ik_max_word"

k_max_word：會將文本做最細粒度的拆分，例如「中華人民共和國國歌」會被拆分為「中華人民共和國、中華人民、中華、華人、人民共和國、人民、人、民、共和國、共和、和、國國、國歌」，會窮盡各種可能的組合
ik_smart：會將文本做最粗粒度的拆分，例如「中華人民共和國國歌」會被拆分為「中華人民共和國、國歌」

GET http://localhost:9200/_analyze
{
	"analyzer":"ik_max_word",
	"text":"農業銀行"
}

得到如下結果

{
    "tokens": [
        {
            "token": "農業銀行",
            "start_offset": 0,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "農業",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "銀行",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 2
        }
    ]
}

百度搜索中每天都會收錄新的詞匯，es中也可以進行擴展詞匯。

我們首先查詢弗雷爾卓德字段

GET http://localhost:9200/_analyze
{
	"analyzer":"ik_max_word",
	"text":"弗雷爾卓德"
}

僅僅可以得到每個字的分詞結果，我們需要做的就是使分詞器識別到弗雷爾卓德也是一個詞語。

{
    "tokens": [
        {
            "token": "弗",
            "start_offset": 0,
            "end_offset": 1,
            "type": "CN_CHAR",
            "position": 0
        },
        {
            "token": "雷",
            "start_offset": 1,
            "end_offset": 2,
            "type": "CN_CHAR",
            "position": 1
        },
        {
            "token": "爾",
            "start_offset": 2,
            "end_offset": 3,
            "type": "CN_CHAR",
            "position": 2
        },
        {
            "token": "卓",
            "start_offset": 3,
            "end_offset": 4,
            "type": "CN_CHAR",
            "position": 3
        },
        {
            "token": "德",
            "start_offset": 4,
            "end_offset": 5,
            "type": "CN_CHAR",
            "position": 4
        }
    ]
}

首先進入es根目錄中的plugins文件夾下的ik文件夾，進入config目錄，創建custom.dic文件，寫入弗雷爾卓德。同時打開IKAnalyzer.cfg文件，將新建的custom.dic配置其中，重啟es。

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
	<comment>IK Analyzer 擴展配置</comment>
	<!--用戶可以在這里配置自己的擴展字典 -->
	<entry key="ext_dict">custom.doc</entry>
	 <!--用戶可以在這里配置自己的擴展停止詞字典-->
	<entry key="ext_stopwords"></entry>
	<!--用戶可以在這里配置遠程擴展字典 -->
	<!-- <entry key="remote_ext_dict">words_location</entry> -->
	<!--用戶可以在這里配置遠程擴展停止詞字典-->
	<!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

再次查詢發現es的分詞器可以識別到弗雷爾卓德詞匯

{
    "tokens": [
        {
            "token": "弗雷爾卓德",
            "start_offset": 0,
            "end_offset": 5,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "弗雷爾",
            "start_offset": 0,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "卓",
            "start_offset": 3,
            "end_offset": 4,
            "type": "CN_CHAR",
            "position": 2
        },
        {
            "token": "德",
            "start_offset": 4,
            "end_offset": 5,
            "type": "CN_CHAR",
            "position": 3
        }
    ]
}

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 ElasticSearch中文分詞器-IK分詞器的使用 elasticsearch中文分詞器（ik）配置 ElasticSearch安裝中文分詞器IK 如何給Elasticsearch安裝中文分詞器IK elasticsearch - ik分詞器 elasticsearch之ik分詞器 Elasticsearch IK分詞器 Elasticsearch集群使用ik分詞器 ElasticSearch中使用IK分詞器使用 docker 部署 elasticsearch 並安裝 ik 中文分詞器