ElasticSearch-IK分詞器和集成使用

本文轉載自查看原文 2021-01-26 11:28 407 ElasticSearch

1.查詢存在問題分析

在進行字符串查詢時，我們發現去搜索"搜索服務器"和"鋼索"都可以搜索到數據；
而在進行詞條查詢時，我們搜索"搜索"卻沒有搜索到數據；
究其原因是ElasticSearch的標准分詞器導致的，當我們創建索引時，字段使用的是標准分詞器：

如果使用ES搜索中文內容，默認是不支持中文分詞，英文支持

例如：How are you!

How

are

you

!

例如：我是一個好男人！

我

是

一

個

好

男

人

！

{
    "mappings": {
        "article": {
            "properties": {
                "id": {
                	"type": "long",
                    "store": true,
                    "index":false
                },
                "title": {
                	"type": "text",
                    "store": true,
                    "index":true,
                    "analyzer":"standard"	//標准分詞器  standard 內置的不支持中文分詞
                },
                "content": {
                	"type": "text",
                    "store": true,
                    "index":true,
                    "analyzer":"standard"	//標准分詞器
                }
            }
        }
    }
}

例如對 "我是程序員" 進行分詞

標准分詞器分詞效果測試：

GET http://localhost:9200/_analyze

{ "analyzer": "standard", "text": "我是程序員" }

分詞結果：

{
  "tokens" : [
    {
      "token" : "我",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<IDEOGRAPHIC>",
      "position" : 0
    },
    {
      "token" : "是",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "<IDEOGRAPHIC>",
      "position" : 1
    },
    {
      "token" : "程",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "<IDEOGRAPHIC>",
      "position" : 2
    },
    {
      "token" : "序",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "<IDEOGRAPHIC>",
      "position" : 3
    },
    {
      "token" : "員",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "<IDEOGRAPHIC>",
      "position" : 4
    }
  ]
}

而我們需要的分詞效果是：我、是、程序、程序員

這樣的話就需要對中文支持良好的分析器的支持，支持中文分詞的分詞器有很多，word分詞器、庖丁解牛、盤古分詞、Ansj分詞等，但我們常用的還是下面要介紹的IK分詞器。

2.IK分詞器簡介

IKAnalyzer是一個開源的，基於java語言開發的輕量級的中文分詞工具包。從2006年12月推出1.0版開始，IKAnalyzer已經推出了3個大版本。最初，它是以開源項目Lucene為應用主體的，結合詞典分詞和文法分析算法的中文分詞組件。新版本的IKAnalyzer3.0則發展為面向Java的公用分詞組件，獨立於Lucene項目，同時提供了對Lucene的默認優化實現。

IK分詞器3.0的特性如下：

1）采用了特有的“正向迭代最細粒度切分算法“，具有60萬字/秒的高速處理能力。
2）采用了多子處理器分析模式，支持：英文字母（IP地址、Email、URL）、數字（日期，常用中文數量詞，羅馬數字，科學計數法），中文詞匯（姓名、地名處理）等分詞處理。
3）對中英聯合支持不是很好,在這方面的處理比較麻煩.需再做一次查詢,同時是支持個人詞條的優化的詞典存儲，更小的內存占用。
4）支持用戶詞典擴展定義。
5）針對Lucene全文檢索優化的查詢分析器IKQueryParser；采用歧義分析算法優化查詢關鍵字的搜索排列組合，能極大的提高Lucene檢索的命中率。

3. IK分詞器的安裝

1）下載地址：https://github.com/medcl/elasticsearch-analysis-ik/releases
課程資料也提供了IK分詞器的壓縮包：
在這里插入圖片描述 2）解壓，將解壓后的elasticsearch文件夾拷貝到elasticsearch-5.6.8\plugins下，並重命名文件夾為analysis-ik

3）重新啟動ElasticSearch，即可加載IK分詞器
在這里插入圖片描述

4.IK分詞器測試

IK提供了兩個分詞算法ik_smart 和 ik_max_word
其中 ik_smart 為最少切分，ik_max_word為最細粒度划分
我們分別來試一下

1）最小切分：在瀏覽器地址欄輸入地址

GET http://localhost:9200/_analyze

{ "analyzer": "ik_smart", "text": "我是程序員" }

輸出的結果為：

{
  "tokens" : [
    {
      "token" : "我",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
      "token" : "是",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "程序員",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 2
    }
  ]
}

2）最細切分：在瀏覽器地址欄輸入地址

http://127.0.0.1:9200/_analyze?analyzer=ik_max_word&pretty=true&text=我是程序員

輸出的結果為：

{
  "tokens" : [
    {
      "token" : "我",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
      "token" : "是",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "程序員",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "程序",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "員",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "CN_CHAR",
      "position" : 4
    }
  ]
}

6. 修改索引映射mapping

6.1 重建索引

刪除原有blog1索引

DELETE		localhost:9200/blog1

創建blog1索引，此時分詞器使用ik_max_word

PUT		localhost:9200/blog1

{
    "mappings": {
        "article": {
            "properties": {
                "id": {
                	"type": "long",
                    "store": true,
                    "index":false
                },
                "title": {
                	"type": "text",
                    "store": true,
                    "index":true,
                    "analyzer":"ik_max_word"
                },
                "content": {
                	"type": "text",
                    "store": true,
                    "index":true,
                    "analyzer":"ik_max_word"
                }
            }
        }
    }
}

創建文檔

POST	localhost:9200/blog1/article/1

{
	"id":1,
	"title":"ElasticSearch是一個基於Lucene的搜索服務器",
	"content":"它提供了一個分布式多用戶能力的全文搜索引擎，基於RESTful web接口。Elasticsearch是用Java開發的，並作為Apache許可條款下的開放源碼發布，是當前流行的企業級搜索引擎。設計用於雲計算中，能夠達到實時搜索，穩定，可靠，快速，安裝使用方便。"
}

6.2 再次測試queryString查詢

請求url：

POST	localhost:9200/blog1/article/_search

請求體：

{
    "query": {
        "query_string": {
            "default_field": "title",
            "query": "搜索服務器"
        }
    }
}

postman截圖：
在這里插入圖片描述
將請求體搜索字符串修改為"鋼索"，再次查詢：

{
    "query": {
        "query_string": {
            "default_field": "title",
            "query": "鋼索"
        }
    }
}

postman截圖：在這里插入圖片描述

6.3 再次測試term測試

請求url：

POST	localhost:9200/blog1/article/_search

請求體：

{
    "query": {
        "term": {
            "title": "搜索"
        }
    }
}

postman截圖：
在這里插入圖片描述

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 ElasticSearch-ik分詞器介紹及使用 Elasticsearch-IK分詞器（二）IK分詞器的使用（1）命令行查看結果 Elasticsearch集成ik分詞器 ElasticSearch中文分詞器-IK分詞器的使用 ElasticSearch中文分詞器-IK分詞器的使用 ElasticSearch中使用IK分詞器 Elasticsearch集群使用ik分詞器 elasticsearch - ik分詞器 Elasticsearch IK分詞器 elasticsearch之ik分詞器