Elasticsearch的分詞

本文轉載自查看原文 2020-04-13 22:04 817 ELK

什么是分詞

分詞就是指將一個文本轉化成一系列單詞的過程，也叫文本分析，在Elasticsearch中稱之為Analysis。
舉例：我是中國人 --> 我/是/中國人

結果：

{
    "tokens": [
        {
            "token": "hello",
            "start_offset": 0,
            "end_offset": 5,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "world",
            "start_offset": 6,
            "end_offset": 11,
            "type": "<ALPHANUM>",
            "position": 1
        }
    ]
}

在結果中不僅可以看出分詞的結果，還返回了該詞在文本中的位置。

中文分詞
中文分詞的難點在於，在漢語中沒有明顯的詞匯分界點，如在英語中，空格可以作為分隔符，如果分隔不正確就會造成歧義。
如：
我/愛/炒肉絲
我/愛/炒/肉絲
常用中文分詞器，IK、jieba、THULAC等，推薦使用IK分詞器。

K Analyzer是一個開源的，基於java語言開發的輕量級的中文分詞工具包。從2006年12月推出1.0版開始，IKAnalyzer已經推出了3個大版本。最初，它是以開源項目Luence為應用主體的，結合詞典分詞和文法分析算法的中文分詞組件。新版本的IK Analyzer 3.0則發展為面向Java的公用分詞組件，獨立於Lucene項目，同時提供了對Lucene的默認優化實現。

采用了特有的“正向迭代最細粒度切分算法“，具有80萬字/秒的高速處理能力采用了多子處理器分析模式，支持：英文字母（IP地址、Email、URL）、數字（日期，常用中文數量詞，羅馬數字，科學計數法），中文詞匯（姓名、地名處理）等分詞處理。優化的詞典存儲，更小的內存占用

IK分詞器 Elasticsearch插件地址：https://github.com/medcl/elasticsearch-analysis-ik

[root@dalianpai ~]# docker run -p 9200:9200 -d -v /root/ik:/usr/share/elasticsearch/plugins/ik --name elasticsearch 3fd2f723b598
8378a1865408d30a279f9e057115cf4e68cfc4360fa2fe3866072ea9b820a27f
[root@dalianpai ~]# docker ps -l
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS                              NAMES
8378a1865408        3fd2f723b598        "/docker-entrypoin..."   3 seconds ago       Up 2 seconds        0.0.0.0:9200->9200/tcp, 9300/tcp   elasticsearch

結果：

{
    "tokens": [
        {
            "token": "我",
            "start_offset": 0,
            "end_offset": 1,
            "type": "CN_CHAR",
            "position": 0
        },
        {
            "token": "是",
            "start_offset": 1,
            "end_offset": 2,
            "type": "CN_CHAR",
            "position": 1
        },
        {
            "token": "中國人",
            "start_offset": 2,
            "end_offset": 5,
            "type": "CN_WORD",
            "position": 2
        },
        {
            "token": "中國",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 3
        },
        {
            "token": "國人",
            "start_offset": 3,
            "end_offset": 5,
            "type": "CN_WORD",
            "position": 4
        }
    ]
}

可以看到，已經對中文進行了分詞。

全文搜索
全文搜索兩個最重要的方面是：

相關性（Relevance）它是評價查詢與其結果間的相關程度，並根據這種相關程度對結果排名的一種能力，這種計算方式可以是 TF/IDF 方法、地理位置鄰近、模糊相似，或其他的某些算法。
分詞（Analysis）它是將文本塊轉換為有區別的、規范化的 token 的一個過程，目的是為了創建倒排索引以及查詢倒排索引。

{
    "acknowledged": true,
    "shards_acknowledged": true,
    "index": "topcheer"
}

批量插入數據

結果：

{
    "took": 213,
    "errors": false,
    "items": [
        {
            "index": {
                "_index": "topcheer",
                "_type": "person",
                "_id": "AXFzovfLg3Eko2bZmO_B",
                "_version": 1,
                "result": "created",
                "_shards": {
                    "total": 2,
                    "successful": 1,
                    "failed": 0
                },
                "created": true,
                "status": 201
            }
        },
        {
            "index": {
                "_index": "itcast",
                "_type": "person",
                "_id": "AXFzovfLg3Eko2bZmO_C",
                "_version": 1,
                "result": "created",
                "_shards": {
                    "total": 2,
                    "successful": 1,
                    "failed": 0
                },
                "created": true,
                "status": 201
            }
        },
        {
            "index": {
                "_index": "itcast",
                "_type": "person",
                "_id": "AXFzovfLg3Eko2bZmO_D",
                "_version": 1,
                "result": "created",
                "_shards": {
                    "total": 2,
                    "successful": 1,
                    "failed": 0
                },
                "created": true,
                "status": 201
            }
        },
        {
            "index": {
                "_index": "itcast",
                "_type": "person",
                "_id": "AXFzovfLg3Eko2bZmO_E",
                "_version": 1,
                "result": "created",
                "_shards": {
                    "total": 2,
                    "successful": 1,
                    "failed": 0
                },
                "created": true,
                "status": 201
            }
        }
    ]
}

單詞搜索

結果：

{
    "took": 38,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": 1,
        "max_score": 1.3123269,
        "hits": [
            {
                "_index": "itcast",
                "_type": "person",
                "_id": "AXFzovfLg3Eko2bZmO_D",
                "_score": 1.3123269,
                "_source": {
                    "name": "王五",
                    "age": 22,
                    "mail": "333@qq.com",
                    "hobby": "羽毛球、籃球、游泳、聽音樂"
                },
                "highlight": {
                    "hobby": [
                        "羽毛球、籃球、游泳、聽<em>音</em><em>樂</em>"
                    ]
                }
            }
        ]
    }
}

過程說明：
1. 檢查字段類型
愛好 hobby 字段是一個 text 類型（指定了IK分詞器），這意味着查詢字符串本身也應該被分詞。
2. 分析查詢字符串。
將查詢的字符串 “音樂” 傳入IK分詞器中，輸出的結果是單個項音樂。因為只有一個單詞項，所以 match 查詢執行的是單個底層 term 查詢。
3. 查找匹配文檔。
用 term 查詢在倒排索引中查找 “音樂” 然后獲取一組包含該項的文檔

4. 為每個文檔評分。
用 term 查詢計算每個文檔相關度評分 _score ，這是種將詞頻（term frequency，即詞 “音樂” 在相關文檔的hobby 字段中出現的頻率）和反向文檔頻率（inverse document frequency，即詞 “音樂” 在所有文檔的hobby 字段中出現的頻率），以及字段的長度（即字段越短相關度越高）相結合的計算方式。

多詞搜索

結果：

{
    "took": 5,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": 3,
        "max_score": 1.2632889,
        "hits": [
            {
                "_index": "topcheer",
                "_type": "person",
                "_id": "AXFzrtEJg3Eko2bZmO_K",
                "_score": 1.2632889,
                "_source": {
                    "name": "王五",
                    "age": 22,
                    "mail": "333@qq.com",
                    "hobby": "羽毛球、籃球、游泳、聽音樂"
                },
                "highlight": {
                    "hobby": [
                        "羽毛球、<em>籃球</em>、游泳、聽<em>音樂</em>"
                    ]
                }
            },
            {
                "_index": "topcheer",
                "_type": "person",
                "_id": "AXFzrtEJg3Eko2bZmO_L",
                "_score": 0.42327404,
                "_source": {
                    "name": "趙六",
                    "age": 23,
                    "mail": "444@qq.com",
                    "hobby": "跑步、游泳、籃球"
                },
                "highlight": {
                    "hobby": [
                        "跑步、游泳、<em>籃球</em>"
                    ]
                }
            },
            {
                "_index": "topcheer",
                "_type": "person",
                "_id": "AXFzrtEJg3Eko2bZmO_J",
                "_score": 0.2887157,
                "_source": {
                    "name": "李四",
                    "age": 21,
                    "mail": "222@qq.com",
                    "hobby": "羽毛球、乒乓球、足球、籃球"
                },
                "highlight": {
                    "hobby": [
                        "羽毛球、乒乓球、足球、<em>籃球</em>"
                    ]
                }
            }
        ]
    }
}

可以看到，包含了“音樂”、“籃球”的數據都已經被搜索到了。
可是，搜索的結果並不符合我們的預期，因為我們想搜索的是既包含“音樂”又包含“籃球”的用戶，顯然結果返回的“或”的關系。
在Elasticsearch中，可以指定詞之間的邏輯關系，如下：

結果：

{
    "took": 7,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": 1,
        "max_score": 1.2632889,
        "hits": [
            {
                "_index": "topcheer",
                "_type": "person",
                "_id": "AXFzrtEJg3Eko2bZmO_K",
                "_score": 1.2632889,
                "_source": {
                    "name": "王五",
                    "age": 22,
                    "mail": "333@qq.com",
                    "hobby": "羽毛球、籃球、游泳、聽音樂"
                },
                "highlight": {
                    "hobby": [
                        "羽毛球、<em>籃球</em>、游泳、聽<em>音樂</em>"
                    ]
                }
            }
        ]
    }
}

可以看到結果符合預期。

前面我們測試了“OR” 和 “AND”搜索，這是兩個極端，其實在實際場景中，並不會選取這2個極端，更有可能是選取這種，或者說，只需要符合一定的相似度就可以查詢到數據，在Elasticsearch中也支持這樣的查詢，通過minimum_should_match來指定匹配度，如：70%；
示例：

結果：

{
    "took": 3,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": 3,
        "max_score": 1.2632889,
        "hits": [
            {
                "_index": "topcheer",
                "_type": "person",
                "_id": "AXFzrtEJg3Eko2bZmO_K",
                "_score": 1.2632889,
                "_source": {
                    "name": "王五",
                    "age": 22,
                    "mail": "333@qq.com",
                    "hobby": "羽毛球、籃球、游泳、聽音樂"
                },
                "highlight": {
                    "hobby": [
                        "羽毛球、<em>籃球</em>、游泳、聽<em>音樂</em>"
                    ]
                }
            },
            {
                "_index": "topcheer",
                "_type": "person",
                "_id": "AXFzrtEJg3Eko2bZmO_L",
                "_score": 0.42327404,
                "_source": {
                    "name": "趙六",
                    "age": 23,
                    "mail": "444@qq.com",
                    "hobby": "跑步、游泳、籃球"
                },
                "highlight": {
                    "hobby": [
                        "跑步、游泳、<em>籃球</em>"
                    ]
                }
            },
            {
                "_index": "topcheer",
                "_type": "person",
                "_id": "AXFzrtEJg3Eko2bZmO_J",
                "_score": 0.2887157,
                "_source": {
                    "name": "李四",
                    "age": 21,
                    "mail": "222@qq.com",
                    "hobby": "羽毛球、乒乓球、足球、籃球"
                },
                "highlight": {
                    "hobby": [
                        "羽毛球、乒乓球、足球、<em>籃球</em>"
                    ]
                }
            }
        ]
    }
}

組合搜索

在搜索時，也可以使用過濾器中講過的bool組合查詢，示例：

上面搜索的意思是：

搜索結果中必須包含籃球，不能包含音樂，如果包含了游泳，那么它的相似度更高。

結果：

{
    "took": 5,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": 2,
        "max_score": 1.2458471,
        "hits": [
            {
                "_index": "topcheer",
                "_type": "person",
                "_id": "AXFzrtEJg3Eko2bZmO_L",
                "_score": 1.2458471,
                "_source": {
                    "name": "趙六",
                    "age": 23,
                    "mail": "444@qq.com",
                    "hobby": "跑步、游泳、籃球"
                },
                "highlight": {
                    "hobby": [
                        "跑步、<em>游泳</em>、<em>籃球</em>"
                    ]
                }
            },
            {
                "_index": "topcheer",
                "_type": "person",
                "_id": "AXFzrtEJg3Eko2bZmO_J",
                "_score": 0.2887157,
                "_source": {
                    "name": "李四",
                    "age": 21,
                    "mail": "222@qq.com",
                    "hobby": "羽毛球、乒乓球、足球、籃球"
                },
                "highlight": {
                    "hobby": [
                        "羽毛球、乒乓球、足球、<em>籃球</em>"
                    ]
                }
            }
        ]
    }
}

評分的計算規則

bool 查詢會為每個文檔計算相關度評分 _score ，再將所有匹配的 must 和 should 語句的分數 _score 求和，最后除以 must 和 should 語句的總數。

must_not 語句不會影響評分；它的作用只是將不相關的文檔排除。默認情況下，should中的內容不是必須匹配的，如果查詢語句中沒有must，那么就會至少匹配其中一個。當然了，也可以通過minimum_should_match參數進行控制，該值可以是數字也可以的百分比。

示例：

結果：

{
    "took": 3,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": 2,
        "max_score": 1.8243669,
        "hits": [
            {
                "_index": "topcheer",
                "_type": "person",
                "_id": "AXFzrtEJg3Eko2bZmO_K",
                "_score": 1.8243669,
                "_source": {
                    "name": "王五",
                    "age": 22,
                    "mail": "333@qq.com",
                    "hobby": "羽毛球、籃球、游泳、聽音樂"
                },
                "highlight": {
                    "hobby": [
                        "羽毛球、<em>籃球</em>、<em>游泳</em>、聽<em>音樂</em>"
                    ]
                }
            },
            {
                "_index": "topcheer",
                "_type": "person",
                "_id": "AXFzrtEJg3Eko2bZmO_L",
                "_score": 1.2458471,
                "_source": {
                    "name": "趙六",
                    "age": 23,
                    "mail": "444@qq.com",
                    "hobby": "跑步、游泳、籃球"
                },
                "highlight": {
                    "hobby": [
                        "跑步、<em>游泳</em>、<em>籃球</em>"
                    ]
                }
            }
        ]
    }
}

權重

有些時候，我們可能需要對某些詞增加權重來影響該條數據的得分。如下：

搜索關鍵字為“游泳籃球”，如果結果中包含了“音樂”權重為10，包含了“跑步”權重為2。

結果：

{
    "took": 5,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": 2,
        "max_score": 10.595525,
        "hits": [
            {
                "_index": "topcheer",
                "_type": "person",
                "_id": "AXFzrtEJg3Eko2bZmO_K",
                "_score": 10.595525,
                "_source": {
                    "name": "王五",
                    "age": 22,
                    "mail": "333@qq.com",
                    "hobby": "羽毛球、籃球、游泳、聽音樂"
                },
                "highlight": {
                    "hobby": [
                        "羽毛球、<em>籃球</em>、<em>游泳</em>、聽<em>音樂</em>"
                    ]
                }
            },
            {
                "_index": "topcheer",
                "_type": "person",
                "_id": "AXFzrtEJg3Eko2bZmO_L",
                "_score": 4.1034093,
                "_source": {
                    "name": "趙六",
                    "age": 23,
                    "mail": "444@qq.com",
                    "hobby": "跑步、游泳、籃球"
                },
                "highlight": {
                    "hobby": [
                        "<em>跑步</em>、<em>游泳</em>、<em>籃球</em>"
                    ]
                }
            }
        ]

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 ElasticSearch——分詞 ElasticSearch 分詞器 elasticsearch 分詞后聚合 ElasticSearch 分詞器 Elasticsearch 支持中文分詞 elasticsearch配置jieba分詞 Elasticsearch 分詞器 elasticsearch 進行分詞測試 elasticsearch ik分詞 Elasticsearch實踐（四）：IK分詞