1.概念
1.補全api主要分為四類
- Term Suggester(糾錯補全,輸入錯誤的情況下補全正確的單詞)
- Phrase Suggester(自動補全短語,輸入一個單詞補全整個短語)
- Completion Suggester(完成補全單詞,輸出如前半部分,補全整個單詞)
- Context Suggester(上下文補全)
整體效果類似百度搜索,如圖:
2.Term Suggester(糾錯補全)
2.1.api
1.建立索引
PUT /book4 { "mappings": { "english": { "properties": { "passage": { "type": "text" } } } } }
2.插入數據
curl -H "Content-Type: application/json" -XPOST 'http:localhost:9200/_bulk' -d' { "index" : { "_index" : "book4", "_type" : "english" } } { "passage": "Lucene is cool"} { "index" : { "_index" : "book4", "_type" : "english" } } { "passage": "Elasticsearch builds on top of lucene"} { "index" : { "_index" : "book4", "_type" : "english" } } { "passage": "Elasticsearch rocks"} { "index" : { "_index" : "book4", "_type" : "english" } } { "passage": "Elastic is the company behind ELK stack"} { "index" : { "_index" : "book4", "_type" : "english" } } { "passage": "elk rocks"} { "index" : { "_index" : "book4", "_type" : "english" } } { "passage": "elasticsearch is rock solid"} '
3.看下儲存的分詞有哪些
post /_analyze { "text": [ "Lucene is cool", "Elasticsearch builds on top of lucene", "Elasticsearch rocks", "Elastic is the company behind ELK stack", "elk rocks", "elasticsearch is rock solid" ] }
結果:

{ "tokens": [ { "token": "lucene", "start_offset": 0, "end_offset": 6, "type": "<ALPHANUM>", "position": 0 }, { "token": "is", "start_offset": 7, "end_offset": 9, "type": "<ALPHANUM>", "position": 1 }, { "token": "cool", "start_offset": 10, "end_offset": 14, "type": "<ALPHANUM>", "position": 2 }, { "token": "elasticsearch", "start_offset": 15, "end_offset": 28, "type": "<ALPHANUM>", "position": 103 }, { "token": "builds", "start_offset": 29, "end_offset": 35, "type": "<ALPHANUM>", "position": 104 }, { "token": "on", "start_offset": 36, "end_offset": 38, "type": "<ALPHANUM>", "position": 105 }, { "token": "top", "start_offset": 39, "end_offset": 42, "type": "<ALPHANUM>", "position": 106 }, { "token": "of", "start_offset": 43, "end_offset": 45, "type": "<ALPHANUM>", "position": 107 }, { "token": "lucene", "start_offset": 46, "end_offset": 52, "type": "<ALPHANUM>", "position": 108 }, { "token": "elasticsearch", "start_offset": 53, "end_offset": 66, "type": "<ALPHANUM>", "position": 209 }, { "token": "rocks", "start_offset": 67, "end_offset": 72, "type": "<ALPHANUM>", "position": 210 }, { "token": "elastic", "start_offset": 73, "end_offset": 80, "type": "<ALPHANUM>", "position": 311 }, { "token": "is", "start_offset": 81, "end_offset": 83, "type": "<ALPHANUM>", "position": 312 }, { "token": "the", "start_offset": 84, "end_offset": 87, "type": "<ALPHANUM>", "position": 313 }, { "token": "company", "start_offset": 88, "end_offset": 95, "type": "<ALPHANUM>", "position": 314 }, { "token": "behind", "start_offset": 96, "end_offset": 102, "type": "<ALPHANUM>", "position": 315 }, { "token": "elk", "start_offset": 103, "end_offset": 106, "type": "<ALPHANUM>", "position": 316 }, { "token": "stack", "start_offset": 107, "end_offset": 112, "type": "<ALPHANUM>", "position": 317 }, { "token": "elk", "start_offset": 113, "end_offset": 116, "type": "<ALPHANUM>", "position": 418 }, { "token": "rocks", "start_offset": 117, "end_offset": 122, "type": "<ALPHANUM>", "position": 419 }, { "token": "elasticsearch", "start_offset": 123, "end_offset": 136, "type": "<ALPHANUM>", "position": 520 }, { "token": "is", "start_offset": 137, "end_offset": 139, "type": "<ALPHANUM>", "position": 521 }, { "token": "rock", "start_offset": 140, "end_offset": 144, "type": "<ALPHANUM>", "position": 522 }, { "token": "solid", "start_offset": 145, "end_offset": 150, "type": "<ALPHANUM>", "position": 523 } ] }
4.term suggest api(搜索單個字段)
搜索下試試,給出錯誤單詞Elasticsearaach
POST /book4/_search { "suggest" : { "my-suggestion" : { "text" : "Elasticsearaach", "term" : { "field" : "passage",
"suggest_mode": "popular" } } } }
response:
{ "took": 26, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 0, "max_score": 0, "hits": [] }, "suggest": { "my-suggestion": [ { "text": "elasticsearaach", "offset": 0, "length": 15, "options": [ { "text": "elasticsearch", "score": 0.84615386, "freq": 3 } ] } ] } }
5.搜索多個字段分別給出提示:
POST _search { "suggest": { "my-suggest-1" : { "text" : "tring out Elasticsearch", "term" : { "field" : "message" } }, "my-suggest-2" : { "text" : "kmichy", "term" : { "field" : "user" } } } }
該term
建議者提出基於編輯距離條款。在建議術語之前分析提供的建議文本。建議的術語是根據分析的建議文本標記提供的。該term
建議者不走查詢到的是是的請求部分。
|
建議文字。建議文本是必需的選項,需要全局或按建議設置。 |
|
從中獲取候選建議的字段。這是一個必需的選項,需要全局設置或根據建議設置。 |
|
用於分析建議文本的分析器。默認為建議字段的搜索分析器。 |
|
每個建議文本標記返回的最大更正。 |
|
定義如何根據建議文本術語對建議進行排序。兩個可能的值:
|
|
建議模式控制包含哪些建議或控制建議的文本術語,建議。可以指定三個可能的值:
|
|
在文本分析之后,建議文本術語小寫。 |
|
最大編輯距離候選建議可以具有以便被視為建議。只能是介於1和2之間的值。任何其他值都會導致拋出錯誤的請求錯誤。默認為2。 |
|
必須匹配的最小前綴字符的數量才是候選建議。默認為1.增加此數字可提高拼寫檢查性能。通常拼寫錯誤不會出現在術語的開頭。(舊名“prefix_len”已棄用) |
|
建議文本術語必須具有的最小長度才能包含在內。默認為4.(舊名稱“min_word_len”已棄用) |
|
設置從每個單獨分片中檢索的最大建議數。在減少階段,僅根據 |
|
用於乘以的因子, |
|
建議應出現的文檔數量的最小閾值。可以指定為絕對數字或文檔數量的相對百分比。這可以僅通過建議高頻項來提高質量。默認為0f且未啟用。如果指定的值大於1,則該數字不能是小數。分片級文檔頻率用於此選項。 |
|
建議文本令牌可以存在的文檔數量的最大閾值,以便包括在內。可以是表示文檔頻率的相對百分比數(例如0.4)或絕對數。如果指定的值大於1,則不能指定小數。默認為0.01f。這可用於排除高頻術語的拼寫檢查。高頻術語通常拼寫正確,這也提高了拼寫檢查的性能。分片級文檔頻率用於此選項。 |
|
用於比較類似建議術語的字符串距離實現。可以指定五個可能的值: |
2.phase sguesster:短語糾錯
phrase 短語建議,在term的基礎上,會考量多個term之間的關系,比如是否同時出現在索引的原文里,相鄰程度,以及詞頻等
示例1:
POST book4/_search
"text": "Elasticsearch rock",
"phrase": {
"field": "passage"
}
}
}
}
{ "took": 11, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 0, "max_score": 0, "hits": [] }, "suggest": { "myss": [ { "text": "Elasticsearch rock", "offset": 0, "length": 18, "options": [ { "text": "elasticsearch rocks", "score": 0.3467123 } ] } ] } }
3. Completion suggester 自動補全
針對自動補全場景而設計的建議器。此場景下用戶每輸入一個字符的時候,就需要即時發送一次查詢請求到后端查找匹配項,在用戶輸入速度較高的情況下對后端響應速度要求比較苛刻。因此實現上它和前面兩個Suggester采用了不同的數據結構,索引並非通過倒排來完成,而是將analyze過的數據編碼成FST和索引一起存放。對於一個open狀態的索引,FST會被ES整個裝載到內存里的,進行前綴查找速度極快。但是FST只能用於前綴查找,這也是Completion Suggester的局限所在。
1.建立索引
POST /book5 { "mappings": { "music" : { "properties" : { "suggest" : { "type" : "completion" }, "title" : { "type": "keyword" } } } } }
插入數據:
POST /book5/music { "suggest":"test my book" }
Input 指定輸入詞 Weight 指定排序值(可選)
PUT music/music/5nupmmUBYLvVFwGWH3cu?refresh { "suggest" : { "input": [ "test", "book" ], "weight" : 34 } }
指定不同的排序值:
PUT music/_doc/6Hu2mmUBYLvVFwGWxXef?refresh { "suggest" : [ { "input": "test", "weight" : 10 }, { "input": "good", "weight" : 3 } ]}
示例1:查詢建議根據前綴查詢
POST book5/_search?pretty { "suggest": { "song-suggest" : { "prefix" : "te", "completion" : { "field" : "suggest" } } } }
{ "took": 8, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 0, "max_score": 0, "hits": [] }, "suggest": { "song-suggest": [ { "text": "te", "offset": 0, "length": 2, "options": [ { "text": "test my book1", "_index": "book5", "_type": "music", "_id": "6Xu6mmUBYLvVFwGWpXeL", "_score": 1, "_source": { "suggest": "test my book1" } }, { "text": "test my book1", "_index": "book5", "_type": "music", "_id": "6nu8mmUBYLvVFwGWSndC", "_score": 1, "_source": { "suggest": "test my book1" } }, { "text": "test my book1 english", "_index": "book5", "_type": "music", "_id": "63u8mmUBYLvVFwGWZHdC", "_score": 1, "_source": { "suggest": "test my book1 english" } } ] } ] } }
示例2:對建議查詢結果去重
{ "suggest": { "song-suggest" : { "prefix" : "te", "completion" : { "field" : "suggest" , "skip_duplicates": true } } } }
示例3:查詢建議文檔存儲短語
POST /book5/music/63u8mmUBYLvVFwGWZHdC?refresh { "suggest" : { "input": [ "book1 english", "test english" ], "weight" : 20 } }
查詢:
POST book5/_search?pretty { "suggest": { "song-suggest" : { "prefix" : "test", "completion" : { "field" : "suggest" , "skip_duplicates": true } } } }
結果:
{ "took": 7, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 0, "max_score": 0, "hits": [] }, "suggest": { "song-suggest": [ { "text": "test", "offset": 0, "length": 4, "options": [ { "text": "test english", "_index": "book5", "_type": "music", "_id": "63u8mmUBYLvVFwGWZHdC", "_score": 20, "_source": { "suggest": { "input": [ "book1 english", "test english" ], "weight": 20 } } }, { "text": "test my book1", "_index": "book5", "_type": "music", "_id": "6Xu6mmUBYLvVFwGWpXeL", "_score": 1, "_source": { "suggest": "test my book1" } } ] } ] } }
4. 總結和建議
因此用好Completion Sugester並不是一件容易的事,實際應用開發過程中,需要根據數據特性和業務需要,靈活搭配analyzer和mapping參數,反復調試才可能獲得理想的補全效果。
回到篇首搜索框的補全/糾錯功能,如果用ES怎么實現呢?我能想到的一個的實現方式:
- 在用戶剛開始輸入的過程中,使用Completion Suggester進行關鍵詞前綴匹配,剛開始匹配項會比較多,隨着用戶輸入字符增多,匹配項越來越少。如果用戶輸入比較精准,可能Completion Suggester的結果已經夠好,用戶已經可以看到理想的備選項了。
- 如果Completion Suggester已經到了零匹配,那么可以猜測是否用戶有輸入錯誤,這時候可以嘗試一下Phrase Suggester。
- 如果Phrase Suggester沒有找到任何option,開始嘗試term Suggester。
精准程度上(Precision)看: Completion > Phrase > term, 而召回率上(Recall)則反之。從性能上看,Completion Suggester是最快的,如果能滿足業務需求,只用Completion Suggester做前綴匹配是最理想的。 Phrase和Term由於是做倒排索引的搜索,相比較而言性能應該要低不少,應盡量控制suggester用到的索引的數據量,最理想的狀況是經過一定時間預熱后,索引可以全量map到內存。