轉自:https://elasticsearch.cn/article/132
在 Elasticsearch 5.x 有一個字段折疊(Field Collapsing,#22337)的功能非常有意思,在這里分享一下,
字段折疊是一個很有歷史的需求了,可以看這個 issue,編號#256,最初是2010年7月提的issue,也是討論最多的帖子之一(240+評論),熬了6年才支持的特性,你說牛不牛,哈哈。
目測該特性將於5.3發布,嘗鮮地址:Elasticsearch-5.3.0-SNAPSHOT,文檔地址:search-request-collapse。
So,什么是字段折疊,可以理解就是按特定字段進行合並去重,比如我們有一個菜譜搜索,我希望按菜譜的“菜系”字段進行折疊,即返回結果每個菜系都返回一個結果,也就是按菜系去重,我搜索關鍵字“魚”,要去返回的結果里面各種菜系都有,有湘菜,有粵菜,有中餐,有西餐,別全是湘菜,就是這個意思,通過按特定字段折疊之后,來豐富搜索結果的多樣性。
說到這里,有人肯定會想到,使用 term agg+ top hits agg 來實現啊,這種組合兩種聚和的方式可以實現上面的功能,不過也有一些局限性,比如,不能分頁,#4915;結果不夠精確(top term+top hits,es 的聚合實現選擇了犧牲精度來提高速度);數據量大的情況下,聚合比較慢,影響搜索體驗。
而新的的字段折疊的方式是怎么實現的的呢,有這些要點:
- 折疊+取 inner_hits 分兩階段執行(組合聚合的方式只有一個階段),所以 top hits 永遠是精確的。
- 字段折疊只在 top hits 層執行,不需要每次都在完整的結果集上對為每個折疊主鍵計算實際的 doc values 值,只對 top hits 這小部分數據操作就可以,和 term agg 相比要節省很多內存。
- 因為只在 top hits 上進行折疊,所以相比組合聚合的方式,速度要快很多。
- 折疊 top docs 不需要使用全局序列(global ordinals)來轉換 string,相比 agg 這也節省了很多內存。
- 分頁成為可能,和常規搜索一樣,具有相同的局限,先獲取 from+size 的內容,再合並。
- search_after 和 scroll 暫未實現,不過具備可行性。
- 折疊只影響搜索結果,不影響聚合,搜索結果的 total 是所有的命中紀錄數,去重的結果數未知(無法計算)。
下面來看看具體的例子,就知道怎么回事了,使用起來很簡單。
- 先准備索引和數據,這里以菜譜為例,name:菜譜名,type 為菜系,rating 為用戶的累積平均評分
DELETE recipes PUT recipes POST recipes/type/_mapping { "properties": { "name":{ "type": "text" }, "rating":{ "type": "float" },"type":{ "type": "keyword" } } } POST recipes/type/ { "name":"清蒸魚頭","rating":1,"type":"湘菜" } POST recipes/type/ { "name":"剁椒魚頭","rating":2,"type":"湘菜" } POST recipes/type/ { "name":"紅燒鯽魚","rating":3,"type":"湘菜" } POST recipes/type/ { "name":"鯽魚湯(辣)","rating":3,"type":"湘菜" } POST recipes/type/ { "name":"鯽魚湯(微辣)","rating":4,"type":"湘菜" } POST recipes/type/ { "name":"鯽魚湯(變態辣)","rating":5,"type":"湘菜" } POST recipes/type/ { "name":"廣式鯽魚湯","rating":5,"type":"粵菜" } POST recipes/type/ { "name":"魚香肉絲","rating":2,"type":"川菜" } POST recipes/type/ { "name":"奶油鮑魚湯","rating":2,"type":"西菜" }
- 現在我們看看普通的查詢效果是怎么樣的,搜索關鍵字帶“魚”的菜,返回3條數據
POST recipes/type/_search { "query": {"match": { "name": "魚" }},"size": 3 }
全是湘菜,我的天,最近上火不想吃辣,這個第一頁的結果對我來說就是垃圾,如下:
{
"took": 2, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 9, "max_score": 0.26742277, "hits": [ { "_index": "recipes", "_type": "type", "_id": "AVoESHYF_OA-dG63Txsd", "_score": 0.26742277, "_source": { "name": "鯽魚湯(變態辣)", "rating": 5, "type": "湘菜" } }, { "_index": "recipes", "_type": "type", "_id": "AVoESHXO_OA-dG63Txsa", "_score": 0.19100356, "_source": { "name": "紅燒鯽魚", "rating": 3, "type": "湘菜" } }, { "_index": "recipes", "_type": "type", "_id": "AVoESHWy_OA-dG63TxsZ", "_score": 0.19100356, "_source": { "name": "剁椒魚頭", "rating": 2, "type": "湘菜" } } ] } }
我們再看看,這次我想加個評分排序,大家都喜歡的是那些,看看有沒有喜歡吃的,執行查詢:
POST recipes/type/_search { "query": {"match": { "name": "魚" }},"sort": [ { "rating": { "order": "desc" } } ],"size": 3 }
結果稍微好點了,不過3個里面2個是湘菜,還是有點不合適,結果如下:
{
"took": 1, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 9, "max_score": null, "hits": [ { "_index": "recipes", "_type": "type", "_id": "AVoESHYF_OA-dG63Txsd", "_score": null, "_source": { "name": "鯽魚湯(變態辣)", "rating": 5, "type": "湘菜" }, "sort": [ 5 ] }, { "_index": "recipes", "_type": "type", "_id": "AVoESHYW_OA-dG63Txse", "_score": null, "_source": { "name": "廣式鯽魚湯", "rating": 5, "type": "粵菜" }, "sort": [ 5 ] }, { "_index": "recipes", "_type": "type", "_id": "AVoESHX7_OA-dG63Txsc", "_score": null, "_source": { "name": "鯽魚湯(微辣)", "rating": 4, "type": "湘菜" }, "sort": [ 4 ] } ] } }
現在我知道了,我要看看其他菜系,這家不是還有西餐、廣東菜等各種菜系的么,來來,幫我每個菜系來一個菜看看,換 terms agg 先得到唯一的 term 的 bucket,再組合 top_hits agg,返回按評分排序的第一個 top hits,有點復雜,沒關系,看下面的查詢就知道了:
GET recipes/type/_search { "query": { "match": { "name": "魚" } }, "sort": [ { "rating": { "order": "desc" } } ],"aggs": { "type": { "terms": { "field": "type", "size": 10 },"aggs": { "rated": { "top_hits": { "sort": [{ "rating": {"order": "desc"} }], "size": 1 } } } } }, "size": 0, "from": 0 }
看下面的結果,雖然 json 結構有點復雜,不過總算是我們想要的結果了,湘菜、粵菜、川菜、西菜都出來了,每樣一個,不重樣:
{
"took": 4, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 9, "max_score": 0, "hits": [] }, "aggregations": { "type": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "湘菜", "doc_count": 6, "rated": { "hits": { "total": 6, "max_score": null, "hits": [ { "_index": "recipes", "_type": "type", "_id": "AVoESHYF_OA-dG63Txsd", "_score": null, "_source": { "name": "鯽魚湯(變態辣)", "rating": 5, "type": "湘菜" }, "sort": [ 5 ] } ] } } }, { "key": "川菜", "doc_count": 1, "rated": { "hits": { "total": 1, "max_score": null, "hits": [ { "_index": "recipes", "_type": "type", "_id": "AVoESHYr_OA-dG63Txsf", "_score": null, "_source": { "name": "魚香肉絲", "rating": 2, "type": "川菜" }, "sort": [ 2 ] } ] } } }, { "key": "粵菜", "doc_count": 1, "rated": { "hits": { "total": 1, "max_score": null, "hits": [ { "_index": "recipes", "_type": "type", "_id": "AVoESHYW_OA-dG63Txse", "_score": null, "_source": { "name": "廣式鯽魚湯", "rating": 5, "type": "粵菜" }, "sort": [ 5 ] } ] } } }, { "key": "西菜", "doc_count": 1, "rated": { "hits": { "total": 1, "max_score": null, "hits": [ { "_index": "recipes", "_type": "type", "_id": "AVoESHY3_OA-dG63Txsg", "_score": null, "_source": { "name": "奶油鮑魚湯", "rating": 2, "type": "西菜" }, "sort": [ 2 ] } ] } } } ] } } }
上面的實現方法,前面已經說了,可以做,有局限性,那看看新的字段折疊法如何做到呢,查詢如下,加一個 collapse 參數,指定對那個字段去重就行了,這里當然對菜系“type”字段進行去重了:
GET recipes/type/_search { "query": { "match": { "name": "魚" } }, "collapse": { "field": "type" }, "size": 3, "from": 0 }
結果很理想嘛,命中結果還是熟悉的那個味道(和查詢結果長的一樣嘛),如下:
{
"took": 1, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 9, "max_score": null, "hits": [ { "_index": "recipes", "_type": "type", "_id": "AVoDNlRJ_OA-dG63TxpW", "_score": 0.018980097, "_source": { "name": "鯽魚湯(微辣)", "rating": 4, "type": "湘菜" }, "fields": { "type": [ "湘菜" ] } }, { "_index": "recipes", "_type": "type", "_id": "AVoDNlRk_OA-dG63TxpZ", "_score": 0.013813315, "_source": { "name": "魚香肉絲", "rating": 2, "type": "川菜" }, "fields": { "type": [ "川菜" ] } }, { "_index": "recipes", "_type": "type", "_id": "AVoDNlRb_OA-dG63TxpY", "_score": 0.0125863515, "_source": { "name": "廣式鯽魚湯", "rating": 5, "type": "粵菜" }, "fields": { "type": [ "粵菜" ] } } ] } }
我再試試翻頁,把 from 改一下,現在返回了3條數據,from 改成3,新的查詢如下:
{
"took": 1, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 9, "max_score": null, "hits": [ { "_index": "recipes", "_type": "type", "_id": "AVoDNlRw_OA-dG63Txpa", "_score": 0.012546891, "_source": { "name": "奶油鮑魚湯", "rating": 2, "type": "西菜" }, "fields": { "type": [ "西菜" ] } } ] } }
上面的結果只有一條了,去重之后本來就只有4條數據,上面的工作正常,每個菜系只有一個菜啊,那我不樂意了,幫我每個菜系里面多返回幾條,我好選菜啊,加上參數 inner_hits 來控制返回的條數,這里返回2條,按 rating 也排個序,新的查詢構造如下:
GET recipes/type/_search { "query": { "match": { "name": "魚" } }, "collapse": { "field": "type", "inner_hits": { "name": "top_rated", "size": 2, "sort": [ { "rating": "desc" } ] } }, "sort": [ { "rating": { "order": "desc" } } ], "size": 2, "from": 0 }
查詢結果如下,完美:
{
"took": 1, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 9, "max_score": null, "hits": [ { "_index": "recipes", "_type": "type", "_id": "AVoESHYF_OA-dG63Txsd", "_score": null, "_source": { "name": "鯽魚湯(變態辣)", "rating": 5, "type": "湘菜" }, "fields": { "type": [ "湘菜" ] }, "sort": [ 5 ], "inner_hits": { "top_rated": { "hits": { "total": 6, "max_score": null, "hits": [ { "_index": "recipes", "_type": "type", "_id": "AVoESHYF_OA-dG63Txsd", "_score": null, "_source": { "name": "鯽魚湯(變態辣)", "rating": 5, "type": "湘菜" }, "sort": [ 5 ] }, { "_index": "recipes", "_type": "type", "_id": "AVoESHX7_OA-dG63Txsc", "_score": null, "_source": { "name": "鯽魚湯(微辣)", "rating": 4, "type": "湘菜" }, "sort": [ 4 ] } ] } } } }, { "_index": "recipes", "_type": "type", "_id": "AVoESHYW_OA-dG63Txse", "_score": null, "_source": { "name": "廣式鯽魚湯", "rating": 5, "type": "粵菜" }, "fields": { "type": [ "粵菜" ] }, "sort": [ 5 ], "inner_hits": { "top_rated": { "hits": { "total": 1, "max_score": null, "hits": [ { "_index": "recipes", "_type": "type", "_id": "AVoESHYW_OA-dG63Txse", "_score": null, "_source": { "name": "廣式鯽魚湯", "rating": 5, "type": "粵菜" }, "sort": [ 5 ] } ] } } } } ] } }
好了,字段折疊介紹就到這里。