Aggregation 簡介
ps : 本篇文章 Elasticsearch 和 Kibana 版本為 7.10.1。如果版本不一致請查看官方文檔,避免誤導!
聚合框架有助於基於搜索查詢提供聚合數據。它基於稱為聚合的簡單構建塊,可以組合以構建復雜的數據摘要。
Elasticsearch 將聚合分為三類:
-
從字段值計算度量的聚合,例如最大、最小、總和和平均值。
-
根據字段值、范圍或其他條件將文檔分組為桶(也稱為箱),類似於關系型數據庫中的group by。
-
從其他聚合而不是文檔或字段中獲取輸入的聚合。
聚合可以將我們的數據匯總為指標,統計或其他分析信息。使用聚合可以為我們帶來的好處:
- 我的網站的平均加載時間是多少?
- 根據交易量,誰是我最有價值的客戶?
- 什么會被視為我網絡上的大文件?
- 每個產品類別中有多少個產品?
數據准備
創建索引
DELETE twitter
PUT twitter
{
"settings": {
"number_of_shards": 2,
"number_of_replicas": 1
},
"mappings": {
"properties": {
"birthday": {
"type": "date"
},
"address": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"age": {
"type": "long"
},
"city": {
"type": "keyword"
},
"country": {
"type": "keyword"
},
"location": {
"type": "geo_point"
},
"message": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"province": {
"type": "keyword"
},
"uid": {
"type": "long"
},
"user": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
導入數據
使用 Bulk API 將數據導入到 Elasticsearch 中:
POST _bulk
{"index":{"_index":"twitter","_id":1}}
{"user":"張三","message":"今兒天氣不錯啊,出去轉轉去","uid":2,"age":20,"city":"北京","province":"北京","country":"中國","address":"中國北京市海淀區","location":{"lat":"39.970718","lon":"116.325747"}, "birthday": "1999-04-01"}
{"index":{"_index":"twitter","_id":2}}
{"user":"老劉","message":"出發,下一站雲南!","uid":3,"age":22,"city":"北京","province":"北京","country":"中國","address":"中國北京市東城區台基廠三條3號","location":{"lat":"39.904313","lon":"116.412754"}, "birthday": "1997-04-01"}
{"index":{"_index":"twitter","_id":3}}
{"user":"李四","message":"happy birthday!","uid":4,"age":25,"city":"北京","province":"北京","country":"中國","address":"中國北京市東城區","location":{"lat":"39.893801","lon":"116.408986"}, "birthday": "1994-04-01"}
{"index":{"_index":"twitter","_id":4}}
{"user":"老賈","message":"123,gogogo","uid":5,"age":30,"city":"北京","province":"北京","country":"中國","address":"中國北京市朝陽區建國門","location":{"lat":"39.718256","lon":"116.367910"}, "birthday": "1989-04-01"}
{"index":{"_index":"twitter","_id":5}}
{"user":"老王","message":"Happy BirthDay My Friend!","uid":6,"age":26,"city":"北京","province":"北京","country":"中國","address":"中國北京市朝陽區國貿","location":{"lat":"39.918256","lon":"116.467910"}, "birthday": "1993-04-01"}
{"index":{"_index":"twitter","_id":6}}
{"user":"老吳","message":"好友來了都今天我生日,好友來了,什么 birthday happy 就成!","uid":7,"age":28,"city":"上海","province":"上海","country":"中國","address":"中國上海市閔行區","location":{"lat":"31.175927","lon":"121.383328"}, "birthday": "1991-04-01"}
注意:並不是所有字段都可以用來做聚合,一般來說,只有具有 keyword或者數值類型的字段是可以用來做聚合。
我們可以通過 _field_cat
命令還查詢文檔中的字段是否可以作為聚合:
GET twitter/_field_caps?fields=message,age,province,city.keyword
從結果我們可以看到四個字段都可以用來做搜索的,但是只有 age
和 city.keyword
才可以用來做聚合
{
"indices" : [
"twitter"
],
"fields" : {
"province" : {
"text" : {
"type" : "text",
"searchable" : true,
"aggregatable" : false
}
},
"message" : {
"text" : {
"type" : "text",
"searchable" : true,
"aggregatable" : false
}
},
"city.keyword" : {
"keyword" : {
"type" : "keyword",
"searchable" : true,
"aggregatable" : true
}
},
"age" : {
"long" : {
"type" : "long",
"searchable" : true,
"aggregatable" : true
}
}
}
}
searchable
是否為所有索引上的搜索都索引了該字段。
aggregatable
是否可以在所有索引上匯總此字段。
indices
該字段具有相同類型族的索引列表;如果所有索引具有相同的類型族,則為null。
non_searchable_indices
該字段不可搜索的索引列表;如果所有索引對該字段的定義相同,則為null。
non_aggregatable_indices
該字段不可聚合的索引列表;如果所有索引對該字段的定義相同,則為null。
聚合操作 語法
"aggregations" : {
"<aggregation_name>" : { <!--聚合的名字 -->
"<aggregation_type>" : { <!--聚合的類型 -->
<aggregation_body> <!--聚合體:對哪些字段進行聚合 -->
}
[,"meta" : { [<meta_data_body>] } ]? <!--元 -->
[,"aggregations" : { [<sub_aggregation>]+ } ]? <!--在聚合里面在定義子聚合 -->
}
[,"<aggregation_name_2>" : { ... } ]*<!--聚合的名字 -->
}
上面的 aggregation 可以使用 aggs
來代替
Metric 聚合操作
Avg Sum Max Min 聚合
Avg Aggregation : 一個單值度量聚合,計算從聚合文檔中提取的數值的平均值。
Sum Aggregation :sum聚合對從聚合文檔中提取的數值進行匯總的單值度量。
Max Aggregation :一個單值度量聚合,用於跟蹤並返回從聚合文檔中提取的數值中的最大值。
Min Aggregation :一個單值度量聚合,用於跟蹤並返回從聚合文檔中提取的數值中的最小值。
這些值可以從文檔中的特定數字字段中提取,也可以由提供的腳本生成。
查詢 twitter 索引下文檔 age 的 平均值、總和、最大值及最小值:
GET twitter/_search?size=0
{
"aggs": {
"age_avg": {
"avg": {
"field": "age"
}
},
"age_sum":{
"sum": {
"field": "age"
}
},
"age_max":{
"max": {
"field": "age"
}
},
"age_min":{
"min": {
"field": "age"
}
}
}
}
返回結果:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 6,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"age_sum" : {
"value" : 151.0
},
"age_min" : {
"value" : 20.0
},
"age_avg" : {
"value" : 25.166666666666668
},
"age_max" : {
"value" : 30.0
}
}
}
Stats 聚合
數據聚合一個多值指標聚合,它根據從聚合文檔中提取的數值計算統計信息。
返回的統計數據包括:最小值,最大值,和;
匯總所有文檔的年齡統計
GET twitter/_search?size=0
{
"query": {
"match": {
"city": "北京"
}
},
"aggs": {
"age_stats": {
"stats": {
"field": "age"
}
}
}
}
返回結果
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 5,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"age_stats" : {
"count" : 5,
"min" : 20.0,
"max" : 30.0,
"avg" : 24.6,
"sum" : 123.0
}
}
}
Bucket 聚合操作
Range 聚合(multi-bucket)
基於多桶值源的聚合,可以定義一組范圍(每個范圍代表一個桶)。在聚合過程中,將從每個存儲區范圍中檢查並從文檔中提取值
注意:此聚合包含每個范圍的 from 值,但不包括 to 值。
將年齡進行分段,查詢不同年齡段的用戶:
GET twitter/_search
{
"size": 0,
"aggs": {
"age_range": {
"range": {
"field": "age",
"ranges": [
{
"from": 20,
"to": 22
},
{
"from": 22,
"to": 25
},
{
"from": 25,
"to": 30
}
]
}
}
}
}
上面我們使用 range 類型的聚合,定義了不同的年齡段。通過上面的查詢,得到了不同年齡段的 bucket。並且因為是針對聚合,我們並不關心返回的結果,通過 size=0
忽略了返回結果。得到了以下輸出:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 6,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"age_range" : {
"buckets" : [
{
"key" : "20.0-22.0",
"from" : 20.0,
"to" : 22.0,
"doc_count" : 1
},
{
"key" : "22.0-25.0",
"from" : 22.0,
"to" : 25.0,
"doc_count" : 1
},
{
"key" : "25.0-30.0",
"from" : 25.0,
"to" : 30.0,
"doc_count" : 3
}
]
}
}
}
Sub-aggregation
在聚合的內部嵌套一個聚合。
在 range 操作之中,我們可以做 sub-aggregation。分別來計算它們的平均年齡、最大以及最小的年齡!
GET twitter/_search
{
"size": 0,
"aggs": {
"age_range": {
"range": {
"field": "age",
"ranges": [
{
"from": 20,
"to": 22
},
{
"from": 22,
"to": 25
},
{
"from": 25,
"to": 30
}
]
},
"aggs": {
"age_avg": {
"avg": {
"field": "age"
}
},
"age_min":{
"min": {
"field": "age"
}
},
"age_max":{
"max": {
"field": "age"
}
}
}
}
}
}
上面的查詢結果為:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 6,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"age_range" : {
"buckets" : [
{
"key" : "20.0-22.0",
"from" : 20.0,
"to" : 22.0,
"doc_count" : 1,
"age_min" : {
"value" : 20.0
},
"age_avg" : {
"value" : 20.0
},
"age_max" : {
"value" : 20.0
}
},
{
"key" : "22.0-25.0",
"from" : 22.0,
"to" : 25.0,
"doc_count" : 1,
"age_min" : {
"value" : 22.0
},
"age_avg" : {
"value" : 22.0
},
"age_max" : {
"value" : 22.0
}
},
{
"key" : "25.0-30.0",
"from" : 25.0,
"to" : 30.0,
"doc_count" : 3,
"age_min" : {
"value" : 25.0
},
"age_avg" : {
"value" : 26.333333333333332
},
"age_max" : {
"value" : 28.0
}
}
]
}
}
}
Filters 聚合 (multi-bucket)
使用 Filter 聚合定義一個多存儲桶聚合,每個存儲桶都與一個過濾器相關。每個存儲桶將收集與其關聯的過濾器相匹配的所有文檔。
在上面我們使用 Range 將數據拆分成了不同的 Bucket,但是這種方式只適合字段為數字的字段。我們可以使用 Filter 聚合來對非數字字段來建立不同的 Bucket。
GET twitter/_search
{
"size": 0,
"aggs": {
"city_filters": {
"filters": {
"filters": {
"beijing": {
"match":{
"city":"北京"
}
},
"shanghai":{
"match":{
"city":"上海"
}
}
}
}
}
}
}
上面的查詢結果顯示有5個北京的文檔,一個上海的文檔。並且每個filter都有自己的名字:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 6,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"city_filter" : {
"buckets" : {
"beijing" : {
"doc_count" : 5
},
"shanghai" : {
"doc_count" : 1
}
}
}
}
}
Filter 聚合 (single-bucket)
在當前文檔上下文中定義與指定過濾器匹配的所有文檔的單個存儲桶。通常將用於將當前聚合上下文縮小到一組特定的文檔。
查詢城市為 北京 的文檔,並求平均年齡、最大以及最小年齡:
GET twitter/_search
{
"size":0,
"aggs": {
"agg_filter": {
"filter": {
"match":{
"city":"北京"
}
},
"aggs": {
"age_avg": {
"avg": {
"field": "age"
}
},
"avg_max":{
"max": {
"field": "age"
}
},
"avg_min":{
"min": {
"field": "age"
}
}
}
}
}
}
查詢結果為:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 6,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"agg_filter" : {
"doc_count" : 5,
"avg_min" : {
"value" : 20.0
},
"avg_max" : {
"value" : 30.0
},
"age_avg" : {
"value" : 24.6
}
}
}
}
Date Range 聚合 (multi-bucket)
專用於日期值的范圍聚合。此聚合與正常范圍聚合之間的主要區別是,from和to值可以用Date Math表達式表示,而且還可以指定返回from和to響應字段的日期格式。
注意:對於每個范圍,此聚合包括from值,排除to值。
根據生日范圍查詢文檔:
GET twitter/_search
{
"size": 0,
"aggs": {
"birthday_range": {
"date_range": {
"field": "birthday",
"format": "yyyy-MM-dd",
"ranges": [
{
"from": "1989-04-01",
"to": "1997-04-01"
},
{
"from": "1994-04-01",
"to": "1999-04-01"
}
]
}
}
}
}
查詢結果:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 6,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"birthday_range" : {
"buckets" : [
{
"key" : "1989-04-01-1997-04-01",
"from" : 6.07392E11,
"from_as_string" : "1989-04-01",
"to" : 8.598528E11,
"to_as_string" : "1997-04-01",
"doc_count" : 4
},
{
"key" : "1994-04-01-1999-04-01",
"from" : 7.651584E11,
"from_as_string" : "1994-04-01",
"to" : 9.229248E11,
"to_as_string" : "1999-04-01",
"doc_count" : 2
}
]
}
}
}
Terms 聚合 (multi-bucket)
基於多桶值源的聚合,其中動態構建桶-每個唯一值一個。
可以根據 terms 聚合查詢關鍵字出現的頻率。下面我們查詢在所有文檔中出現 happy birthday
關鍵字並按照城市進行分類:
GET twitter/_search
{
"query": {
"match": {
"message": "happy birthday"
}
},
"size": 0,
"aggs": {
"city_terms": {
"terms": {
"field": "city.keyword",
"size": 10,
"order": {
"_count": "asc"
}
}
}
}
}
size=10 指的是排名前十的城市。並通過 doc_count 進行排序。聚合的結果為:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"city_terms" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "上海",
"doc_count" : 1
},
{
"key" : "北京",
"doc_count" : 2
}
]
}
}
}
histogram 聚合
基於多桶值源的匯總,可以應用於從文檔中提取數值或數值范圍值。根據值動態構建固定大小(也稱為間隔)的存儲桶。
GET twitter/_search
{
"size": 0,
"aggs": {
"age_histogram": {
"histogram": {
"field": "age",
"interval": 2
}
}
}
}
- interval : 間隔為2
返回結果:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 6,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"age_histogram" : {
"buckets" : [
{
"key" : 20.0,
"doc_count" : 1
},
{
"key" : 22.0,
"doc_count" : 1
},
{
"key" : 24.0,
"doc_count" : 1
},
{
"key" : 26.0,
"doc_count" : 1
},
{
"key" : 28.0,
"doc_count" : 1
},
{
"key" : 30.0,
"doc_count" : 1
}
]
}
}
}