聚合框架有助于根据搜索查询提供聚合数据。聚合查询是数据库中重要的功能特性,ES作为搜索引擎兼数据库,同样提供了强大的聚合分析能力。它基于查询条件来对数据进行分桶、计算的方法。有点类似于 SQL 中的 group by 再加一些函数方法的操作。聚合可以嵌套,由此可以组成复杂的操作(Bucketing聚合可以包含sub-aggregation)。
聚合计算的值可以取字段的值,也可是脚本计算的结果。查询请求体中以aggregations节点的语法定义:
"aggregations" : { //也可简写为 aggs "<aggregation_name>" : { //聚合的名字 "<aggregation_type>" : { //聚合的类型 <aggregation_body> //聚合体:对哪些字段进行聚合 } [,"meta" : { [<meta_data_body>] } ]? //元 [,"aggregations" : { [<sub_aggregation>]+ } ]? //在聚合里面在定义子聚合 } [,"<aggregation_name_2>" : { ... } ]* //聚合的名字 }
1、数据准备
(1) 创建员工索引employee
PUT employee { "mappings": { "properties": { "id": { "type": "integer" }, "name": { "type": "keyword" }, "job": { "type": "keyword" }, "age": { "type": "integer" }, "gender": { "type": "keyword" } } }, "settings":{ "index":{ "number_of_shards":3, #分片数量 "number_of_replicas":2 #副本数量 } } }
(2) 插入数据
POST employee/_bulk {"index": {"_id": 1}} {"id": 1, "name": "Bob", "job": "java", "age": 21, "sal": 8000, "gender": "male"} {"index": {"_id": 2}} {"id": 2, "name": "Rod", "job": "html", "age": 31, "sal": 18000, "gender": "female"} {"index": {"_id": 3}} {"id": 3, "name": "Gaving", "job": "java", "age": 24, "sal": 12000, "gender": "male"} {"index": {"_id": 4}} {"id": 4, "name": "King", "job": "dba", "age": 26, "sal": 15000, "gender": "female"} {"index": {"_id": 5}} {"id": 5, "name": "Jonhson", "job": "dba", "age": 29, "sal": 16000, "gender": "male"} {"index": {"_id": 6}} {"id": 6, "name": "Douge", "job": "java", "age": 41, "sal": 20000, "gender": "female"} {"index": {"_id": 7}} {"id": 7, "name": "cutting", "job": "dba", "age": 27, "sal": 7000, "gender": "male"} {"index": {"_id": 8}} {"id": 8, "name": "Bona", "job": "html", "age": 22, "sal": 14000, "gender": "female"} {"index": {"_id": 9}} {"id": 9, "name": "Shyon", "job": "dba", "age": 20, "sal": 19000, "gender": "female"} {"index": {"_id": 10}} {"id": 10, "name": "James", "job": "html", "age": 18, "sal": 22000, "gender": "male"} {"index": {"_id": 11}} {"id": 11, "name": "Golsling", "job": "java", "age": 32, "sal": 23000, "gender": "female"} {"index": {"_id": 12}} {"id": 12, "name": "Lily", "job": "java", "age": 24, "sal": 2000, "gender": "male"} {"index": {"_id": 13}} {"id": 13, "name": "Jack", "job": "html", "age": 23, "sal": 3000, "gender": "female"} {"index": {"_id": 14}} {"id": 14, "name": "Rose", "job": "java", "age": 36, "sal": 6000, "gender": "female"} {"index": {"_id": 15}} {"id": 15, "name": "Will", "job": "dba", "age": 38, "sal": 4500, "gender": "male"} {"index": {"_id": 16}} {"id": 16, "name": "smith", "job": "java", "age": 32, "sal": 23000, "gender": "male"} #这里有换行符
数据说明:插入的数据为员工信息,name是员工的姓名,job是员工的工种,age为员工的年龄,sal为员工的薪水,gender为员工的性别。
指标聚合
指标聚合,它是对文档进行一些权值计算(比如求所有文档某个字段求最大、最小、和、平均值),输出结果往往是文档的权值,相当于为文档添加了一些统计信息。
它基于特定字段(field)或脚本值(generated using scripts),计算聚合中文档的数值权值。数值权值聚合(注意分类只针对数值权值聚合,非数值的无此分类)输出单个权值的,也叫 single-value numeric metrics,其它生成多个权值(比如:stats)的被叫做 multi-value numeric metrics。
max min sum avg
Max Aggregation,求最大值。基于文档的某个值(可以是特定的数值型字段,也可以通过脚本计算而来),计算该值在聚合文档中的均值。
Min Aggregation,求最小值。同上
Sum Aggregation,求和。同上
Avg Aggregation,求平均数。同上
POST employee/_doc/_search { "size": 0, "aggs": { "max_sal": { "max": { "field": "sal"} } } } 返回结果 { "took": 40, "timed_out": false, "_shards": { "total": 3, "successful": 3, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 16, "relation": "eq" }, "max_score": null, "hits": [] }, "aggregations": { "max_sal": { "value": 23000 } } }
POST employee/_doc/_search { "size": 0, "aggs": { "min_sal": { "min": { "field": "sal"} } } } 返回结果 { "took": 40, "timed_out": false, "_shards": { "total": 3, "successful": 3, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 16, "relation": "eq" }, "max_score": null, "hits": [] }, "aggregations": { "min_sal": { "value": 2000 } } }
POST employee/_doc/_search { "size": 0, "aggs": { "sum_sal": { "sum": { "field": "sal"} } } } 返回结果 { "took": 17, "timed_out": false, "_shards": { "total": 3, "successful": 3, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 16, "relation": "eq" }, "max_score": null, "hits": [] }, "aggregations": { "sum_sal": { "value": 212500 } } }
POST employee/_doc/_search { "size": 0, "aggs": { "avg_sal": { "avg": { "field": "sal"} } } } 返回结果 { "took": 4, "timed_out": false, "_shards": { "total": 3, "successful": 3, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 16, "relation": "eq" }, "max_score": null, "hits": [] }, "aggregations": { "avg_sal": { "value": 13281.25 } } }
值统计
值计数聚合。计算聚合文档中某个值(可以是特定的数值型字段,也可以通过脚本计算而来)的个数。该聚合一般与其它 single-value 聚合联合使用,比如在计算一个字段的平均值的时候,可能还会关注这个平均值是由多少个值计算而来。
POST employee/_doc/_search { "size": 0, "aggs": { "age_count": { "value_count": { "field": "age"} } } } 返回结果 { "took": 4, "timed_out": false, "_shards": { "total": 3, "successful": 3, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 16, "relation": "eq" }, "max_score": null, "hits": [] }, "aggregations": { "age_count": { "value": 16 } } }
distinct 聚合
基数聚合。它属于multi-value,基于文档的某个值(可以是特定的字段,也可以通过脚本计算而来),计算文档非重复的个数(去重计数),相当于sql中的distinct。
POST employee/_doc/_search { "size": 0, "aggs": { "age_count": { "cardinality": { "field": "age" } }, "job_count": { "cardinality": { "field": "job" } } } } 返回结果 { "took": 32, "timed_out": false, "_shards": { "total": 3, "successful": 3, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 16, "relation": "eq" }, "max_score": null, "hits": [] }, "aggregations": { "job_count": { "value": 3 }, "age_count": { "value": 14 } } }
统计聚合
统计聚合。它属于multi-value,基于文档的某个值(可以是特定的数值型字段,也可以通过脚本计算而来),计算出一些统计信息(min、max、sum、count、avg5个值)。
POST employee/_doc/_search { "size": 0, "aggs": { "age_stats": { "stats": { "field": "age" } } } } 返回结果 { "took": 8, "timed_out": false, "_shards": { "total": 3, "successful": 3, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 16, "relation": "eq" }, "max_score": null, "hits": [] }, "aggregations": { "age_stats": { "count": 16, "min": 18, "max": 41, "avg": 27.75, "sum": 444 } } }
拓展的统计聚合
扩展统计聚合。它属于multi-value,比stats多4个统计结果: 平方和、方差、标准差、平均值加/减两个标准差的区间。
POST employee/_doc/_search { "size": 0, "aggs": { "age_stats": { "extended_stats": { "field": "age" } } } } 返回结果 { "took": 5, "timed_out": false, "_shards": { "total": 3, "successful": 3, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 16, "relation": "eq" }, "max_score": null, "hits": [] }, "aggregations": { "age_stats": { "count": 16, "min": 18, "max": 41, "avg": 27.75, "sum": 444, "sum_of_squares": 13006, "variance": 42.8125, "variance_population": 42.8125, "variance_sampling": 45.666666666666664, "std_deviation": 6.5431261641512, "std_deviation_population": 6.5431261641512, "std_deviation_sampling": 6.757711644237764, "std_deviation_bounds": { "upper": 40.8362523283024, "lower": 14.6637476716976, "upper_population": 40.8362523283024, "lower_population": 14.6637476716976, "upper_sampling": 41.26542328847553, "lower_sampling": 14.234576711524472 } } } }
百分比统计
百分比聚合。它属于multi-value,对指定字段(脚本)的值按从小到大累计每个值对应的文档数的占比(占所有命中文档数的百分比),返回指定占比比例对应的值。默认返回[ 1, 5, 25, 50, 75, 95, 99 ]分位上的值。
POST employee/_doc/_search { "size": 0, "aggs": { "age_percents": { "percentiles": { "field": "age" } } } } 返回结果 { "took": 16, "timed_out": false, "_shards": { "total": 3, "successful": 3, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 16, "relation": "eq" }, "max_score": null, "hits": [] }, "aggregations": { "age_percents": { "values": { "1.0": 18, "5.0": 18.6, "25.0": 22.5, "50.0": 26.5, //占比为50%的文档的age值 <= 26.5,或反过来:age<=26.5的文档数占总命中文档数的50% "75.0": 32, "95.0": 40.099999999999994, "99.0": 41 } } } }
指定分位值
POST employee/_doc/_search { "size": 0, "aggs": { "age_percents": { "percentiles": { "field": "age", "percents": [95,99,99.9] } } } } 返回结果 { "took": 18, "timed_out": false, "_shards": { "total": 3, "successful": 3, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 16, "relation": "eq" }, "max_score": null, "hits": [] }, "aggregations": { "age_percents": { "values": { "95.0": 40.099999999999994, "99.0": 41, "99.9": 41 } } } }
百分比排名聚合
统计年龄小于25和年龄小于30的文档的占比,这里需求可以使用。
POST employee/_doc/_search { "size": 0, "aggs": { "gge_perc_rank": { "percentile_ranks": { "field": "age", "values": [25,30] } } } } 返回结果 { "took": 4, "timed_out": false, "_shards": { "total": 3, "successful": 3, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 16, "relation": "eq" }, "max_score": null, "hits": [] }, "aggregations": { "gge_perc_rank": { "values": { //年龄小于25的文档占比为43.75%,年龄小于30的文档占比为62.5% "25.0": 43.75, "30.0": 62.5 } } } }
Top Hits
最高匹配权值聚合。获取到每组前n条数据,相当于sql 中Top(group by 后取出前n条)。它跟踪聚合中相关性最高的文档,该聚合一般用做 sub-aggregation,以此来聚合每个桶中的最高匹配的文档,较为常用的统计。
POST employee/_doc/_search { "size":0, "query": { "match_all": {} }, "aggs": { "group_by_job": { "terms": { "field": "job", "size": 2 //返回的buckets数组长度 }, "aggs": { "top_tag_hits": { "top_hits": { "size": 5 //返回的最大文档个数 } } } } } } 返回结果 { "took": 6, "timed_out": false, "_shards": { "total": 3, "successful": 3, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 16, "relation": "eq" }, "max_score": null, "hits": [] }, "aggregations": { "group_by_job": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 9, "buckets": [ { "key": "java", "doc_count": 7, "top_tag_hits": { "hits": { "total": { "value": 7, "relation": "eq" }, "max_score": 1, "hits": [ { "_index": "employee", "_type": "_doc", "_id": "3", "_score": 1, "_source": { "id": 3, "name": "Gaving", "job": "java", "age": 24, "sal": 12000, "gender": "male" } } ] } } } ] } } }
Geo Bounds Aggregation
地理边界聚合。基于文档的某个字段(geo-point类型字段),计算出该字段所有地理坐标点的边界(左上角/右下角坐标点)。
POST region/_doc/_search { "size": 0 "query": { "match_all": {} }, "aggs": { "viewport": { "geo_bounds": { "field": "location", "wrap_longitude": true //是否允许地理边界与国际日界线存在重叠 } } } }
Geo Centroid Aggregation
地理重心聚合。基于文档的某个字段(geo-point类型字段),计算所有坐标的加权重心。
POST region/_doc/_search { "query" : { "match" : { "crime" : "burglary" } }, "aggs" : { "centroid" : { "geo_centroid" : { "field" : "location" } } } }
桶聚合
它执行的是对文档分组的操作(与sql中的group by类似),把满足相关特性的文档分到一个桶里,即桶分,输出结果往往是一个个包含多个文档的桶(一个桶就是一个group)。
它有一个关键字(field、script),以及一些桶分(分组)的判断条件。执行聚合操作时候,文档会判断每一个分组条件,如果满足某个,该文档就会被分为该组(fall in)。
它不进行权值的计算,他们对文档根据聚合请求中提供的判断条件(比如:{"from":0, "to":100})来进行分组(桶分)。桶聚合还会额外返回每一个桶内文档的个数。
它可以包含子聚合——sub-aggregations(权值聚合不能包含子聚合,可以作为子聚合),子聚合操作将会应用到由父聚合产生的每一个桶上。
它根据聚合条件,可以只定义输出一个桶;也可以输出多个(multi-bucket);还可以在根据聚合条件动态确定桶个数(比如:terms aggregation)
Terms Aggregation
词聚合。基于某个field,该 field 内的每一个【唯一词元】为一个桶,并计算每个桶内文档个数。默认返回顺序是按照文档个数多少排序。它属于multi-bucket。当不返回所有 buckets 的情况(它size控制),文档个数可能不准确。
POST employee/_doc/_search { "size": 0, //表示返回的数据为0,一般用于统计、聚合,不需要返回实际的列表 "aggs": { "age_terms": { "terms": { "field": "job", //字段 "size": 10, //size用来定义需要返回多个 buckets(防止太多),默认会全部返回。 "order": {"_count": "asc"}, //根据文档计数排序,根据分组值排序({ "_key" : "asc" }) "min_doc_count": 1, //只返回文档个数不小于该值的 buckets "include": ".*dba.*", //包含过滤,根据字段关键字过滤 "exclude": "html.*", //排除过滤,根据字段关键字过滤 "missing": "N/A" } } } } 返回结果 { "took": 12, "timed_out": false, "_shards": { "total": 3, "successful": 3, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 16, "relation": "eq" }, "max_score": null, "hits": [] }, "aggregations": { "age_terms": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "dba", "doc_count": 5 } ] } } }
指定每个分片返回多少个分组
POST employee/_doc/_search { "size": 0, "aggs": { "age_terms": { "terms": { "field": "job", "size": 10, "shard_size": 20,//指定每个分片返回多少个分组,默认值(索引只有一个分片:= size,多分片:= size * 1.5 + 10) "show_term_doc_count_error": true //每个分组上显示偏差值 } } } } 返回结果 { "took": 15, "timed_out": false, "_shards": { "total": 3, "successful": 3, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 16, "relation": "eq" }, "max_score": null, "hits": [] }, "aggregations": { "age_terms": { "doc_count_error_upper_bound": 0,//文档计数的最大偏差值 "sum_other_doc_count": 0,//未返回的其他项的文档数 "buckets": [ //默认情况下返回按文档计数从高到低的前10个分组 { "key": "java", //job为java的文档有7个 "doc_count": 7, "doc_count_error_upper_bound": 0 }, { "key": "dba", //job为dba的文档有5个 "doc_count": 5, "doc_count_error_upper_bound": 0 }, { "key": "html", "doc_count": 4, "doc_count_error_upper_bound": 0 } ] } } }
Filter Aggregation
过滤聚合。基于一个条件,来对当前的文档进行过滤的聚合。
POST employee/_doc/_search { "size": 0, "aggs": { "args_term": { "filter": { "match": { "job": "java" } }, "aggs": { "avg_age": { "avg": { "field": "age" } } } } } } 返回结果 { "took": 5, "timed_out": false, "_shards": { "total": 3, "successful": 3, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 16, "relation": "eq" }, "max_score": null, "hits": [] }, "aggregations": { "args_term": { "doc_count": 7, "avg_age": { "value": 30 } } } }
Filters Aggregation
多过滤聚合。基于多个过滤条件,来对当前文档进行【过滤】的聚合,每个过滤都包含所有满足它的文档(多个bucket中可能重复),先过滤再聚合。它属于multi-bucket。
范围聚合
范围分组聚合。基于某个值(可以是 field 或 script),以【字段范围】来桶分聚合。范围聚合包括 from 值,不包括 to 值(区间前闭后开)。它属于multi-bucket。
POST employee/_doc/_search { "size": 0, "aggs": { "age_range": { "range": { "field": "age", "ranges": [ { "to": 25 }, { "from": 25, "to": 35 }, { "from": 35 } ] }, "aggs": { "bmax": { "max": { "field": "sal" } } } } } } 返回结果 { "took": 6, "timed_out": false, "_shards": { "total": 3, "successful": 3, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 16, "relation": "eq" }, "max_score": null, "hits": [] }, "aggregations": { "age_range": { "buckets": [ { "key": "*-25.0", "to": 25, "doc_count": 7, "bmax": { "value": 22000 } }, { "key": "25.0-35.0", "from": 25, "to": 35, "doc_count": 6, "bmax": { "value": 23000 } }, { "key": "35.0-*", "from": 35, "doc_count": 3, "bmax": { "value": 20000 } } ] } } }
时间范围聚合
日期范围聚合。基于日期类型的值,以【日期范围】来桶分聚合。日期范围可以用各种 Date Math 表达式。同样的,包括 from 的值,不包括 to 的值。它属于multi-bucket。
POST employee/_doc/_search { "size": 0, "aggs": { "range": { "date_range": { "field": "date", "format": "MM-yyy", "ranges": [ { "to": "now-10M/M" }, { "from": "now-10M/M" } ] } } } } 返回结果 { "took": 19, "timed_out": false, "_shards": { "total": 3, "successful": 3, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 16, "relation": "eq" }, "max_score": null, "hits": [] }, "aggregations": { "range": { "buckets": [ { "key": "*-01-2021", "to": 1609459200000, "to_as_string": "01-2021", "doc_count": 0 }, { "key": "01-2021-*", "from": 1609459200000, "from_as_string": "01-2021", "doc_count": 0 } ] } } }
时间柱状聚合
1、直方图聚合。基于文档中的某个【数值类型】字段,通过计算来动态的分桶。它属于multi-bucket。
POST employee/_doc/_search { "size": 0, "aggs": { "prices": { "histogram": { "field": "sal", //字段,必须为数值类型 "interval": 50, //分桶间距 "min_doc_count": 1, //最少文档数桶过滤,只有不少于这么多文档的桶才会返回 "extended_bounds": { //范围扩展 "min": 0, "max": 500 }, "order": { "_count": "desc" //对桶排序,如果 histogram 聚合有一个权值聚合类型的"直接"子聚合,那么排序可以使用子聚合中的结果 }, "keyed": true, //hash结构返回,默认以数组形式返回每一个桶 "missing": 0 //配置缺省默认值 } } } } 返回结果 { "took": 5, "timed_out": false, "_shards": { "total": 3, "successful": 3, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 16, "relation": "eq" }, "max_score": null, "hits": [] }, "aggregations": { "prices": { "buckets": { "23000.0": { "key": 23000, "doc_count": 2 }, "2000.0": { "key": 2000, "doc_count": 1 }, "3000.0": { "key": 3000, "doc_count": 1 }, "4500.0": { "key": 4500, "doc_count": 1 }, "6000.0": { "key": 6000, "doc_count": 1 }, "7000.0": { "key": 7000, "doc_count": 1 }, "8000.0": { "key": 8000, "doc_count": 1 }, "12000.0": { "key": 12000, "doc_count": 1 }, "14000.0": { "key": 14000, "doc_count": 1 }, "15000.0": { "key": 15000, "doc_count": 1 }, "16000.0": { "key": 16000, "doc_count": 1 }, "18000.0": { "key": 18000, "doc_count": 1 }, "19000.0": { "key": 19000, "doc_count": 1 }, "20000.0": { "key": 20000, "doc_count": 1 }, "22000.0": { "key": 22000, "doc_count": 1 } } } } }
2、日期直方图聚。
基于日期类型,以【日期间隔】来桶分聚合。可用的时间间隔类型为:year、quarter、month、week、day、hour、minute、second,其中,除了year、quarter 和 month,其余可用小数形式。
POST employee/_doc/_search { "size": 0, "aggs": { "articles_over_time": { "date_histogram": { "field": "date", "interval": "month", "format": "yyyy-MM-dd", "time_zone": "+08:00" } } } }
Missing Aggregation
缺失值的桶聚合
POST employee/_doc/_search { "size": 0, "aggs": { "account_without_a_age": { "missing": { "field": "age" } } } }
IP范围聚合
基于一个 IPv4 字段,对文档进行【IPv4范围】的桶分聚合。和 Range Aggregation 类似,只是应用字段必须是 IPv4 数据类型。它属于multi-bucket。
POST employee/_doc/_search { "size": 0, "aggs": { "ip_ranges": { "ip_range": { "field": "ip", "ranges": [ { "to": "10.0.0.5" }, { "from": "10.0.0.5" } ] } } } }
Nested Aggregation
嵌套类型聚合。基于嵌套(nested)数据类型,把该【嵌套类型的信息】聚合到单个桶里,然后就可以对嵌套类型做进一步的聚合操作。
矩阵聚合
矩阵聚合。此功能是实验性的,在将来的版本中可能会完全更改或删除。
它对多个字段进行操作并根据从请求的文档字段中提取的值生成矩阵结果的聚合系列。与度量聚合和桶聚合不同,此聚合系列尚不支持脚本编写。
管道聚合
Pipeline,管道聚合。它对其它聚合操作的输出(桶或者桶的某些权值)及其关联指标进行聚合,而不是文档,是一种后期对每个分桶的一些计算操作。管道聚合的作用是为输出增加一些有用信息。
管道聚合不能包含子聚合,但是某些类型的管道聚合可以链式使用(比如计算导数的导数)。
管道聚合大致分为两类:
- parent,它输入是其【父聚合】的输出,并对其进行进一步处理。一般不生成新的桶,而是对父聚合桶信息的增强。
- sibling,它输入是其【兄弟聚合】的输出。并能在同级上计算新的聚合。
管道聚合通过 buckets_path 参数指定他们要进行聚合计算的权值对象,bucket_path语法:
聚合分隔符 = ">", 指定父子聚合关系,如:"my_bucket>my_stats.avg"
权值分隔符= ".", 指定聚合的特定权值
聚合名称 = <name of the aggregation> , 直接指定聚合的名称
权值 = <name of the metric> , 直接指定权值
完整路径 = agg_name[> agg_name]*[. metrics] , 综合利用上面的方式指定完整路径
特殊值 = "_count", 输入的文档个数
特殊情况:
- 要进行 pipeline aggregation 聚合的对象名称或权值名称包含小数点,"buckets_path": "my_percentile[99.9]"
- 处理对象中包含空桶(无文档的桶分),参数 gap_policy,可选值有 skip、insert_zeros
参考链接
https://blog.csdn.net/alex_xfboy/article/details/86100037