presto中的分组聚合


原始数据集如下:

group by语法

对数据集进行聚合操作,主要是通过group by子句实现,如下(按照name,对每个人的id进行汇总):

但是当希望在一段sql语法中,实现较复杂的聚合操作,则可以通过presto中的GROUPING SETS,CUBE和ROLLUP语法实现。

复杂的分组操作通常等同于所有简单表达式的并集。然而,这种等价不适用当数据源的聚集是非确定性的。

grouping sets语法

select date,name,sum(cast(id as double))as num
from 
(
select '张三'as name,'2021-04-11' as date,'100'as id
union all 
select '李四'as name,'2021-04-09' as date,'100'as id
union all 
select '赵四'as name,'2021-04-16' as date,'200'as id
union all 
select '张三'as name,'2021-03-10'as date,'300'as id
union all 
select '李四'as name,'2020-01-01'as date,'150'as id
)a 
group by grouping sets((date),(date,name),(name))


以上是通过grouping sets分别按照date、date和name、name对id进行sum聚合处理

cube语法

cube运算符为给定的列生成所有可能的分组集(即排列组合),2^n个组合,如cube(A,B),按照A,AB,B,()进行汇总

select date,name,sum(cast(id as double))as num
from 
(
select '张三'as name,'2021-04-11' as date,'100'as id
union all 
select '李四'as name,'2021-04-09' as date,'100'as id
union all 
select '赵四'as name,'2021-04-16' as date,'200'as id
union all 
select '张三'as name,'2021-03-10'as date,'300'as id
union all 
select '李四'as name,'2020-01-01'as date,'150'as id
)a 
group by cube(date,name) --## 等同于group by grouping sets((date),(date,name),(name),())

rollup语法

ROLLUP运算符对于一个给定的列生成所有可能的子分类汇总,2*n-1种分类,如rollup(A,B)按照A,AB,()进行汇总

select date,name,sum(cast(id as double))as num
from 
(
select '张三'as name,'2021-04-11' as date,'100'as id
union all 
select '李四'as name,'2021-04-09' as date,'100'as id
union all 
select '赵四'as name,'2021-04-16' as date,'200'as id
union all 
select '张三'as name,'2021-03-10'as date,'300'as id
union all 
select '李四'as name,'2020-01-01'as date,'150'as id
)a 
group by rollup(date,name)  -- ## 注意顺序的区别group by rollup(name,date)

多个分组表达式结合使用
同一查询中的多个分组表达式被解释为具有跨产品语义,ALL 和 DISTINCT 关键字决定是否重复分组集每个产生不同的输出行,默认是all,即重复所有分组集

select date,name,sum(cast(id as double))as num
from 
(
select '张三'as name,'2021-04-11' as date,'100'as id
union all 
select '李四'as name,'2021-04-09' as date,'100'as id
union all 
select '赵四'as name,'2021-04-16' as date,'200'as id
union all 
select '张三'as name,'2021-03-10'as date,'300'as id
union all 
select '李四'as name,'2020-01-01'as date,'150'as id
)a 
group by distinct rollup(date,name),cube(date,name)
-- ## 等同于 group by grouping sets((date),(date,name),(name),())
select date,name,sex,sum(cast(id as double))as num
from 
(
select '张三'as name,'2021-04-11' as date,'100'as id,'男'as sex
union all 
select '李四'as name,'2021-04-09' as date,'100'as id,'男'as sex
union all 
select '赵四'as name,'2021-04-16' as date,'200'as id,'女'as sex
union all 
select '张三'as name,'2021-03-10'as date,'300'as id,'男'as sex
union all 
select '李四'as name,'2020-01-01'as date,'150'as id,'男'as sex
)a 
group by all rollup(date,name),cube(date,sex)

等同于

select date,name,sex,sum(cast(id as double))as num
from 
(
select '张三'as name,'2021-04-11' as date,'100'as id,'男'as sex
union all 
select '李四'as name,'2021-04-09' as date,'100'as id,'男'as sex
union all 
select '赵四'as name,'2021-04-16' as date,'200'as id,'女'as sex
union all 
select '张三'as name,'2021-03-10'as date,'300'as id,'男'as sex
union all 
select '李四'as name,'2020-01-01'as date,'150'as id,'男'as sex
)a 
group by grouping sets((date),(date,sex),(date),(date,name),(date,name,sex),(name),(date,name),(date,name,sex),(date,name),(date),(date,sex),())

因为rollup后面有2列,有2*2-1=3种分类,cube后面有2列,有2^2=4种分类,当rollup和cube组合使用时,总共有3*4种分类。

having语法

having语法一般和group by结合使用,用来控制选择分组

select name,sum(cast(id as double))as num
from 
(
select '张三'as name,'2021-04-11' as date,'100'as id
union all 
select '李四'as name,'2021-04-09' as date,'100'as id
union all 
select '赵四'as name,'2021-04-16' as date,'200'as id
union all 
select '张三'as name,'2021-03-10'as date,'300'as id
union all 
select '李四'as name,'2020-01-01'as date,'150'as id
)a 
group by name
having sum(cast(id as double))>210 -- ## 选择sum(id)>210的数据行


免责声明!

本站转载的文章为个人学习借鉴使用,本站对版权不负任何法律责任。如果侵犯了您的隐私权益,请联系本站邮箱yoyou2525@163.com删除。



 
粤ICP备18138465号  © 2018-2025 CODEPRJ.COM