1. partition by和group by區別和聯系
1)group by是分組函數,partition by是分析函數(然后像sum()等是聚合函數)
2)在執行順序上partition by應用在以上關鍵字之后,實際上就是在執行完select之后,在所得結果集之上進行partition,group by 使用常用sql關鍵字的優先級(from > where > group by > having > order by)
3)partition by相比較於group by,能夠在保留全部數據的基礎上,只對其中某些字段做分組排序,而group by則只保留參與分組的字段和聚合函數的結果
2. spark sql 使用group by
val df = Seq( ("ABC", "2019-02-10", "411626"), ("ABC", "2019-02-10", "411627"), ("BCD", "2020-04-01", "411626"), ("BCD", "2020-04-01", "411627"), ("BCD", "2020-04-02", "411626"), ("BCD", "2020-04-02", "411627"), ("DEF", "2019-01-09", "411626")) .toDF("user_id", "start_time", "end_time") df.groupBy(col("user_id"), col("start_time")) .agg(count(col("end_time")), sum(col("end_time"))) .show() +-------+----------+---------------+-------------+ |user_id|start_time|count(end_time)|sum(end_time)| +-------+----------+---------------+-------------+ | BCD|2020-04-02| 2| 823253.0| | ABC|2019-02-10| 2| 823253.0| | BCD|2020-04-01| 2| 823253.0| | DEF|2019-01-09| 1| 411626.0| +-------+----------+---------------+-------------+
2. spark sql 使用partition by
df.withColumn("rank",row_number().over(Window.partitionBy(col("user_id"), col("start_time")).orderBy(col("end_time")))) .show() +-------+----------+--------+----+ |user_id|start_time|end_time|rank| +-------+----------+--------+----+ | BCD|2020-04-02| 411626| 1| | BCD|2020-04-02| 411627| 2| | ABC|2019-02-10| 411626| 1| | ABC|2019-02-10| 411627| 2| | BCD|2020-04-01| 411626| 1| | BCD|2020-04-01| 411627| 2| | DEF|2019-01-09| 411626| 1| +-------+----------+--------+----+
partition by 返所有數據列
3. group by實現返所有數據列
df.groupBy(col("user_id"), col("start_time")) .agg(count(col("end_time")), sum(col("end_time")), collect_set(col("end_time"))(0).as("end_time")) .show() +-------+----------+---------------+-------------+--------+ |user_id|start_time|count(end_time)|sum(end_time)|end_time| +-------+----------+---------------+-------------+--------+ | BCD|2020-04-02| 2| 823253.0| 411627| | ABC|2019-02-10| 2| 823253.0| 411627| | BCD|2020-04-01| 2| 823253.0| 411627| | DEF|2019-01-09| 1| 411626.0| 411626| +-------+----------+---------------+-------------+--------+
使用 collect_set(去重)可以實現返回所有列