sparkSQL中partition by和group by區別及使用


1. partition by和group by區別和聯系

1)group by是分組函數,partition by是分析函數(然后像sum()等是聚合函數)

2)在執行順序上partition by應用在以上關鍵字之后,實際上就是在執行完select之后,在所得結果集之上進行partition,group by 使用常用sql關鍵字的優先級(from > where > group by > having > order by)

3)partition by相比較於group by,能夠在保留全部數據的基礎上,只對其中某些字段做分組排序,而group by則只保留參與分組的字段和聚合函數的結果

2. spark sql 使用group by

    val df = Seq(
      ("ABC", "2019-02-10", "411626"),
      ("ABC", "2019-02-10", "411627"),
      ("BCD", "2020-04-01", "411626"),
      ("BCD", "2020-04-01", "411627"),
      ("BCD", "2020-04-02", "411626"),
      ("BCD", "2020-04-02", "411627"),
      ("DEF", "2019-01-09", "411626"))
      .toDF("user_id", "start_time", "end_time")

   df.groupBy(col("user_id"), col("start_time"))
      .agg(count(col("end_time")), sum(col("end_time")))
      .show()

+-------+----------+---------------+-------------+
|user_id|start_time|count(end_time)|sum(end_time)|
+-------+----------+---------------+-------------+
|    BCD|2020-04-02|              2|     823253.0|
|    ABC|2019-02-10|              2|     823253.0|
|    BCD|2020-04-01|              2|     823253.0|
|    DEF|2019-01-09|              1|     411626.0|
+-------+----------+---------------+-------------+

2. spark sql 使用partition by

    df.withColumn("rank",row_number().over(Window.partitionBy(col("user_id"), col("start_time")).orderBy(col("end_time"))))
    .show()

+-------+----------+--------+----+
|user_id|start_time|end_time|rank|
+-------+----------+--------+----+
|    BCD|2020-04-02|  411626|   1|
|    BCD|2020-04-02|  411627|   2|
|    ABC|2019-02-10|  411626|   1|
|    ABC|2019-02-10|  411627|   2|
|    BCD|2020-04-01|  411626|   1|
|    BCD|2020-04-01|  411627|   2|
|    DEF|2019-01-09|  411626|   1|
+-------+----------+--------+----+

partition by 返所有數據列

3. group by實現返所有數據列

   df.groupBy(col("user_id"), col("start_time"))
      .agg(count(col("end_time")), sum(col("end_time")), collect_set(col("end_time"))(0).as("end_time"))
      .show()

+-------+----------+---------------+-------------+--------+
|user_id|start_time|count(end_time)|sum(end_time)|end_time|
+-------+----------+---------------+-------------+--------+
|    BCD|2020-04-02|              2|     823253.0|  411627|
|    ABC|2019-02-10|              2|     823253.0|  411627|
|    BCD|2020-04-01|              2|     823253.0|  411627|
|    DEF|2019-01-09|              1|     411626.0|  411626|
+-------+----------+---------------+-------------+--------+

使用 collect_set(去重)可以實現返回所有列

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM