spark 計算前后兩條記錄之間的差(diff),時間差等


有時候會遇到這樣的場景:有一個datafram,我們需要計算同一組對象中,前后兩條記錄之間的差值,此處並不僅限於時間,還可以是其他的數據類型
需要用到兩個工具:spark窗口函數Window對對象分組以及lag函數

val df = Seq(
    ("notebook","2019-01-01 00:00:00"),
    ("notebook", "2019-01-10 13:02:00"),
    ("notebook", "2019-01-10 13:15:22"),
    ("small_phone", "2019-01-30 09:30:00"),
    ("small_phone", "2019-01-15 12:00:00"),
    ("small_phone", "2019-01-30 09:50:00"),
    ("small_phone", "2019-01-30 09:32:00"),
    ("big_phone", "2019-01-2 09:30:00")
).toDF("device", "purchase_time").sort("device","purchase_time")

val sessionWindow = Window.partitionBy("device").orderBy("purchase_time")
val diffDf = df.withColumn("pre_time",
                          functions.lag($"purchase_time",1).over(sessionWindow))
diffDf.show()

val minitesDf = diffDf.withColumn("purchase_time",
                                  functions.to_timestamp(col("purchase_time"),"yyyy-mm-dd HH:mm:ss"))
                       .withColumn("pre_time",
                                 functions.to_timestamp(col("pre_time"),"yyyy-mm-dd HH:mm:ss"))
                       .withColumn("minitues_diff",
                                  round((col("purchase_time").cast(LongType)-col("pre_time").cast(LongType))/60))
minitesDf.show()


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM