spark 计算前后两条记录之间的差(diff),时间差等


有时候会遇到这样的场景:有一个datafram,我们需要计算同一组对象中,前后两条记录之间的差值,此处并不仅限于时间,还可以是其他的数据类型
需要用到两个工具:spark窗口函数Window对对象分组以及lag函数

val df = Seq(
    ("notebook","2019-01-01 00:00:00"),
    ("notebook", "2019-01-10 13:02:00"),
    ("notebook", "2019-01-10 13:15:22"),
    ("small_phone", "2019-01-30 09:30:00"),
    ("small_phone", "2019-01-15 12:00:00"),
    ("small_phone", "2019-01-30 09:50:00"),
    ("small_phone", "2019-01-30 09:32:00"),
    ("big_phone", "2019-01-2 09:30:00")
).toDF("device", "purchase_time").sort("device","purchase_time")

val sessionWindow = Window.partitionBy("device").orderBy("purchase_time")
val diffDf = df.withColumn("pre_time",
                          functions.lag($"purchase_time",1).over(sessionWindow))
diffDf.show()

val minitesDf = diffDf.withColumn("purchase_time",
                                  functions.to_timestamp(col("purchase_time"),"yyyy-mm-dd HH:mm:ss"))
                       .withColumn("pre_time",
                                 functions.to_timestamp(col("pre_time"),"yyyy-mm-dd HH:mm:ss"))
                       .withColumn("minitues_diff",
                                  round((col("purchase_time").cast(LongType)-col("pre_time").cast(LongType))/60))
minitesDf.show()


免责声明!

本站转载的文章为个人学习借鉴使用,本站对版权不负任何法律责任。如果侵犯了您的隐私权益,请联系本站邮箱yoyou2525@163.com删除。



 
粤ICP备18138465号  © 2018-2025 CODEPRJ.COM