最近有個需求,計算用戶連續登錄的最大天數(這里使用prestoSql,使用hive也可以),先看下登錄日志數據表hive.traffic.access_user只有兩個字段:uid,day;日期輔助表hive.ods.dim_date,這個表只有一個字段day;
先說下思路,
uid | day | rownumber | day-rownumber【days】 |
---|---|---|---|
101 | 20190911 | 1 | 20190911-1=20190910 |
101 | 20190912 | 2 | 20190912-2=20190910 |
101 | 20190913 | 3 | 20190913-3=20190910 |
101 | 20190916 | 4 | 20190916-4=20190912 |
101 | 20190917 | 5 | 20190917-5=20190912 |
從上可以看到,只要是連續登錄的話,day-rownumber的差值是一樣的,那問題來了,這樣的減法在跨月或者跨年的時候會出問題,所以我們首先將日期轉換成有序的數字
select day,ROW_NUMBER() OVER(ORDER BY day) daynum from hive.ods.dim_date
接下來,我們需要將用戶登錄日志按照uid分組,然后按照日期排序,然后計算出rownumber
with a as (select uid,day from hive.traffic.access_user where day>=20190801 and uid<>'')
select uid,day,ROW_NUMBER() OVER(PARTITION BY uid ORDER BY uid,day) rownum from a group by day,uid
接下來就是計算差值,差值相同的代表連續登錄日期,完整sql如下
with a as (select uid,day from hive.traffic.access_user where day>=20190801 and uid<>''),
b as (select uid,day,ROW_NUMBER() OVER(PARTITION BY uid ORDER BY uid,day) rownum from a group by day,uid ),
c as(select day,ROW_NUMBER() OVER(ORDER BY day) daynum from hive.ods.dim_date),
d as (select uid,b.day,daynum,rownum,daynum-rownum days from b join c on b.day=c.day )
select uid,min(day)"連續登錄開始日",count(*) "連續登錄天數" from d group by uid,days
end