hive 日常數據需求(盡可能展示窗口函數的使用)
SQL Functions
(oracle官方,解釋的很清楚。)
https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions001.htm#i81407
⚠️可以下載pfd全文檔478頁。
背景
日常工作中有許多數據處理需求需要解決,在此之間,獲得需求,用hive實現需求,最終完成任務。
題目
數據源在:hive中的adventure_ods庫的ods_sales_orders表
表名 | 表注釋 | 字段 | 字段注釋 | |
---|---|---|---|---|
ods_sales_orders | 訂單明細表 | sales_order_key | 訂單主鍵 | 一個訂單表示銷售一個產品 |
ods_sales_orders | 訂單明細表 | create_date | 訂單日期 | |
ods_sales_orders | 訂單明細表 | customer_key | 客戶編號 | |
ods_sales_orders | 訂單明細表 | product_key | 產品編號 | |
ods_sales_orders | 訂單明細表 | english_product_name | 產品名 | |
ods_sales_orders | 訂單明細表 | cpzl_zw | 產品子類 | |
ods_sales_orders | 訂單明細表 | cplb_zw | 產品類別 | |
ods_sales_orders | 訂單明細表 | unit_price | 產品單價 |
題目一:每個用戶截止到每月為止的最大交易金額和累計到該月的總交易金額,結果數據格式如下
customer_key | umonth(當月) | ucount(月訂單量) | current_max(最大交易金額) | current_sum(累計該月總交易金額) |
---|---|---|---|---|
11009 | 2018-12 | 1 | 53.99 | 53.99 |
1358999 | 2019-2 | 1 | 28.99 | 28.99 |
1358999 | 2019-4 | 1 | 69.99 | 98.98 |
1359000 | 2019-1 | 1 | 2294.99 | 2294.99 |
1359002 | 2019-11 | 1 | 8.99 | 8.99 |
1359003 | 2020-1 | 1 | 1120.49 | 1120.49 |
1359005 | 2019-2 | 1 | 782.99 | 782.99 |
1359009 | 2019-1 | 1 | 2384.07 | 2384.07 |
1359014 | 2019-1 | 1 | 69.99 | 69.99 |
1359014 | 2019-2 | 1 | 69.99 | 94.98 |
思路:
- 1.對數據按照客戶及其年-月分組
- 2.分組后就每月銷售金額之和
- 3.使用窗口函數,對每個客戶不同月份分組求最大值(max)和累計值(sum)
題目二:計算用戶的回購率和復購率
復購率: 當前月份購買2次及以上的客戶占所有客戶比例
回購率:當前月份購買且上個月份也購買的客戶占當月所有月份客戶比例
思路:
復購率分析過程:
- 分組:每個客戶+月;聚合函數:count(當月購買次數)
- 新表:分組:月; 聚合函數:count(條件) / count(*)
回購率
- 1、篩選當月及上月部分
- 2、利用客戶id進行當月連上月,推薦左連
- 3、對同一條客戶id均有購買記錄的,認為是回購群體
#1 根據客戶號和購買的月份分組: select customer_key, substr(create_date, 1, 7) as umonth from adventure_ods.ods_sales_orders group by customer_key, substr(create_date, 1, 7); #2 利用笛卡爾乘積: 相同兩個表關聯,然后使用關聯條件。篩選出當前月關聯上個月的數據。 select *
from () a left join () b on a.customer_key = b.customer_key and a表月份 = b表月份 - 1 #3 新的連和表,以a的月份排序。進行count計算
題目三:求用戶號對應不同的產品 ⚠️比較難的題目,沒實際意義。
用戶號 | 產品 | 購買時間 |
---|---|---|
1 | A | 2019-12-23 |
1 | B | 2019-12-23 |
2 | C | 2019-12-23 |
2 | A | 2019-12-24 |
2 | B | 2019-12-23 |
要求輸出例子:用戶號-產品1-產品2
例如:1-A-B (按先后時間順序,同時不限定)
參考:https://www.jianshu.com/p/90d0657c0218
思路:
- 1.利用窗口函數,對用戶號分組,按時間對產品進行排序
- 2.利用左連或其他方法拼接,篩選排序順序為1、2的
- 3.用concat或者其他函數拼接獲得結果
題目四:查詢每年5月份購買過的顧客及總人數
思路:
- 1.篩選月份為5月
- 2.對客戶去重或分組操作
- 3.計算總人數,拼接
題目五:查詢顧客的購買明細及月購買總額
備注:
- 每條記錄后附加一個字段:這條記錄所在月的總購買額。
- 可以使用窗口函數。
題目六:上述的場景,要將unit_price 按照日期進行累加
參考鏈接:https://www.iteye.com/blog/yugouai-1908121
其中涉及的ROWS UNBOUNDED PRECEDING
題目七:查詢顧客上次的購買時間
提示:使用偏移的窗口函數lag()。
LAG(value_expr [, offset ] [, default ])
OVER ([ query_partition_clause ] order_by_clause)
- a analytic function。通過query子句得到一系列行,然后把這個系列行排序;
- lag()告訴程序要查找的范圍是當前行前面的部分,
- 參數value_expr這個行的一個字段值。
- 參數offset相當於指針(默認值為1),返回當前行前面的n個的行中的value。
- 參數default, 自定義的值,代表如果查找失敗,返回default值。
例子:
SELECT last_name, hire_date, salary, LAG(salary, 1, 0) OVER (ORDER BY hire_date) AS prev_sal FROM employees WHERE job_id = 'PU_CLERK'; LAST_NAME HIRE_DATE SALARY PREV_SAL ------------------------- --------- ---------- ---------- Khoo 18-MAY-95 3100 0 Tobias 24-JUL-97 2800 3100 Baida 24-DEC-97 2900 2800 Himuro 15-NOV-98 2600 2900 Colmenares 10-AUG-99 2500 2600
題目八:查詢最近前20%時間的訂單信息
提示:使用ntile(x):分割x份。x是整數。
NTILE(x)
OVER ([ query_partition_clause ] order_by_clause)
- Divides an ordered partition into x groups called buckets and assigns a bucket number to each row in the partition.
- This allows easy calculation of tertiles, quartiles, deciles, percentiles and other common summary statistics.
答案
題目一:計算每個用戶截止到每月為止的最大交易金額和累計到該月的總交易金額
第一步:提取需要字段及按客戶id、年月分組,求分組后的訂單量及消費金額
select customer_key, substr(create_date, 1,7) as umonth, count(sales_order_key) as ucount, sum(unit_price) as usum from adventure_ods.ods_sales_orders group by customer_key, substr(create_date, 1,7) ⚠️group by內可以用表達式,但是不能用別名,否則報錯❌。 order by customer_key asc , umonth asc ⚠️,這句話不加也可以,默認就是這個排序。 limit 10;
第二步:利用窗口函數,對客戶按照月份排序,求最大金額及累積金額。
select t.customer_key, t.umonth, t.ucount, max(usum) over(partition by t.customer_key order by umonth) as current_max, sum(usum) over(partition by t.customer_key order by umonth) as current_sum from (select customer_key, substr(create_date, 1,7) as umonth, count(sales_order_key) as ucount, sum(unit_price) as usum from adventure_ods.ods_sales_orders group by customer_key, substr(create_date, 1,7) ) as t limit 10; ⚠️👆的列名,不加別名t也可以
參考: Hive分析窗口函數(一) SUM,AVG,MIN,MAX https://www.cnblogs.com/qingyunzong/p/8782794.html
備注:也可以使用with tmp as () 句法:
with tmp as ( select customer_key, substr(create_date,1,7) as umonth, count(sales_order_key) as ucount, sum(unit_price) as income_per_month from adventure_ods.ods_sales_orders group by customer_key, substr(create_date,1,7) ) select customer_key, umonth, ucount, max(income_per_month) over(partition by customer_key order by umonth) as current_max, sum(income_per_month) over(partition by customer_key order by umonth) as current_sum from tmp limit 10;
⚠️:再次⏰,窗口函數的作用范圍,有partition by,但沒有order by則函數作用范圍就是當前分組的全部行。
題目二:計算用戶的回購率和復購率
復購率:
select customer_key, count(sales_order_key) as ncount from adventure_ods.ods_sales_orders where substr(create_date, 1, 7) = '2020-02'
group by customer_key
2. 進行計算操作:
select count(ncount), count(if(ncount>1, 1, null)), count(if(ncount>1,1,null))/count(ncount) as ratio from (select customer_key, count(sales_order_key) as ncount from adventure_ods.ods_sales_orders where substr(create_date, 1, 7) = '2020-02' group by customer_key) t
備注:計算一個月復購率和所有月的復購率的方法類似,下面是計算所有月的復購率:
with tmp as ( select customer_key, substr(create_date,1,7) as umonth, count(sales_order_key) as purchase_num from adventure_ods.ods_sales_orders group by customer_key, substr(create_date,1,7)) select tmp.umonth, count(if(tmp.purchase_num > 1, 1, null)) as a, --購買多次的客戶的數量 count(tmp.customer_key) as b , --所有購買的客戶的數量 concat(round((count(if(tmp.purchase_num >1, 1, null))/count(tmp.customer_key))*100, 2), "%") as ratio --相除:復購率 from tmp group by tmp.umonth;
-- 返回結果
-- umonth a b ratio
-- 2018-12 0 1 0.0%
-- 2019-01 635 11628 5.46%
-- 2019-02 304 10784 2.82%
-- 2019-03 257 12034 2.14%
-- 2019-04 174 11722 1.48%
-- 2019-05 153 12141 1.26%
-- 2019-06 112 11796 0.95%
-- 2019-07 104 12190 0.85%
-- 2019-08 92 12209 0.75%
-- 2019-09 63 11826 0.53%
-- 2019-10 68 12226 0.56%
-- 2019-11 78 13297 0.59%
-- 2019-12 61 12229 0.5%
-- 2020-01 49 12249 0.4%
-- 2020-02 41 11059 0.37%
-- Time taken: 3.967 seconds, Fetched: 15 row(s)
回購率:
我的計算:指定2020-02的回購率
#求得了得到2月和1月都有購買的人數same_num select count(*) from (select customer_key from adventure_ods.ods_sales_orders where substr(create_date, 1, 7) = '2020-01' group by customer_key) a1 inner join (select customer_key from adventure_ods.ods_sales_orders where substr(create_date, 1, 7) = '2020-02' group by customer_key ) a2 on a1.customer_key = a2.customer_key # 計算2月的購買人數 select count(*) from (select customer_key from adventure_ods.ods_sales_orders where substr(create_date, 1, 7) = '2020-02' group by customer_key) a # 最后2個數字相除即可。
計算全表的每個月的回購率:
方法1:正常思路下的方法:(推薦)✅
- tmp1: 每個客戶的不同月份的購買記錄。-》
- tmp2:得出連續2個月都購買的客戶名單,即回購客戶名單。-》
- tmp3:計算每個月,回購客戶的數量。
- tmp4:計算每個月,當月的客戶數量。
- 最后,tmp3/tmp4=回購率表。
with tmp as ( select customer_key, substr(create_date,1,7) as umonth from adventure_ods.ods_sales_orders group by customer_key, substr(create_date,1,7) ), tmp2 as ( -- 內連接,得到連續兩個月都消費的人的記錄 select a1.* from tmp as a1 inner join tmp as a2 on a1.customer_key = a2.customer_key and substr(a1.umonth,6,2) == substr(a2.umonth,6,2) - 1 --保證連續月 and substr(a1.umonth,1,4) == substr(a2.umonth,1,4) --保證是當年的 ), tmp3 as ( --分子 , 由👆2個表得到✅ select umonth, count(customer_key) as active_customer from tmp2 group by umonth), tmp4 as ( --分母 按月分組,統計每個月購買的人數。✅ select substr(create_date,1,7) as umonth, count(distinct customer_key) as num from adventure_ods.ods_sales_orders group by substr(create_date,1,7) ) select tmp4.umonth, tmp3.active_customer as active_customer, tmp4.num as current_customer, concat(round((tmp3.active_customer / tmp4.num)*100, 2), "%") as ratio from tmp4 left join tmp3 on tmp4.umonth = tmp3.umonth; -- tmp4.umonth rcount lcount ratio -- 2018-12 NULL 1 NULL -- 2019-01 600 11628 5.16% -- 2019-02 423 10784 3.92% -- 2019-03 353 12034 2.93% -- 2019-04 291 11722 2.48% -- 2019-05 240 12141 1.98% -- 2019-06 182 11796 1.54% -- 2019-07 189 12190 1.55% -- 2019-08 156 12209 1.28% -- 2019-09 132 11826 1.12% -- 2019-10 161 12226 1.32% -- 2019-11 112 13297 0.84% -- 2019-12 NULL 12229 NULL -- 2020-01 86 12249 0.7% -- 2020-02 NULL 11059 NULL
方法2:不使用with, 用左連接。
第一步:
#笛卡爾乘積去重 ## 相同的表做連接,然后去重。 ## substr(a.umonth,6,2) = (substr(b.umonth,6,2) -1) 表示當月和上個月的關聯。 select * from (select customer_key,substr(create_date,1,7) as umonth from ods_sales_orders group by customer_key,substr(create_date,1,7)) a left join (select customer_key,substr(create_date,1,7) as umonth from ods_sales_orders group by customer_key,substr(create_date,1,7)) b on a.customer_key = b.customer_key and substr(a.umonth,6,2) = (substr(b.umonth,6,2) -1) and substr(a1.umonth,1,4) == substr(a2.umonth,1,4) --保證是當年的
limit;
第二步:
# 根據上面的表,以a.umonth分組,然后進行計算。 select a.umonth, count(a.customer_key) as mcount, count(b.customer_key) as lcount, concat(round((count(b.customer_key)/count(a.customer_key))*100,2),"%") as ratio '''計算字段內的計算不能用別名''' from ( (select customer_key,substr(create_date,1,7) as umonth from ods_sales_orders group by customer_key,substr(create_date,1,7)) a left join (select customer_key,substr(create_date,1,7) as umonth from ods_sales_orders group by customer_key,substr(create_date,1,7)) b on a.customer_key = b.customer_key and substring(a.umonth,6,2) = (substring(b.umonth,6,2) -1) )
and substr(a1.umonth,1,4) == substr(a2.umonth,1,4)
group by a.umonth;
題目三:求用戶號對應不同的產品
with tmp as( select customer_key, cpzl_zw,order_num,cpzl_zw1 from (SELECT customer_key , cpzl_zw, row_number() over(partition by customer_key order by create_date asc) as order_num, lag(cpzl_zw,1,null) OVER(partition by customer_key order by create_date asc) AS cpzl_zw1 from ods_sales_orders) as a where cpzl_zw != cpzl_zw1), tmp2 as ( select customer_key,cpzl_zw,order_num,cpzl_zw1, row_number() over(partition by customer_key order by order_num ) as cpzl_zw_num from tmp) select concat( customer_key,'-',concat_ws('-', collect_set(cpzl_zw)) ) from tmp2 where cpzl_zw_num <3 group by customer_key;
- lag窗口函數的用法
- concat()用法。
題目四:查詢每年5月份購買過的顧客及總人數
over() 指定函數工作的數據窗口大小, over()內為空,則范圍是所有行。
select customer_key, count(*) over() from ods_sales_orders where month(create_date)="5" group by customer_key;
題目五:查詢顧客的購買明細及月購買總額
select * ,sum(unit_price) over(partition by substr(create_date,1,7)) from ods_sales_orders limit 10;
題目六:上述的場景,要將unit_price 按照日期進行累加
select * , sum(unit_price) over(sort by create_date rows between unbounded preceding and current row ) as sumcost from adventure_ods.ods_sales_orders
題目七:查詢顧客上次的購買時間
select *, lag(create_date,1) over(distribute by customer_key sort by create_date) from ods_sales_orders limit 10;
題目八:查詢最近前20%時間的訂單信息
用ntile函數將訂單時間按順序分為5堆
select * from ( select *, ntile(5) over(sort by create_date asc) as five_num from adventure_ods.ods_sales_orders) t where five_num = 1