一、DWS層
1、概括
dwd層的數據,每日輕度聚合,建寬表
| 粒度 | |
|---|---|
| dws_uv_detail_daycount | 一個設備是一行 |
| dws_user_action_daycount(只統計今天登錄的會員) | 一個會員是一行 |
| dws_sku_action_daycount(只統計被下單或平均或支付或加購或收藏的商品) | 一個商品是一行 |
| dws_coupon_use_daycount(只統計未過期的優惠券) | 一個優惠券是一行 |
| dws_activity_info_daycount(統計所有活動) | 一個活動是一行 |
| dws_sale_detail_daycount(每日購買數據) |
2、dws_uv_detail_daycount(每日設備行為)-一台設備有多個行為,每列的多個行為進行字符串拼接
(1)建表
create external table dws_uv_detail_daycount ( -- 從啟動日志dwd_start_log表取以下字段 `mid_id` string COMMENT '設備唯一標識', `user_id` string COMMENT '用戶標識', `version_code` string COMMENT '程序版本號', `version_name` string COMMENT '程序版本名', `lang` string COMMENT '系統語言', `source` string COMMENT '渠道號', `os` string COMMENT '安卓系統版本', `area` string COMMENT '區域', `model` string COMMENT '手機型號', `brand` string COMMENT '手機品牌', `sdk_version` string COMMENT 'sdkVersion', `gmail` string COMMENT 'gmail', `height_width` string COMMENT '屏幕寬高', `app_time` string COMMENT '客戶端日志產生時的時間', `network` string COMMENT '網絡模式', `lng` string COMMENT '經度', `lat` string COMMENT '緯度', -- 從啟動日志dwd_start_log表按照mid_id進行聚合,之后count(*)取以下字段 `login_count` bigint COMMENT '活躍次數' )
(2)數據導入
insert overwrite table dws_uv_detail_daycount PARTITION(dt='2020-05-06') select mid_id, concat_ws('|',collect_set(user_id)), concat_ws('|',collect_set(version_code)), concat_ws('|',collect_set(version_name)), concat_ws('|',collect_set(lang)), concat_ws('|',collect_set(source)), concat_ws('|',collect_set(os)), concat_ws('|',collect_set(area)), concat_ws('|',collect_set(model)), concat_ws('|',collect_set(brand)), concat_ws('|',collect_set(sdk_version)), concat_ws('|',collect_set(gmail)), concat_ws('|',collect_set(height_width)), concat_ws('|',collect_set(app_time)), concat_ws('|',collect_set(network)), concat_ws('|',collect_set(lng)), concat_ws('|',collect_set(lat)), count(*) FROM dwd_start_log where dt='2020-05-06' GROUP by mid_id
3、dws_user_action_daycount(每日會員行為)
用戶登錄、加購、下單、支付次數及金額
導入數據時,類似於建表/視圖操作
witht1
as (select user_id,count(*) login_count from dwd_start_log where dt='2020-05-06' and user_id is not NULL GROUP BY user_id),
t3 as (select user_id,count(*) order_count, sum(final_total_amount) order_amount from dwd_fact_order_info where dt='2020-05-06' GROUP by user_id ),
t4 as (select user_id,count(*) payment_count,sum(payment_amount) payment_amount from dwd_fact_payment_info where dt='2020-05-06' GROUP by user_id),
t2 as (select user_id,count(*) cart_count,sum(cart_price*sku_num) cart_amount from dwd_fact_cart_info where dt='2020-05-06' and date_format(create_time,'yyyy-MM-dd')='2020-05-06' GROUP by user_id )
insert overwrite TABLE dws_user_action_daycount PARTITION(dt='2020-05-06')select t1.user_id,login_count, nvl(cart_count,0), nvl(cart_amount,0), nvl(order_count,0),
nvl(order_amount,0), nvl(payment_count,0), nvl(payment_amount,0)from t1 left join t2 on t1.user_id=t2.user_idleft join t3 on t1.user_id=t3.user_idleft join t4 on t1.user_id=t4.user_id
4、dws_sku_action_daycount(每日商品行為)
被下單次數、被支付、退款、加購、好評、差評次數
with t1 as (select sku_id,count(*) order_count,sum(sku_num) order_num, sum(total_amount) order_amount from dwd_fact_order_detail where dt='2020-05-06' GROUP by sku_id), t2 as (select sku_id, sum(sku_num) payment_num,sum(total_amount) payment_amount, count(*) payment_count from (SELECT order_id,sku_id,sku_num,total_amount from dwd_fact_order_detail where dt='2020-05-06' or dt=date_sub('2020-05-06',1)) tmp1 join (select order_id from dwd_fact_payment_info where dt='2020-05-06') tmp2 on tmp1.order_id=tmp2.order_id GROUP by sku_id), t3 as (SELECT sku_id, count(*) refund_count,sum(refund_num) refund_num, sum(refund_amount) refund_amount from dwd_fact_order_refund_info where dt='2020-05-06' GROUP by sku_id), t4 as (select sku_id, count(*) cart_count,sum(sku_num) cart_num from dwd_fact_cart_info where dt='2020-05-06' and sku_num>0 GROUP by sku_id), t5 as (SELECT sku_id, count(*) favor_count from dwd_fact_favor_info where dt='2020-05-06' and is_cancel=0 group by sku_id), t6 as (SELECT sku_id, sum(if(appraise='1201',1,0)) appraise_good_count, sum(if(appraise='1202',1,0)) appraise_mid_count, sum(if(appraise='1203',1,0)) appraise_bad_count, sum(if(appraise='1204',1,0)) appraise_default_count from dwd_fact_comment_info where dt='2020-05-06' group by sku_id) insert overwrite table dws_sku_action_daycount partition(dt='2020-05-06') SELECT nvl(nvl(nvl(nvl(nvl(t1.sku_id,t2.sku_id),t3.sku_id),t4.sku_id),t5.sku_id),t6.sku_id), nvl(order_count,0), nvl(order_num,0), nvl(order_amount,0), nvl(payment_count,0), nvl(payment_num,0), nvl(payment_amount,0), nvl(refund_count,0), nvl(refund_num,0), nvl(refund_amount,0), nvl(cart_count,0), nvl(cart_num,0), nvl(favor_count,0), nvl(appraise_good_count,0), nvl(appraise_mid_count,0), nvl(appraise_bad_count,0), nvl(appraise_default_count,0) from t1 full join t2 on t1.sku_id=t2.sku_id full join t3 on t1.sku_id=t3.sku_id full join t4 on t1.sku_id=t4.sku_id full join t5 on t1.sku_id=t5.sku_id full join t6 on t1.sku_id=t6.sku_id
5、dws_coupon_use_daycount(每日優惠券使用行為)
范圍、商品id、品牌、品類、領用次數、下單次數
insert overwrite table dws_coupon_use_daycount PARTITION(dt='2020-05-06') select t1.id coupon_id,coupon_name, coupon_type, condition_amount, condition_num, activity_id, benefit_amount, benefit_discount, create_time, range_type, spu_id, tm_id, category3_id, limit_num, get_count,using_count, used_count from (SELECT * from dwd_dim_coupon_info where dt='2020-05-06' and nvl(expire_time,'9999-99-99') >'2020-05-06') t1 left join (select coupon_id, sum(if(date_format(get_time,'yyyy-MM-dd')='2020-05-06',1,0)) get_count, sum(if(date_format(using_time,'yyyy-MM-dd')='2020-05-06',1,0)) using_count, sum(if(date_format(used_time,'yyyy-MM-dd')='2020-05-06',1,0)) used_count from dwd_fact_coupon_use GROUP by coupon_id) t2 on t1.id=t2.coupon_id
6、dws_activity_info_daycount(每日活動行為)
活動類型、時間、下單、支付次數
with t1 as (select id,activity_name,activity_type, start_time,end_time,create_time from dwd_dim_activity_info where dt='2020-05-06' GROUP by id,activity_name,activity_type, start_time,end_time,create_time), t2 as (select activity_id,count(*) order_count from dwd_fact_order_info where dt='2020-05-06' GROUP by activity_id), t5 as (SELECT activity_id,count(*) payment_count from (SELECT order_id,id from dwd_fact_payment_info where dt='2020-05-06') t3 join (SELECT id,activity_id from dwd_fact_order_info WHERE dt='2020-05-06' or dt=date_sub('2020-05-06',1)) t4 on t3.order_id=t4.id GROUP by activity_id) insert overwrite table dws_activity_info_daycount partition(dt='2020-05-06') SELECT t1.id,activity_name, activity_type, start_time, end_time, create_time, nvl(order_count,0), nvl(payment_count,0) from t1 left join t2 on t1.id=t2.activity_id left join t5 on t1.id=t5.activity_id
7、dws_sale_detail_daycount(每日用戶購買商品詳情)
用戶、商品、sku、購買次數、下單次數、下單金額
GROUP by user_id,sku_id) insert overwrite table dws_sale_detail_daycount PARTITION(dt='2020-05-06') SELECT t7.user_id, t7.sku_id, user_gender, user_age, user_level, order_price, sku_name, sku_tm_id, sku_category3_id, sku_category2_id, sku_category1_id, sku_category3_name, sku_category2_name, sku_category1_name, spu_id, sku_num, order_count, order_amount from (select nvl(t3.user_id,t4.user_id) user_id, nvl(t3.sku_id,t4.sku_id) sku_id, nvl(order_count,0) order_count, nvl(order_amount,0) order_amount, nvl(sku_num,0) sku_num FROM t3 full join t4 on t3.user_id=t4.user_id and t3.sku_id=t4.sku_id) t7 join t1 on t7.user_id=t1.user_id join t2 on t7.sku_id=t2.sku_id
二、DWT層-當前表及DWS表中按日期的匯總(合並及更新)
1、概述
將DWS層每日聚合的數據進行累積
不是分區表,是一個累積型全量表
累積型全量表: ①查詢要改動的舊數據 ②查詢新增和變化的新數據 ③新舊關聯,以新換舊 ④導入覆蓋
2、dwt_uv_topic
create external table dwt_uv_topic
用戶及設備信息、首次活躍、當日活躍、末次活躍、累計活躍天數
今天未登錄的老用戶:new.mid_id is null
老用戶:old.mid_id is not null
新用戶:old.mid_id is null
今天登錄的老用戶:new.mid_id is not null and old.mid_id is not null
insert overwrite table gmall.dwt_uv_topic select nvl(old.mid_id,new.mid_id), concat_ws('|',old.user_id,new.user_id), concat_ws('|',old.version_code,new.version_code), concat_ws('|',old.version_name,new.version_name), concat_ws('|',old.lang,new.lang), concat_ws('|',old.source,new.source), concat_ws('|',old.os,new.os), concat_ws('|',old.area,new.area), concat_ws('|',old.model,new.model), concat_ws('|',old.brand,new.brand), concat_ws('|',old.sdk_version,new.sdk_version), concat_ws('|',old.gmail,new.gmail), concat_ws('|',old.height_width,new.height_width), concat_ws('|',old.app_time,new.app_time), concat_ws('|',old.network,new.network), concat_ws('|',old.lng,new.lng), concat_ws('|',old.lat,new.lat), nvl(old.login_date_first,'2020-05-06') login_date_first, IF(new.mid_id is null,old.login_date_last,'2020-05-06') login_date_last, nvl(new.login_count,0) login_day_count, nvl(old.login_count,0)+if(new.login_count is not null,1,0) login_count from dwt_uv_topic old full join (select * from dws_uv_detail_daycount where dt='2020-05-06') new on old.mid_id=new.mid_id
3、dwt_user_topic
用戶首末次登錄、下單時間、天數,累計、最近30天下單支付金額、次數
insert overwrite table dwt_user_topic SELECT t1.user_id,login_date_first, login_date_last, login_count, nvl(login_last_30d_count,0), order_date_first, order_date_last, order_count, order_amount, nvl(order_last_30d_count,0), nvl(order_last_30d_amount,0), payment_date_first, payment_date_last, payment_count, payment_amount, nvl(payment_last_30d_count,0), nvl(payment_last_30d_amount,0) from (SELECT nvl(old.user_id,new.user_id) user_id, nvl(old.login_date_first,'2020-05-06') login_date_first, nvl(old.order_date_first,if(new.order_count>0,'2020-05-06',null)) order_date_first, nvl(old.payment_date_first,if(new.payment_count>0,'2020-05-06',null)) payment_date_first, if(new.user_id is null,old.login_date_last,'2020-05-06') login_date_last, if(new.order_count>0,'2020-05-06',old.order_date_last) order_date_last, if(new.payment_count>0,'2020-05-06',old.payment_date_last) payment_date_last, nvl(old.login_count,0)+if(new.user_id is not null,1,0) login_count, nvl(old.order_count,0)+nvl(new.order_count,0) order_count, nvl(old.order_amount,0)+nvl(new.order_amount,0) order_amount, nvl(old.payment_count,0)+nvl(new.payment_count,0) payment_count, nvl(old.payment_amount,0)+nvl(new.payment_amount,0) payment_amount from dwt_user_topic old full join (select * from dws_user_action_daycount where dt='2020-05-06') new on old.user_id=new.user_id) t1 left join ( SELECT user_id, sum(order_count) order_last_30d_count, sum(order_amount) order_last_30d_amount, sum(payment_count) payment_last_30d_count, sum(payment_amount) payment_last_30d_amount, count(*) login_last_30d_count FROM dws_user_action_daycount where dt BETWEEN date_sub('2020-05-06',29) and '2020-05-06' GROUP by user_id) t2 on t1.user_id=t2.user_id
4、dwt_sku_topic
最近30日及累計下單、支付、退款、加購、收藏、好中差評次數
create external table dwt_sku_topic ( sku_id string comment 'sku_id', spu_id string comment 'spu_id', -- 從dws_sku_action_daycount 取,where 30天之前<=dt<=今天,sum() order_last_30d_count bigint comment '最近30日被下單次數',
insert overwrite TABLE dwt_sku_topic SELECT t2.sku_id, t2.spu_id, nvl(order_last_30d_count,0),
5、dwt_coupon_topic
優惠券當日及累計領用、下單、支付次數
insert overwrite table dwt_coupon_topic select nvl(old.coupon_id,new.coupon_id) coupon_id, nvl(new.get_count,0) get_day_count, nvl(new.using_count,0) using_day_count, nvl(new.used_count,0) used_day_count, nvl(old.get_count,0)+nvl(new.get_count,0) get_count, nvl(old.get_count,0)+nvl(new.using_count,0) using_count, nvl(old.get_count,0)+nvl(new.used_count,0) used_count from dwt_coupon_topic old full join (select * from dws_coupon_use_daycount where dt='2020-05-06')new on old.coupon_id=new.coupon_id
6、dwt_activity_topic
活動當日及累計下單、支付次數
insert overwrite table dwt_activity_topic select nvl(old.id,new.id) id, nvl(old.activity_name,new.activity_name) activity_name, nvl(new.order_count,0) order_day_count, nvl(new.payment_count,0) payment_day_count, nvl(old.order_count,0)+nvl(new.order_count,0) order_count, nvl(old.payment_count,0)+nvl(new.payment_count,0) payment_count from dwt_activity_topic old full join (select * from dws_activity_info_daycount where dt='2020-05-06')new on old.id=new.id
三、ADS層
1、概述
將需求根據要查詢的數據源進行分類
同一類需求創建一張表進行統計
創建的表均為全量表
2、構造數據
將集群的時間,調整到要導入數據的前一天
上傳jar包
啟動采集通道,啟動hive
執行腳本
3、設備主題
(1)活躍設備數(日、周、月)
從dws_uv_daycount 或 dwt_uv_topic 表取數據
日活、周活、月活【字段:是否是周末或月末】:至少活躍一次
create external table ads_uv_count( `dt` string COMMENT '統計日期', // 從dws層取當天的,也可以從dwt層取 `day_count` bigint COMMENT '當日用戶數量', // 從dws層取當周的,也可以從dwt層取 `wk_count` bigint COMMENT '當周用戶數量', // 從dws層取當月的,也可以從dwt層取 `mn_count` bigint COMMENT '當月用戶數量', // 借助next_day() `is_weekend` string COMMENT 'Y,N是否是周末,用於得到本周最終結果', // 借助last_day() `is_monthend` string COMMENT 'Y,N是否是月末,用於得到本月最終結果' ) COMMENT '活躍設備數'
insert into table ads_uv_count SELECT '2020-05-06',day_count,wk_count,mn_count, if('2020-05-06'=date_sub(next_day('2020-05-06','MO'),1),'Y','N') is_weekend, if('2020-05-06'=last_day('2020-05-06'),'Y','N') is_monthend from (SELECT '2020-05-06' dt,count(*) day_count
(2)每日新增設備, login_date_first=今天
insert into ads_new_mid_count SELECT '2020-05-06' create_date, count(*) new_mid_count FROM dwt_uv_topic where login_date_first='2020-05-06';
(3)沉默用戶數
只在安裝當天啟動過: login_date_first='當天'= login_date_last
啟動時間是在7天前: login_date_last< 今天的7天前
insert into table ads_silent_count SELECT '2020-05-06', count(*) from dwt_uv_topic where login_date_first=login_date_last and login_date_last<date_sub('2020-05-06',7)
(4)本周回流用戶數
本周登錄過的,沒在上周登錄過的老用戶數
本周活躍與上周活躍,兩個結果集使用left join后取差集(本周活躍但上周不活躍):
on t1.mid_id=t2.mid_id
where t2.mid_id is null
login_date_last >= date_sub(next_day('2020-05-06','MO'),7)
(5)流失用戶數:連續7天未活躍的設備
login_date_last<date_sub('2020-05-06',7)
(6)留存率:留存用戶 占 某天新增用戶的 比率
某天新增的用戶中,在n天后繼續使用的用戶稱為留存用戶
①某一天新增的人數
②留存的天數,留存的日期=新增的天數+留存的天數
③取留存日期當天的留存人數
(7)最近連續三周活躍用戶數
用戶在這三周中,都至少需要出現一次
(8)最近七天內連續三天活躍用戶數
4、會員主題
(1)會員信息
用戶新鮮率、活躍率、付費率
cast(sum(if(login_date_last='2020-05-19',1,0)) / count(*) * 100 as decimal(10,2)) day_users2users
(2)轉化率
訪問/加購,加購/下單,下單/支付
cast( sum(if(payment_count>0,1,0)) / sum(if(order_count>0,1,0)) * 100 as decimal(10,2))
5、商品主題
(1)商品個數信息-各個商品的種類數
insert into table ads_product_info SELECT '2020-05-19' dt, count(*) sku_num, count(DISTINCT spu_id) spu_num from dwt_sku_topic
(2)商品累積銷量排名
FROM dwt_sku_topic where payment_num>0 order by payment_num desc limit 10
(3)商品收藏排名
(4)加入購物車排名
(5)最近30天退款率
(6)差評率排名
6、營銷主題
(1)每日下單統計ads_order_daycount
(2)每日支付統計ads_payment_daycount
(3)品牌的月復購率ads_sale_tm_category1_stat_mn
單次、多次復購率
四、總結
1、數據來源

2、各層數據的來源於導入
| 建模 | 如何導入數據 | 備注 | |
|---|---|---|---|
| hdfs | 采用lzo壓縮的格式 | ||
| ODS | 原數據有幾個字段是什么類型,就怎么建模 | 必須指定ODS的表使用能夠讀取LZO壓縮格式的輸入格式,為LZO格式創建索引 | |
| 用戶行為DWD | 用戶行為數據根據不同類型數據的字段明細,進行建模 | 啟動日志: get_json_object 事件日志: 自定義UDF,UDTF,將事件日志中的每個事件,解析到一個base_event表中,再使用get_json_object展開事件明細。 | |
| 業務數據DWD | 維度表:維度退化,將多個同一類型維度的字段合並到一張表中。事實表:采取星型模型,基於3w原則,按照選取業務線---確認粒度---選取維度---選取度量進行建模 | 維度表:多表Join 事實表:選擇一張事實表作為主表,通過外鍵關聯維度表,選取維度字段。再選取度量! | |
| 事務型事實表:選取ods層某一天分區的數據,再關聯維度表,選取維度字段,再選取度量! | |||
| 周期型快照事實表:直接從ODS層全量導入(加入購物車,收藏表) | |||
| 累積型快照事實表: 按照事實發生最初的事件作為分區字段!①選擇要覆蓋的老的分區的所有數據②選取今日新增和變化的新數據③新舊交替,以新換舊④覆蓋到指定的分區 | |||
| 拉鏈表(緩慢變化維度):old left join new ,將old中過期的數據的end_date修改為new中start_date的前一天。 再union all new。導入到臨時表,再導入到原表 | |||
| dws層 | 緊緊貼合需求。將同一類型的需求,匯總,分類,以某個需求的統計目標為主題(設備,用戶,商品,優惠券,活動,購買行為),創建寬表 | 取dwd層每日最新的分區,進行多表關聯 | |
| dwt層 | 緊緊貼合需求。將同一類型的需求,匯總,分類,以某個需求的統計目標為主題(設備,用戶,商品,優惠券,活動,購買行為),創建寬表 | dwt full join dws 當日分區的數據①新舊交替,以新換舊②覆蓋原表 | |
| ads | 緊緊貼合需求。將同一類型的需求,匯總,分類,以某個需求的統計目標為主題(用戶,商品,會員,營銷), | 取某一天的歷史切片數據,從dws層取,如果要取當前的數據或累計狀態,從dwt層取 | |
| 導出mysql | update_mode: allowinsert update-key: dt |
