【企業流行新數倉】Day02:DWS層(按日分區的寬表)、DWT層(全量累計表)、ADS層、總結


一、DWS層

1、概括

dwd層的數據,每日輕度聚合,建寬表

表名 粒度
dws_uv_detail_daycount 一個設備是一行
dws_user_action_daycount(只統計今天登錄的會員) 一個會員是一行
dws_sku_action_daycount(只統計被下單或平均或支付或加購或收藏的商品) 一個商品是一行
dws_coupon_use_daycount(只統計未過期的優惠券) 一個優惠券是一行
dws_activity_info_daycount(統計所有活動) 一個活動是一行
dws_sale_detail_daycount(每日購買數據) 一個用戶購買的一款商品是一行

2、dws_uv_detail_daycount(每日設備行為)-一台設備有多個行為,每列的多個行為進行字符串拼接

(1)建表

create external table dws_uv_detail_daycount
(
    -- 從啟動日志dwd_start_log表取以下字段
    `mid_id` string COMMENT '設備唯一標識',
    `user_id` string COMMENT '用戶標識',
    `version_code` string COMMENT '程序版本號', 
    `version_name` string COMMENT '程序版本名', 
    `lang` string COMMENT '系統語言', 
    `source` string COMMENT '渠道號', 
    `os` string COMMENT '安卓系統版本', 
    `area` string COMMENT '區域', 
    `model` string COMMENT '手機型號', 
    `brand` string COMMENT '手機品牌', 
    `sdk_version` string COMMENT 'sdkVersion', 
    `gmail` string COMMENT 'gmail', 
    `height_width` string COMMENT '屏幕寬高',
    `app_time` string COMMENT '客戶端日志產生時的時間',
    `network` string COMMENT '網絡模式',
    `lng` string COMMENT '經度',
    `lat` string COMMENT '緯度',
    -- 從啟動日志dwd_start_log表按照mid_id進行聚合,之后count(*)取以下字段
    `login_count` bigint COMMENT '活躍次數'
)

(2)數據導入

insert overwrite table dws_uv_detail_daycount PARTITION(dt='2020-05-06')
select 
    mid_id,
    concat_ws('|',collect_set(user_id)),
    concat_ws('|',collect_set(version_code)),
    concat_ws('|',collect_set(version_name)),
    concat_ws('|',collect_set(lang)),
    concat_ws('|',collect_set(source)),
    concat_ws('|',collect_set(os)),
    concat_ws('|',collect_set(area)),
    concat_ws('|',collect_set(model)),
    concat_ws('|',collect_set(brand)),
    concat_ws('|',collect_set(sdk_version)),
    concat_ws('|',collect_set(gmail)),
    concat_ws('|',collect_set(height_width)),
    concat_ws('|',collect_set(app_time)),
    concat_ws('|',collect_set(network)),
    concat_ws('|',collect_set(lng)),
    concat_ws('|',collect_set(lat)),
    count(*)
FROM dwd_start_log where dt='2020-05-06'
GROUP by mid_id

3、dws_user_action_daycount(每日會員行為)

用戶登錄、加購、下單、支付次數及金額

導入數據時,類似於建表/視圖操作

witht1 
as (select user_id,count(*) login_count  from dwd_start_log where dt='2020-05-06' and user_id is not NULL GROUP BY user_id),
t3 as (select user_id,count(*) order_count, sum(final_total_amount) order_amount from dwd_fact_order_info where dt='2020-05-06' GROUP by user_id ),
t4 as (select user_id,count(*) payment_count,sum(payment_amount) payment_amount from dwd_fact_payment_info where dt='2020-05-06' GROUP by user_id),
t2 as (select user_id,count(*) cart_count,sum(cart_price*sku_num) cart_amount from dwd_fact_cart_info where dt='2020-05-06' and date_format(create_time,'yyyy-MM-dd')='2020-05-06' GROUP by user_id )
insert overwrite TABLE dws_user_action_daycount PARTITION(dt='2020-05-06')select   t1.user_id,login_count,   nvl(cart_count,0),   nvl(cart_amount,0),   nvl(order_count,0),  
nvl(order_amount,0),   nvl(payment_count,0),   nvl(payment_amount,0)from t1 left join t2 on t1.user_id=t2.user_idleft join t3 on t1.user_id=t3.user_idleft join t4 on t1.user_id=t4.user_id

4、dws_sku_action_daycount(每日商品行為)

被下單次數、被支付、退款、加購、好評、差評次數

with 
t1 as
(select  sku_id,count(*) order_count,sum(sku_num) order_num,
sum(total_amount) order_amount
from dwd_fact_order_detail where dt='2020-05-06'
GROUP by sku_id),
t2 as
(select 
    sku_id,
    sum(sku_num)  payment_num,sum(total_amount) payment_amount,
    count(*) payment_count
from
(SELECT order_id,sku_id,sku_num,total_amount from dwd_fact_order_detail 
where dt='2020-05-06' or dt=date_sub('2020-05-06',1)) tmp1
 join
(select order_id from dwd_fact_payment_info where dt='2020-05-06') tmp2
on tmp1.order_id=tmp2.order_id
GROUP by sku_id),
t3 as 
(SELECT sku_id,
        count(*) refund_count,sum(refund_num) refund_num,
        sum(refund_amount) refund_amount
from dwd_fact_order_refund_info where dt='2020-05-06'
GROUP by sku_id),
t4 as
(select sku_id,
        count(*) cart_count,sum(sku_num) cart_num
from dwd_fact_cart_info where dt='2020-05-06' and sku_num>0
GROUP by sku_id),
t5 as
(SELECT sku_id,
        count(*) favor_count      
from dwd_fact_favor_info where dt='2020-05-06' and is_cancel=0
group by sku_id),
t6 as
(SELECT sku_id,
sum(if(appraise='1201',1,0)) appraise_good_count,
sum(if(appraise='1202',1,0)) appraise_mid_count,
sum(if(appraise='1203',1,0)) appraise_bad_count,
sum(if(appraise='1204',1,0)) appraise_default_count
from  dwd_fact_comment_info where dt='2020-05-06'
group by sku_id)
insert overwrite table dws_sku_action_daycount partition(dt='2020-05-06')
SELECT
nvl(nvl(nvl(nvl(nvl(t1.sku_id,t2.sku_id),t3.sku_id),t4.sku_id),t5.sku_id),t6.sku_id),
    nvl(order_count,0), 
    nvl(order_num,0),
    nvl(order_amount,0),
    nvl(payment_count,0),
    nvl(payment_num,0),
    nvl(payment_amount,0),
    nvl(refund_count,0),
    nvl(refund_num,0),
    nvl(refund_amount,0),
    nvl(cart_count,0),
    nvl(cart_num,0),
    nvl(favor_count,0),
    nvl(appraise_good_count,0),
    nvl(appraise_mid_count,0),
    nvl(appraise_bad_count,0),
    nvl(appraise_default_count,0)
from t1 
full join t2 on t1.sku_id=t2.sku_id
full join t3 on t1.sku_id=t3.sku_id
full join t4 on t1.sku_id=t4.sku_id
full join t5 on t1.sku_id=t5.sku_id
full join t6 on t1.sku_id=t6.sku_id

5、dws_coupon_use_daycount(每日優惠券使用行為)

范圍、商品id、品牌、品類、領用次數、下單次數

insert overwrite table dws_coupon_use_daycount PARTITION(dt='2020-05-06')
select 
  t1.id coupon_id,coupon_name, coupon_type, condition_amount, 
condition_num, activity_id, benefit_amount, benefit_discount,
create_time, range_type, spu_id, tm_id, category3_id, limit_num,
    get_count,using_count, used_count
from
(SELECT *
from dwd_dim_coupon_info 
where dt='2020-05-06' and nvl(expire_time,'9999-99-99') >'2020-05-06') t1
left join
(select coupon_id,
sum(if(date_format(get_time,'yyyy-MM-dd')='2020-05-06',1,0)) get_count,
sum(if(date_format(using_time,'yyyy-MM-dd')='2020-05-06',1,0)) using_count,
sum(if(date_format(used_time,'yyyy-MM-dd')='2020-05-06',1,0)) used_count 
from dwd_fact_coupon_use
GROUP by coupon_id) t2
on t1.id=t2.coupon_id

6、dws_activity_info_daycount(每日活動行為)

活動類型、時間、下單、支付次數

with t1 as
 (select 
    id,activity_name,activity_type,
    start_time,end_time,create_time  
 from dwd_dim_activity_info 
 where dt='2020-05-06'
 GROUP by id,activity_name,activity_type,
    start_time,end_time,create_time),
 t2 as 
 (select 
    activity_id,count(*) order_count
 from dwd_fact_order_info 
 where dt='2020-05-06'
 GROUP by activity_id),
  t5 as 
 (SELECT
    activity_id,count(*) payment_count
 from
 (SELECT order_id,id from dwd_fact_payment_info where dt='2020-05-06') t3
  join
 (SELECT id,activity_id from dwd_fact_order_info WHERE dt='2020-05-06' or dt=date_sub('2020-05-06',1)) t4
  on   t3.order_id=t4.id
  GROUP by activity_id)
  insert overwrite table dws_activity_info_daycount partition(dt='2020-05-06')
  SELECT
    t1.id,activity_name, activity_type, 
  start_time, end_time, create_time, 
  nvl(order_count,0),
  nvl(payment_count,0)
  from t1 
  left join t2 on t1.id=t2.activity_id
  left join t5 on t1.id=t5.activity_id

7、dws_sale_detail_daycount(每日用戶購買商品詳情)

用戶、商品、sku、購買次數、下單次數、下單金額

GROUP by user_id,sku_id)
insert overwrite table dws_sale_detail_daycount PARTITION(dt='2020-05-06')
SELECT
    t7.user_id, t7.sku_id, user_gender, user_age, 
    user_level, order_price, sku_name, sku_tm_id,
    sku_category3_id, sku_category2_id, sku_category1_id, 
    sku_category3_name, sku_category2_name, 
    sku_category1_name, spu_id, sku_num, order_count,
    order_amount
from
(select 
    nvl(t3.user_id,t4.user_id) user_id,
    nvl(t3.sku_id,t4.sku_id) sku_id,
    nvl(order_count,0) order_count, 
    nvl(order_amount,0) order_amount,
    nvl(sku_num,0) sku_num 
FROM t3 full join t4 on 
t3.user_id=t4.user_id and t3.sku_id=t4.sku_id) t7
 join t1 on t7.user_id=t1.user_id 
 join t2 on t7.sku_id=t2.sku_id

二、DWT層-當前表及DWS表中按日期的匯總(合並及更新)

1、概述

將DWS層每日聚合的數據進行累積

不是分區表,是一個累積型全量表

累積型全量表: ①查詢要改動的舊數據  ②查詢新增和變化的新數據  ③新舊關聯,以新換舊  ④導入覆蓋

2、dwt_uv_topic

create external table dwt_uv_topic

用戶及設備信息、首次活躍、當日活躍、末次活躍、累計活躍天數

今天未登錄的老用戶:new.mid_id is null

老用戶:old.mid_id is not null

新用戶:old.mid_id is null

今天登錄的老用戶:new.mid_id is not null and old.mid_id is not null

insert overwrite table gmall.dwt_uv_topic
select 
    nvl(old.mid_id,new.mid_id),
    concat_ws('|',old.user_id,new.user_id),
    concat_ws('|',old.version_code,new.version_code),
    concat_ws('|',old.version_name,new.version_name),
    concat_ws('|',old.lang,new.lang),
    concat_ws('|',old.source,new.source),
    concat_ws('|',old.os,new.os),
    concat_ws('|',old.area,new.area),
    concat_ws('|',old.model,new.model),
    concat_ws('|',old.brand,new.brand),
    concat_ws('|',old.sdk_version,new.sdk_version),
    concat_ws('|',old.gmail,new.gmail),
    concat_ws('|',old.height_width,new.height_width),
    concat_ws('|',old.app_time,new.app_time),
    concat_ws('|',old.network,new.network),
    concat_ws('|',old.lng,new.lng),
    concat_ws('|',old.lat,new.lat),
    nvl(old.login_date_first,'2020-05-06') login_date_first,
    IF(new.mid_id is null,old.login_date_last,'2020-05-06') login_date_last, 
    nvl(new.login_count,0) login_day_count, 
    nvl(old.login_count,0)+if(new.login_count is not null,1,0) login_count
from
dwt_uv_topic old
full join
(select * from dws_uv_detail_daycount where dt='2020-05-06') new
on old.mid_id=new.mid_id

3、dwt_user_topic

用戶首末次登錄、下單時間、天數,累計、最近30天下單支付金額、次數

insert overwrite table dwt_user_topic
SELECT
    t1.user_id,login_date_first, 
login_date_last, login_count, nvl(login_last_30d_count,0), 
order_date_first, order_date_last, order_count, order_amount,
nvl(order_last_30d_count,0), nvl(order_last_30d_amount,0), payment_date_first, 
payment_date_last, payment_count, payment_amount, nvl(payment_last_30d_count,0), 
nvl(payment_last_30d_amount,0)
from
 (SELECT
    nvl(old.user_id,new.user_id) user_id,
    nvl(old.login_date_first,'2020-05-06') login_date_first,
    nvl(old.order_date_first,if(new.order_count>0,'2020-05-06',null)) order_date_first,
    nvl(old.payment_date_first,if(new.payment_count>0,'2020-05-06',null))  payment_date_first,
    if(new.user_id is null,old.login_date_last,'2020-05-06') login_date_last,
    if(new.order_count>0,'2020-05-06',old.order_date_last) order_date_last,
    if(new.payment_count>0,'2020-05-06',old.payment_date_last) payment_date_last,
    nvl(old.login_count,0)+if(new.user_id is not null,1,0) login_count,
    nvl(old.order_count,0)+nvl(new.order_count,0) order_count,
    nvl(old.order_amount,0)+nvl(new.order_amount,0) order_amount,
    nvl(old.payment_count,0)+nvl(new.payment_count,0) payment_count,
    nvl(old.payment_amount,0)+nvl(new.payment_amount,0) payment_amount 
 from
 dwt_user_topic old
 full join (select * from dws_user_action_daycount where dt='2020-05-06') new
 on old.user_id=new.user_id) t1
 left join
 ( 
  SELECT
    user_id,
    sum(order_count) order_last_30d_count,
    sum(order_amount) order_last_30d_amount,
    sum(payment_count) payment_last_30d_count,
    sum(payment_amount) payment_last_30d_amount,
    count(*) login_last_30d_count
  FROM dws_user_action_daycount
  where dt BETWEEN date_sub('2020-05-06',29) and '2020-05-06'
  GROUP by user_id) t2
  on t1.user_id=t2.user_id

4、dwt_sku_topic

 最近30日及累計下單、支付、退款、加購、收藏、好中差評次數

create external table dwt_sku_topic
(
    sku_id string comment 'sku_id',
    spu_id string comment 'spu_id',
    -- 從dws_sku_action_daycount  取,where 30天之前<=dt<=今天,sum()
    order_last_30d_count bigint comment '最近30日被下單次數',
insert overwrite TABLE dwt_sku_topic
SELECT
    t2.sku_id, t2.spu_id, 
    nvl(order_last_30d_count,0), 

5、dwt_coupon_topic

優惠券當日及累計領用、下單、支付次數

insert overwrite table dwt_coupon_topic select 
    nvl(old.coupon_id,new.coupon_id) coupon_id,
    nvl(new.get_count,0) get_day_count,
    nvl(new.using_count,0) using_day_count,
    nvl(new.used_count,0) used_day_count,
    nvl(old.get_count,0)+nvl(new.get_count,0) get_count, 
    nvl(old.get_count,0)+nvl(new.using_count,0) using_count,
    nvl(old.get_count,0)+nvl(new.used_count,0) used_count
from dwt_coupon_topic old 
full join (select * from dws_coupon_use_daycount where dt='2020-05-06')new 
on old.coupon_id=new.coupon_id

6、dwt_activity_topic

活動當日及累計下單、支付次數

insert overwrite table dwt_activity_topic
select 
    nvl(old.id,new.id) id,
    nvl(old.activity_name,new.activity_name) activity_name,
    nvl(new.order_count,0) order_day_count,
    nvl(new.payment_count,0) payment_day_count,
    nvl(old.order_count,0)+nvl(new.order_count,0) order_count, 
    nvl(old.payment_count,0)+nvl(new.payment_count,0) payment_count
from dwt_activity_topic old 
full join (select * from dws_activity_info_daycount where dt='2020-05-06')new 
on old.id=new.id

三、ADS層

1、概述

將需求根據要查詢的數據源進行分類

同一類需求創建一張表進行統計

創建的表均為全量表

2、構造數據

將集群的時間,調整到要導入數據的前一天

上傳jar包

啟動采集通道,啟動hive

執行腳本

3、設備主題

(1)活躍設備數(日、周、月)

從dws_uv_daycount 或 dwt_uv_topic 表取數據

日活、周活、月活【字段:是否是周末或月末】:至少活躍一次

create external table ads_uv_count( 
    `dt` string COMMENT '統計日期',
    // 從dws層取當天的,也可以從dwt層取
    `day_count` bigint COMMENT '當日用戶數量',
    // 從dws層取當周的,也可以從dwt層取
    `wk_count`  bigint COMMENT '當周用戶數量',
    // 從dws層取當月的,也可以從dwt層取
    `mn_count`  bigint COMMENT '當月用戶數量',
    // 借助next_day()
    `is_weekend` string COMMENT 'Y,N是否是周末,用於得到本周最終結果',
    // 借助last_day()
    `is_monthend` string COMMENT 'Y,N是否是月末,用於得到本月最終結果' 
) COMMENT '活躍設備數'
insert into table ads_uv_count
 SELECT
    '2020-05-06',day_count,wk_count,mn_count,
    if('2020-05-06'=date_sub(next_day('2020-05-06','MO'),1),'Y','N') is_weekend,
    if('2020-05-06'=last_day('2020-05-06'),'Y','N') is_monthend
 from
 (SELECT '2020-05-06' dt,count(*)  day_count

(2)每日新增設備, login_date_first=今天

insert into ads_new_mid_count
SELECT 
    '2020-05-06' create_date, 
    count(*) new_mid_count
FROM dwt_uv_topic where login_date_first='2020-05-06';

(3)沉默用戶數

只在安裝當天啟動過: login_date_first='當天'= login_date_last

啟動時間是在7天前: login_date_last< 今天的7天前

insert into table ads_silent_count
SELECT
    '2020-05-06',
    count(*)
from dwt_uv_topic where login_date_first=login_date_last
      and
      login_date_last<date_sub('2020-05-06',7)

(4)本周回流用戶數

本周登錄過的,沒在上周登錄過的老用戶數

本周活躍與上周活躍,兩個結果集使用left join后取差集(本周活躍但上周不活躍):  

on t1.mid_id=t2.mid_id
where t2.mid_id is null

login_date_last >= date_sub(next_day('2020-05-06','MO'),7)

(5)流失用戶數:連續7天未活躍的設備

login_date_last<date_sub('2020-05-06',7)

(6)留存率:留存用戶 占  某天新增用戶的 比率

某天新增的用戶中,在n天后繼續使用的用戶稱為留存用戶

①某一天新增的人數

②留存的天數,留存的日期=新增的天數+留存的天數

③取留存日期當天的留存人數

(7)最近連續三周活躍用戶數

用戶在這三周中,都至少需要出現一次

(8)最近七天內連續三天活躍用戶數

4、會員主題

(1)會員信息

用戶新鮮率、活躍率、付費率

cast(sum(if(login_date_last='2020-05-19',1,0)) / count(*) * 100 as decimal(10,2)) day_users2users

(2)轉化率

訪問/加購,加購/下單,下單/支付

cast( sum(if(payment_count>0,1,0)) / sum(if(order_count>0,1,0)) * 100 as decimal(10,2))

5、商品主題

(1)商品個數信息-各個商品的種類數

insert into table ads_product_info
SELECT
    '2020-05-19' dt,
    count(*) sku_num,
    count(DISTINCT spu_id) spu_num
from dwt_sku_topic

(2)商品累積銷量排名

FROM dwt_sku_topic
where payment_num>0
order by payment_num desc
limit 10

(3)商品收藏排名

(4)加入購物車排名

(5)最近30天退款率

(6)差評率排名

6、營銷主題

 (1)每日下單統計ads_order_daycount

(2)每日支付統計ads_payment_daycount

(3)品牌的月復購率ads_sale_tm_category1_stat_mn

單次、多次復購率

四、總結

1、數據來源

 

 

 2、各層數據的來源於導入

數據源 建模 如何導入數據 備注
hdfs     采用lzo壓縮的格式
ODS 原數據有幾個字段是什么類型,就怎么建模 必須指定ODS的表使用能夠讀取LZO壓縮格式的輸入格式,為LZO格式創建索引  
用戶行為DWD 用戶行為數據根據不同類型數據的字段明細,進行建模 啟動日志: get_json_object 事件日志: 自定義UDF,UDTF,將事件日志中的每個事件,解析到一個base_event表中,再使用get_json_object展開事件明細。  
業務數據DWD 維度表:維度退化,將多個同一類型維度的字段合並到一張表中。事實表:采取星型模型,基於3w原則,按照選取業務線---確認粒度---選取維度---選取度量進行建模 維度表:多表Join 事實表:選擇一張事實表作為主表,通過外鍵關聯維度表,選取維度字段。再選取度量!  
    事務型事實表:選取ods層某一天分區的數據,再關聯維度表,選取維度字段,再選取度量!  
    周期型快照事實表:直接從ODS層全量導入(加入購物車,收藏表)  
    累積型快照事實表: 按照事實發生最初的事件作為分區字段!①選擇要覆蓋的老的分區的所有數據②選取今日新增和變化的新數據③新舊交替,以新換舊④覆蓋到指定的分區  
    拉鏈表(緩慢變化維度):old left join new ,將old中過期的數據的end_date修改為new中start_date的前一天。 再union all new。導入到臨時表,再導入到原表  
dws層 緊緊貼合需求。將同一類型的需求,匯總,分類,以某個需求的統計目標為主題(設備,用戶,商品,優惠券,活動,購買行為),創建寬表 取dwd層每日最新的分區,進行多表關聯  
dwt層 緊緊貼合需求。將同一類型的需求,匯總,分類,以某個需求的統計目標為主題(設備,用戶,商品,優惠券,活動,購買行為),創建寬表 dwt full join dws 當日分區的數據①新舊交替,以新換舊②覆蓋原表  
ads 緊緊貼合需求。將同一類型的需求,匯總,分類,以某個需求的統計目標為主題(用戶,商品,會員,營銷), 取某一天的歷史切片數據,從dws層取,如果要取當前的數據或累計狀態,從dwt層取  
導出mysql   update_mode: allowinsert update-key: dt  


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM