一、數據倉庫
數據倉庫是一個面向主題的、集成的、相對穩定的、反應歷史變化的數據集合,用於支持管理決策。
l 面向主題:傳統的數據庫是面向事務處理的,而數據倉庫是面向某一領域而組織的數據集合,主題是指用戶關心的某一聯系緊密的集合。
l 集成:數據倉庫中數據來源於各個離散的業務系統數據庫、外部數據、非結構化數據的集合,數據倉庫數據是集成的。
l 相對穩定:數據倉庫中的數據不應該支持dml操作,而是通過批處理方式進行數據的處理。
l 反應歷史:數據倉庫保存了數據的歷史各個版本。
我們今天所介紹的就是數據倉庫保留數據歷史版本的一種方法-拉鏈表。
這里我簡單介紹一下我們數據倉庫中掃采用的架構,主要包括貼源層、明細層、匯總層、集市層、報表層、維度層,簡單的介紹如下:
l 貼源層:采集的各個業務系統數據首先存儲在貼源層中,這里需要注意的是采集業務源數據的方法,增量采集還是全量采集,好的業務系統設計應該支持增量采集(這里留一個問題作為思考:增量采集數據應該滿足哪些要求),這樣的好處減少了采集數據對倉庫資源和業務系統資源的消耗。
l 明細層:該層采用規范化方式存儲數據,處理數據主要來自於貼源層,實現的目的主要包括面向主題設計存儲結構、集成不同業務源數據、統一編碼規范、保留歷史數據(拉鏈表主要在這一層中進行設計實現)等倉庫基本要處理的
l 匯總層:對於明細層整合的數據,針對需要匯總的指標按照業務口徑進行計算並且初步反規范化設計實現連接明細層的規范化數據成小寬表,目的方便下一步處理使用。
l 集市層:面向不同需求方,按照維度建模方法,進行星型模型設計, 這一層設計完成后的目的要達到可以方便出具報表和日常提數任務。這里有些倉庫設計人員還會用另一個思路,即集市層不采用星型模型設計方法,而是設計大寬表,采用這種方式的設計人員主要理由是這種方式方便人們使用。
l 報表層:根據各個部門不同需求出具報表。
l 維度層:統一存儲數倉維表相關數據。
目前數據倉庫設計主要有兩個陣營,kimball和inmon架構,這里不會針對與這兩種放進進行詳細說明。個人所接觸項目經驗,如果極端采用某一種架構,最后數倉項目成功概率都很低,因此個人建議結合兩種架構的優點進行數倉設計(即三范式簡歷數倉明細層,集市層采用星型模型設計方法),合理結合兩種思路優點可以有效的避免業務驅動方式帶來的煩雜工作以及需求驅動所帶來的后期維護及擴展性問題。
二、拉鏈表原理
這里以一個虛擬的示例簡單介紹拉鏈表實現原理:
1、比如在2017-01-01日,我們初始化了用戶數據到數據倉庫,我們為初始化到數據倉庫中的用戶表(customer)添加了一個start_date和end_date字段用來標識該條數據的生命周期,具體如下:
cus_id job start_date end_date
----------------------------------------------------------------------
10001 oracle 2018-01-01 3000-12-21
10002 pgsql 2018-01-01 3000-12-21
10003 mysql 2018-01-01 3000-12-21
10004 java 2018-01-01 3000-12-21
10005 python 2018-01-01 3000-12-21
2、在2017-01-02這一天,10004用戶被刪除,同時增加了10006及10007用戶,10003用戶的job由mysql變成了mongodb,明細數據如下:
cus_id job start_date end_date
--------------------------------------------
10001 oracle 2018-01-01 3000-12-21
10002 pgsql 2018-01-01 3000-12-21
10003 mysql 2018-01-01 2018-01-02
10003 mongodb 2018-01-02 3000-12-21
10004 java 2018-01-01 2018-01-02
10005 python 2018-01-01 3000-12-21
10006 docker 2018-01-02 3000-12-21
10007 redis 2018-01-02 3000-12-21
3、在2017-01-03這一天,10007用戶被刪除,同時10006工作由docker變成了openstack,10003用戶工作由mongodb變成了hive,並且增加了10008用戶數據,明細數據如下:
cus_id job start_date end_date
---------------- ----------------------------
10001 oracle 2018-01-01 3000-12-21
10002 pgsql 2018-01-01 3000-12-21
10003 mysql 2018-01-01 2018-01-02
10003 mongodb 2018-01-02 2018-01-03
10003 hive 2018-01-03 3000-12-21
10004 java 2018-01-01 2018-01-02
10005 python 2018-01-01 3000-12-21
10006 docker 2018-01-02 2018-01-03
10006 openstack 2018-01-03 3000-12-21
10007 redis 2018-01-02 2018-01-03
10008 hadoop 2018-01-03 3000-12-21
拉鏈表原理分析:這里以10003用戶為例,通過記錄10003用戶數據變化時間線我們可以發現如下的規律:
2017-01-01 首次注冊,job為mysql;
2017-01-02 工作變更,job變為mongodb;
2017-01-03 工作變更,job變為hive。
在上圖中,10003用戶工作變更的時間線上,我們可以發現每一個時間點,10003用戶只有一個工作。在20170101~20170102期間內10003的job為mysql,在20170102~20170103期間內10003的job為mongodb,在20170103~30001231期間內10003的job為hive。拉鏈表中每一個記錄都滿足上邊規律,下面讓我們想想怎么樣准確的訪問拉鏈表數據呢?
拉鏈表訪問方法:
1、 訪問拉鏈表最新數據:
select * from customer t where t.end_date = '3000-12-31';
2、 訪問2017-01-01這天的歷史快照數據:
select * from customer t where t.start_date <= '2017-01-01' and t.end_date > '2017-01-01';
3、訪問2017-01-02這天的歷史快照數據:
select * from customer t where t.start_date <= '2017-01-02' and t.end_date > '2017-01-02';
4、訪問10003用戶所有歷史數據:
select * from customer t where t.cus_id = '10003';
三、拉鏈表實現步驟
1、准備數據:
1)2017-01-01初始化數據:
cus_id |
job |
start_date |
end_date |
dtype |
dw_status |
dw_ins_date |
10001 |
oracle |
2017-01-01 |
3000-12-31 |
C |
I |
2017-01-01 |
10002 |
pgsql |
2017-01-01 |
3000-12-31 |
C |
I |
2017-01-01 |
10003 |
mysql |
2017-01-01 |
3000-12-31 |
C |
I |
2017-01-01 |
10004 |
java |
2017-01-01 |
3000-12-31 |
C |
I |
2017-01-01 |
10005 |
python |
2017-01-01 |
3000-12-31 |
C |
I |
2017-01-01 |
2)2017-01-02增量數據:
cus_id |
job |
dw_status |
dw_ins_date |
10003 |
mongodb |
U |
2017-01-02 |
10004 |
java |
D |
2017-01-02 |
10006 |
docker |
I |
2017-01-02 |
10007 |
redis |
I |
2017-01-02 |
3)2017-01-03增量數據:
cus_id |
job |
dw_status |
dw_ins_date |
10003 |
hive |
U |
2017-01-03 |
10007 |
redis |
D |
2017-01-03 |
10006 |
openstack |
U |
2017-01-03 |
10008 |
hadoop |
I |
2017-01-03 |
2、數據加載過程:
1) 初始化customer表:
drop table customer;
create table customer(
cus_id int,
job varchar2(20),
start_date varchar2(10),
end_date varchar2(10),
dtype varchar2(1),
dw_status varchar2(1),
dw_ins_date varchar2(10)
)
partition by list(end_date)
(
partition cus_par20170101 values('2017-01-01') tablespace users,
partition cus_par20170102 values('2017-01-02') tablespace users,
partition cus_par20170103 values('2017-01-03') tablespace users,
partition cus_par30001231 values('3000-12-31') tablespace users
);
insert into customer(cus_id,job,start_date,end_date,dtype,dw_status,dw_ins_date) values (10001,'oracle','2017-01-01','3000-12-31','C','I','2017-01-01');
insert into customer(cus_id,job,start_date,end_date,dtype,dw_status,dw_ins_date) values (10002,'pgsql','2017-01-01','3000-12-31','C','I','2017-01-01');
insert into customer(cus_id,job,start_date,end_date,dtype,dw_status,dw_ins_date) values (10003,'mysql','2017-01-01','3000-12-31','C','I','2017-01-01');
insert into customer(cus_id,job,start_date,end_date,dtype,dw_status,dw_ins_date) values (10004,'java','2017-01-01','3000-12-31','C','I','2017-01-01');
insert into customer(cus_id,job,start_date,end_date,dtype,dw_status,dw_ins_date) values (10005,'python','2017-01-01','3000-12-31','C','I','2017-01-01');
2) 初始化2017-01-02號增量表:
create table customer_inc(
cus_id int,
job varchar2(20),
dw_status varchar2(1),
dw_ins_date varchar2(10)
);
truncate table customer_inc;
insert into customer_inc(cus_id,job,dw_status,dw_ins_date)values(10003,'mongodb','U','2017-01-02');
insert into customer_inc(cus_id,job,dw_status,dw_ins_date)values(10004,'java','D','2017-01-02');
insert into customer_inc(cus_id,job,dw_status,dw_ins_date)values(10006,'docker','I','2017-01-02');
insert into customer_inc(cus_id,job,dw_status,dw_ins_date)values(10007,'redis','I','2017-01-02');
3) 創建中間表:
drop table customer_tmp0;
create table customer_tmp0(
cus_id int,
job varchar2(20),
start_date varchar2(10),
end_date varchar2(10),
dtype varchar2(1),
dw_status varchar2(1),
dw_ins_date varchar2(10)
)
partition by list(dtype)
(
partition cus_dtype_H values('H') tablespace users,
partition cus_dtype_C values('C') tablespace users
);
3、刷新customer_inc表數據到customer表(2017-01-02):
1) customer表最新分區和customer_inc表中更新和刪除數據連接,處理customer最新分區中變化數據:
insert into customer_tmp0
select
t1.cus_id,
t1.job,
t1.start_date,
case when t2.cus_id is null then t1.end_date else '2017-01-02' end as end_date,
case when t2.cus_id is null then 'C' else 'H' end dtype,
case when t2.cus_id is null then t1.dw_status else t2.dw_status end dw_status,
case when t2.cus_id is null then t1.dw_ins_date else t2.dw_ins_date end as dw_ins_date
from customer t1 left join customer_inc t2 on t1.cus_id = t2.cus_id and t2.dw_status in ('D','U')
where t1.end_date = '3000-12-31'
order by cus_id asc
;
2)將customer表中更新和插入數據插入到customer_tmp0臨時表中:
insert into customer_tmp0
select
t1.cus_id,
t1.job,
'2017-01-02' as start_date,
'3000-12-31' as end_date,
'C' as dtype,
t1.dw_status,
'2017-01-03' as dw_ins_date
from customer_inc t1
where t1.dw_status in ('I','U')
;
3)同步表到customer事實表,這一步可以使用交換分區操作:
alter table customer truncate partition cus_par30001231;
insert into customer
select * from customer_tmp0;
4)查看結果:
SQL> select * from customer order by cus_id asc;
CUS_ID JOB START_DATE END_DATE DTYPE DW_STATUS DW_INS_DATE
---------- -------------------- ---------- ---------- ----- --------- -----------
10001 oracle 2017-01-01 3000-12-31 C I 2017-01-01
10002 pgsql 2017-01-01 3000-12-31 C I 2017-01-01
10003 mysql 2017-01-01 2017-01-02 H U 2017-01-02
10003 mongodb 2017-01-02 3000-12-31 C U 2017-01-03
10004 java 2017-01-01 2017-01-02 H D 2017-01-02
10005 python 2017-01-01 3000-12-31 C I 2017-01-01
10006 docker 2017-01-02 3000-12-31 C I 2017-01-03
10007 redis 2017-01-02 3000-12-31 C I 2017-01-03
8 rows selected
SQL>
4、刷新customer_inc表數據到customer表(2017-01-03)
1)初始化2017-01-02號增量表:
truncate table customer_inc;
insert into customer_inc(cus_id,job,dw_status,dw_ins_date)values(10003,'hive','U','2017-01-03');
insert into customer_inc(cus_id,job,dw_status,dw_ins_date)values(10008,'hadoop','I','2017-01-03');
insert into customer_inc(cus_id,job,dw_status,dw_ins_date)values(10006,'openstack','U','2017-01-03');
insert into customer_inc(cus_id,job,dw_status,dw_ins_date)values(10007,'redis','D','2017-01-03');
2) customer表最新分區和customer_inc表中更新和刪除數據連接,處理customer最新分區中變化數據:
truncate table customer_tmp0;
insert into customer_tmp0
select
t1.cus_id,
t1.job,
t1.start_date,
case when t2.cus_id is null then t1.end_date else '2017-01-03' end as end_date,
case when t2.cus_id is null then 'C' else 'H' end dtype,
case when t2.cus_id is null then t1.dw_status else t2.dw_status end dw_status,
case when t2.cus_id is null then t1.dw_ins_date else t2.dw_ins_date end as dw_ins_date
from customer t1 left join customer_inc t2 on t1.cus_id = t2.cus_id and t2.dw_status in ('D','U')
where t1.end_date = '3000-12-31'
order by cus_id asc
;
3)將customer表中更新和插入數據插入到customer_tmp0臨時表中:
insert into customer_tmp0
select
t1.cus_id,
t1.job,
'2017-01-03' as start_date,
'3000-12-31' as end_date,
'C' as dtype,
t1.dw_status,
'2017-01-04' as dw_ins_date
from customer_inc t1
where t1.dw_status in ('I','U')
;
4) 表到customer事實表,這一步可以使用交換分區操作:
alter table customer truncate partition cus_par30001231;
insert into customer
select * from customer_tmp0;
5) 查看結果
SQL> select * from customer order by cus_id asc;
CUS_ID JOB START_DATE END_DATE DTYPE DW_STATUS DW_INS_DATE
----------- -------------------- ---------- ---------- ----- --------- -----------
10001 oracle 2017-01-01 3000-12-31 C I 2017-01-01
10002 pgsql 2017-01-01 3000-12-31 C I 2017-01-01
10003 mongodb 2017-01-02 2017-01-03 H U 2017-01-03
10003 hive 2017-01-03 3000-12-31 C U 2017-01-04
10003 mysql 2017-01-01 2017-01-02 H U 2017-01-02
10004 java 2017-01-01 2017-01-02 H D 2017-01-02
10005 python 2017-01-01 3000-12-31 C I 2017-01-01
10006 docker 2017-01-02 2017-01-03 H U 2017-01-03
10006 openstack 2017-01-03 3000-12-31 C U 2017-01-04
10007 redis 2017-01-02 2017-01-03 H D 2017-01-03
10008 hadoop 2017-01-03 3000-12-31 C I 2017-01-04
11 rows selected
SQL>
5、查詢拉鏈表:
1) 查詢拉鏈表最新數據:
SQL> select * from customer where end_date = '3000-12-31' order by cus_id asc;
CUS_ID JOB START_DATE END_DATE DTYPE DW_STATUS DW_INS_DATE
--------- -------------------- ---------- ---------- ----- --------- -----------
10001 oracle 2017-01-01 3000-12-31 C I 2017-01-01
10002 pgsql 2017-01-01 3000-12-31 C I 2017-01-01
10003 hive 2017-01-03 3000-12-31 C U 2017-01-04
10005 python 2017-01-01 3000-12-31 C I 2017-01-01
10006 openstack 2017-01-03 3000-12-31 C U 2017-01-04
10008 hadoop 2017-01-03 3000-12-31 C I 2017-01-04
6 rows selected
SQL>
2) 查詢2017-01-01歷史快照數據:
SQL> select * from customer where start_date <= '2017-01-01' and end_date > '2017-01-01' order by cus_id asc;
CUS_ID JOB START_DATE END_DATE DTYPE DW_STATUS DW_INS_DATE
--------- -------------------- ---------- ---------- ----- --------- -----------
10001 oracle 2017-01-01 3000-12-31 C I 2017-01-01
10002 pgsql 2017-01-01 3000-12-31 C I 2017-01-01
10003 mysql 2017-01-01 2017-01-02 H U 2017-01-02
10004 java 2017-01-01 2017-01-02 H D 2017-01-02
10005 python 2017-01-01 3000-12-31 C I 2017-01-01
SQL>
3)查詢2017-01-02歷史快照數據:
SQL> select * from customer where start_date <= '2017-01-02' and end_date > '2017-01-02' order by cus_id asc;
CUS_ID JOB START_DATE END_DATE DTYPE DW_STATUS DW_INS_DATE
---------- -------------------- ---------- ---------- ----- --------- -----------
10001 oracle 2017-01-01 3000-12-31 C I 2017-01-01
10002 pgsql 2017-01-01 3000-12-31 C I 2017-01-01
10003 mongodb 2017-01-02 2017-01-03 H U 2017-01-03
10005 python 2017-01-01 3000-12-31 C I 2017-01-01
10006 docker 2017-01-02 2017-01-03 H U 2017-01-03
10007 redis 2017-01-02 2017-01-03 H D 2017-01-03
6 rows selected
SQL>
4)查看10003用戶的所有數據:
SQL> select * from customer where cus_id = '10003';
CUS_ID JOB START_DATE END_DATE DTYPE DW_STATUS DW_INS_DATE
---------- -------------------- ---------- ---------- ----- --------- -----------
10003 mysql 2017-01-01 2017-01-02 H U 2017-01-02
10003 mongodb 2017-01-02 2017-01-03 H U 2017-01-03
10003 hive 2017-01-03 3000-12-31 C U 2017-01-04
SQL>