網站日志流量分析系統之(日志收集)已將數據落地收集並落地至HDFS,根據網站日志流量分析系統中架構圖,接下來要做的事情就是做離線分析,編寫MR程序或通過手寫HQL對HDFS中的數據進行清洗;由於清洗邏輯比較簡單,這里我選擇用Hive來對HDFS中的數據進行清洗(當然也可以用MR來清洗)。數據清洗處理過程相對較長,所以:Be patient,please!
二、服務器規划

三、數據清洗
由於本次測試數據清洗邏輯較為簡單,所以采用Hive來進行清洗(當然你也可以選擇手寫MR程序),下面操作是在hadoopalone主機操作(即安裝有hadoop偽分布式)
(1)進入hive命令行模式,創建庫logdb
hive> create database logdb;
(2)創建外部分區表管理數據(HDFS)
hive>use logdb;
hive> create external table logdemo > (url string,urlname string,title string,chset string, > scr string,col string,lg string,je string,ec string, > fv string,cn string,ref string,uagent string, > stat_uv string,stat_ss string,cip string) > partitioned by (reportTime string) row format delimited fields > terminated by '|' location '/logdemo';
(3)增加今日分區
hive> alter table logdemo add partition(reportTime='2019-09-07') location '/logdemo/reportTime=2019-09-07'; //這里關聯hdfs
(4)查看數據
hive> select * from logdemo;

Hadoop中hdfs中的數據如下圖:

(5)創建數據清洗表
hive> create table dataclear > (url string,urlname string,ref string,uagent string, > uvid string,ssid string,sscoutn string,sstime string,cip string) > partitioned by (reportTime string) row format delimited fields terminated by '|';
(6)將logdemo表中數據導入數據清洗表(dataclear)
hive> insert into dataclear partition(reportTime='2019-09-07') > select split(url,'-')[2],urlname,ref,uagent,stat_uv,split(stat_ss,'_')[0], > split(stat_ss,'_')[1],split(stat_ss,'_')[2],cip from logdemo > where reportTime = '2019-09-07';

(7)查看數據清洗表(dataclear)
hive> select * from dataclear;

接下來就是計算相關的業務需求指標。
四、數據處理
利用Hive數據倉庫工具實現業務指標的計算。
(1)pv(點擊量)計算
pv:計算一天之內訪問的數量,也就意味着一條日志代表一次點擊量,Hql語句也就好寫了,如下:
hive> select count(*) as pv from dataclear where reportTime='2019-09-07';

(2)uv(獨立訪客數)計算
uv:一天之內獨立訪客數量,同一個客戶在一天之內多次訪問只能記錄一個uv,計算邏輯:當天日志uvid去重計算即可
hive> select count(distinct uvid) as uv from dataclear where reportTime='2019-09-07';

(3)vv(會話總數)計算
vv:會話總數,一天之內會話的總數量,計算邏輯:當天日志ssid去重計算即可
hive> select count(distinct ssid) as vv from dataclear where reportTime='2019-09-07';

(4)br(跳出率)計算
br:跳出率,一天之內跳出的會話占總會話的比率。跳出會話:一個會話內只看過一個頁面成為跳出會話。
總會話sql:select count(distinct ssid) as vv_count from dataclear where reportTime='2019-09-07'
跳出會話sql:select count(br_tab.ssid) as br_count from (select ssid from dataclear where reportTime='2019-09-07' group by ssid having count(*) = 1) as br_tab,HQL計算邏輯:跳出會話數/總會話數
hive> select round(br_left_tab.br_count / br_right_tab.vv_count,4) as br from > (select count(br_tab.ssid) as br_count from (select ssid from dataclear where reportTime='2019-09-07' group by ssid having count(*) = 1) as br_tab) as br_left_tab, > (select count(distinct ssid) as vv_count from dataclear where reportTime='2019-09-07') as br_right_tab;

(5)newip(新增ip)
newip:新增ip總數,計算邏輯:當天所有的ip去重后在歷史數據中從未出現過的數量。
hive> select count(distinct dataclear.cip) as newip from dataclear > where dataclear.reportTime='2019-09-07' > and dataclear.cip not in > (select distinct inner_dataclear_tab.cip from dataclear as inner_dataclear_tab > where datediff('2019-09-07',inner_dataclear_tab.reportTime)>0)

(6)newcust(新增客戶總數)計算
newcust:新增客戶總數,計算邏輯:今天所有uvid去重后在歷史數據中從未出現過的數量
hive> select count(distinct dataclear.uvid) as newcust from dataclear > where dataclear.reportTime='2019-09-07' > and dataclear.uvid not in > (select inner_dataclear_tab.uvid from dataclear as inner_dataclear_tab > where datediff('2019-09-07',inner_dataclear_tab.reportTime)>0);

(7)avgtime(平均訪問時長)
avgtime:平均訪問時長,一天內所有會話的訪問時長的平均值。一個會話的訪問時長:這個會話最后一個頁面的訪問時間-第一個頁面的訪問時間
hive> select avg(avgtime_tab.use_time) as avgtime from > (select max(sstime) - min(sstime) as use_time from dataclear > where reportTime='2019-09-07' group by ssid) as avgtime_tab;

(8)avgdeep(平均訪問深度)計算
avgdeep:一天內所有會話訪問深度的平均值。一個會話的訪問深度指的是所有地址去重后計數
hive> select round(avg(avgdeep_tab.deep),4) as avgdeep from (select count(distinct urlname) as deep from dataclear where > reportTime='2019-10-09' group by ssid) as avgdeep_tab;

五、業務指標計算結果寫入目標表
我們的目標是將計算的8個指標(pv,uv,vv,br,newip,newcust,avgtim,avgdeep)寫入目標表(tongji1表),這里有兩套方案如下可以選擇:
方案一:將每個指標的計算hql當成一個表來處理,也就是這個8個表做笛卡兒積查詢,將結果寫入tongji1表,大概hql模樣如下:
insert into tongji1 select '2019-09-07',tab1.pv,tab2.uv,tab3.vv,tab4.br,tab5.newip,tab6.newcust,tab7.avgtime,tab8.avgdeep from (select count(*) as pv from dataclear where reportTime='2019-09-07') as tab1, (select count(distinct uvid) as uv from dataclear where reportTime='2019-09-07') as tab2, (select count(distinct ssid) as vv from dataclear where reportTime='2019-09-07') as tab3, (select round(br_left_tab.br_count / br_right_tab.vv_count,4) as br from (select count(br_tab.ssid) as br_count from (select ssid from dataclear where reportTime='2019-09-07' group by ssid having count(*) = 1) as br_tab) as br_left_tab, (select count(distinct ssid) as vv_count from dataclear where reportTime='2019-09-07') as br_right_tab) as tab4, (select count(distinct dataclear.cip) as newip from dataclear where dataclear.reportTime='2019-09-07' and dataclear.cip not in (select distinct inner_dataclear_tab.cip from dataclear as inner_dataclear_tab where datediff('2019-09-07',inner_dataclear_tab.reportTime)>0)) as tab5, (select count(distinct dataclear.uvid) as newcust from dataclear where dataclear.reportTime='2019-09-07' and dataclear.uvid not in (select inner_dataclear_tab.uvid from dataclear as inner_dataclear_tab where datediff('2019-09-07',inner_dataclear_tab.reportTime)>0)) as tab6, (select avg(avgtime_tab.use_time) as avgtime from (select max(sstime) - min(sstime) as use_time from dataclear where reportTime='2019-09-07' group by ssid) as avgtime_tab) as tab7, (select round(avg(avgdeep_tab.deep),4) as avgdeep from (select count(distinct url) as deep from dataclear where reportTime='2019-09-07' group by ssid) as avgdeep_tab) as tab8;
這種方式通過連接查詢,將多個查詢結果插入一張tongji1表,雖然實現了效果,但是過多的表連接效率低下,且任何一個MR程序出錯,整個程序都要重新計算,可靠性比較低。所以采用以下第二種方案。
方案二:借助中間臨時表過渡,存儲中間數據,最終將數據寫入目標表(tongji1表),實現如下:
(1)創建統計表(tongji1)
hive> create table tongji1 (reportTime string,pv int,uv int,vv int,br double,newip int,newcust int,avgtime double,avgdeep double) row format delimited fields terminated by '|';
(2)創建中間表(tongji1_temp)
hive> create table tongji1_temp (reportTime string,field string,value double) row format delimited fields terminated by '|';
(3)依次將各個業務指標寫入中間表(tongji1_temp)
hive> insert into tongji1_temp select '2019-09-07','pv',t1.pv from (select count(*) as pv from dataclear where reportTime='2019-09-07') as t1;
hive> insert into tongji1_temp select '2019-09-07','uv',t2.uv from (select count(distinct uvid) as uv from dataclear where reportTime='2019-09-07') as t2;
hive> insert into tongji1_temp select '2019-09-07','vv',t3.vv from (select count(distinct ssid) as vv from dataclear where reportTime='2019-09-07') as t3;
hive> insert into tongji1_temp select '2019-09-07','br',t4.br from (select round(br_left_tab.br_count / br_right_tab.vv_count,4) as br from (select count(br_tab.ssid) as br_count from (select ssid from dataclear where reportTime='2019-09-07' group by ssid having count(*) = 1) as br_tab) as br_left_tab, (select count(distinct ssid) as vv_count from dataclear where reportTime='2019-09-07') as br_right_tab) as t4;
hive> insert into tongji1_temp select '2019-09-07','newip',t5.newip from (select count(distinct dataclear.cip) as newip from dataclear where dataclear.reportTime='2019-09-07' and dataclear.cip not in (select distinct inner_dataclear_tab.cip from dataclear as inner_dataclear_tab where datediff('2019-09-07',inner_dataclear_tab.reportTime)>0)) as t5;
hive> insert into tongji1_temp select '2019-09-07','newcust',t6.newcust from (select count(distinct dataclear.uvid) as newcust from dataclear where dataclear.reportTime='2019-09-07' and dataclear.uvid not in (select inner_dataclear_tab.uvid from dataclear as inner_dataclear_tab where datediff('2019-09-07',inner_dataclear_tab.reportTime)>0)) as t6;
hive> insert into tongji1_temp select '2019-09-07','avgtime',t7.avgtime from (select avg(avgtime_tab.use_time) as avgtime from (select max(sstime) - min(sstime) as use_time from dataclear where reportTime='2019-09-07' group by ssid) as avgtime_tab) as t7;
hive> insert into tongji1_temp select '2019-09-07','avgdeep',t8.avgdeep from (select round(avg(avgdeep_tab.deep),4) as avgdeep from (select count(distinct url) as deep from dataclear where reportTime='2019-09-07' group by ssid) as avgdeep_tab) as t8;
(4)將中間表數據(tongji1_temp)數據導入最終目標表(tongji1表)
hive> insert into tongji1 select '2019-09-07',t1.pv,t2.uv,t3.vv,t4.br,t5.newip, t6.newcust, t7.avgtime, t8.avgdeep from > (select value as pv from tongji1_temp where field='pv' and reportTime='2019-09-07') as t1, > (select value as uv from tongji1_temp where field='uv' and reportTime='2019-09-07') as t2, > (select value as vv from tongji1_temp where field='vv' and reportTime='2019-09-07') as t3, > (select value as br from tongji1_temp where field='br' and reportTime='2019-09-07') as t4, > (select value as newip from tongji1_temp where field='newip' and reportTime='2019-09-07') as t5, > (select value as newcust from tongji1_temp where field='newcust' and reportTime='2019-09-07') as t6, > (select value as avgtime from tongji1_temp where field='avgtime' and reportTime='2019-09-07') as t7, > (select value as avgdeep from tongji1_temp where field='avgdeep' and reportTime='2019-09-07') as t8;
(5)查看目標表(tongji1)
hive> select * from tongji1

(6)方案總結
采用第二種方案,分別計算各個業務指標,借助中間表存放中間臨時數據,再從臨時表向tongji1表導入數據,降低了sql的復雜度,提升效率,此外任何一個hql語句出錯,只需重新跑一遍該hql語句,無需重跑整個程序。但是這種方案也有一些缺點,比如浪費空間等。
當然,還有第三種方案,比如利用Hive的事務表,這里由於鄙人能力有限,未作了解。
六、通過Sqoop工具將數據清洗處理結果寫入MySQL
(1)在mysql數據庫中創建數據庫logdb,表名tongji1,關鍵sql如下:
create database logdb; use logdb; create table tongji1( reportTime date, pv int, uv int, vv int, br double, newip int, newcust int, avgtime double, avgdeep double );
(2)進入sqoop的bin目錄執行以下操作(參考sqoop的安裝和基本命令使用)
[root@hadoopalone bin]# ./sqoop export --connect jdbc:mysql://hadoopalone:3306/logdb --username root --password root --export-dir '/user/hive/warehouse/logdb.db/tongji1' --table tongji1 -m 1 --fields-terminated-by '|'
注:如果提示沒有mysql的驅動包,則上傳一份mysql-connector-java-5.1.38-bin.jar至對應sqoop的lib目錄下重新運行該腳本即可。

(3)查看mysql中tongji1表的數據

七、總結
至此,我們已經完成了離線分析中的數據清洗處理,並將結果通過sqoop導出至MySQL中,通過此案例方知:數據提取、清洗、處理(ETL)是大數據處理中一個非常重要的階段,由該博文的篇章可以看出來,感謝大家能看到結束。下面就是對離線分析處理的結果進行可視化展示:網站日志流量分析系統之數據可視化展示
不知道大家有沒有發現,在整篇文章中,我寫到的日期都是寫死的,也就是reportTime='2019-09-07',但是在實際開發中,肯定不會寫的,讓程序自動執行的,所以由此就會有Hql的自動化執行腳本,博主已經寫好相關博文,網站日志流量分析系統之離線分析(自動化腳本),希望大家可以和我一起討論!謝謝
