網站日志流量分析系統之數據清洗處理（離線分析）

本文轉載自查看原文 2019-09-07 01:32 766 網站日志流量分析

　　網站日志流量分析系統之（日志收集）已將數據落地收集並落地至HDFS，根據網站日志流量分析系統中架構圖，接下來要做的事情就是做離線分析，編寫MR程序或通過手寫HQL對HDFS中的數據進行清洗；由於清洗邏輯比較簡單，這里我選擇用Hive來對HDFS中的數據進行清洗（當然也可以用MR來清洗）。數據清洗處理過程相對較長，所以：Be patient，please!

二、服務器規划

三、數據清洗

　　由於本次測試數據清洗邏輯較為簡單，所以采用Hive來進行清洗（當然你也可以選擇手寫MR程序），下面操作是在hadoopalone主機操作（即安裝有hadoop偽分布式）

（1）進入hive命令行模式，創建庫logdb

hive> create database logdb;

（2）創建外部分區表管理數據（HDFS）

hive>use logdb;

hive> create external table logdemo
    > (url string,urlname string,title string,chset string,
    > scr string,col string,lg string,je string,ec string,
    > fv string,cn string,ref string,uagent string,
    > stat_uv string,stat_ss string,cip string)
    > partitioned by (reportTime string)  row format delimited fields 
    > terminated by '|' location '/logdemo';

（3）增加今日分區

hive> alter table logdemo add partition(reportTime='2019-09-07')  location '/logdemo/reportTime=2019-09-07';　　//這里關聯hdfs

（4）查看數據

hive> select * from logdemo;

Hadoop中hdfs中的數據如下圖：

（5）創建數據清洗表

hive> create table dataclear 
    > (url string,urlname string,ref string,uagent string,
    > uvid string,ssid string,sscoutn string,sstime string,cip string) 
    > partitioned by (reportTime string) row format delimited fields terminated by '|';

（6）將logdemo表中數據導入數據清洗表（dataclear）

hive> insert into dataclear partition(reportTime='2019-09-07') 
    > select split(url,'-')[2],urlname,ref,uagent,stat_uv,split(stat_ss,'_')[0],
    > split(stat_ss,'_')[1],split(stat_ss,'_')[2],cip from logdemo 
    > where reportTime = '2019-09-07';

（7）查看數據清洗表（dataclear）

hive> select * from dataclear;

　　接下來就是計算相關的業務需求指標。

四、數據處理

　　利用Hive數據倉庫工具實現業務指標的計算。

（1）pv（點擊量）計算

　　pv：計算一天之內訪問的數量，也就意味着一條日志代表一次點擊量，Hql語句也就好寫了，如下：

hive> select count(*) as pv from dataclear where reportTime='2019-09-07';

（2）uv（獨立訪客數）計算

　　uv：一天之內獨立訪客數量，同一個客戶在一天之內多次訪問只能記錄一個uv，計算邏輯：當天日志uvid去重計算即可

hive> select count(distinct uvid) as uv from dataclear where reportTime='2019-09-07';

（3）vv（會話總數）計算

　　vv：會話總數，一天之內會話的總數量，計算邏輯：當天日志ssid去重計算即可

hive> select count(distinct ssid) as vv from dataclear where reportTime='2019-09-07';

（4）br（跳出率）計算

　　br：跳出率，一天之內跳出的會話占總會話的比率。跳出會話：一個會話內只看過一個頁面成為跳出會話。

　　總會話sql：select count(distinct ssid) as vv_count from dataclear where reportTime='2019-09-07'

　　跳出會話sql：select count(br_tab.ssid) as br_count from (select ssid from dataclear where reportTime='2019-09-07' group by ssid having count(*) = 1) as br_tab，HQL計算邏輯：跳出會話數/總會話數

hive> select round(br_left_tab.br_count / br_right_tab.vv_count,4) as br from   
    > (select count(br_tab.ssid) as br_count from (select ssid from dataclear where reportTime='2019-09-07' group by ssid having count(*) = 1) as br_tab) as br_left_tab, 
    > (select count(distinct ssid) as vv_count from dataclear where reportTime='2019-09-07') as br_right_tab;

（5）newip（新增ip）

　　newip：新增ip總數，計算邏輯：當天所有的ip去重后在歷史數據中從未出現過的數量。

hive> select count(distinct dataclear.cip) as newip from dataclear 
    > where dataclear.reportTime='2019-09-07' 
    > and dataclear.cip not in 
    > (select distinct inner_dataclear_tab.cip from dataclear as inner_dataclear_tab 
    > where datediff('2019-09-07',inner_dataclear_tab.reportTime)>0)

（6）newcust（新增客戶總數）計算

　　newcust：新增客戶總數，計算邏輯：今天所有uvid去重后在歷史數據中從未出現過的數量

hive> select count(distinct dataclear.uvid) as newcust from dataclear 
    > where dataclear.reportTime='2019-09-07' 
    > and dataclear.uvid not in 
    > (select inner_dataclear_tab.uvid from dataclear as inner_dataclear_tab 
    > where datediff('2019-09-07',inner_dataclear_tab.reportTime)>0);

（7）avgtime（平均訪問時長）

　　avgtime：平均訪問時長，一天內所有會話的訪問時長的平均值。一個會話的訪問時長：這個會話最后一個頁面的訪問時間-第一個頁面的訪問時間

hive> select avg(avgtime_tab.use_time) as avgtime from 
    > (select max(sstime) - min(sstime) as use_time from dataclear 
    > where reportTime='2019-09-07' group by ssid) as avgtime_tab;

（8）avgdeep（平均訪問深度）計算

　　avgdeep：一天內所有會話訪問深度的平均值。一個會話的訪問深度指的是所有地址去重后計數

hive> select round(avg(avgdeep_tab.deep),4) as avgdeep from (select count(distinct urlname) as deep from dataclear where
    > reportTime='2019-10-09' group by ssid) as avgdeep_tab;

五、業務指標計算結果寫入目標表

　　我們的目標是將計算的8個指標（pv，uv，vv，br，newip，newcust，avgtim，avgdeep）寫入目標表（tongji1表），這里有兩套方案如下可以選擇：

方案一：將每個指標的計算hql當成一個表來處理，也就是這個8個表做笛卡兒積查詢，將結果寫入tongji1表，大概hql模樣如下：　　

insert into tongji1 
select  '2019-09-07',tab1.pv,tab2.uv,tab3.vv,tab4.br,tab5.newip,tab6.newcust,tab7.avgtime,tab8.avgdeep from 
(select count(*) as pv from dataclear where reportTime='2019-09-07') as tab1, 
(select count(distinct uvid) as uv from dataclear where reportTime='2019-09-07') as tab2, 
(select count(distinct ssid) as vv from dataclear where reportTime='2019-09-07') as tab3, 
(select round(br_left_tab.br_count / br_right_tab.vv_count,4) as br from   
(select count(br_tab.ssid) as br_count from (select ssid from dataclear 
where reportTime='2019-09-07' group by ssid having count(*) = 1) as br_tab) as br_left_tab, 
(select count(distinct ssid) as vv_count from dataclear where reportTime='2019-09-07') as br_right_tab) as tab4, 
(select count(distinct dataclear.cip) as newip from dataclear 
where dataclear.reportTime='2019-09-07' and dataclear.cip not in 
(select distinct inner_dataclear_tab.cip from dataclear as inner_dataclear_tab 
where datediff('2019-09-07',inner_dataclear_tab.reportTime)>0)) as tab5, 
(select count(distinct dataclear.uvid) as newcust from dataclear 
where dataclear.reportTime='2019-09-07' and dataclear.uvid not in 
(select inner_dataclear_tab.uvid from dataclear as inner_dataclear_tab 
where datediff('2019-09-07',inner_dataclear_tab.reportTime)>0)) as tab6, 
(select avg(avgtime_tab.use_time) as avgtime from (select max(sstime) - min(sstime) as use_time from dataclear 
where reportTime='2019-09-07' group by ssid) as avgtime_tab) as tab7, 
(select round(avg(avgdeep_tab.deep),4) as avgdeep from 
(select count(distinct url) as deep from dataclear 
where reportTime='2019-09-07' group by ssid) as avgdeep_tab) as tab8;

　　這種方式通過連接查詢，將多個查詢結果插入一張tongji1表，雖然實現了效果，但是過多的表連接效率低下，且任何一個MR程序出錯，整個程序都要重新計算，可靠性比較低。所以采用以下第二種方案。

方案二：借助中間臨時表過渡，存儲中間數據，最終將數據寫入目標表（tongji1表），實現如下：

（1）創建統計表（tongji1）

hive> create table tongji1 (reportTime string,pv int,uv int,vv int,br double,newip int,newcust int,avgtime double,avgdeep double) row format delimited fields terminated by '|';

（2）創建中間表（tongji1_temp）

hive> create table tongji1_temp (reportTime string,field string,value double) row format delimited fields terminated by '|';

（3）依次將各個業務指標寫入中間表（tongji1_temp）

hive> insert into tongji1_temp  select '2019-09-07','pv',t1.pv from (select count(*) as pv from dataclear where reportTime='2019-09-07') as t1;

hive> insert into tongji1_temp  select '2019-09-07','uv',t2.uv from (select count(distinct uvid) as uv from dataclear where reportTime='2019-09-07') as t2;

hive> insert into tongji1_temp  select '2019-09-07','vv',t3.vv from (select count(distinct ssid) as vv from dataclear where reportTime='2019-09-07') as t3;

hive> insert into tongji1_temp  select '2019-09-07','br',t4.br from (select round(br_left_tab.br_count / br_right_tab.vv_count,4) as br from   (select count(br_tab.ssid) as br_count from (select ssid from dataclear where reportTime='2019-09-07' group by ssid having count(*) = 1) as br_tab) as br_left_tab, (select count(distinct ssid) as vv_count from dataclear where reportTime='2019-09-07') as br_right_tab) as t4;

hive> insert into tongji1_temp  select '2019-09-07','newip',t5.newip from (select count(distinct dataclear.cip) as newip from dataclear where dataclear.reportTime='2019-09-07' and dataclear.cip not in (select distinct inner_dataclear_tab.cip from dataclear as inner_dataclear_tab where datediff('2019-09-07',inner_dataclear_tab.reportTime)>0)) as t5;

hive> insert into tongji1_temp  select '2019-09-07','newcust',t6.newcust from (select count(distinct dataclear.uvid) as newcust from dataclear where dataclear.reportTime='2019-09-07' and dataclear.uvid not in (select inner_dataclear_tab.uvid from dataclear as inner_dataclear_tab where datediff('2019-09-07',inner_dataclear_tab.reportTime)>0)) as t6;

hive> insert into tongji1_temp  select '2019-09-07','avgtime',t7.avgtime from (select avg(avgtime_tab.use_time) as avgtime from (select max(sstime) - min(sstime) as use_time from dataclear where reportTime='2019-09-07' group by ssid) as avgtime_tab) as t7;

hive> insert into tongji1_temp  select '2019-09-07','avgdeep',t8.avgdeep from (select round(avg(avgdeep_tab.deep),4) as avgdeep from (select count(distinct url) as deep from dataclear where reportTime='2019-09-07' group by ssid) as avgdeep_tab) as t8;

（4）將中間表數據（tongji1_temp）數據導入最終目標表（tongji1表）

hive> insert into tongji1 select '2019-09-07',t1.pv,t2.uv,t3.vv,t4.br,t5.newip, t6.newcust, t7.avgtime, t8.avgdeep from  
    > (select value as pv from tongji1_temp where field='pv' and reportTime='2019-09-07') as t1, 
    > (select value as uv from tongji1_temp where field='uv' and reportTime='2019-09-07') as t2, 
    > (select value as vv from tongji1_temp where field='vv' and reportTime='2019-09-07') as t3, 
    > (select value as br from tongji1_temp where field='br' and reportTime='2019-09-07') as t4, 
    > (select value as newip from tongji1_temp where field='newip' and reportTime='2019-09-07') as t5, 
    > (select value as newcust from tongji1_temp where field='newcust' and reportTime='2019-09-07') as t6, 
    > (select value as avgtime from tongji1_temp where field='avgtime' and reportTime='2019-09-07') as t7, 
    > (select value as avgdeep from tongji1_temp where field='avgdeep' and reportTime='2019-09-07') as t8;

View Code

（5）查看目標表（tongji1）

hive> select * from tongji1

（6）方案總結

　　采用第二種方案，分別計算各個業務指標，借助中間表存放中間臨時數據，再從臨時表向tongji1表導入數據，降低了sql的復雜度，提升效率，此外任何一個hql語句出錯，只需重新跑一遍該hql語句，無需重跑整個程序。但是這種方案也有一些缺點，比如浪費空間等。

　　當然，還有第三種方案，比如利用Hive的事務表，這里由於鄙人能力有限，未作了解。

六、通過Sqoop工具將數據清洗處理結果寫入MySQL

（1）在mysql數據庫中創建數據庫logdb，表名tongji1，關鍵sql如下：

create database logdb;
use logdb;
create table tongji1(
        reportTime date,
        pv int,
        uv int,
        vv int,
        br double,
        newip int,
        newcust int,
        avgtime double,
        avgdeep double
);

View Code

（2）進入sqoop的bin目錄執行以下操作（參考sqoop的安裝和基本命令使用）

[root@hadoopalone bin]# ./sqoop export --connect jdbc:mysql://hadoopalone:3306/logdb --username root --password root --export-dir '/user/hive/warehouse/logdb.db/tongji1' --table tongji1 -m 1 --fields-terminated-by '|'

View Code

注：如果提示沒有mysql的驅動包，則上傳一份mysql-connector-java-5.1.38-bin.jar至對應sqoop的lib目錄下重新運行該腳本即可。

（3）查看mysql中tongji1表的數據

七、總結

　　至此，我們已經完成了離線分析中的數據清洗處理，並將結果通過sqoop導出至MySQL中，通過此案例方知：數據提取、清洗、處理（ETL）是大數據處理中一個非常重要的階段，由該博文的篇章可以看出來，感謝大家能看到結束。下面就是對離線分析處理的結果進行可視化展示：網站日志流量分析系統之數據可視化展示

　　不知道大家有沒有發現，在整篇文章中，我寫到的日期都是寫死的，也就是reportTime='2019-09-07'，但是在實際開發中，肯定不會寫的，讓程序自動執行的，所以由此就會有Hql的自動化執行腳本，博主已經寫好相關博文，網站日志流量分析系統之離線分析（自動化腳本），希望大家可以和我一起討論！謝謝

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 網站日志流量分析系統大數據實戰：用戶流量分析系統數據清洗數據清洗 HIVE數據清洗 SQL數據清洗數據清洗 MapReduce數據清洗加密惡意流量分析-Maltrail惡意流量檢測系統網站流量分析指標-PV/UV/PR/IP