1:網站點擊流數據分析項目推薦書籍:
可以看看百度如何實現這個功能的:https://tongji.baidu.com/web/welcome/login
1 網站點擊流數據分析,業務知識,推薦書籍: 2 《網站分析實戰——如何以數據驅動決策,提升網站價值》王彥平,吳盛鋒編著 http://download.csdn.net/download/biexiansheng/10160197
2:整體技術流程及架構:
2.1 數據處理流程
該項目是一個純粹的數據分析項目,其整體流程基本上就是依據數據的處理流程進行,依此有以下幾個大的步驟:
(1):數據采集
首先,通過頁面嵌入JS代碼的方式獲取用戶訪問行為,並發送到web服務的后台記錄日志(假設已經獲取到數據); 然后,將各服務器上生成的點擊流日志通過實時或批量的方式匯聚到HDFS文件系統中 ,當然,一個綜合分析系統,數據源可能不僅包含點擊流數據,還有數據庫中的業務數據(如用戶信息、商品信息、訂單信息等)及對分析有益的外部數據。
(2):數據預處理
通過mapreduce程序對采集到的點擊流數據進行預處理,比如清洗,格式整理,濾除臟數據等;形成明細表,即寬表,多個表,以空間換時間。
(3):數據入庫
將預處理之后的數據導入到HIVE倉庫中相應的庫和表中;
(4):數據分析
項目的核心內容,即根據需求開發ETL分析語句,得出各種統計結果;
(5):數據展現
將分析所得數據進行可視化;
2.2 項目結構:
由於本項目是一個純粹數據分析項目,其整體結構亦跟分析流程匹配,並沒有特別復雜的結構,如下圖:
其中,需要強調的是:
系統的數據分析不是一次性的,而是按照一定的時間頻率反復計算,因而整個處理鏈條中的各個環節需要按照一定的先后依賴關系緊密銜接,即涉及到大量任務單元的管理調度,所以,項目中需要添加一個任務調度模塊
2.3 數據展現
數據展現的目的是將分析所得的數據進行可視化,以便運營決策人員能更方便地獲取數據,更快更簡單地理解數據;
3:模塊開發——數據采集
3.1 需求
數據采集的需求廣義上來說分為兩大部分。
1)是在頁面采集用戶的訪問行為,具體開發工作:
a、開發頁面埋點js,采集用戶訪問行為
b、后台接受頁面js請求記錄日志,此部分工作也可以歸屬為“數據源”,其開發工作通常由web開發團隊負責
2)是從web服務器上匯聚日志到HDFS,是數據分析系統的數據采集,此部分工作由數據分析平台建設團隊負責,具體的技術實現有很多方式:
Shell腳本
優點:輕量級,開發簡單
缺點:對日志采集過程中的容錯處理不便控制
Java采集程序
優點:可對采集過程實現精細控制
缺點:開發工作量大
Flume日志采集框架
成熟的開源日志采集系統,且本身就是hadoop生態體系中的一員,與hadoop體系中的各種框架組件具有天生的親和力,可擴展性強
3.2 技術選型
在點擊流日志分析這種場景中,對數據采集部分的可靠性、容錯能力要求通常不會非常嚴苛,因此使用通用的flume日志采集框架完全可以滿足需求。
本項目即使用flume來實現日志采集。
3.3 Flume日志采集系統搭建
a、數據源信息
本項目分析的數據用nginx服務器所生成的流量日志,存放在各台nginx服務器上,省略。
b、數據內容樣例
數據的具體內容在采集階段其實不用太關心。
1 字段解析: 2 1、訪客ip地址: 58.215.204.118 3 2、訪客用戶信息: - - 4 3、請求時間:[18/Sep/2013:06:51:35 +0000] 5 4、請求方式:GET 6 5、請求的url:/wp-includes/js/jquery/jquery.js?ver=1.10.2 7 6、請求所用協議:HTTP/1.1 8 7、響應碼:304 9 8、返回的數據流量:0 10 9、訪客的來源url:http://blog.fens.me/nodejs-socketio-chat/ 11 10、訪客所用瀏覽器:Mozilla/5.0 (Windows NT 5.1; rv:23.0) Gecko/20100101 Firefox/23.0
開始實際操作,現學現賣,使用flume采集數據如下所示:
由於是直接使用現成的數據,所以省略獲取原始數據的操作:
《默認hadoop,fLume,hive,azkaban,mysql等工具全部安裝完成,配置完成,必須的都配置完成》 第一步:假設已經獲取到數據,這里使用已經獲取到的數據,如果你學習過此套課程,知道此數據文件名稱為access.log.fensi,這里修改為access.log文件名稱; 第二步:獲取到數據以后就可以使用Flume日志采集系統采集數據。 第三步:采集規則配置詳情,如下所示 fLume的文件名稱如:tail-hdfs.conf 用tail命令獲取數據,下沉到hdfs,將數據存放到hdfs上面. 啟動命令: bin/flume-ng agent -c conf -f conf/tail-hdfs.conf -n a1 ######## # Name the components on this agent # 定義這個agent中各組件的名字,給那三個組件sources,sinks,channels取個名字,是一個邏輯代號: # a1是agent的代表。 a1.sources = r1 a1.sinks = k1 a1.channels = c1 # Describe/configure the source 描述和配置source組件:r1 類型, 從網絡端口接收數據,在本機啟動, 所以localhost, type=exec采集目錄源,目錄里有就采 # exec用來執行要執行的命令 a1.sources.r1.type = exec # -F根據文件名稱來追蹤,采集文件的路徑及其文件名稱. a1.sources.r1.command = tail -F /home/hadoop/log/test.log a1.sources.r1.channels = c1 # Describe the sink 描述和配置sink組件:k1 # type,下沉類型,使用hdfs,將數據下沉到hdfs分布式文件系統里面。 a1.sinks.k1.type = hdfs a1.sinks.k1.channel = c1 # 下沉的路徑,flume會進行格式的替換. a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M/ # 文件的前綴 a1.sinks.k1.hdfs.filePrefix = events- # 10分鍾修改一個新的目錄. a1.sinks.k1.hdfs.round = true a1.sinks.k1.hdfs.roundValue = 10 a1.sinks.k1.hdfs.roundUnit = minute # 3秒種滾動一次.可以方便查看效果,文件滾動周期默認30秒 a1.sinks.k1.hdfs.rollInterval = 3 # 文件滾動的大小限制,500字節滾動一次. a1.sinks.k1.hdfs.rollSize = 500 # 多少個事件,寫入多少個event數據后滾動文件即事件個數. a1.sinks.k1.hdfs.rollCount = 20 # 多少個事件寫一次 a1.sinks.k1.hdfs.batchSize = 5 # 是否從本地獲取時間useLocalTimeStamp a1.sinks.k1.hdfs.useLocalTimeStamp = true # 生成的文件類型,默認是Sequencefile,可用DataStream,則為普通文本 a1.sinks.k1.hdfs.fileType = DataStream # Use a channel which buffers events in memory # 描述和配置channel組件,此處使用是內存緩存的方式 # type類型是內存memory。 a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel # 描述和配置source channel sink之間的連接關系 # 將sources和sinks綁定到channel上面。 a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
具體操作如下所示:
1 [root@master soft]# cd flume/conf/ 2 [root@master conf]# ls 3 flume-conf.properties.template flume-env.ps1.template flume-env.sh flume-env.sh.template log4j.properties 4 [root@master conf]# vim tail-hdfs.conf
內容如下所示:
1 # Name the components on this agent 2 a1.sources = r1 3 a1.sinks = k1 4 a1.channels = c1 5 6 # Describe/configure the source 7 a1.sources.r1.type = exec 8 a1.sources.r1.command = tail -F /home/hadoop/data_hadoop/access.log 9 a1.sources.r1.channels = c1 10 11 # Describe the sink 12 a1.sinks.k1.type = hdfs 13 a1.sinks.k1.channel = c1 14 a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/ 15 a1.sinks.k1.hdfs.filePrefix = events- 16 a1.sinks.k1.hdfs.round = true 17 a1.sinks.k1.hdfs.roundValue = 10 18 a1.sinks.k1.hdfs.roundUnit = minute 19 a1.sinks.k1.hdfs.rollInterval = 3 20 a1.sinks.k1.hdfs.rollSize = 500 21 a1.sinks.k1.hdfs.rollCount = 20 22 a1.sinks.k1.hdfs.batchSize = 5 23 a1.sinks.k1.hdfs.useLocalTimeStamp = true 24 a1.sinks.k1.hdfs.fileType = DataStream 25 26 27 28 # Use a channel which buffers events in memory 29 a1.channels.c1.type = memory 30 a1.channels.c1.capacity = 1000 31 a1.channels.c1.transactionCapacity = 100 32 33 # Bind the source and sink to the channel 34 a1.sources.r1.channels = c1 35 a1.sinks.k1.channel = c1
然后啟動你的hdfs,yarn可以不啟動,這里都啟動起來了:
[root@master hadoop]# start-dfs.sh
[root@master hadoop]# start-yarn.sh
啟動起來以后,可以查看一下hdfs是否正常工作,如下所示:
[root@master hadoop]# hdfs dfsadmin -report
1 Configured Capacity: 56104357888 (52.25 GB) 2 Present Capacity: 39446368256 (36.74 GB) 3 DFS Remaining: 39438364672 (36.73 GB) 4 DFS Used: 8003584 (7.63 MB) 5 DFS Used%: 0.02% 6 Under replicated blocks: 0 7 Blocks with corrupt replicas: 0 8 Missing blocks: 0 9 10 ------------------------------------------------- 11 Live datanodes (3): 12 13 Name: 192.168.199.130:50010 (master) 14 Hostname: master 15 Decommission Status : Normal 16 Configured Capacity: 18611974144 (17.33 GB) 17 DFS Used: 3084288 (2.94 MB) 18 Non DFS Used: 7680802816 (7.15 GB) 19 DFS Remaining: 10928087040 (10.18 GB) 20 DFS Used%: 0.02% 21 DFS Remaining%: 58.72% 22 Configured Cache Capacity: 0 (0 B) 23 Cache Used: 0 (0 B) 24 Cache Remaining: 0 (0 B) 25 Cache Used%: 100.00% 26 Cache Remaining%: 0.00% 27 Xceivers: 1 28 Last contact: Sat Dec 16 13:31:03 CST 2017 29 30 31 Name: 192.168.199.132:50010 (slaver2) 32 Hostname: slaver2 33 Decommission Status : Normal 34 Configured Capacity: 18746191872 (17.46 GB) 35 DFS Used: 1830912 (1.75 MB) 36 Non DFS Used: 4413718528 (4.11 GB) 37 DFS Remaining: 14330642432 (13.35 GB) 38 DFS Used%: 0.01% 39 DFS Remaining%: 76.45% 40 Configured Cache Capacity: 0 (0 B) 41 Cache Used: 0 (0 B) 42 Cache Remaining: 0 (0 B) 43 Cache Used%: 100.00% 44 Cache Remaining%: 0.00% 45 Xceivers: 1 46 Last contact: Sat Dec 16 13:31:03 CST 2017 47 48 49 Name: 192.168.199.131:50010 (slaver1) 50 Hostname: slaver1 51 Decommission Status : Normal 52 Configured Capacity: 18746191872 (17.46 GB) 53 DFS Used: 3088384 (2.95 MB) 54 Non DFS Used: 4563468288 (4.25 GB) 55 DFS Remaining: 14179635200 (13.21 GB) 56 DFS Used%: 0.02% 57 DFS Remaining%: 75.64% 58 Configured Cache Capacity: 0 (0 B) 59 Cache Used: 0 (0 B) 60 Cache Remaining: 0 (0 B) 61 Cache Used%: 100.00% 62 Cache Remaining%: 0.00% 63 Xceivers: 1 64 Last contact: Sat Dec 16 13:31:03 CST 2017 65 66 67 [root@master hadoop]#
如果hdfs正常啟動,然后呢,用tail命令獲取數據,下沉到hdfs,將數據存放到hdfs上面:
啟動命令,啟動采集,啟動flume的agent,以及操作如下所示(注意:啟動命令中的 -n 參數要給配置文件中配置的agent名稱):
bin/flume-ng agent -c conf -f conf/tail-hdfs.conf -n a1
1 [root@master conf]# cd /home/hadoop/soft/flume/ 2 [root@master flume]# ls 3 bin CHANGELOG conf DEVNOTES docs lib LICENSE NOTICE README RELEASE-NOTES tools 4 [root@master flume]# bin/flume-ng agent -c conf -f conf/tail-hdfs.conf -n a1
出現如下說明已經清洗完畢:
1 [root@master flume]# bin/flume-ng agent -c conf -f conf/tail-hdfs.conf -n a1 2 Info: Sourcing environment configuration script /home/hadoop/soft/flume/conf/flume-env.sh 3 Info: Including Hadoop libraries found via (/home/hadoop/soft/hadoop-2.6.4/bin/hadoop) for HDFS access 4 Info: Excluding /home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/slf4j-api-1.7.5.jar from classpath 5 Info: Excluding /home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar from classpath 6 Info: Including Hive libraries found via (/home/hadoop/soft/apache-hive-1.2.1-bin) for Hive access 7 + exec /home/hadoop/soft/jdk1.7.0_65/bin/java -Xmx20m -cp '/home/hadoop/soft/flume/conf:/home/hadoop/soft/flume/lib/*:/home/hadoop/soft/hadoop-2.6.4/etc/hadoop:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/activation-1.1.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/apacheds-i18n-2.0.0-M15.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/apacheds-kerberos-codec-2.0.0-M15.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/api-asn1-api-1.0.0-M20.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/api-util-1.0.0-M20.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/asm-3.2.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/avro-1.7.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/commons-beanutils-1.7.0.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/commons-beanutils-core-1.8.0.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/commons-cli-1.2.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/commons-codec-1.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/commons-collections-3.2.2.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/commons-compress-1.4.1.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/commons-configuration-1.6.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/commons-digester-1.8.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/commons-el-1.0.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/commons-httpclient-3.1.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/commons-io-2.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/commons-lang-2.6.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/commons-logging-1.1.3.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/commons-math3-3.1.1.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/commons-net-3.1.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/curator-client-2.6.0.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/curator-framework-2.6.0.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/curator-recipes-2.6.0.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/gson-2.2.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/guava-11.0.2.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/hadoop-annotations-2.6.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/hadoop-auth-2.6.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/hamcrest-core-1.3.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/htrace-core-3.0.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/httpclient-4.2.5.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/httpcore-4.2.5.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/jackson-core-asl-1.9.13.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/jackson-jaxrs-1.9.13.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/jackson-mapper-asl-1.9.13.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/jackson-xc-1.9.13.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/jasper-compiler-5.5.23.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/jasper-runtime-5.5.23.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/java-xmlbuilder-0.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/jaxb-api-2.2.2.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/jaxb-impl-2.2.3-1.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/jersey-core-1.9.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/jersey-json-1.9.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/jersey-server-1.9.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/jets3t-0.9.0.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/jettison-1.1.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/jetty-6.1.26.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/jetty-util-6.1.26.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/jsch-0.1.42.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/jsp-api-2.1.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/jsr305-1.3.9.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/junit-4.11.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/log4j-1.2.17.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/mockito-all-1.8.5.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/netty-3.6.2.Final.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/paranamer-2.3.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/protobuf-java-2.5.0.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/servlet-api-2.5.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/snappy-java-1.0.4.1.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/stax-api-1.0-2.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/xmlenc-0.52.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/xz-1.0.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib/zookeeper-3.4.6.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/hadoop-common-2.6.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/hadoop-common-2.6.4-tests.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/hadoop-nfs-2.6.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/jdiff:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/lib:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/sources:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/common/templates:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/lib/asm-3.2.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/lib/commons-cli-1.2.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/lib/commons-codec-1.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/lib/commons-daemon-1.0.13.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/lib/commons-el-1.0.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/lib/commons-io-2.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/lib/commons-lang-2.6.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/lib/commons-logging-1.1.3.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/lib/guava-11.0.2.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/lib/htrace-core-3.0.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/lib/jackson-core-asl-1.9.13.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/lib/jackson-mapper-asl-1.9.13.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/lib/jasper-runtime-5.5.23.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/lib/jersey-core-1.9.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/lib/jersey-server-1.9.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/lib/jetty-6.1.26.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/lib/jetty-util-6.1.26.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/lib/jsp-api-2.1.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/lib/jsr305-1.3.9.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/lib/log4j-1.2.17.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/lib/netty-3.6.2.Final.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/lib/protobuf-java-2.5.0.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/lib/servlet-api-2.5.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/lib/xercesImpl-2.9.1.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/lib/xml-apis-1.3.04.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/lib/xmlenc-0.52.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/hadoop-hdfs-2.6.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/hadoop-hdfs-2.6.4-tests.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/hadoop-hdfs-nfs-2.6.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/jdiff:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/lib:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/sources:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/templates:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/hdfs/webapps:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/activation-1.1.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/aopalliance-1.0.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/asm-3.2.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/commons-cli-1.2.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/commons-codec-1.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/commons-collections-3.2.2.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/commons-compress-1.4.1.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/commons-httpclient-3.1.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/commons-io-2.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/commons-lang-2.6.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/commons-logging-1.1.3.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/guava-11.0.2.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/guice-3.0.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/guice-servlet-3.0.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/jackson-core-asl-1.9.13.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/jackson-jaxrs-1.9.13.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/jackson-mapper-asl-1.9.13.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/jackson-xc-1.9.13.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/javax.inject-1.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/jaxb-api-2.2.2.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/jaxb-impl-2.2.3-1.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/jersey-client-1.9.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/jersey-core-1.9.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/jersey-guice-1.9.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/jersey-json-1.9.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/jersey-server-1.9.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/jettison-1.1.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/jetty-6.1.26.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/jetty-util-6.1.26.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/jline-2.12.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/jsr305-1.3.9.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/leveldbjni-all-1.8.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/log4j-1.2.17.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/netty-3.6.2.Final.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/protobuf-java-2.5.0.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/servlet-api-2.5.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/stax-api-1.0-2.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/xz-1.0.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib/zookeeper-3.4.6.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/hadoop-yarn-api-2.6.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.6.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/hadoop-yarn-applications-unmanaged-am-launcher-2.6.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/hadoop-yarn-client-2.6.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/hadoop-yarn-common-2.6.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/hadoop-yarn-registry-2.6.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/hadoop-yarn-server-applicationhistoryservice-2.6.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/hadoop-yarn-server-common-2.6.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/hadoop-yarn-server-nodemanager-2.6.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/hadoop-yarn-server-resourcemanager-2.6.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/hadoop-yarn-server-tests-2.6.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/hadoop-yarn-server-web-proxy-2.6.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/lib:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/sources:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/yarn/test:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/lib/aopalliance-1.0.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/lib/asm-3.2.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/lib/avro-1.7.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/lib/commons-compress-1.4.1.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/lib/commons-io-2.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/lib/guice-3.0.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/lib/guice-servlet-3.0.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/lib/hadoop-annotations-2.6.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/lib/hamcrest-core-1.3.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/lib/jackson-core-asl-1.9.13.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/lib/jackson-mapper-asl-1.9.13.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/lib/javax.inject-1.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/lib/jersey-core-1.9.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/lib/jersey-guice-1.9.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/lib/jersey-server-1.9.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/lib/junit-4.11.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/lib/leveldbjni-all-1.8.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/lib/log4j-1.2.17.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/lib/netty-3.6.2.Final.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/lib/paranamer-2.3.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/lib/protobuf-java-2.5.0.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/lib/snappy-java-1.0.4.1.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/lib/xz-1.0.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/hadoop-mapreduce-client-app-2.6.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/hadoop-mapreduce-client-common-2.6.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.6.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/hadoop-mapreduce-client-hs-2.6.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/hadoop-mapreduce-client-hs-plugins-2.6.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.6.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.6.4-tests.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/hadoop-mapreduce-client-shuffle-2.6.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.4.jar:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/lib:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/lib-examples:/home/hadoop/soft/hadoop-2.6.4/share/hadoop/mapreduce/sources:/home/hadoop/soft/hadoop-2.6.4/contrib/capacity-scheduler/*.jar:/home/hadoop/soft/apache-hive-1.2.1-bin/lib/*' -Djava.library.path=:/home/hadoop/soft/hadoop-2.6.4/lib/native org.apache.flume.node.Application -f conf/tail-hdfs.conf -n a1
然后呢,可以查看一下,使用命令或者瀏覽器查看,如下所示:
如果/home/hadoop/data_hadoop/access.log文件不斷生成日志,那么下面的清洗的也不斷生成。
1 [root@master hadoop]# hadoop fs -ls /flume/events/17-12-16
4:模塊開發——數據預處理:
4.1 主要目的: 過濾“不合規”數據 格式轉換和規整 根據后續的統計需求,過濾分離出各種不同主題(不同欄目path)的基礎數據 4.2 實現方式: 開發一個mapreduce程序WeblogPreProcess(不貼代碼了,詳細見github代碼);
開發程序,在window操作系統的eclipse工具,導入的jar包包含hadoop的jar包(之前說過,這里不多做啰嗦了)和hive的jar包(apache-hive-1.2.1-bin\lib的jar包):
學習的過程中,也許要查看hadoop的源碼,之前弄出來,今天按ctrl鍵查看hadoop的時候沒辦法看了,也忘記咋弄的了,這里記錄一下,我趕緊最方便快捷,操作如:右鍵項目--》Build Path--》Configure Build Path--》Source--》Link Source然后選擇hadoop-2.6.4-src即可。
如果無法查看類的話,如下操作:選中此jar包,然后屬性properties,然后java source attachment,然后external location,然后external floder即可。
程序開發完畢可以運行一下,對數據進行預處理操作(即清洗日志數據):
[root@master data_hadoop]# hadoop jar webLogPreProcess.java.jar com.bie.dataStream.hive.mapReduce.pre.WeblogPreProcess /flume/events/17-12-16 /flume/filterOutput
執行的結果如下所示:
1 [root@master data_hadoop]# hadoop jar webLogPreProcess.java.jar com.bie.dataStream.hive.mapReduce.pre.WeblogPreProcess /flume/events/17-12-16 /flume/filterOutput 2 17/12/16 17:57:25 INFO client.RMProxy: Connecting to ResourceManager at master/192.168.199.130:8032 3 17/12/16 17:57:57 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this. 4 17/12/16 17:58:03 INFO input.FileInputFormat: Total input paths to process : 3 5 17/12/16 17:58:08 INFO mapreduce.JobSubmitter: number of splits:3 6 17/12/16 17:58:09 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1513402019656_0001 7 17/12/16 17:58:19 INFO impl.YarnClientImpl: Submitted application application_1513402019656_0001 8 17/12/16 17:58:20 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1513402019656_0001/ 9 17/12/16 17:58:20 INFO mapreduce.Job: Running job: job_1513402019656_0001 10 17/12/16 17:59:05 INFO mapreduce.Job: Job job_1513402019656_0001 running in uber mode : false 11 17/12/16 17:59:05 INFO mapreduce.Job: map 0% reduce 0% 12 17/12/16 18:00:25 INFO mapreduce.Job: map 100% reduce 0% 13 17/12/16 18:00:27 INFO mapreduce.Job: Job job_1513402019656_0001 completed successfully 14 17/12/16 18:00:27 INFO mapreduce.Job: Counters: 30 15 File System Counters 16 FILE: Number of bytes read=0 17 FILE: Number of bytes written=318342 18 FILE: Number of read operations=0 19 FILE: Number of large read operations=0 20 FILE: Number of write operations=0 21 HDFS: Number of bytes read=1749 22 HDFS: Number of bytes written=1138 23 HDFS: Number of read operations=15 24 HDFS: Number of large read operations=0 25 HDFS: Number of write operations=6 26 Job Counters 27 Launched map tasks=3 28 Data-local map tasks=3 29 Total time spent by all maps in occupied slots (ms)=212389 30 Total time spent by all reduces in occupied slots (ms)=0 31 Total time spent by all map tasks (ms)=212389 32 Total vcore-milliseconds taken by all map tasks=212389 33 Total megabyte-milliseconds taken by all map tasks=217486336 34 Map-Reduce Framework 35 Map input records=10 36 Map output records=10 37 Input split bytes=381 38 Spilled Records=0 39 Failed Shuffles=0 40 Merged Map outputs=0 41 GC time elapsed (ms)=3892 42 CPU time spent (ms)=3820 43 Physical memory (bytes) snapshot=160026624 44 Virtual memory (bytes) snapshot=1093730304 45 Total committed heap usage (bytes)=33996800 46 File Input Format Counters 47 Bytes Read=1368 48 File Output Format Counters 49 Bytes Written=1138 50 [root@master data_hadoop]#
可以使用命令進行查看操作:
1 [root@master data_hadoop]# hadoop fs -cat /flume/filterOutput/part-m-00000 2 3 [root@master data_hadoop]# hadoop fs -cat /flume/filterOutput/part-m-00001 4 5 [root@master data_hadoop]# hadoop fs -cat /flume/filterOutput/part-m-00002
做到這里,發現自己好像做懵了,由於數據采集過程,我並沒做,所以flume采集數據,就沒有這個過程了,這里使用flume對access.log數據進行采集,發現采集沒多少條,這才發現自己思考錯誤了,access.log文件里面的數據就是采集好的。數據采集,數據預處理,數據入庫,數據分析,數據展現;那么數據采集就算使用現成的數據文件access.log了。所以,從數據預處理開始就可以了。
那么,數據預處理操作,將寫好的程序可以在window的eclipse跑一下,結果如下所示(由於上面的flume算是練習了,沒有刪,在這篇博客里面屬於閹割的。所以看到的小伙伴選擇性看即可):
1 2017-12-16 21:51:18,078 INFO [main] Configuration.deprecation (Configuration.java:warnOnceIfDeprecated(1129)) - session.id is deprecated. Instead, use dfs.metrics.session-id 2 2017-12-16 21:51:18,083 INFO [main] jvm.JvmMetrics (JvmMetrics.java:init(76)) - Initializing JVM Metrics with processName=JobTracker, sessionId= 3 2017-12-16 21:51:18,469 WARN [main] mapreduce.JobResourceUploader (JobResourceUploader.java:uploadFiles(64)) - Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this. 4 2017-12-16 21:51:18,481 WARN [main] mapreduce.JobResourceUploader (JobResourceUploader.java:uploadFiles(171)) - No job jar file set. User classes may not be found. See Job or Job#setJar(String). 5 2017-12-16 21:51:18,616 INFO [main] input.FileInputFormat (FileInputFormat.java:listStatus(281)) - Total input paths to process : 1 6 2017-12-16 21:51:18,719 INFO [main] mapreduce.JobSubmitter (JobSubmitter.java:submitJobInternal(199)) - number of splits:1 7 2017-12-16 21:51:18,931 INFO [main] mapreduce.JobSubmitter (JobSubmitter.java:printTokens(288)) - Submitting tokens for job: job_local616550674_0001 8 2017-12-16 21:51:19,258 INFO [main] mapreduce.Job (Job.java:submit(1301)) - The url to track the job: http://localhost:8080/ 9 2017-12-16 21:51:19,259 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1346)) - Running job: job_local616550674_0001 10 2017-12-16 21:51:19,261 INFO [Thread-5] mapred.LocalJobRunner (LocalJobRunner.java:createOutputCommitter(471)) - OutputCommitter set in config null 11 2017-12-16 21:51:19,273 INFO [Thread-5] mapred.LocalJobRunner (LocalJobRunner.java:createOutputCommitter(489)) - OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter 12 2017-12-16 21:51:19,355 INFO [Thread-5] mapred.LocalJobRunner (LocalJobRunner.java:runTasks(448)) - Waiting for map tasks 13 2017-12-16 21:51:19,355 INFO [LocalJobRunner Map Task Executor #0] mapred.LocalJobRunner (LocalJobRunner.java:run(224)) - Starting task: attempt_local616550674_0001_m_000000_0 14 2017-12-16 21:51:19,412 INFO [LocalJobRunner Map Task Executor #0] util.ProcfsBasedProcessTree (ProcfsBasedProcessTree.java:isAvailable(181)) - ProcfsBasedProcessTree currently is supported only on Linux. 15 2017-12-16 21:51:19,479 INFO [LocalJobRunner Map Task Executor #0] mapred.Task (Task.java:initialize(587)) - Using ResourceCalculatorProcessTree : org.apache.hadoop.yarn.util.WindowsBasedProcessTree@75805410 16 2017-12-16 21:51:19,487 INFO [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:runNewMapper(753)) - Processing split: file:/C:/Users/bhlgo/Desktop/input/access.log.fensi:0+3025757 17 2017-12-16 21:51:20,273 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1367)) - Job job_local616550674_0001 running in uber mode : false 18 2017-12-16 21:51:20,275 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1374)) - map 0% reduce 0% 19 2017-12-16 21:51:21,240 INFO [LocalJobRunner Map Task Executor #0] mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(591)) - 20 2017-12-16 21:51:21,242 INFO [LocalJobRunner Map Task Executor #0] mapred.Task (Task.java:done(1001)) - Task:attempt_local616550674_0001_m_000000_0 is done. And is in the process of committing 21 2017-12-16 21:51:21,315 INFO [LocalJobRunner Map Task Executor #0] mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(591)) - 22 2017-12-16 21:51:21,315 INFO [LocalJobRunner Map Task Executor #0] mapred.Task (Task.java:commit(1162)) - Task attempt_local616550674_0001_m_000000_0 is allowed to commit now 23 2017-12-16 21:51:21,377 INFO [LocalJobRunner Map Task Executor #0] output.FileOutputCommitter (FileOutputCommitter.java:commitTask(439)) - Saved output of task 'attempt_local616550674_0001_m_000000_0' to file:/C:/Users/bhlgo/Desktop/output/_temporary/0/task_local616550674_0001_m_000000 24 2017-12-16 21:51:21,395 INFO [LocalJobRunner Map Task Executor #0] mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(591)) - map 25 2017-12-16 21:51:21,395 INFO [LocalJobRunner Map Task Executor #0] mapred.Task (Task.java:sendDone(1121)) - Task 'attempt_local616550674_0001_m_000000_0' done. 26 2017-12-16 21:51:21,395 INFO [LocalJobRunner Map Task Executor #0] mapred.LocalJobRunner (LocalJobRunner.java:run(249)) - Finishing task: attempt_local616550674_0001_m_000000_0 27 2017-12-16 21:51:21,405 INFO [Thread-5] mapred.LocalJobRunner (LocalJobRunner.java:runTasks(456)) - map task executor complete. 28 2017-12-16 21:51:22,303 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1374)) - map 100% reduce 0% 29 2017-12-16 21:51:22,304 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1385)) - Job job_local616550674_0001 completed successfully 30 2017-12-16 21:51:22,321 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1392)) - Counters: 18 31 File System Counters 32 FILE: Number of bytes read=3025930 33 FILE: Number of bytes written=2898908 34 FILE: Number of read operations=0 35 FILE: Number of large read operations=0 36 FILE: Number of write operations=0 37 Map-Reduce Framework 38 Map input records=14619 39 Map output records=14619 40 Input split bytes=116 41 Spilled Records=0 42 Failed Shuffles=0 43 Merged Map outputs=0 44 GC time elapsed (ms)=40 45 CPU time spent (ms)=0 46 Physical memory (bytes) snapshot=0 47 Virtual memory (bytes) snapshot=0 48 Total committed heap usage (bytes)=162529280 49 File Input Format Counters 50 Bytes Read=3025757 51 File Output Format Counters 52 Bytes Written=2647097
生成的文件,切記輸出文件,例如output文件是自動生成的:
4.3 點擊流模型數據梳理(預處理程序和模型梳理程序處理的生成三份數據,這里都需要使用,hive建表映射。預處理階段的mapReduce程序的調度腳本.):
由於大量的指標統計從點擊流模型中更容易得出,所以在預處理階段,可以使用mr程序來生成點擊流模型的數據;
4.3.1 點擊流模型pageviews表,Pageviews表模型數據生成:
4.3.2 點擊流模型visit信息表
注:“一次訪問”=“N次連續請求”;
直接從原始數據中用hql語法得出每個人的“次”訪問信息比較困難,可先用mapreduce程序分析原始數據得出“次”信息數據,然后再用hql進行更多維度統計;
用MR程序從pageviews數據中,梳理出每一次visit的起止時間、頁面信息;
方法一,如下所示: 開發到此處,有出現一點小問題,你將寫好的程序可以手動執行,即如下所示: hadoop jar webLogPreProcess.jar com.bie.dataStream.hive.mapReduce.pre.WeblogPreProcess /data/weblog/preprocess/input /data/weblog/preprocess/output hadoop jar webLogPreProcess.jar com.bie.dataStream.hive.mapReduce.pre.WeblogPreValid /data/weblog/preprocess/input /data/weblog/preprocess/valid_output hadoop jar webLogPreProcess.jar com.bie.dataStream.hive.mapReduce.ClickStream /data/weblog/preprocess/output /data/weblog/preprocess/click_pv_out hadoop jar webLogPreProcess.jar com.bie.dataStream.hive.mapReduce.pre.ClickStreamVisit /data/weblog/preprocess/click_pv_out /data/weblog/preprocess/click_visit_out 方法二:使用azkaban進行任務調度:
接下來啟動我的azkaban任務調度工具:
1 [root@master flume]# cd /home/hadoop/azkabantools/server/ 2 [root@master server]# nohup bin/azkaban-web-start.sh 1>/tmp/azstd.out 2>/tmp/azerr.out& 3 [root@master server]# jps 4 [root@master server]# cd ../executor/ 5 [root@master executor]# bin/azkaban-executor-start.sh
然后在瀏覽器登陸azkaban客戶端:https://master:8443,賬號和密碼都是自己設置好的,我的是admin,admin。
1 預先啟動你的集群,如下所示 2 [root@master hadoop]# start-dfs.sh 3 [root@master hadoop]# start-yarn.sh 4 將事先使用的輸入目錄創建好,如下所示,輸出目錄不用創建,否則報錯: 5 [root@master hadoop]# hadoop fs -mkdir -p /data/weblog/preprocess/input 6 然后將采集好的數據上傳到這個input目錄里面即可: 7 [root@master data_hadoop]# hadoop fs -put access.log /data/weblog/preprocess/input
我這里使用azkaban遇到一點小問題,先使用手動對數據進行處理了。真是問題不斷......
1 [root@master data_hadoop]# hadoop jar webLogPreProcess.jar com.bie.dataStream.hive.mapReduce.pre.WeblogPreProcess /data/weblog/preprocess/input /data/weblog/preprocess/output 2 Exception in thread "main" java.io.IOException: No FileSystem for scheme: C 3 at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584) 4 at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591) 5 at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91) 6 at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630) 7 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612) 8 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370) 9 at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296) 10 at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:498) 11 at com.bie.dataStream.hive.mapReduce.pre.WeblogPreProcess.main(WeblogPreProcess.java:94) 12 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 13 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 14 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 15 at java.lang.reflect.Method.invoke(Method.java:606) 16 at org.apache.hadoop.util.RunJar.run(RunJar.java:221) 17 at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
手動執行過程中遇到如上問題,是因為我的主方法里面路徑寫成了下面的那種在window運行的了,解決方法,修改以后,重新打包即可;
這篇博客,從下面開始,才具有意義,以上全是摸索式進行的。這里還是最原始手動執行的。以先做出來為主吧。
1 FileInputFormat.setInputPaths(job, new Path(args[0])); 2 FileOutputFormat.setOutputPath(job, new Path(args[1])); 3 4 //FileInputFormat.setInputPaths(job, new Path("c:/weblog/pageviews")); 5 //FileOutputFormat.setOutputPath(job, new Path("c:/weblog/visitout")); 6
運行結果如下所示:
1 [root@master data_hadoop]# hadoop jar webLogPreProcess.jar com.bie.dataStream.hive.mapReduce.pre.WeblogPreProcess /data/weblog/preprocess/input /data/weblog/preprocess/output 2 17/12/17 14:37:29 INFO client.RMProxy: Connecting to ResourceManager at master/192.168.199.130:8032 3 17/12/17 14:37:44 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this. 4 17/12/17 14:37:54 INFO input.FileInputFormat: Total input paths to process : 1 5 17/12/17 14:38:07 INFO mapreduce.JobSubmitter: number of splits:1 6 17/12/17 14:38:10 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1513489846377_0001 7 17/12/17 14:38:19 INFO impl.YarnClientImpl: Submitted application application_1513489846377_0001 8 17/12/17 14:38:19 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1513489846377_0001/ 9 17/12/17 14:38:19 INFO mapreduce.Job: Running job: job_1513489846377_0001 10 17/12/17 14:39:51 INFO mapreduce.Job: Job job_1513489846377_0001 running in uber mode : false 11 17/12/17 14:39:51 INFO mapreduce.Job: map 0% reduce 0% 12 17/12/17 14:40:16 INFO mapreduce.Job: map 100% reduce 0% 13 17/12/17 14:40:29 INFO mapreduce.Job: Job job_1513489846377_0001 completed successfully 14 17/12/17 14:40:30 INFO mapreduce.Job: Counters: 30 15 File System Counters 16 FILE: Number of bytes read=0 17 FILE: Number of bytes written=106127 18 FILE: Number of read operations=0 19 FILE: Number of large read operations=0 20 FILE: Number of write operations=0 21 HDFS: Number of bytes read=3025880 22 HDFS: Number of bytes written=2626565 23 HDFS: Number of read operations=5 24 HDFS: Number of large read operations=0 25 HDFS: Number of write operations=2 26 Job Counters 27 Launched map tasks=1 28 Data-local map tasks=1 29 Total time spent by all maps in occupied slots (ms)=15389 30 Total time spent by all reduces in occupied slots (ms)=0 31 Total time spent by all map tasks (ms)=15389 32 Total vcore-milliseconds taken by all map tasks=15389 33 Total megabyte-milliseconds taken by all map tasks=15758336 34 Map-Reduce Framework 35 Map input records=14619 36 Map output records=14619 37 Input split bytes=123 38 Spilled Records=0 39 Failed Shuffles=0 40 Merged Map outputs=0 41 GC time elapsed (ms)=201 42 CPU time spent (ms)=990 43 Physical memory (bytes) snapshot=60375040 44 Virtual memory (bytes) snapshot=364576768 45 Total committed heap usage (bytes)=17260544 46 File Input Format Counters 47 Bytes Read=3025757 48 File Output Format Counters 49 Bytes Written=2626565 50 [root@master data_hadoop]#
瀏覽器查看如下所示:
點擊流模型數據梳理
由於大量的指標統計從點擊流模型中更容易得出,所以在預處理階段,可以使用mr程序來生成點擊流模型的數據
1 [root@master data_hadoop]# hadoop jar webLogPreProcess.jar com.bie.dataStream.hive.mapReduce.ClickStream /data/weblog/preprocess/output /data/weblog/preprocess/click_pv_out 2 17/12/17 14:47:33 INFO client.RMProxy: Connecting to ResourceManager at master/192.168.199.130:8032 3 17/12/17 14:47:43 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this. 4 17/12/17 14:48:16 INFO input.FileInputFormat: Total input paths to process : 1 5 17/12/17 14:48:18 INFO mapreduce.JobSubmitter: number of splits:1 6 17/12/17 14:48:21 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1513489846377_0002 7 17/12/17 14:48:22 INFO impl.YarnClientImpl: Submitted application application_1513489846377_0002 8 17/12/17 14:48:22 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1513489846377_0002/ 9 17/12/17 14:48:22 INFO mapreduce.Job: Running job: job_1513489846377_0002 10 17/12/17 14:48:44 INFO mapreduce.Job: Job job_1513489846377_0002 running in uber mode : false 11 17/12/17 14:48:45 INFO mapreduce.Job: map 0% reduce 0% 12 17/12/17 14:48:58 INFO mapreduce.Job: map 100% reduce 0% 13 17/12/17 14:49:39 INFO mapreduce.Job: map 100% reduce 100% 14 17/12/17 14:49:42 INFO mapreduce.Job: Job job_1513489846377_0002 completed successfully 15 17/12/17 14:49:43 INFO mapreduce.Job: Counters: 49 16 File System Counters 17 FILE: Number of bytes read=17187 18 FILE: Number of bytes written=247953 19 FILE: Number of read operations=0 20 FILE: Number of large read operations=0 21 FILE: Number of write operations=0 22 HDFS: Number of bytes read=2626691 23 HDFS: Number of bytes written=18372 24 HDFS: Number of read operations=6 25 HDFS: Number of large read operations=0 26 HDFS: Number of write operations=2 27 Job Counters 28 Launched map tasks=1 29 Launched reduce tasks=1 30 Data-local map tasks=1 31 Total time spent by all maps in occupied slots (ms)=10414 32 Total time spent by all reduces in occupied slots (ms)=38407 33 Total time spent by all map tasks (ms)=10414 34 Total time spent by all reduce tasks (ms)=38407 35 Total vcore-milliseconds taken by all map tasks=10414 36 Total vcore-milliseconds taken by all reduce tasks=38407 37 Total megabyte-milliseconds taken by all map tasks=10663936 38 Total megabyte-milliseconds taken by all reduce tasks=39328768 39 Map-Reduce Framework 40 Map input records=14619 41 Map output records=76 42 Map output bytes=16950 43 Map output materialized bytes=17187 44 Input split bytes=126 45 Combine input records=0 46 Combine output records=0 47 Reduce input groups=53 48 Reduce shuffle bytes=17187 49 Reduce input records=76 50 Reduce output records=76 51 Spilled Records=152 52 Shuffled Maps =1 53 Failed Shuffles=0 54 Merged Map outputs=1 55 GC time elapsed (ms)=327 56 CPU time spent (ms)=1600 57 Physical memory (bytes) snapshot=205991936 58 Virtual memory (bytes) snapshot=730013696 59 Total committed heap usage (bytes)=127045632 60 Shuffle Errors 61 BAD_ID=0 62 CONNECTION=0 63 IO_ERROR=0 64 WRONG_LENGTH=0 65 WRONG_MAP=0 66 WRONG_REDUCE=0 67 File Input Format Counters 68 Bytes Read=2626565 69 File Output Format Counters 70 Bytes Written=18372 71 [root@master data_hadoop]#
執行結果如下所示:
點擊流模型visit信息表:
注:“一次訪問”=“N次連續請求”
直接從原始數據中用hql語法得出每個人的“次”訪問信息比較困難,可先用mapreduce程序分析原始數據得出“次”信息數據,然后再用hql進行更多維度統計
用MR程序從pageviews數據中,梳理出每一次visit的起止時間、頁面信息:
1 [root@master data_hadoop]# hadoop jar webLogPreProcess.jar com.bie.dataStream.hive.mapReduce.pre.ClickStreamVisit /data/weblog/preprocess/click_pv_out /data/weblog/preprocess/click_visit_out 2 17/12/17 15:06:30 INFO client.RMProxy: Connecting to ResourceManager at master/192.168.199.130:8032 3 17/12/17 15:06:32 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this. 4 17/12/17 15:06:33 INFO input.FileInputFormat: Total input paths to process : 1 5 17/12/17 15:06:33 INFO mapreduce.JobSubmitter: number of splits:1 6 17/12/17 15:06:34 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1513489846377_0003 7 17/12/17 15:06:35 INFO impl.YarnClientImpl: Submitted application application_1513489846377_0003 8 17/12/17 15:06:35 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1513489846377_0003/ 9 17/12/17 15:06:35 INFO mapreduce.Job: Running job: job_1513489846377_0003 10 17/12/17 15:06:47 INFO mapreduce.Job: Job job_1513489846377_0003 running in uber mode : false 11 17/12/17 15:06:47 INFO mapreduce.Job: map 0% reduce 0% 12 17/12/17 15:07:44 INFO mapreduce.Job: map 100% reduce 0% 13 17/12/17 15:08:15 INFO mapreduce.Job: map 100% reduce 100% 14 17/12/17 15:08:18 INFO mapreduce.Job: Job job_1513489846377_0003 completed successfully 15 17/12/17 15:08:18 INFO mapreduce.Job: Counters: 49 16 File System Counters 17 FILE: Number of bytes read=6 18 FILE: Number of bytes written=213705 19 FILE: Number of read operations=0 20 FILE: Number of large read operations=0 21 FILE: Number of write operations=0 22 HDFS: Number of bytes read=18504 23 HDFS: Number of bytes written=0 24 HDFS: Number of read operations=6 25 HDFS: Number of large read operations=0 26 HDFS: Number of write operations=2 27 Job Counters 28 Launched map tasks=1 29 Launched reduce tasks=1 30 Data-local map tasks=1 31 Total time spent by all maps in occupied slots (ms)=55701 32 Total time spent by all reduces in occupied slots (ms)=22157 33 Total time spent by all map tasks (ms)=55701 34 Total time spent by all reduce tasks (ms)=22157 35 Total vcore-milliseconds taken by all map tasks=55701 36 Total vcore-milliseconds taken by all reduce tasks=22157 37 Total megabyte-milliseconds taken by all map tasks=57037824 38 Total megabyte-milliseconds taken by all reduce tasks=22688768 39 Map-Reduce Framework 40 Map input records=76 41 Map output records=0 42 Map output bytes=0 43 Map output materialized bytes=6 44 Input split bytes=132 45 Combine input records=0 46 Combine output records=0 47 Reduce input groups=0 48 Reduce shuffle bytes=6 49 Reduce input records=0 50 Reduce output records=0 51 Spilled Records=0 52 Shuffled Maps =1 53 Failed Shuffles=0 54 Merged Map outputs=1 55 GC time elapsed (ms)=325 56 CPU time spent (ms)=1310 57 Physical memory (bytes) snapshot=203296768 58 Virtual memory (bytes) snapshot=730161152 59 Total committed heap usage (bytes)=126246912 60 Shuffle Errors 61 BAD_ID=0 62 CONNECTION=0 63 IO_ERROR=0 64 WRONG_LENGTH=0 65 WRONG_MAP=0 66 WRONG_REDUCE=0 67 File Input Format Counters 68 Bytes Read=18372 69 File Output Format Counters 70 Bytes Written=0 71 [root@master data_hadoop]#
運行結果如下所示:
5:模塊開發——數據倉庫設計(注:采用星型模型,數據倉庫概念知識以及星型模型和雪花模型的區別請自行腦補。):
星型模型是采用事實表和維度表的模型的。下面創建事實表,維度表這里省略,不做處理。
原始數據表:t_origin_weblog |
||
valid |
string |
是否有效 |
remote_addr |
string |
訪客ip |
remote_user |
string |
訪客用戶信息 |
time_local |
string |
請求時間 |
request |
string |
請求url |
status |
string |
響應碼 |
body_bytes_sent |
string |
響應字節數 |
http_referer |
string |
來源url |
http_user_agent |
string |
訪客終端信息 |
ETL中間表:t_etl_referurl |
||
valid |
string |
是否有效 |
remote_addr |
string |
訪客ip |
remote_user |
string |
訪客用戶信息 |
time_local |
string |
請求時間 |
request |
string |
請求url |
request_host |
string |
請求的域名 |
status |
string |
響應碼 |
body_bytes_sent |
string |
響應字節數 |
http_referer |
string |
來源url |
http_user_agent |
string |
訪客終端信息 |
valid |
string |
是否有效 |
remote_addr |
string |
訪客ip |
remote_user |
string |
訪客用戶信息 |
time_local |
string |
請求時間 |
request |
string |
請求url |
status |
string |
響應碼 |
body_bytes_sent |
string |
響應字節數 |
http_referer |
string |
外鏈url |
http_user_agent |
string |
訪客終端信息 |
host |
string |
外鏈url的域名 |
path |
string |
外鏈url的路徑 |
query |
string |
外鏈url的參數 |
query_id |
string |
外鏈url的參數值 |
訪問日志明細寬表:t_ods_access_detail |
||
valid |
string |
是否有效 |
remote_addr |
string |
訪客ip |
remote_user |
string |
訪客用戶信息 |
time_local |
string |
請求時間 |
request |
string |
請求url整串 |
request_level1 |
string |
請求的一級欄目 |
request_level2 |
string |
請求的二級欄目 |
request_level3 |
string |
請求的三級欄目 |
status |
string |
響應碼 |
body_bytes_sent |
string |
響應字節數 |
http_referer |
string |
來源url |
http_user_agent |
string |
訪客終端信息 |
valid |
string |
是否有效 |
remote_addr |
string |
訪客ip |
remote_user |
string |
訪客用戶信息 |
time_local |
string |
請求時間 |
request |
string |
請求url |
status |
string |
響應碼 |
body_bytes_sent |
string |
響應字節數 |
http_referer |
string |
外鏈url |
http_user_agent |
string |
訪客終端信息整串 |
http_user_agent_browser |
string |
訪客終端瀏覽器 |
http_user_agent_sys |
string |
訪客終端操作系統 |
http_user_agent_dev |
string |
訪客終端設備 |
host |
string |
外鏈url的域名 |
path |
string |
外鏈url的路徑 |
query |
string |
外鏈url的參數 |
query_id |
string |
外鏈url的參數值 |
daystr |
string |
日期整串 |
tmstr |
string |
時間整串 |
month |
string |
月份 |
day |
string |
日 |
hour |
string |
時 |
minute |
string |
分 |
## |
## |
## |
mm |
string |
分區字段--月 |
dd |
string |
分區字段--日 |
6 :模塊開發——ETL
該項目的數據分析過程在hadoop集群上實現,主要應用hive數據倉庫工具,因此,采集並經過預處理后的數據,需要加載到hive數據倉庫中,以進行后續的挖掘分析。
6.1:創建原始數據表:
--在hive倉庫中建貼源數據表 ods_weblog_origin 下面開始創建hive的數據庫和數據表,操作如下所示: [root@master soft]# cd apache-hive-1.2.1-bin/ [root@master apache-hive-1.2.1-bin]# ls [root@master apache-hive-1.2.1-bin]# cd bin/ [root@master apache-hive-1.2.1-bin]# ls [root@master bin]# ./hive hive> show databases; hive> create database webLog; hive> show databases; #按照日期來分區 hive> create table ods_weblog_origin(valid string,remote_addr string,remote_user string,time_local string,request string,status string,body_bytes_sent string,http_referer string,http_user_agent string) > partitioned by (datestr string) > row format delimited > fields terminated by '\001'; hive> show tables; hive> desc ods_weblog_origin; #點擊流模型pageviews表 ods_click_pageviews hive> create table ods_click_pageviews( > Session string, > remote_addr string, > remote_user string, > time_local string, > request string, > visit_step string, > page_staylong string, > http_referer string, > http_user_agent string, > body_bytes_sent string, > status string) > partitioned by (datestr string) > row format delimited > fields terminated by '\001'; hive> show tables; #點擊流visit模型表 click_stream_visit hive> create table click_stream_visit( > session string, > remote_addr string, > inTime string, > outTime string, > inPage string, > outPage string, > referal string, > pageVisits int) > partitioned by (datestr string); hive> show tables;
6.2:導入數據,操作如下所示:
1 1:導入清洗結果數據到貼源數據表ods_weblog_origin 2 3 hive> load data inpath '/data/weblog/preprocess/output/part-m-00000' overwrite into table ods_weblog_origin partition(datestr='2017-12-17'); 4 5 hive> show partitions ods_weblog_origin; 6 hive> select count(*) from ods_weblog_origin; 7 hive> select * from ods_weblog_origin; 8 9 2:導入點擊流模型pageviews數據到ods_click_pageviews表 10 hive> load data inpath '/data/weblog/preprocess/click_pv_out/part-r-00000' overwrite into table ods_click_pageviews partition(datestr='2017-12-17'); 11 12 hive> select count(1) from ods_click_pageviews; 13 14 3:導入點擊流模型visit數據到click_stream_visit表 15 hive> load data inpath '/data/weblog/preprocess/click_visit_out/part-r-00000' overwrite into table click_stream_visit partition(datestr='2017-12-17'); 16 17 hive> select count(1) from click_stream_visit;
待續......