配置環境
|
IP | 備注 |
|
|
|
|
|
|
軟件版本:
CentOS release 6.6 (Final) Hdk-8u131-linux-x64 Hadoop-2.7.3 Hive-2.1.1 Apache-flume-1.7.0-bin
下載JDK、Hadoop、Hive、Flume:
[root@Hadoop-Data01 soft]# wget --no-check-certificate --no-cookies --header "Cookie: oraclelicense=accept-securebackup-cookie" http://download.oracle.com/otn-pub/java/jdk/8u131-b11/d54c1d3a095b4ff2b6607d096fa80163/jdk-8u131-linux-x64.tar.gz [root@Hadoop-Data01 soft]# wget http://apache.fayea.com/hadoop/common/hadoop-2.7.3/hadoop-2.7.3-src.tar.gz [root@Hadoop-Data01 soft]# wget http://apache.fayea.com/hive/hive-2.1.1/apache-hive-2.1.1-bin.tar.gz
Hadoop部署
修改主機名,/etc/hosts文件,確保各主機DNS解析正確:
[root@Hadoop-Data01 ~]# cat /etc/hosts 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 192.168.0.194 Hadoop-Data01 192.168.0.195 Hadoop-Data02 192.168.0.196 Hadoop-Data03 注:Slave服務器內容同上.
配置Hadoop-Master、Hadoop-Slave主機間的免key登錄:
[root@Hadoop-Data01 ~]# vim /etc/ssh/sshd_config RSAAuthentication yes PubkeyAuthentication yes 注:這里可以通過sed:sed -i '47,48s/^#//g' /etc/ssh/sshd_config [root@Hadoop-Data01 ~]# ssh-keygen -t rsa [root@Hadoop-Data01 .ssh]# cat authorized_keys ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEA2JGjCEwc+H3/5Y939DHSkhHYAO7qPjO86gyaqvlN2j1ZMUhdKhXUmTH0pBBwXIqp9jooTXxtIu55cuBvOeBD6eUKN5mH9rydRIXm8HEvb9nQzOvVghP1E9lBTGsGXkUWDo0KPkFYOhb2NguYibzVUgpUpAt0NY5iqdenXNqvDOWGhWqDsg/C6VnUzsxskiT9x2EROhddWQnYsObXxjOasgdGPngzZsJZPchRboS+HfvVF0uSyUjljtKsQqYOX2Nt0plO4t6VlcnZXvjDXKezJCNwGToFvvoiIHnjVu/akgtv/bpd8HZp1dZEj7cYnSFkqN5xdodg7TmtjAjobutU5Q== root@Hadoop-Data01 ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAvQ3JZOtdfFvrsM/m6YwQQuGkOCpNt0+tw87tS4p1gB98ZAn+zaUnFMw5Gvo0i1KvHVaxmb0s1gqDjGDNVLQM5MB60emyVFHLs6DZBI5f4c0BiA17KfDRzlsfuTmuLdymmoj54OhPbEcH+mwo/N1UK9V0gqxAB9abC6UFT00MXXXJN1+qBkV9mUuFbXhn4m5/DCoEbIxvMlWghAsSrDtMaMtJYRumRvd7MLwwefdCYyQd8dZASE1Z8VP0K/BDRntWXCeKGCVMb4uJAnSdhN6ZcRme/Qlx0YCkPpQir3jgcblVW5RODNUyaIc+vUMp9UYagvK7nKKfWAGa/MPdyfu2nw== root@Hadoop-Data02 ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEA1pC5Py1aqbojVetakak3WmxJf4DgmTe1ci60tn9Hyq84kdAhw7z1lAQN544uPDDvl4XPki36Y13Hjl0P+S3g11iOi42FRugkBDmokqADZrUfp5tqWX8K9QvYMePoyiuQlnrGAyCpOiMmEAykBR6lVkNHgPAWThjU9eggt6dalMPiy/dDKZNemlWGHy8wdS5PyjVsIuDGgTtNLADn6OOaYcO/UWq78gqc1Nkq4mNxKSTYorh7taki9SKw4cq0NeggDFz7cZEewtgJdRla0W2ZKz8bgfuUSSntbN55/uCVUSgK+kurqRmklQ3sA3c9687BH1Lse5luDFJRaYo2wa5nlQ== root@Hadoop-Data03 注:合並三台服務器/root/.ssh/id_rsa.pub文件到authorized_keys [root@Hadoop-Data01 .ssh]# scp authorized_keys root@192.168.0.195:/root/.ssh/ [root@Hadoop-Data01 .ssh]# scp authorized_keys root@192.168.0.196:/root/.ssh/
在各個主機上安裝JDK
[root@Hadoop-Data01 soft]# tar -xf jdk-8u131-linux-x64.tar.gz [root@Hadoop-Data01 soft]# \cp -r jdk1.8.0_131 /usr/local/ [root@Hadoop-Data01 soft]# cd /usr/local/ [root@Hadoop-Data01 local]# ln -s jdk1.8.0_131 jdk [root@Hadoop-Data01 ~]# vim /etc/profile >>>>> ulimit -n 10240 export JAVA_HOME=/usr/local/jdk export CLASSPATH=$CLASSPATH:$JAVA_HOME/lib:$JAVA_HOME/jre/lib export PATH=$JAVA_HOME/bin:$JAVA_HOME/jre/bin:$PATH [root@Hadoop-Data01 ~]# source /etc/profile [root@Hadoop-Data03 ~]# java -version java version "1.8.0_131" Java(TM) SE Runtime Environment (build 1.8.0_131-b11) Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)
安裝hadoop
/usr/local/hadoop/etc/hadoop/core-site.xml配置文件: [root@Hadoop-Data01 soft]# tar -xf hadoop-2.7.3.tar.gz [root@Hadoop-Data01 soft]# mv hadoop-2.7.3 /usr/local/ [root@Hadoop-Data01 soft]# cd /usr/local/ [root@Hadoop-Data01 local]# ln -s hadoop-2.7.3 hadoop [root@Hadoop-Data01 hadoop]# vim core-site.xml >>>>> <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://192.168.0.194:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>file:/usr/local/hadoop/tmp</value> </property> <property> <name>io.file.buffer.size</name> <value>1024</value> </property> </configuration> 注: <fs.defaultFS>:默認文件系統的名稱,URI形式,uri的scheme需要由(fs.SCHEME.impl)指定文件系統實現類;uri的authority部分用來指定host、port等;默認是本地文件系統。HA方式,這里設置服務名,例如:hdfs:// 192.168.0.194:9000,HDFS的客戶端訪問HDFS需要此參數; <hadoop.tmp.dir>:Hadoop的臨時目錄,其它目錄會基於此路徑,本地目錄。只可以設置一個值;建議設置到一個足夠空間的地方,而不是默認的/tmp下,服務端參數,修改需重啟; <io.file.buffer.size>:在讀寫文件時使用的緩存大小,這個大小應該是內存Page的倍數,建議1M。 ---------- /usr/local/hadoop/etc/hadoop/hdfs-site.xml配置文件: [root@Hadoop-Data01 hadoop]# vim hdfs-site.xml >>>>> <configuration> <property> <name>dfs.namenode.name.dir</name> <value>file:/usr/local/hadoop/dfs/name</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:/usr/local/hadoop/dfs/data</value> </property> <property> <name>dfs.replication</name> <value>2</value> </property> <property> <name>dfs.namenode.secondary.http-address</name> <value>192.168.0.194:9001</value> </property> <property> <name>dfs.webhdfs.enabled</name> <value>true</value> </property> </configuration> 注: <dfs.namenode.name.dir>:本地磁盤目錄,NN存儲fsimage文件的地方;可以是按逗號分隔的目錄列表,fsimage文件會存儲在全部目錄,冗余安全;這里多個目錄設定,最好在多個磁盤,另外,如果其中一個磁盤故障,不會導致系統故障,會跳過壞磁盤。由於使用了HA,建議僅設置一個,如果特別在意安全,可以設置2個; <dfs.datanode.data.dir>:本地磁盤目錄,HDFS數據應該存儲Block的地方。可以是逗號分隔的目錄列表(典型的,每個目錄在不同的磁盤),這些目錄被輪流使用,一個塊存儲在這個目錄,下一個塊存儲在下一個目錄,依次循環;每個塊在同一個機器上僅存儲一份,不存在的目錄被忽略;必須創建文件夾,否則被視為不存在; <dfs.replication>:數據塊副本數,此值可以在創建文件是設定,客戶端可以只有設定,也可以在命令行修改;不同文件可以有不同的副本數,默認值用於未指定時。 <dfs.namenode.secondary.http-address>:SNN的http服務地址,如果是0,服務將隨機選擇一個空閑端口,使用了HA后,就不再使用SNN; <dfs.webhdfs.enabled>:在NN和DN上開啟WebHDFS (REST API)功能。 ---------- /usr/local/hadoop/etc/hadoop/mapred-site.xml配置文件: [root@Hadoop-Data01 hadoop]# cp mapred-site.xml.template mapred-site.xml [root@Hadoop-Data01 hadoop]# vim mapred-site.xml >>>>> <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>mapreduce.jobhistory.address</name> <value>192.168.0.194:10020</value> </property> <property> <name>mapreduce.jobhistory.webapp.address</name> <value>192.168.0.194:19888</value> </property> </configuration> /usr/local/hadoop/etc/hadoop/yarn-site.xml配置文件: [root@Hadoop-Data01 hadoop]# vim yarn-site.xml >>>>> <configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>192.168.0.194:8032</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>192.168.0.194:8030</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>192.168.0.194:8031</value> </property> <property> <name>yarn.resourcemanager.admin.address</name> <value>192.168.0.194:8033</value> </property> <property> <name>yarn.resourcemanager.webapp.address</name> <value>192.168.0.194:8088</value> </property> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>8192</value> </property> </configuration> 注: <mapreduce.framework.name>:MapReduce按照任務大小和設置的不同,提供了兩種任務模式:①本地模式(LocalJobRunner實現)mapreduce.framework.name設置為local,則不會使用YARN集群來分配資源,在本地節點執行。在本地模式運行的任務,無法發揮集群的優勢。在web UI是查看不到本地模式運行的任務。②Yarn模式(YARNRunner實現)mapreduce.framework.name設置為yarn,當客戶端配置mapreduce.framework.name為yarn時, 客戶端會使用YARNRunner與服務端通信, 而YARNRunner真正的實現是通過ClientRMProtocol與RM交互, 包括提交Application, 查詢狀態等功能。 <mapreduce.jobhistory.address>和<mapreduce.jobhistory.webapp.address>:Hadoop自帶了一個歷史服務器,可以通過歷史服務器查看已經運行完的Mapreduce作業記錄,比如用了多少個Map、用了多少個Reduce、作業提交時間、作業啟動時間、作業完成時間等信息。 ---------- 配置hadoop環境變量: [root@Hadoop-Data01 hadoop]# vim /etc/profile >>>>> export HADOOP_HOME=/usr/local/hadoop export PATH=$HADOOP_HOME/bin:$PATH [root@Hadoop-Data01 hadoop]# vim hadoop-env.sh >>>>> export JAVA_HOME=/usr/local/jdk 添加從節點IP到Slave文件: [root@Hadoop-Data01 hadoop]# echo > slave && echo 192.168.0.195 > slave 拷貝hadoop服務目錄到從主機: [root@Hadoop-Data01 local]# \scp -r hadoop-2.7.3 root@192.168.0.195:/usr/local/ 進入Hadoop目錄,啟動Hadoop-Master主機上的服務: ①初始化: [root@Hadoop-Data01 bin]# sh /usr/local/hadoop/bin/hdfs namenode -format ②啟動服務: [root@Hadoop-Data01 sbin]# sh /usr/local/hadoop/sbin/start-all.sh ③關閉服務: [root@Hadoop-Data01 sbin]# sh /usr/local/hadoop/sbin/stop-all.sh ④查看組件: [root@Hadoop-Data01 sbin]# jps 6517 SecondaryNameNode 6326 NameNode 6682 ResourceManager 6958 Jps
測試訪問OK
瀏覽器訪問:http://192.168.0.194:8088/
瀏覽器訪問:http://192.168.0.194:50070/
部署Hive
解壓部署、配置環境變量:
[root@Hadoop-Data01 soft]# tar -xf apache-hive-2.1.1-bin.tar.gz [root@Hadoop-Data01 soft]# mv apache-hive-2.1.1-bin /usr/local/ [root@Hadoop-Data01 soft]# cd /usr/local/ [root@Hadoop-Data01 local]# ln -s apache-hive-2.1.1-bin hive [root@Hadoop-Data01 conf]# cp hive-env.sh.template hive-env.sh [root@Hadoop-Data01 conf]# vim hive-env.sh >>>>> HADOOP_HOME=/usr/local/hadoop export HIVE_CONF_DIR=/usr/loca/hive/conf export HIVE_AUX_JARS_PATH=/usr/loca/hive/lib
安裝部署mysql環境
[root@Hadoop-Data01 conf]# yum install httpd php mysql mysql-server php-mysql -y [root@Hadoop-Data01 conf]# /usr/bin/mysqladmin -u root password 'hadoopmysql' [root@Hadoop-Data01 conf]# /usr/bin/mysqladmin -u root -h192.168.0.194 password 'hadoopmysql' [root@Hadoop-Data01 conf]# mysql -uroot -phadoopmysql mysql> create user 'hive' identified by 'hive'; mysql> grant all privileges on *.* to 'hive'@'localhost' identified by 'hive'; Query OK, 0 rows affected (0.00 sec) mysql> grant all privileges on *.* to 'hive'@'%' identified by 'hiveycfw'; Query OK, 0 rows affected (0.00 sec) mysql> flush privileges; Query OK, 0 rows affected (0.00 sec) mysql> create database hive; Query OK, 1 row affected (0.00 sec)
修改HIVE配置文件:
[root@Hadoop-Data01 conf]# vim hive-site.xml 44行:>>>>> <name>hive.exec.local.scratchdir</name> <value>/usr/local/hive/iotmp</value> 批量替換::%s/${system:java.io.tmpdir}/\/usr\/local\/hive\/iotmp/g 486行:>>>>> <name>javax.jdo.option.ConnectionPassword</name> <value>hive</value> 501行:>>>>> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true</value> 686行:>>>>> <name>hive.metastore.schema.verification</name> <value>false</value> 933行:>>>>> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> 957行:>>>>> <name>javax.jdo.option.ConnectionUserName</name> <value>hive</value> 拷貝JDBC驅動到lib目錄下: [root@Hadoop-Data01 mysql-connector-java-5.1.42]# cp mysql-connector-java-5.1.42-bin.jar /usr/local/hive/lib/ 精簡版hive-site.xml: <?xml version="1.0" encoding="UTF-8" standalone="no"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>javax.jdo.option.ConnectionURL</name> #數據庫連接串 <value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true</value> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> #JDBC驅動 <value>com.mysql.jdbc.Driver</value> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> #數據庫賬號 <value>hive</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> #數據庫密碼 <value>hive</value> </property> <property> <name>hive.metastore.warehouse.dir</name> #該參數指定了 Hive 的數據存儲目錄,默認位置在 HDFS 上面的 /user/hive/warehouse 路徑下 <value>/user/hive/warehouse</value> </property> <property> <name>hive.exec.scratchdir</name> #該參數指定了 Hive 的數據臨時文件目錄,默認位置為 HDFS 上面的 /tmp/hive 路徑下 <value>/tmp/hive</value> </property> </configuration>
初始化Mysql
[root@Hadoop-Data01 bin]# schematool -initSchema -dbType mysql #初始化完成后,mysql數據庫中會增加hive庫 which: no hbase in (/usr/local/hive/bin:/usr/local/hive/conf:/usr/local/hadoop/bin:/usr/local/jdk//bin:/usr/local/jdk//jre/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/root/bin) SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/local/apache-hive-2.1.1-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] Metastore connection URL: jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true Metastore Connection Driver : com.mysql.jdbc.Driver Metastore connection User: hive Starting metastore schema initialization to 2.1.0 Initialization script hive-schema-2.1.0.mysql.sql Initialization script completed schemaTool completed
啟動Hive
[root@Hadoop-Data01 bin]# ./hive Logging initialized using configuration in jar:file:/usr/local/apache-hive-1.2.2-bin/lib/hive-common-1.2.2.jar!/hive-log4j.properties hive> hive> show functions; #查看hive函數; hive> desc function day; #查看day函數詳細信息; OK day(param) - Returns the day of the month of date/timestamp, or day component of interval Time taken: 0.039 seconds, Fetched: 1 row(s)
部署Flume
一、簡介
- flume是分布式的日志收集系統,把收集來的數據傳送到目的地去。
- flume里面有個核心概念,叫做agent。agent是一個java進程,運行在日志收集節點。
- agent里面包含3個核心組件:source、channel、sink。 source組件是專用於收集日志的,可以處理各種類型各種格式的日志數據,包括avro、thrift、exec、jms、spooling directory、netcat、sequence generator、syslog、http、legacy、自定義。 source組件把數據收集來以后,臨時存放在channel中。 channel組件是在agent中專用於臨時存儲數據的,可以存放在memory、jdbc、file、自定義。 channel中的數據只有在sink發送成功之后才會被刪除。 sink組件是用於把數據發送到目的地的組件,目的地包括hdfs、logger、avro、thrift、ipc、file、null、hbase、solr、自定義。
- 在整個數據傳輸過程中,流動的是event。事務保證是在event級別。
- flume可以支持多級flume的agent,支持扇入(fan-in)、扇出(fan-out)。
二、安裝
解壓flume文件,傳輸到/usr/local/下(安裝到hadoop服務器):
[root@Hadoop-Data01 soft]# \cp -r apache-flume-1.7.0-bin /usr/local/ [root@Hadoop-Data01 soft]# cd /usr/local/ [root@Hadoop-Data01 local]# ln -s apache-flume-1.7.0-bin flume [root@Hadoop-Data01 conf]# cp flume-env.sh.template flume-env.sh [root@Hadoop-Data01 conf]# vim flume-env.sh >>>>: export JAVA_HOME=/usr/local/jdk [root@Hadoop-Data01 bin]# ./flume-ng version Flume 1.7.0 Source code repository: https://git-wip-us.apache.org/repos/asf/flume.git Revision: 511d868555dd4d16e6ce4fedc72c2d1454546707 Compiled by bessbd on Wed Oct 12 20:51:10 CEST 2016 From source with checksum 0d21b3ffdc55a07e1d08875872c00523
下載flume服務到需要采集的服務器,這里是windows,然后配置/conf/flume-conf.properties:
a1.sources = r1 a1.sinks = k1 a1.channels = c1 # Describe/configure the source a1.sources.r1.type = spooldir a1.sources.r1.channels = c1 a1.sources.r1.spoolDir = D:\\flume\\log #收集這個目錄下的文件 a1.sources.r1.fileHeader = true a1.sources.r1.basenameHeader = true a1.sources.r1.basenameHeaderKey = fileName a1.sources.r1.ignorePattern = ^(.)*\\.tmp$ a1.sources.r1.interceptors = i1 a1.sources.r1.interceptors.i1.type = timestamp a1.sinks.k1.type = avro a1.sinks.k1.hostname = 192.168.0.194 #接受agent端地址 a1.sinks.k1.port = 19949 # Use a channel which buffers events in memory a1.channels.c1.type=memory a1.channels.c1.capacity=10000 a1.channels.c1.transactionCapacity=1000 a1.channels.c1.keep-alive=30 # Bind the source and sink to the channel a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
啟動采集端服務,windows端:
D:\apache-flume-1.7.0-bin\bin>flume-ng.cmd agent --conf ..\conf --conf-file ..\conf\flume-conf.properties --name a1
配置Linux端,agent配置/conf/flume-conf.properties:
tier1.sources=source1 tier1.channels=channel1 tier1.sinks=sink1 tier1.sources.source1.type=avro tier1.sources.source1.bind=192.168.0.194 #flume接收端地址 tier1.sources.source1.port=19949 tier1.sources.source1.channels=channel1 tier1.channels.channel1.type=memory tier1.channels.channel1.capacity=10000 tier1.channels.channel1.transactionCapacity=1000 tier1.channels.channel1.keep-alive=30 tier1.sinks.sink1.channel=channel1 tier1.sources.source1.interceptors=e1 e2 tier1.sources.source1.interceptors.e1.type=com.huawei.flume.InterceptorsCommons$Builder tier1.sources.source1.interceptors.e2.type=com.huawei.flume.InterceptorsFlows$Builder tier1.sinks.sink1.type = hdfs tier1.sinks.sink1.hdfs.path=hdfs://192.168.0.194:9000/user/hive/warehouse/%{table_name}/inputdate=%Y-%m-%d #flume接受端agent,hive表名 tier1.sinks.sink1.hdfs.writeFormat = Text tier1.sinks.sink1.hdfs.fileType = DataStream tier1.sinks.sink1.hdfs.fileSuffix = .log tier1.sinks.sink1.hdfs.rollInterval = 0 tier1.sinks.sink1.hdfs.rollSize = 0 tier1.sinks.sink1.hdfs.rollCount = 0 tier1.sinks.sink1.hdfs.useLocalTimeStamp = true tier1.sinks.sink1.hdfs.idleTimeout = 60 tier1.sinks.sink1.hdfs.rollSize = 125829120 tier1.sinks.sink1.hdfs.minBlockReplicas = 1
啟動Linux端,agent服務:
[root@Hadoop-Data01 conf]# flume-ng agent -c /usr/local/flume/conf/ -f /usr/local/flume/conf/flume-conf.properties -n tier1 -Dflume.root.logger=DEBUG,console