CentOS7+Hadoop2.7.2(HA高可用+Federation聯邦)+Hive1.2.1+Spark2.1.0 完全分布式集群安裝


1       VM網絡配置... 3

2       CentOS配置... 5

2.1             下載地址... 5

2.2             激活網卡... 5

2.3             SecureCRT. 5

2.4             修改主機名... 6

2.5             yum代理上網... 7

2.6             安裝ifconfig. 8

2.7             wget安裝與代理... 8

2.8             安裝VMware Tools. 8

2.9             其他... 9

2.9.1         問題... 9

2.9.2         設置... 9

2.9.2.1     去掉開機等待時間... 9

2.9.2.2    VM調整... 9

2.9.3         命令... 10

2.9.3.1     關機與重啟... 10

2.9.3.2     服務停止與禁用... 10

2.9.3.3     查大文件目錄... 11

2.9.3.4     查看磁盤使用情況... 11

2.9.3.5     查看內存使用情況... 12

3       安裝JDK. 12

4       復制虛擬機... 12

5       SSH 免密碼登錄... 14

5.1             一般的ssh原理(需要密碼)... 14

5.2             免密碼原理... 14

5.3             SSH免密碼... 14

6       HA+Federation服務器規划... 15

7       zookeeper. 16

7.1             超級權限... 17

7.2             問題... 17

8       Hadoop. 17

8.1             hadoop-env.sh. 17

8.2             hdfs-site.xml18

8.3             core-site.xml20

8.4             slaves. 20

8.5             yarn-env.sh. 21

8.6             mapred-site.xml21

8.7             yarn-site.xml21

8.8             復制與修改... 22

8.9             啟動ZK. 23

8.10          格式化zkfc. 23

8.11          啟動journalnode. 23

8.12          namenode格式化和啟動... 24

8.13          啟動zkfc. 26

8.14          啟動datanode. 27

8.15          HDFS驗證... 27

8.16          HA驗證... 27

8.16.1       手動切換... 28

8.17          啟動yarn. 28

8.18          MapReduce測試... 29

8.19          腳本... 29

8.19.1       啟動與停用腳本... 29

8.19.2       重啟、關機... 31

8.20          Eclipse插件... 31

8.20.1       插件安裝... 31

8.20.2       WordCount工程... 32

8.20.2.1    WordCount.java. 33

8.20.2.2    yarn-default.xml34

8.20.2.3    build.xml34

8.20.2.4    log4j.properties. 35

8.20.3       打包執行... 35

8.20.4       權限訪問... 36

8.21          殺任務... 36

8.22          日志... 36

8.22.1       Hadoop系統服務日志... 36

8.22.2       Mapreduce日志... 38

8.22.3       System.out. 41

8.22.4       log4j42

9       MySQL. 44

10              HIVE安裝... 46

10.1          三種安裝模式... 46

10.2          遠程模式安裝... 47

11              Scala安裝... 49

12              Spark安裝... 49

12.1          測試... 50

12.2          Hive啟動問題... 52

13              清理與壓縮... 52

14              hadoop2.x常用端口... 53

15              Linux命令... 54

16              hadoop文件系統命令... 55

 


 

 

本文檔主要記錄了Hadoop+Hive+Spark集群安裝過程,並且對NameNodeResourceManager進行了HA高可用配置,以及對NameNode的橫向擴展(Federation聯邦)

 

1       VM網絡配置

將子網IP設置為192.168.1.0

將網關設置為192.168.1.2

並禁止DHCP

 

當經過上面配置后,虛擬網卡8IP會變成192.168.1.1

虛擬機與物理機不在一個網段是沒有關系的

2                      CentOS配置

2.1       下載地址

http://mirrors.neusoft.edu.cn/centos/7/isos/x86_64/CentOS-7-x86_64-Minimal-1511.iso

下載不帶桌面的最小安裝版本

2.2       激活網卡

激活網卡,並設置相關IP

網關與DNS設置為上面虛擬網卡8中設置的網關即可

2.3       SecureCRT

當網卡激活后,就可以使用SecureCRT終端遠程連接Linux,這樣方便后續操作。如何連接這里省略,

這里連接上后簡單的進行下面設置:

 

2.4       修改主機名

/etc/sysconfig/network

 

/etc/hostname

 

/etc/hosts

192.168.1.11   node1

192.168.1.12   node2

192.168.1.13   node3

192.168.1.14   node4

 

2.5       yum代理上網

由於公司內部是代理上網,所以yum無法連網搜索軟件包

yum代理的設置:vi /etc/yum.conf

 

再次運行yum,發現可以連網搜索軟件包了:

 

2.6       安裝ifconfig

2.7       wget安裝與代理

 

安裝好wget后,在/etc目錄下就會產生wget配置文件wgetrc,在這里面可以配置wget代理:

[root@node1 ~]# vi /etc/wgetrc

http_proxy = http://10.19.110.55:8080

https_proxy = http://10.19.110.55:8080

ftp_proxy = http://10.19.110.55:8080

2.8       安裝VMware Tools

為了虛擬機與主機時間同步,所以需要安裝VMWare Tools

 

[root@node1 opt]# yum -y install perl

[root@node1 ~]# mount /dev/cdrom /mnt

[root@node1 ~]# tar -zxvf /mnt/VMwareTools-9.6.1-1378637.tar.gz -C /root

[root@node1 ~]# umount /dev/cdrom

[root@node1 ~]# /root/vmware-tools-distrib/vmware-install.pl

[root@node1 ~]# rm -rf /root/vmware-tools-distrib

注:下面文件共享與鼠標拖放功能不要安裝,否則安裝過程會出問題:

[root@node1 ~]# chkconfig --list | grep vmware

vmware-tools    0:    1:    2:    3:    4:    5:    6:

vmware-tools-thinprint  0:    1:    2:    3:    4:    5:    6:

[root@node1 ~]# chkconfig vmware-tools-thinprint off

[root@node1 ~]# find / -name *vmware-tools-thinprint* | xargs rm -rf

 

2.9       其他

2.9.1  問題

剛啟動時會出以下錯誤提示:

修改虛擬機配置文件node1.vmx可以解決:

vcpu.hotadd = "FALSE"

mem.hotadd = "FALSE"

 

2.9.2  設置

2.9.2.1去掉開機等待時間

[root@node1 ~]# vim /etc/default/grub

GRUB_TIMEOUT=0                                               #默認為5

 

[root@node1 ~]# grub2-mkconfig -o /boot/grub2/grub.cfg

2.9.2.2VM調整

注:小內存禁用

 

修改node1.vmx文件:

mainMem.useNamedFile = "FALSE"

 

 

為了全屏顯示,方便命令行輸入,做以下調整:

並去掉狀態欄顯示:

2.9.3  命令

2.9.3.1關機與重啟

[root@node1 ~]# reboot

[root@node1 ~]# shutdown -h now

2.9.3.2服務停止與禁用

#查看開機自啟動服務

[root@node1 ~]# systemctl list-unit-files | grep enabled | sort

auditd.service                               enabled

crond.service                               enabled

dbus-org.freedesktop.NetworkManager.service enabled

dbus-org.freedesktop.nm-dispatcher.service  enabled

default.target                              enabled

dm-event.socket                             enabled

getty@.service                              enabled

irqbalance.service                          enabled

lvm2-lvmetad.socket                         enabled

lvm2-lvmpolld.socket                        enabled

lvm2-monitor.service                        enabled

microcode.service                           enabled

multi-user.target                           enabled

NetworkManager-dispatcher.service           enabled

NetworkManager.service                      enabled

postfix.service                             enabled

remote-fs.target                            enabled

rsyslog.service                             enabled

sshd.service                                enabled

systemd-readahead-collect.service           enabled

systemd-readahead-drop.service              enabled

systemd-readahead-replay.service            enabled

tuned.service                               enabled

 

[root@node1 ~]#  systemctl | grep running | sort 

crond.service                   loaded active running   Command Scheduler

dbus.service                    loaded active running   D-Bus System Message Bus

dbus.socket                     loaded active running   D-Bus System Message Bus Socket

getty@tty1.service              loaded active running   Getty on tty1

irqbalance.service              loaded active running   irqbalance daemon

lvm2-lvmetad.service            loaded active running   LVM2 metadata daemon

lvm2-lvmetad.socket             loaded active running   LVM2 metadata daemon socket

NetworkManager.service          loaded active running   Network Manager

polkit.service                  loaded active running   Authorization Manager

postfix.service                 loaded active running   Postfix Mail Transport Agent

rsyslog.service                 loaded active running   System Logging Service

session-1.scope                 loaded active running   Session 1 of user root

session-2.scope                 loaded active running   Session 2 of user root

session-3.scope                 loaded active running   Session 3 of user root

sshd.service                    loaded active running   OpenSSH server daemon

systemd-journald.service        loaded active running   Journal Service

systemd-journald.socket         loaded active running   Journal Socket

systemd-logind.service          loaded active running   Login Service

systemd-udevd-control.socket    loaded active running   udev Control Socket

systemd-udevd-kernel.socket     loaded active running   udev Kernel Socket

systemd-udevd.service           loaded active running   udev Kernel Device Manager

tuned.service                   loaded active running   Dynamic System Tuning Daemon

vmware-tools.service            loaded active running   SYSV: Manages the services needed to run VMware software

wpa_supplicant.service          loaded active running   WPA Supplicant daemon

 

#查看一個服務的狀態

systemctl status auditd.service

 

#開機時啟用一個服務

systemctl enable auditd.service

 

#開機時關閉一個服務

systemctl disable auditd.service

systemctl disable postfix.service

systemctl disable rsyslog.service

systemctl disable wpa_supplicant.service

 

#查看服務是否開機啟動

systemctl is-enabled auditd.service

2.9.3.3查大文件目錄

find . -type f -size +10M  -print0 | xargs -0 du -h | sort -nr

 

將前最大的前20目錄列出來,--max-depth表示目錄深度,如果去掉,則遍歷所有子目錄:

du -hm --max-depth=5 / | sort -nr | head -20

 

find /etc -name '*srm*'  #表示在/etc目錄下查找文件名中含有字符

2.9.3.4查看磁盤使用情況

[root@node1 dev]# df -h

文件系統                 容量  已用  可用 已用% 掛載點

/dev/mapper/centos-root   50G  1.5G   49G    3% /

devtmpfs                 721M     0  721M    0% /dev

tmpfs                    731M     0  731M    0% /dev/shm

tmpfs                    731M  8.5M  723M    2% /run

tmpfs                    731M     0  731M    0% /sys/fs/cgroup

/dev/mapper/centos-home   47G   33M   47G    1% /home

/dev/sda1                497M  106M  391M   22% /boot

tmpfs                    147M     0  147M    0% /run/user/0

2.9.3.5查看內存使用情況

[root@node1 dev]# top

3                      安裝JDK

JDK所有舊版本在官網中的下載地址:http://www.oracle.com/technetwork/java/archive-139210.html

 

在線下載jdk-8u72-linux-x64.tar.gz,並存放在/root下:

wget -O /root/jdk-8u92-linux-x64.tar.gz http://120.52.72.24/download.oracle.com/c3pr90ntc0td/otn/java/jdk/8u92-b14/jdk-8u92-linux-x64.tar.gz

 

 

[root@node1 ~]# tar -zxvf /root/jdk-8u92-linux-x64.tar.gz -C /root

 

[root@node1 ~]# vi /etc/profile

 

/etc/profile文件的最末加上如下內容:

export JAVA_HOME=/root/jdk1.8.0_92
export PATH=.:$PATH:$JAVA_HOME/bin

export CLASSPATH=.:$JAVA_HOME/jre/lib/rt.jar:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar

 

[root@node1 ~]# source /etc/profile

[root@node1 ~]# java -version

java version "1.8.0_92"

Java(TM) SE Runtime Environment (build 1.8.0_92-b14)

Java HotSpot(TM) 64-Bit Server VM (build 25.92-b14, mixed mode)

 

使用env命令查看當前設置的環境變量是否正確:

[root@node1 ~]# env | grep CLASSPATH

CLASSPATH=.:/root/jdk1.8.0_92/jre/lib/rt.jar:/root/jdk1.8.0_92/lib/dt.jar:/root/jdk1.8.0_92/lib/tools.jar

4                      復制虛擬機

前面只安裝一台node1的物理機,現從node1復制出node2\node3\node3

node1

192.168.1.11

node2

192.168.1.12

node3

192.168.1.13

node4

192.168.1.14

修改相應虛擬機顯示名:

 

開機時選擇復制:

 

 

修改主機名:

[root@node1 ~]# vi /etc/sysconfig/network

 

[root@node1 ~]# vi /etc/hostname

5                      SSH 免密碼登錄

RSA加密算法是一種典型的非對稱加密算法

RSA算法可以用於數據加密公鑰加密,私鑰解密)和數字簽名或認證私鑰加密,公鑰解密

 

5.1       一般的ssh原理(需要密碼)

客戶端向服務器端發出連接請求

服務器端向客戶端發出自己的公鑰

客戶端使用服務器端的公鑰加密通訊登錄密碼然后發給服務器端

如果通訊過程被截獲,由於竊聽者即使獲知公鑰和經過公鑰加密的內容,但不擁有私鑰依然無法解密(RSA算法)

服務器端接收到密文后,用私鑰解密,獲知通訊密碼

5.2       免密碼原理

先在客戶端創建一對密匙,並把公用密匙放在需要訪問的服務器上

客戶端向服務器發出請求,請求用你的密匙進行安全驗證

   服務器收到請求之后, 先在該服務器上你的主目錄下尋找你的公用密匙,然后把它和你發送過來的公用密匙進行比較。如果兩個密匙一致, 服務器就用公用密匙加密“質詢”(challenge)並把它發送給客戶端

客戶端收到“質詢”之后就可以用自己的私人密匙解密再把它發送給服務器

服務器比較發來的“質詢”和原先的是否一致,如果一致則進行授權,完成建立會話的操作

 

5.3       SSH免密碼

先刪除以前生成的:

rm -rf /root/.ssh

生成密鑰:

[root@node1 ~]# ssh-keygen -t rsa

[root@node2 ~]# ssh-keygen -t rsa

[root@node3 ~]# ssh-keygen -t rsa

[root@node4 ~]# ssh-keygen -t rsa

命令“ssh-keygen -t rsa”表示使用 rsa 加密方式生成密鑰, 回車后,會提示三次輸入信息,我們直接回車即可。

 

查看生成的密鑰:

其中id_rsa.pub為公鑰,id_rsa為私鑰

 

服務器之間公鑰拷貝:

ssh-copy-id -i /root/.ssh/id_rsa.pub <主機名>

表示將本機的公鑰拷貝到hadoop-slave1主機上去,並自動追加到authorized_keys文件中去,如果不存在則會自動創建一個。如果是自己遠程自己時,主機就填自己

[root@node1 ~]# ssh-copy-id -i /root/.ssh/id_rsa.pub node1

[root@node1 ~]# ssh-copy-id -i /root/.ssh/id_rsa.pub node2

[root@node1 ~]# ssh-copy-id -i /root/.ssh/id_rsa.pub node3

[root@node1 ~]# ssh-copy-id -i /root/.ssh/id_rsa.pub node4

 

[root@node2 ~]# ssh-copy-id -i /root/.ssh/id_rsa.pub node1

[root@node2 ~]# ssh-copy-id -i /root/.ssh/id_rsa.pub node2

[root@node2 ~]# ssh-copy-id -i /root/.ssh/id_rsa.pub node3

[root@node2 ~]# ssh-copy-id -i /root/.ssh/id_rsa.pub node4

 

[root@node3 ~]# ssh-copy-id -i /root/.ssh/id_rsa.pub node1

[root@node3 ~]# ssh-copy-id -i /root/.ssh/id_rsa.pub node2

[root@node3 ~]# ssh-copy-id -i /root/.ssh/id_rsa.pub node3

[root@node3 ~]# ssh-copy-id -i /root/.ssh/id_rsa.pub node4

 

[root@node4 ~]# ssh-copy-id -i /root/.ssh/id_rsa.pub node1

[root@node4 ~]# ssh-copy-id -i /root/.ssh/id_rsa.pub node2

[root@node4 ~]# ssh-copy-id -i /root/.ssh/id_rsa.pub node3

[root@node4 ~]# ssh-copy-id -i /root/.ssh/id_rsa.pub node4

注:如果發現三台虛擬機上生成的公鑰都是一個時,請先刪除/etc/udev/rules.d/70-persistent-net.rules 文件,再刪除 /root/.ssh文件夾后,重新生成

 

6                      HA+Federation服務器規划

 

 

node1

node2

node3

node4

NameNode

Hadoop

Y(屬於cluster1

Y集群1

Y(屬於cluster2

Y集群2

DateNode

 

Y

Y

Y

NodeManager

 

Y

Y

Y

JournalNodes

Y

Y

Y

 

zkfcDFSZKFailoverController

Y(有namenode的地方

Y就有zkfc

Y

Y

ResourceManager

Y

Y

 

 

ZooKeeperQuorumPeerMain

Zookeeper

Y

Y

Y

 

MySQL

HIVE

 

 

 

Y

metastoreRunJar

 

 

Y

 

HIVERunJar

Y

 

 

 

Scala

Spark

Y

Y

Y

Y

Spark-master

Y

 

 

 

Spark-worker

 

Y

Y

Y

 

不同的NameNode通過同一ClusterID來共用同一套DataNode

說明: HDFS Federation Architecture

 

NS-n單元:

說明: Hadoop-HA

說明: MapReduce NextGen Architecture

 

7                      zookeeper

[root@node1 ~]# wget -O /root/zookeeper-3.4.9.tar.gz https://mirrors.tuna.tsinghua.edu.cn/apache/zookeeper/zookeeper-3.4.9/zookeeper-3.4.9.tar.gz

 

[root@node1 ~]# tar -zxvf /root/zookeeper-3.4.9.tar.gz -C /root

 

[root@node1 conf]# cp /root/zookeeper-3.4.9/conf/zoo_sample.cfg /root/zookeeper-3.4.9/conf/zoo.cfg

 

[root@node1 conf]# vi /root/zookeeper-3.4.9/conf/zoo.cfg

 

[root@node1 conf]# mkdir /root/zookeeper-3.4.9/zkData

[root@node1 conf]# touch /root/zookeeper-3.4.9/zkData/myid

[root@node1 conf]# echo 1 > /root/zookeeper-3.4.9/zkData/myid

 

[root@node1 conf]# scp -r /root/zookeeper-3.4.9 node2:/root

[root@node1 conf]# scp -r /root/zookeeper-3.4.9 node3:/root

[root@node2 conf]# echo 2 > /root/zookeeper-3.4.9/zkData/myid

[root@node3 conf]# echo 3 > /root/zookeeper-3.4.9/zkData/myid

7.1       超級權限

[root@node1 ~]# vi /root/zookeeper-3.4.9/bin/zkServer.sh

在下面啟動Java的地方加上啟動參數"-Dzookeeper.DigestAuthenticationProvider.superDigest=super:Q9YtF+3h9Ko5UNT8apBWr8hovH4="super后面是密碼(AAAaaa111):

 

[root@node1 ~]# /root/zookeeper-3.4.9/bin/zkCli.sh

[zk: localhost:2181(CONNECTED) 11] addauth digest super:AAAaaa111

現在就可以任意刪除節點數據了:

[zk: localhost:2181(CONNECTED) 15] rmr /rmstore/ZKRMStateRoot

7.2       問題

zookeeper無法啟動"Unable to load database on disk"

 

[root@node3 ~]# more zookeeper.out

2017-01-24 11:31:31,827 [myid:3] - ERROR [main:QuorumPeer@557] - Unable to load database on disk

java.io.IOException: The accepted epoch, d is less than the current epoch, 17

        at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:554)

        at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:500)

        at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:153)

        at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:111)

        at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78)

 

[root@node3 ~]# more /root/zookeeper-3.4.9/conf/zoo.cfg | grep dataDir

dataDir=/root/zookeeper-3.4.9/zkData

[root@node3 ~]# ls /root/zookeeper-3.4.9/zkData

myid  version-2  zookeeper_server.pid

清空version-2下的所有文件:

[root@node3 ~]# rm -f /root/zookeeper-3.4.9/zkData/version-2/*.*

[root@node3 ~]# rm -rf /root/zookeeper-3.4.9/zkData/version-2/acceptedEpoch

[root@node3 ~]# rm -rf /root/zookeeper-3.4.9/zkData/version-2/currentEpoch

8                      Hadoop

[root@node1 ~]# wget -O /root/hadoop-2.7.2.tar.gz  http://mirror.bit.edu.cn/apache/hadoop/common/hadoop-2.7.2/hadoop-2.7.2.tar.gz

[root@node1 ~]# tar -zxvf /root/hadoop-2.7.2.tar.gz -C /root

8.1       hadoop-env.sh

[root@node1 ~]# vi /root/hadoop-2.7.2/etc/hadoop/hadoop-env.sh

 

下面這個存放PID進程號的位置一定要修改,否則可能會出現:XXX running as process 1609. Stop it first.

8.2       hdfs-site.xml

[root@node1 ~]# vi /root/hadoop-2.7.2/etc/hadoop/hdfs-site.xml

<configuration>

   

       <property>

               <name>dfs.replication</name>

               <value>2</value>

<description>指定DataNode存儲block的副本數量。默認值是3個,我們現在有4DataNode,該值不大於4即可</description>

        </property>

 

<property>

  <name>dfs.blocksize</name>

  <value>134217728</value>

  <description>

      The default block size for new files, in bytes.

      You can use the following suffix (case insensitive):

      k(kilo), m(mega), g(giga), t(tera), p(peta), e(exa) to specify the size (such as 128k, 512m, 1g, etc.),

      Or provide complete size in bytes (such as 134217728 for 128 MB).

           注:1.X及以前版本默認是64M,而且配置項名為dfs.block.size

  </description>

</property>

 

<property>

     <name>dfs.permissions.enabled</name>

     <value>false</value>

     <description>注:如果還有權限問題,請執行下“/root/hadoop-2.7.2/bin/hdfs dfs -chmod -R 777 /”命令</description>

</property>

 

<property>

  <name>dfs.nameservices</name>

  <value>cluster1,cluster2</value>

<description>使用federation時,使用了2HDFS集群。這里抽象出兩個NameService實際上就是給這2HDFS集群起了個別名。名字可以隨便起,相互不重復即可。多個集群時使用逗號分開。注:這里的命名只是個邏輯空間的概念,不是集群1、集群2兩集群,應該是 cluster1+cluster2 才組成一個集群,cluster1cluster2只是集群的一部分,從邏輯上將整個集群分成了兩部分(當然還要以加一個高可用NameNode進來,組成第三部分),cluster1cluster2是否屬於同一集群,則是是clusterID決定的,clusterID這個值是在格式化NameNode時指定的,請參照namenode格式化和啟動</description>

</property>

<property>

  <name>dfs.ha.namenodes.cluster1</name>

  <value>nn1,nn2</value>

<description>集群1里面NameNode的邏輯名,注:只是隨便命的邏輯名,這里不是真實的NameNode主機名,后面配置才指定到主機</description>

</property>

<property>

  <name>dfs.ha.namenodes.cluster2</name>

  <value>nn3,nn4</value>

<description>集群2里的NameNode邏輯名</description>

</property>

 

<!-- 下面配置實現邏輯名與物理主機綁定-->

<property>

  <name>dfs.namenode.rpc-address.cluster1.nn1</name>

  <value>node1:8020</value>

<description>8020HDFS 客戶端接入地址(包括命令行與程序),有的使用9000</description>

</property>

<property>

  <name>dfs.namenode.rpc-address.cluster1.nn2</name>

  <value>node2:8020</value>

</property>

<property>

  <name>dfs.namenode.rpc-address.cluster2.nn3</name>

  <value>node3:8020</value>

</property>

<property>

  <name>dfs.namenode.rpc-address.cluster2.nn4</name>

  <value>node4:8020</value>

</property>

<property>

  <name>dfs.namenode.http-address.cluster1.nn1</name>

  <value>node1:50070</value>

<description> namenode web的接入地址</description>

</property>

<property>

  <name>dfs.namenode.http-address.cluster1.nn2</name>

  <value>node2:50070</value>

</property>

<property>

  <name>dfs.namenode.http-address.cluster2.nn3</name>

  <value>node3:50070</value>

</property>

<property>

  <name>dfs.namenode.http-address.cluster3.nn4</name>

  <value>node4:50070</value>

</property>

 

<property>

  <name>dfs.namenode.shared.edits.dir</name>

  <value>qjournal://node1:8485;node2:8485;node3:8485/cluster1</value>

<description>指定cluster1的兩個NameNode共享edits文件目錄時,使用的JournalNode集群信息。

node1\node2主機中使用這個配置</description>

</property>

<!--

<property>

  <name>dfs.namenode.shared.edits.dir</name>

  <value>qjournal://node1:8485;node2:8485;node3:8485/cluster2</value>

<description>指定cluster2的兩個NameNode共享edits文件目錄時,使用的JournalNode集群信息。

node3\node4主機中使用這個配置</description>

</property>

-->

 

<property>

<name>dfs.ha.automatic-failover.enabled.cluster1</name>

<value>true</value>

<description>指定cluster1是否啟動自動故障恢復,即當NameNode出故障時,是否自動切換到另一台NameNode</description>

</property>

<property>

<name>dfs.ha.automatic-failover.enabled.cluster2</name>

<value>true</value>

</property>

<property>

  <name>dfs.client.failover.proxy.provider.cluster1</name>

  <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>

<description>指定cluster1出故障時,哪個Java類負責執行故障切換</description>

</property>

<property>

  <name>dfs.client.failover.proxy.provider.cluster2</name>

  <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>

</property>

 

<property>

  <name>dfs.journalnode.edits.dir</name>

  <value>/root/hadoop-2.7.2/tmp/journal</value>

<description>指定JournalNode自身存儲數據的磁盤路徑</description>

</property>

 

<property>

  <name>dfs.ha.fencing.methods</name>

  <value>sshfence</value>

  <description>NameNode使用SSH進行主備切換</description>

</property>

<property>

  <name>dfs.ha.fencing.ssh.private-key-files</name>

  <value>/root/.ssh/id_rsa</value>

<description>如果使用ssh進行故障切換,使用ssh通信時用的密鑰存儲的位置</description>

</property>

 

</configuration>

8.3       core-site.xml

[root@node1 ~]# vi /root/hadoop-2.7.2/etc/hadoop/core-site.xml

<configuration>

       <property>

                <name>fs.defaultFS</name>

                <value>hdfs://cluster1:8020</value>

                <description>在使用客戶端(或程序)時,如果不指定具體的接入地址?該值來自於hdfs-site.xml中的配置。注:所有主機上配置都一樣</description>

       </property>

       <property>

               <name>hadoop.tmp.dir</name>

               <value>/root/hadoop-2.7.2/tmp</value>

               <description>這里的路徑默認是NameNodeDataNodeJournalNode等存放數據的公共目錄</description>

       </property>

<property>

   <name>ha.zookeeper.quorum</name>

   <value>node1:2181,node2:2181,node3:2181</value>

   <description>這里是ZooKeeper集群的地址和端口。注意,數量一定是奇數,且不少於三個節點</description>

</property>

 

 

<!-- 下面的配置可解決NameNode連接JournalNode超時異常問題-->

<property>

  <name>ipc.client.connect.retry.interval</name>

  <value>10000</value>

  <description>Indicates the number of milliseconds a client will wait for

    before retrying to establish a server connection.

  </description>

</property>

 

</configuration>

8.4       slaves

指定DataNode所在主機:

[root@node1 ~]# vi /root/hadoop-2.7.2/etc/hadoop/slaves

8.5       yarn-env.sh

[root@node1 ~]# vi /root/hadoop-2.7.2/etc/hadoop/yarn-env.sh

8.6       mapred-site.xml

[root@node1 ~]# vi /root/hadoop-2.7.2/etc/hadoop/mapred-site.xml

<configuration>

          <property>

 <name>mapreduce.framework.name</name>

                <value>yarn</value>

<description>指定mapreduce運行在Yarn框架下</description>

           </property>

 

    <property>

       <name>mapreduce.jobhistory.address</name>

       <value>node1:10020</value>

<description>注:每台機器上配置都不一樣,需要修改成對應的主機名,端口不用修改,比如node2:10020node3:10020node4:10020,,拷貝過去后請做相應修改</description>

    </property>

 

    <property>

       <name>mapreduce.jobhistory.webapp.address</name>

       <value>node1:19888</value>

       <description>注:每台機器上配置都不一樣,需要修改成對應的主機名,端口不用修改,比如node2:19888node3:19888node4:19888,拷貝過去后請做相應修改</description>

</property>

</configuration>

8.7       yarn-site.xml

[root@node1 ~]# vi /root/hadoop-2.7.2/etc/hadoop/yarn-site.xml

<configuration>

        <property>

               <name>yarn.nodemanager.aux-services</name>

               <value>mapreduce_shuffle</value>

        </property>

        <property>                                                               

<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>

               <value>org.apache.hadoop.mapred.ShuffleHandler</value>

        </property>

 

<property>

  <name>yarn.resourcemanager.ha.enabled</name>

  <value>true</value>

</property>

 

<property>

  <name>yarn.resourcemanager.cluster-id</name>

  <value>yarn-cluster</value>

</property>

<property>

  <name>yarn.resourcemanager.ha.rm-ids</name>

  <value>rm1,rm2</value>

</property>

<property>

  <name>yarn.resourcemanager.hostname.rm1</name>

  <value>node1</value>

</property>

<property>

  <name>yarn.resourcemanager.hostname.rm2</name>

  <value>node2</value>

</property>

<property>

  <name>yarn.resourcemanager.webapp.address.rm1</name>

  <value>node1:8088</value>

</property>

<property>

  <name>yarn.resourcemanager.webapp.address.rm2</name>

  <value>node2:8088</value>

</property>

<property>

  <name>yarn.resourcemanager.zk-address</name>

  <value>node1:2181,node2:2181,node3:2181</value>

</property>

 

<property>

<name>yarn.resourcemanager.recovery.enabled</name>

<value>true</value>

</property>

<property>

<name>yarn.resourcemanager.store.class</name>

<value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>

<description>RM的數據默認存放在ZK上的/rmstore中,可通過yarn.resourcemanager.zk-state-store.parent-path 設定</description>

</property>

 

 

<property>

<name>yarn.log-aggregation-enable</name>  

<value>true</value>

<description>開啟日志收集,這樣會將每台執行任務的機上產生的本地日志文件集中拷貝到HDFS的某個地方,這樣就可以在任何一台集群中的機器上集中查看作業日志了</description>

</property>

 

<property>

  <name>yarn.log.server.url</name>

  <value>http://node1:19888/jobhistory/logs</value>

  <description>注:每台機器上配置都不一樣,需要修改成對應的主機名,端口不用修改,比如http://node2:19888/jobhistory/logshttp://node3:19888/jobhistory/logshttp://node4:19888/jobhistory/logs,拷貝過去后請做相應修改</description>

</property>

 

 

</configuration>

8.8       復制與修改

[root@node1 ~]# scp -r /root/hadoop-2.7.2/ node2:/root

[root@node1 ~]# scp -r /root/hadoop-2.7.2/ node3:/root

[root@node1 ~]# scp -r /root/hadoop-2.7.2/ node4:/root

 

[root@node3 ~]# vi /root/hadoop-2.7.2/etc/hadoop/hdfs-site.xml

[root@node3 ~]# scp /root/hadoop-2.7.2/etc/hadoop/hdfs-site.xml node4:/root/hadoop-2.7.2/etc/hadoop

 

[root@node2 ~]# vi /root/hadoop-2.7.2/etc/hadoop/mapred-site.xml

[root@node3 ~]# vi /root/hadoop-2.7.2/etc/hadoop/mapred-site.xml

[root@node4 ~]# vi /root/hadoop-2.7.2/etc/hadoop/mapred-site.xml

 

[root@node2 ~]# vi /root/hadoop-2.7.2/etc/hadoop/yarn-site.xml

[root@node3 ~]# vi /root/hadoop-2.7.2/etc/hadoop/yarn-site.xml

[root@node4 ~]# vi /root/hadoop-2.7.2/etc/hadoop/yarn-site.xml

8.9       啟動ZK

[root@node1 bin]# /root/zookeeper-3.4.9/bin/zkServer.sh start

[root@node2 bin]# /root/zookeeper-3.4.9/bin/zkServer.sh start

[root@node3 bin]# /root/zookeeper-3.4.9/bin/zkServer.sh start

[root@node1 bin]# jps

1622 QuorumPeerMain

 

查看狀態:

[root@node1 ~]# /root/zookeeper-3.4.9/bin/zkServer.sh status

ZooKeeper JMX enabled by default

Using config: /root/zookeeper-3.4.9/bin/../conf/zoo.cfg

Mode: follower

[root@node2 ~]# /root/zookeeper-3.4.9/bin/zkServer.sh status

ZooKeeper JMX enabled by default

Using config: /root/zookeeper-3.4.9/bin/../conf/zoo.cfg

Mode: leader

 

查看數據節點:

[root@node1 hadoop-2.7.2]# /root/zookeeper-3.4.9/bin/zkCli.sh

[zk: localhost:2181(CONNECTED) 0] ls /

[zookeeper]

8.10格式化zkfc

在每個集群上的任意一節點上進行操作,目的是在Zookeeper集群上建立HA的相應Znode節點數據

 

[root@node1 ~]# /root/hadoop-2.7.2/bin/hdfs zkfc -formatZK

[root@node3 ~]# /root/hadoop-2.7.2/bin/hdfs zkfc -formatZK

 

格式化后,會在ZK上創建hadoop-ha名稱的Znode數據節點:

[root@node1 ~]# /root/zookeeper-3.4.9/bin/zkCli.sh

8.11啟動journalnode

[root@node1 ~]# /root/hadoop-2.7.2/sbin/hadoop-daemon.sh start journalnode

[root@node2 ~]# /root/hadoop-2.7.2/sbin/hadoop-daemon.sh start journalnode

[root@node3 ~]# /root/hadoop-2.7.2/sbin/hadoop-daemon.sh start journalnode

[root@node1 ~]# jps

1810 JournalNode

8.12namenode格式化和啟動

[root@node1 ~]# /root/hadoop-2.7.2/bin/hdfs namenode -format -clusterId CLUSTER_UUID_1

[root@node1 ~]# /root/hadoop-2.7.2/sbin/hadoop-daemon.sh start namenode

[root@node1 ~]# jps

1613 NameNode

 

同一集群中的所有集群ID必須相同(包括NameNodeDataNode等):

[root@node2 ~]# /root/hadoop-2.7.2/bin/hdfs namenode -bootstrapStandby

[root@node2 ~]# /root/hadoop-2.7.2/sbin/hadoop-daemon.sh start namenode

 

[root@node3 ~]# /root/hadoop-2.7.2/bin/hdfs namenode -format -clusterId CLUSTER_UUID_1

[root@node3 ~]# /root/hadoop-2.7.2/sbin/hadoop-daemon.sh start namenode

[root@node4 ~]# /root/hadoop-2.7.2/bin/hdfs namenode -bootstrapStandby

[root@node4 ~]# /root/hadoop-2.7.2/sbin/hadoop-daemon.sh start namenode

8.13啟動zkfc

ZKFCzookeeper Failover Controller)是用來監控NameNode狀態的,協助實現主備NameNode切換的,在所有NameNode上執行

 

[root@node1 ~]# /root/hadoop-2.7.2/sbin/hadoop-daemon.sh start zkfc

[root@node2 ~]# /root/hadoop-2.7.2/sbin/hadoop-daemon.sh start zkfc

[root@node1 ~]# jps

5280 DFSZKFailoverController

 

自動切換成功:

 

[root@node3 ~]# /root/hadoop-2.7.2/sbin/hadoop-daemon.sh start zkfc

[root@node4 ~]# /root/hadoop-2.7.2/sbin/hadoop-daemon.sh start zkfc

 

8.14啟動datanode

[root@node2 ~]# /root/hadoop-2.7.2/sbin/hadoop-daemon.sh start datanode

[root@node3 ~]# /root/hadoop-2.7.2/sbin/hadoop-daemon.sh start datanode

[root@node4 ~]# /root/hadoop-2.7.2/sbin/hadoop-daemon.sh start datanode

8.15HDFS驗證

上傳到指定的集群2中:

[root@node1 ~]# /root/hadoop-2.7.2/bin/hdfs dfs -put /root/hadoop-2.7.2.tar.gz hdfs://cluster2/

[root@node1 ~]# /root/hadoop-2.7.2/bin/hdfs dfs -put /root/test_upload.tar hdfs://cluster1:8020/

上傳時如果未明確指定路徑,則會默認使用core-site.xml配置文本中的fs.defaultFS配置項:

[root@node1 ~]# /root/hadoop-2.7.2/bin/hdfs dfs -put /root/hadoop-2.7.2.tar.gz /

也可以具體到某台主機(但要是處於激活狀態):

/root/hadoop-2.7.2/bin/hdfs dfs -put /root/hadoop-2.7.2.tar hdfs://node3:8020/

/root/hadoop-2.7.2/bin/hdfs dfs -put /root/hadoop-2.7.2.tar hdfs://node3/

 

8.16HA驗證

[root@node1 ~]# /root/hadoop-2.7.2/bin/hdfs haadmin -ns cluster1 -getServiceState nn1

active

[root@node1 ~]# /root/hadoop-2.7.2/bin/hdfs haadmin -ns cluster1 -getServiceState nn2

standby

[root@node1 ~]# jps

2448 NameNode

3041 DFSZKFailoverController

3553 Jps

2647 JournalNode

2954 QuorumPeerMain

[root@node1 ~]# kill 2448

[root@node1 ~]# /root/hadoop-2.7.2/bin/hdfs haadmin -ns cluster1 -getServiceState nn2

active

8.16.1              手動切換

/root/hadoop-2.7.2/bin/hdfs haadmin -ns cluster1 -failover nn2 nn1

/root/hadoop-2.7.2/bin/hdfs haadmin -ns cluster2 -failover nn4 nn3

8.17啟動yarn

[root@node1 ~]# /root/hadoop-2.7.2/sbin/yarn-daemon.sh start resourcemanager

[root@node2 ~]# /root/hadoop-2.7.2/sbin/yarn-daemon.sh start resourcemanager

 

[root@node2 ~]# /root/hadoop-2.7.2/sbin/yarn-daemon.sh start nodemanager

[root@node3 ~]# /root/hadoop-2.7.2/sbin/yarn-daemon.sh start nodemanager

[root@node4 ~]# /root/hadoop-2.7.2/sbin/yarn-daemon.sh start nodemanager

 

http://node1:8088/cluster/cluster

注:輸入地址為http://XXXXX/cluster/cluster形式,否則如果是備用的則會自動跳轉到激活主機上面去

http://node2:8088/cluster/cluster

 

查看狀態命令:

[root@node4 logs]# /root/hadoop-2.7.2/bin/yarn rmadmin -getServiceState rm2

8.18MapReduce測試

[root@node4 ~]# /root/hadoop-2.7.2/bin/hdfs dfs -mkdir hdfs://cluster1/hadoop

[root@node4 ~]# /root/hadoop-2.7.2/bin/hdfs dfs -put /root/hadoop-2.7.2/etc/hadoop/*xml* hdfs://cluster1/hadoop

[root@node4 ~]# /root/hadoop-2.7.2/bin/hadoop jar /root/hadoop-2.7.2/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar wordcount hdfs://cluster1:8020/hadoop/h* hdfs://cluster1:8020/hadoop/m* hdfs://cluster1/wordcountOutput

注:MapReduce的輸出要與其輸入在同一集群。雖然可以放在另一集群時也要執行成功,但通過Web查看輸出結果文件時,會找不到

8.19腳本

以下腳本放在node1上運行

8.19.1              啟動與停用腳本

自動交互

在通過腳本進行RM手動切換時使用

[root@node1 ~]# yum install expect

 

[root@node1 ~]# vi /root/starthadoop.sh

#rm -rf /root/hadoop-2.7.2/logs/*.*

#ssh root@node2 'export BASH_ENV=/etc/profile;rm -rf /root/hadoop-2.7.2/logs/*.*'

#ssh root@node3 'export BASH_ENV=/etc/profile;rm -rf /root/hadoop-2.7.2/logs/*.*'

#ssh root@node4 'export BASH_ENV=/etc/profile;rm -rf /root/hadoop-2.7.2/logs/*.*'

 

/root/zookeeper-3.4.9/bin/zkServer.sh start

ssh root@node2 'export BASH_ENV=/etc/profile;/root/zookeeper-3.4.9/bin/zkServer.sh start'

ssh root@node3 'export BASH_ENV=/etc/profile;/root/zookeeper-3.4.9/bin/zkServer.sh start'

 

/root/hadoop-2.7.2/sbin/start-all.sh

ssh root@node2 'export BASH_ENV=/etc/profile;/root/hadoop-2.7.2/sbin/yarn-daemon.sh start resourcemanager'

 

/root/hadoop-2.7.2/sbin/hadoop-daemon.sh start zkfc

ssh root@node2 'export BASH_ENV=/etc/profile;/root/hadoop-2.7.2/sbin/hadoop-daemon.sh start zkfc'

ssh root@node3 'export BASH_ENV=/etc/profile;/root/hadoop-2.7.2/sbin/hadoop-daemon.sh start zkfc'

ssh root@node4 'export BASH_ENV=/etc/profile;/root/hadoop-2.7.2/sbin/hadoop-daemon.sh start zkfc'

 

#ret=`/root/hadoop-2.7.2/bin/hdfs dfsadmin -safemode get | grep ON | head -1`

#while [ -n "$ret" ]

#do

#echo '等待離開安全模式'

#sleep 1s

#ret=`/root/hadoop-2.7.2/bin/hdfs dfsadmin -safemode get | grep ON | head -1`

#done

 

/root/hadoop-2.7.2/bin/hdfs haadmin -ns cluster1 -failover nn2 nn1

/root/hadoop-2.7.2/bin/hdfs haadmin -ns cluster2 -failover nn4 nn3

echo 'Y' | ssh root@node1 'export BASH_ENV=/etc/profile;/root/hadoop-2.7.2/bin/yarn rmadmin -transitionToActive --forcemanual rm1'

 

/root/hadoop-2.7.2/sbin/mr-jobhistory-daemon.sh start historyserver

ssh root@node2 'export BASH_ENV=/etc/profile;/root/hadoop-2.7.2/sbin/mr-jobhistory-daemon.sh start historyserver'

ssh root@node3 'export BASH_ENV=/etc/profile;/root/hadoop-2.7.2/sbin/mr-jobhistory-daemon.sh start historyserver'

ssh root@node4 'export BASH_ENV=/etc/profile;/root/hadoop-2.7.2/sbin/mr-jobhistory-daemon.sh start historyserver'

 

#此命令行啟動Spark,只安裝Hadoop時去掉

/root/spark-2.1.0-bin-hadoop2.7/sbin/start-all.sh

 

echo '--------------node1---------------'

jps | grep -v Jps | sort  -k 2 -t ' '

echo '--------------node2---------------'

ssh root@node2 "export PATH=/usr/bin:$PATH;jps | grep -v Jps | sort  -k 2 -t ' '"

echo '--------------node3---------------'

ssh root@node3 "export PATH=/usr/bin:$PATH;jps | grep -v Jps | sort  -k 2 -t ' '"

echo '--------------node4---------------'

ssh root@node4 "export PATH=/usr/bin:$PATH;jps | grep -v Jps | sort  -k 2 -t ' '"

 

#下面兩行命令用來啟動Hive,沒有安裝時請去掉

ssh root@node4 'export BASH_ENV=/etc/profile;service mysql start'

ssh root@node3 'export BASH_ENV=/etc/profile;/root/hive-1.2.1/bin/hive --service metastore&'

 [root@node1 ~]# vi /root/stophadoop.sh

#此命令行用來停止Spark,未安裝時去掉

/root/spark-2.1.0-bin-hadoop2.7/sbin/stop-all.sh

#下面兩行用來停止HIVE,未安裝時去掉

ssh root@node4 'export BASH_ENV=/etc/profile;service mysql stop'

ssh root@node3 'export BASH_ENV=/etc/profile;/root/jdk1.8.0_92/bin/jps | grep RunJar | head -1 |cut -f1 -d " "|  xargs kill'

 

ssh root@node2 'export BASH_ENV=/etc/profile;/root/hadoop-2.7.2/sbin/yarn-daemon.sh stop resourcemanager'

/root/hadoop-2.7.2/sbin/stop-all.sh

 

/root/hadoop-2.7.2/sbin/hadoop-daemon.sh stop zkfc

ssh root@node2 'export BASH_ENV=/etc/profile;/root/hadoop-2.7.2/sbin/hadoop-daemon.sh stop zkfc'

ssh root@node3 'export BASH_ENV=/etc/profile;/root/hadoop-2.7.2/sbin/hadoop-daemon.sh stop zkfc'

ssh root@node4 'export BASH_ENV=/etc/profile;/root/hadoop-2.7.2/sbin/hadoop-daemon.sh stop zkfc'

 

/root/zookeeper-3.4.9/bin/zkServer.sh stop

ssh root@node2 'export BASH_ENV=/etc/profile;/root/zookeeper-3.4.9/bin/zkServer.sh stop'

ssh root@node3 'export BASH_ENV=/etc/profile;/root/zookeeper-3.4.9/bin/zkServer.sh stop'

 

/root/hadoop-2.7.2/sbin/mr-jobhistory-daemon.sh stop historyserver

ssh root@node2 'export BASH_ENV=/etc/profile;/root/hadoop-2.7.2/sbin/mr-jobhistory-daemon.sh stop historyserver'

ssh root@node3 'export BASH_ENV=/etc/profile;/root/hadoop-2.7.2/sbin/mr-jobhistory-daemon.sh stop historyserver'

ssh root@node4 'export BASH_ENV=/etc/profile;/root/hadoop-2.7.2/sbin/mr-jobhistory-daemon.sh stop historyserver'

 

[root@node1 ~]# chmod 777 starthadoop.sh stophadoop.sh

8.19.2              重啟、關機

[root@node1 ~]# vi /root/reboot.sh 

ssh root@node2 "export PATH=/usr/bin:$PATH;reboot"

ssh root@node3 "export PATH=/usr/bin:$PATH;reboot"

ssh root@node4 "export PATH=/usr/bin:$PATH;reboot"

reboot

 

[root@node1 ~]# vi /root/shutdown.sh

ssh root@node2 "export PATH=/usr/bin:$PATH;shutdown -h now"

ssh root@node3 "export PATH=/usr/bin:$PATH;shutdown -h now"

ssh root@node4 "export PATH=/usr/bin:$PATH;shutdown -h now"

shutdown -h now

 

[root@node1 ~]# chmod 777 /root/shutdown.sh /root/reboot.sh

 

8.20Eclipse插件

8.20.1              插件安裝

1、  hadoop-2.7.2.tar.gz(前面自己編譯的CentOS版本)解壓到D:\hadoop下,並將winutil.exe.hadoop.dll等文件到hadoop安裝目錄bin文件夾下,再將hadoop.dll放到C:\WindowsC:\Windows\System32下。

2、  添加HADOOP_HOME環境變量,值為D:\hadoop\hadoop-2.7.2,並將%HADOOP_HOME%\bin添加到Path環境變量中

3、  雙擊winutils.exe,如果出現“缺失MSVCR120.dll”的提示,則安裝VC++2013相關組件

4、  hadoop-eclipse-plugin-2.7.2.jar(該插件包也是要在Windows上進行編譯,非常麻煩,也找現成的吧!)插件包拷貝到Eclipse plugins目錄下

5、  運行Eclipse,進行配置:

  • Map/ReduceV2 Master :這個端口不用管,不影響任務遠程提交與執行。如果配置正確,下面這個就可以在Eclips直接監視任務執行情況了(這個搗鼓了很久,還是沒出來,在hadoop1.2.1倒是搞出來過):

  • DFS Master Name NodeIP和端口,hdfs-site.xmldfs.namenode.rpc-address配置端口,這個配置決定了左邊樹是否可以連上Hadoopdfs

8.20.2              WordCount工程

8.20.2.1         WordCount.java

package jzj;

 

import java.io.IOException;

import java.net.URI;

import java.util.StringTokenizer;

 

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.log4j.Logger;

 

publicclass WordCount {

 

       publicstaticclass TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {

 

              privatefinalstatic IntWritable one = new IntWritable(1);

              private Text word = new Text();

              private Logger log = Logger.getLogger(TokenizerMapper.class);

 

              publicvoid map(Object key, Text value, Context context) throws IOException, InterruptedException {

                     log.debug("[Thread=" + Thread.currentThread().hashCode() + "]  map任務:log4j輸出:wordcountkey=" + key + "value=" + value);

                     System.out.println("[Thread=" + Thread.currentThread().hashCode() + "]  map任務:System.out輸出:wordcountkey=" + key + "value="

                                   + value);

                     StringTokenizer itr = new StringTokenizer(value.toString());

                     while (itr.hasMoreTokens()) {

                            word.set(itr.nextToken());

                            context.write(word, one);

                     }

              }

       }

 

       publicstaticclass IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

              private IntWritable result = new IntWritable();

              private Logger log = Logger.getLogger(IntSumReducer.class);

 

             publicvoid reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {

                     int sum = 0;

                     for (IntWritable val : values) {

                            sum += val.get();

                     }

                     result.set(sum);

                     context.write(key, result);

                     log.debug("[Thread=" + Thread.currentThread().hashCode() + "]  reduce任務:log4j輸出:wordcountkey=" + key + "count=" + sum);

                     System.out.println("[Thread=" + Thread.currentThread().hashCode() + "]  reduce任務:System.out輸出:wordcountkey=" + key + "count="

                                   + sum);

              }

       }

 

       publicstaticvoid main(String[] args) throws Exception {

              Logger log = Logger.getLogger(WordCount.class);

              log.debug("JOB Main方法:log4j輸出:wordcount");

              System.out.println("JOB Main方法:System.out輸出:wordcount");

              Configuration conf = new Configuration();

              // 注:xxx.jar任務包中需要一個空的yarn-default.xml配置文件,否則任務遠程提交后會一直等待,Why

              conf.set("mapreduce.framework.name", "yarn");// 指定使用yarn框架

              conf.set("yarn.resourcemanager.address", "node1:8032"); // 提交任務到哪台機器上

              // 需要加上,否則拋異常:java.io.IOException: The ownership on the staging

              // directory /tmp/hadoop-yarn/staging/15040078/.staging

              // is not as expected. It is owned by . The directory must be owned by

              // the submitter 15040078 or by 15040078

              conf.set("fs.defaultFS", "hdfs://node1:8020");// 指定namenode

              // 加上該配置,否則拋異常:Stack trace: ExitCodeException exitCode=1: /bin/bash: 0

              // :fg: 無任務控制

              conf.set("mapreduce.app-submission.cross-platform", "true");

 

              // 此處Keymapred.jar不要修改,值為本項目導出的Jar,如果不設置,則報找不到類

              conf.set("mapred.jar", "wordcount.jar");

 

              Job job = Job.getInstance(conf, "wordcount");

              job.setJarByClass(WordCount.class);

              job.setMapperClass(TokenizerMapper.class);

              // 如果這里設置了Combiner,則Map端與會有reduce日志,原因設置了Combiner后,Map端做完Map后,會繼續運行reduce任務,所以在Map端也會看到reduce任務日志就不奇怪了

              // job.setCombinerClass(IntSumReducer.class);

              job.setReducerClass(IntSumReducer.class);

              job.setOutputKeyClass(Text.class);

              job.setOutputValueClass(IntWritable.class);

              // job.setNumReduceTasks(4);

              FileInputFormat.addInputPath(job, new Path("hdfs://node1/hadoop/core-site.xml"));

              FileInputFormat.addInputPath(job, new Path("hdfs://node1/hadoop/m*"));

 

              FileSystem fs = FileSystem.get(URI.create("hdfs://node1"), conf);

              fs.delete(new Path("/wordcountOutput"), true);

 

              FileOutputFormat.setOutputPath(job, new Path("hdfs://node1/wordcountOutput"));

 

              System.exit(job.waitForCompletion(true) ? 0 : 1);

              System.out.println(job.getStatus().getJobID());

       }

}

 

8.20.2.2      yarn-default.xml

注:工程中的yarn-default.xml為空文件,但經測式一定需要

8.20.2.3      build.xml

<projectdefault="jar"name="Acid">

       <propertyname="lib.dir"value="D:/hadoop/hadoop-2.7.2/share/hadoop"/>

       <propertyname="src.dir"value="../src"/>

       <propertyname="classes.dir"value="../bin"/>

 

       <propertyname="output.dir"value=".."/>

       <propertyname="jarname"value="wordcount.jar"/>

       <propertyname="mainclass"value="jzj.WordCount"/>

 

       <!-- 第三方jar包的路徑 -->

       <pathid="lib-classpath">

              <filesetdir="${lib.dir}">

                     <includename="**/*.jar"/>

              </fileset>

       </path>

 

       <!-- 1. 初始化工作,如創建目錄等 -->

       <targetname="init">

              <mkdirdir="${classes.dir}"/>

              <mkdirdir="${output.dir}"/>

              <deletefile="${output.dir}/wordcount.jar"/>

              <deleteverbose="true"includeemptydirs="true">

                     <filesetdir="${classes.dir}">

                            <includename="**/*"/>

                     </fileset>

              </delete>

 

       </target>

 

       <!-- 2. 編譯 -->

       <targetname="compile"depends="init">

              <javacsrcdir="${src.dir}"destdir="${classes.dir}"includeantruntime="on">

                     <compilerargline="-encoding GBK"/>

                     <classpathrefid="lib-classpath"/>

              </javac>

       </target>

 

 

       <!-- 3. 打包jar文件 -->

       <targetname="jar"depends="compile">

              <copytodir="${classes.dir}">

                     <filesetdir="${src.dir}">

                            <includename="**"/>

                            <excludename="build.xml"/>

                            <!--注:不能排除掉log4j.properties文件,該文件也要一起打包,否則運行時不會顯示日志

                            該日志配置文件僅作用於JOB,即會在作業提交的客戶端上產生日志,而TASKMapReduce任務)

                            則是由/root/hadoop-2.7.2/etc/hadoop/log4j.properties配置文件來決定-->

                            <!--exclude name="log4j.properties" / -->

                     </fileset>

              </copy>

              <!-- jar文件的輸出路徑 -->

              <jardestfile="${output.dir}/${jarname}"basedir="${classes.dir}">

                     <manifest>

                            <attributename="Main-class"value="${mainclass}"/>

                     </manifest>

              </jar>

       </target>

</project>

8.20.2.4      log4j.properties

log4j.rootLogger=info,stdout,R 

log4j.appender.stdout=org.apache.log4j.ConsoleAppender 

log4j.appender.stdout.layout=org.apache.log4j.PatternLayout 

log4j.appender.stdout.layout.ConversionPattern=%5p-%m%n 

log4j.appender.R=org.apache.log4j.RollingFileAppender 

log4j.appender.R.File=mapreduce_test.log 

log4j.appender.R.MaxFileSize=1MB 

log4j.appender.R.MaxBackupIndex=1

log4j.appender.R.layout=org.apache.log4j.PatternLayout 

log4j.appender.R.layout.ConversionPattern=%p%t%c-%m%n 

 

log4j.logger.jzj =DEBUG

8.20.3              打包執行

打開工程中的build.xml構件文件,按 SHIFT+ALT+XQ,即可在工程下打成作業jar包:

包結構如下:

然后打開工程中的WordCount.java源碼文件,點擊:

8.20.4              權限訪問

運行時如果報以下異常:

 

Exception in thread "main" org.apache.hadoop.security.AccessControlException: Permission denied: user=15040078, access=EXECUTE, inode="/tmp/hadoop-yarn/staging/15040078/.staging/job_1484039063795_0001":root:supergroup:drwxrwx---

       at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)

       at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:259)

       at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:205)

       at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190)

       at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1720)

       at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1704)

       at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkOwner(FSDirectory.java:1673)

       at org.apache.hadoop.hdfs.server.namenode.FSDirAttrOp.setPermission(FSDirAttrOp.java:61)

       at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.setPermission(FSNamesystem.java:1653)

       at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.setPermission(NameNodeRpcServer.java:695)

       at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.setPermission(ClientNamenodeProtocolServerSideTranslatorPB.java:453)

       at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)

       at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)

       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)

       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)

       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)

       at java.security.AccessController.doPrivileged(Native Method)

       at javax.security.auth.Subject.doAs(Subject.java:422)

       at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)

       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)

 

[root@node1 ~]# /root/hadoop-2.7.2/bin/hdfs dfs -chmod -R 777 /

8.21殺任務

如果發現任務提交后,停止不前,則可以殺掉該任務:

[root@node1 ~]# /root/hadoop-2.7.2/bin/hadoop job -list

[root@node1 ~]# /root/hadoop-2.7.2/bin/hadoop job -kill job_1475762778825_0008

8.22日志

8.22.1              Hadoop系統服務日志

NameNodesecondarynamenodehistoryserverResourceManageDataNodenodemanager等系統自帶的服務輸出來的日志默認是存放 ${HADOOP_HOME}/logs目錄下,也可以通過Web頁面這樣查看:

http://node1:19888/logs/

 

這些日志實際上對應每台主機上的本地日志文件,進入相應主機可以看到原始文件:

當日志到達一定的大小將會被切割出一個新的文件,后面的數字越大,代表日志越舊。在默認情況 下,只保存前20個日志文件。系統日志位置及大小都是可以在 ${HADOOP_HOME}/etc/hadoop/log4j.properties文件中配置的,配置文件中的環境變量由${HADOOP_HOME}/etc/hadoop/目錄下相關配置文件來設定

*.out文件,標准輸出會重定向到這里

 

http://node2:19888/logs/

http://node3:19888/logs/

http://node4:19888/logs/

也可以這樣點進來:

 

8.22.2              Mapreduce日志

Mapreduce日志可以分為歷史作業日志和Container日志。
  (1)、歷史作業的記錄里面包含了一個作業用了多少個Map、用了多少個Reduce、作業提交時間、作業啟動時間、作業完成時間等信息;這些信息對分析作業是很有幫助的,我們可以通過這些歷史作業記錄得到每天有多少個作業運行成功、有多少個作業運行失敗、每個隊列作業運行了多少個作業等很有用的信息。這些歷史作業的信息是通過下面的信息配置的:

注:這一類日志文件是放在HDFS上面的

2)、Container日志:包含ApplicationMaster日志和普通Task日志等信息。

YARN提供了兩種存放容器(container)日志的方式:

1)         本地:如果日志聚合服務被開啟的話(通過yarn.log-aggregation-enable來配置),容器日志將會被拷貝到HDFS中並且刪除本機上的日志文件,位置由yarn-site.xml中的yarn.nodemanager.remote-app-log-dir來配置,默認在hdfs://tmp/logs目錄中:

<property>

    <description>Where to aggregate logs to.</description>

    <name>yarn.nodemanager.remote-app-log-dir</name>

    <value>/tmp/logs</value>

  </property>

/tmp/logs下的子目錄默認配置:

<property>

    <description>The remote log dir will be created at {yarn.nodemanager.remote-app-log-dir}/${user}/{thisParam}

    </description>

    <name>yarn.nodemanager.remote-app-log-dir-suffix</name>

    <value>logs</value>

  </property>

默認情況下,這些日志信息是存放在${HADOOP_HOME}/logs/userlogs目錄下:

我們可以通過下面的配置進行修改:

2)         HDFS:當日志聚合服務關閉時(yarn.log-aggregation-enablefalse),日志被保留在任務執行的機器本地的$HADOOP_HOME/logs/userlogs,作業執行完后不會被移到HDFS系統中

 

 

通過http://node1:8088/cluster/apps進去點擊即可查看正在運行與已經完成的作業的日志信息:

 

點擊相應鏈接可以查看到每個MapReduce任務的日志:

8.22.3              System.out

JOB啟動類main方法中的System.out:會在 Job作業提交節點的終端上輸出。如果在是Eclipse上遠程提交的,會在Eclipse中輸出:

 

 

如果作業提交到遠程服務器上運行,在哪個節點(jobtracker)上啟動作業,就在哪個節點終端上顯示輸出:

 

如果是Map或者是reduce類里輸出的,則會將日志輸出到 ${HADOOP_HOME}/logs/userlogs目錄下的文件中(如果日志聚合服務被開啟的話,則任務執行完后會移到HDFS中去存儲,所以在試驗時要在任務運行完之前查看):

 

這些日志還可以通過http://node1:8088/cluster/apps頁面查看的

 

8.22.4              log4j

Eclipse中啟動運行:

作業提交代碼(即Main方法)中的日志、以及作業運行過程中Eclipse控制台輸出,是由作業jar打包中的log4j.properties配置文件來決定:

由於在log4j.properties文件中配置了Console標准輸出,所以在Eclipse控制台會直接打印出來:

從輸出來看,除了Main方法中的日志輸出外,還有大量的作業運行過程中產生的日志記錄,這些也是log4j輸出的,這所有日志記錄(Main中的輸出、作業系統框架輸出)都會記錄到mapreduce_test.log文件中去:

 

提交到服務上運行時:此時的配置文件為/root/hadoop-2.7.2/etc/hadoop/log4j.properties

 

MapReduce任務中的日志級別是由mapred-site.xml中配置,下面是默認配置:

<property>

  <name>mapreduce.map.log.level</name>

  <value>INFO</value>

  <description>The logging level for the map task. The allowed levels are:

  OFF, FATAL, ERROR, WARN, INFO, DEBUG, TRACE and ALL.

  The setting here could be overridden if "mapreduce.job.log4j-properties-file"

  is set.

  </description>

</property>

 

<property>

  <name>mapreduce.reduce.log.level</name>

  <value>INFO</value>

  <description>The logging level for the reduce task. The allowed levels are:

  OFF, FATAL, ERROR, WARN, INFO, DEBUG, TRACE and ALL.

  The setting here could be overridden if "mapreduce.job.log4j-properties-file"

  is set.

  </description>

</property>

 

MapReduce類中的log4j輸出日志會直接輸入到${HADOOP_HOME}/logs/userlogs目錄下的相應文件中(如果日志聚合服務被開啟的話,則任務執行完后會移到HDFS中去存儲),而不是/root/hadoop-2.7.2/etc/hadoop/log4j.properties中配的日志文件(該配置文件所指定的默認名為hadoop.log,但一直都沒找到過!?):

 

注:如果這里設置了Combiner,則Map端與會有reduce日志,原因設置了Combiner后,Map端做完Map后,會繼續運行reduce任務,所以在Map端也會看到reduce任務日志就不奇怪了

9                      MySQL

1、下載mysqlrepo

[root@node4 ~]# wget http://repo.mysql.com/mysql-community-release-el7-5.noarch.rpm

 

2、安裝mysql-community-release-el7-5.noarch.rpm

[root@node4 ~]# rpm -ivh mysql-community-release-el7-5.noarch.rpm

 

安裝這個包后,會獲得兩個mysqlyum repo源:/etc/yum.repos.d/mysql-community.repo/etc/yum.repos.d/mysql-community-source.repo

 

3、安裝mysql

[root@node4 ~]# yum install mysql-server

 

4、啟動數據庫

[root@node4 /root]# service mysql start

 

5、修改root的密碼

[root@node4 /root]# mysqladmin -u root password 'AAAaaa111'

 

6、配置遠程訪問,為了安全,默認情況只允許本地登錄,限制其他IP遠程訪問

[root@node4 /root]# mysql -h localhost -u root -p

Enter password: AAAaaa111

mysql> GRANT ALL PRIVILEGES ON *.* TO 'root'@'%' IDENTIFIED BY 'AAAaaa111' WITH GRANT OPTION;

mysql> flush privileges;

 

7、查看數據庫字符集

mysql> show variables like 'character%';

8、修改字符集

[root@node4 /root]# vi /etc/my.cnf

[client]

default-character-set=utf8

[mysql]

default-character-set=utf8

[mysqld]

character-set-server=utf8

 

9、大小寫敏感配置

不區分表名的大小寫;

[root@node4 /root]# vi /etc/my.cnf

[mysqld]

lower_case_table_names = 1

其中 0:區分大小寫,1:不區分大小寫

 

10、     重啟服務

[root@node4 /root]# service mysql stop

[root@node4 /root]# service mysql start

 

11、     [root@node4 /root]# mysql -h localhost -u root -p

 

12、     字符集修改后再次查看

mysql> show variables like 'character%';

13、     創建庫

mysql> create database hive;

 

14、     顯示數據庫

mysql> show databases;

 

15、     連接數據庫

mysql> use hive;

 

16、     查看庫中有哪些表

mysql> show tables;

 

17、     退出:

mysql> exit;

 

10               HIVE安裝

10.1三種安裝模式

基本概念:metastore包括兩部分,服務進程數據的存儲

hadoop權威指南 第二版》374頁這張圖:

說明: http://attach.dataguru.cn/attachments/forum/201211/22/200716nfnqd4d334q2qr2q.jpg

1.上方描述的是內嵌模式,特點是:hive服務metastore服務運行在同一個進程中,derby服務也運行在該進程中。
該模式無需特殊配置

2.
中間是本地模式,特點是:hive服務metastore服務運行在同一個進程中,mysql是單獨的進程,可以在同一台機器上,也可以在遠程機器上。
該模式只需將hive-site.xml中的ConnectionURL指向mysql,並配置好驅動名、數據庫連接賬號即可

說明: http://attach.dataguru.cn/attachments/forum/201211/22/2012562j7x92wx1x723sxp.jpg

3.下方是遠程模式,特點是:hive服務和metastore不同的進程內,可能是不同的機器
該模式需要將hive.metastore.local設置為false,並將hive.metastore.uris設置為metastore服務器URI,如有多個metastore服務器,URI之間用逗號分隔。metastore服務器URI的格式為thrift://host:portThrift:是hive的通信協議

<property>
<name>hive.metastore.uris</name>
<value>thrift://127.0.0.1:9083</value>
</property>

把這些理解后,大家就會明白,其實僅連接遠程的mysql並不能稱之為遠程模式,是否遠程指的是metastorehive服務是否在同一進程內,換句話說,指的是metastorehive服務離得

10.2遠程模式安裝

node1上安裝hive,在node3上安裝metastore服務:

1、  下載地址:http://apache.fayea.com/hive

Hadoop版本為2.7.2,這里下載apache-hive-1.2.1-bin.tar.gz

[root@node1 ~]# wget http://apache.fayea.com/hive/stable/apache-hive-1.2.1-bin.tar.gz

2、  [root@node1 ~]# tar -zxvf apache-hive-1.2.1-bin.tar.gz

3、  [root@node1 ~]# mv apache-hive-1.2.1-bin hive-1.2.1

4、  [root@node1 ~]# vi /etc/profile

export HIVE_HOME=/root/hive-1.2.1

export PATH=.:$PATH:$JAVA_HOME/bin:$HIVE_HOME/bin

5、  [root@node1 ~]# source /etc/profile

6、  mysql-connector-java-5.6-bin.jar驅動放在 /root/hive-1.2.1/lib/ 目錄下面

7、  [root@node1 ~]# cp /root/hive-1.2.1/conf/hive-env.sh.template /root/hive-1.2.1/conf/hive-env.sh

8、  [root@node1 ~]# vi /root/hive-1.2.1/conf/hive-env.sh

經過上面這些操作后,應該可以啟動默認配置(數據庫用的是內嵌數據庫derbyHIVE了(注:運行Hive之前要啟動Hadoop):

[root@node1 ~]# hive

Logging initialized using configuration in jar:file:/root/hive-1.2.1/lib/hive-common-1.2.1.jar!/hive-log4j.properties

hive>

9、  node1上的Hive拷貝到node3

[root@node1 ~]# scp -r /root/hive-1.2.1 node3:/root

[root@node1 ~]# scp /etc/profile node3:/etc/profile

[root@node3 ~]# source /etc/profile

 

10、              [root@node1 ~]# vi /root/hive-1.2.1/conf/hive-site.xml

<configuration>

<property>

<name>hive.metastore.uris</name>

<value>thrift://node3:9083</value>

</property>   

</configuration>

 

11、              [root@node3 ~]# vi /root/hive-1.2.1/conf/hive-site.xml

<configuration>

    <property>

      <name>hive.metastore.warehouse.dir</name>

      <value>/user/hive/warehouse</value>

    </property>

 

    <property>

      <name>javax.jdo.option.ConnectionURL</name>

      <value>jdbc:mysql://node4:3306/hive?createDatabaseIfNotExist=true&amp;characterEncoding=UTF-8</value>

    </property>

 

    <property>

      <name>javax.jdo.option.ConnectionDriverName</name>

      <value>com.mysql.jdbc.Driver</value>

    </property>

 

    <property>

      <name>javax.jdo.option.ConnectionUserName</name>

      <value>root</value>

    </property>

 

    <property>

      <name>javax.jdo.option.ConnectionPassword</name>

      <value>AAAaaa111</value>

    </property>

</configuration>

 

12、              啟動metastore 服務:

[root@node3 ~]# hive --service metastore&

[1] 2561

Starting Hive Metastore Server

[root@hadoop-slave1 /root]# jps

2561 RunJar

&表示讓metastore服務在后台運行

 

13、              啟動Hive Server

[root@node1 ~]# hive --service hiveserver2 &

[1] 3310

[root@hadoop-master /root]# jps

3310 RunJar

進程號名也是RunJar

 

注:不要使用 hive --service hiveserver 來啟動服務,否則會拋異常:

Exception in thread "main" java.lang.ClassNotFoundException: org.apache.hadoop.hive.service.HiveServer

 

直接使用hive命令啟動shell環境時,其實已經順帶啟動了hiveserver,所以遠程模式下其實只需要單獨啟動metastore,然后就可以進入shell環境正常使用,所以這一步實際上可以省掉,直接運行hive進入shell環境

 

14、              啟動hive命令行

[root@hadoop-master /root]# hive

Logging initialized using configuration in jar:file:/root/hive-1.2.1/lib/hive-common-1.2.1.jar!/hive-log4j.properties

hive>

注:啟運hive時會順帶啟動了hiveserver,所以沒有必要運行hive --service hiveserver2 & 命令了

 

15、              驗證hive

[root@hadoop-master /root]# hive

 

Logging initialized using configuration in jar:file:/root/hive-1.2.1/lib/hive-common-1.2.1.jar!/hive-log4j.properties

hive> show tables;

OK

Time taken: 1.011 seconds

hive> create table test(id int,name string);

可能會出現以下兩種之一的異常:

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:For direct MetaStore DB connections, we don't support retries at the client level.)

 

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:javax.jdo.JDODataStoreException: An exception was thrown while adding/validating class(es) : Specified key was too long; max key length is 767 bytes

com.mysql.jdbc.exceptions.MySQLSyntaxErrorException: Specified key was too long; max key length is 767 bytes

 

這是由於數據庫字符集引起的,進入mysql修改:

[root@node4 /root]# mysql -h localhost -u root -p

mysql> alter database hive character set latin1;

 

16、              登錄mySQL查看meta信息

mysql> use hive;

3)登錄hadoop查看

[root@node1 ~]# hadoop-2.7.2/bin/hdfs dfs -ls /user/hive/warehouse

Found 1 items

drwxr-xr-x   - root supergroup          0 2017-01-22 23:45 /user/hive/warehouse/test

11               Scala安裝

1、    [root@node1 ~]# wget -O /root/scala-2.12.1.tgz http://downloads.lightbend.com/scala/2.12.1/scala-2.12.1.tgz

2、    [root@node1 ~]# tar -zxvf /root/scala-2.12.1.tgz

3、    [root@node1 ~]# vi /etc/profile

export SCALA_HOME=/root/scala-2.12.1

export PATH=.:$PATH:$JAVA_HOME/bin:$HIVE_HOME/bin:$SCALA_HOME/bin

4、    [root@node1 ~]# source /etc/profile

5、    [root@node1 ~]# scala -version    

Scala code runner version 2.12.1 -- Copyright 2002-2016, LAMP/EPFL and Lightbend, Inc.

 

[root@node1 ~]# scala

Welcome to Scala 2.12.1 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_92).

Type in expressions for evaluation. Or try :help.

 

scala> 9*9;

res0: Int = 81

 

scala>

6、    [root@node1 ~]# scp -r /root/scala-2.12.1 node2:/root

[root@node1 ~]# scp -r /root/scala-2.12.1 node3:/root

[root@node1 ~]# scp -r /root/scala-2.12.1 node4:/root

[root@node1 ~]# scp /etc/profile node2:/etc

[root@node1 ~]# scp /etc/profile node3:/etc

[root@node1 ~]# scp /etc/profile node4:/etc

[root@node2 ~]# source /etc/profile

[root@node3 ~]# source /etc/profile

[root@node4 ~]# source /etc/profile

12               Spark安裝

1、    [root@node1 ~]# wget -O /root/spark-2.1.0-bin-hadoop2.7.tgz http://d3kbcqa49mib13.cloudfront.net/spark-2.1.0-bin-hadoop2.7.tgz

2、    [root@node1 ~]# tar -zxvf /root/spark-2.1.0-bin-hadoop2.7.tgz

3、    [root@node1 ~]# vi /etc/profile

export SPARK_HOME=/root/spark-2.1.0-bin-hadoop2.7

export PATH=.:$PATH:$JAVA_HOME/bin:$HIVE_HOME/bin:$SCALA_HOME/bin:$SPARK_HOME/bin

4、    [root@node1 ~]# source /etc/profile

5、    [root@node1 ~]# cp /root/spark-2.1.0-bin-hadoop2.7/conf/spark-env.sh.template /root/spark-2.1.0-bin-hadoop2.7/conf/spark-env.sh

6、    [root@node1 ~]# vi /root/spark-2.1.0-bin-hadoop2.7/conf/spark-env.sh

export SCALA_HOME=/root/scala-2.12.1

export JAVA_HOME=//root/jdk1.8.0_92

export HADOOP_CONF_DIR=/root/hadoop-2.7.2/etc/hadoop

7、    [root@node1 ~]# cp /root/spark-2.1.0-bin-hadoop2.7/conf/slaves.template /root/spark-2.1.0-bin-hadoop2.7/conf/slaves

8、    [root@node1 ~]# vi /root/spark-2.1.0-bin-hadoop2.7/conf/slaves

7、    [root@node1 ~]# scp -r /root/spark-2.1.0-bin-hadoop2.7 node2:/root

[root@node1 ~]# scp -r /root/spark-2.1.0-bin-hadoop2.7 node3:/root

[root@node1 ~]# scp -r /root/spark-2.1.0-bin-hadoop2.7 node4:/root

[root@node1 ~]# scp /etc/profile node2:/etc

[root@node1 ~]# scp /etc/profile node3:/etc

[root@node1 ~]# scp /etc/profile node4:/etc

[root@node2 ~]# source /etc/profile

[root@node3 ~]# source /etc/profile

[root@node4 ~]# source /etc/profile

8、    [root@node1 conf]# /root/spark-2.1.0-bin-hadoop2.7/sbin/stop-all.sh

[root@node1 ~]# jps

2569 Master

[root@node2 ~]# jps

2120 Worker

 [root@node3 ~]# jps

2121 Worker

[root@node4 ~]# jps

2198 Worker

 

12.1測試

直接在Spark Shell中進行測試:

[root@node1 conf]# spark-shell

val file=sc.textFile("hdfs://node1/hadoop/core-site.xml")

val rdd = file.flatMap(line => line.split(" ")).map(word => (word,1)).reduceByKey(_+_)

rdd.collect()

rdd.foreach(println)

 

使用SparkHadoop提供的WordCount示例提交測試:

[root@node1 ~]# spark-submit --master spark://node1:7077 --class org.apache.hadoop.examples.WordCount --name wordcount /root/hadoop-2.7.2/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar hdfs://node1/hadoop/core-site.xml hdfs://node1/output

不過此種情況還是提交成MapReduce任務,而不是Spark任務,該示例包jarJava語開發的,並且程序中未使用到Spark

 

使用Spark提供的WordCount示例進行測試:

spark-submit --master spark://node1:7077 --class org.apache.spark.examples.JavaWordCount --name wordcount /root/spark-2.1.0-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.1.0.jar hdfs://node1/hadoop/core-site.xml hdfs://node1/output

該示例也是Java語句實現,但程序是通過Spark包實現的,所以產生了Spark任務:

12.2Hive啟動問題

Hivespark2.0.0啟動時無法訪問../lib/spark-assembly-*.jar: 沒有那個文件或目錄的解決辦法

 

[root@node1 ~]# vi /root/hive-1.2.1/bin/hive

  #sparkAssemblyPath=`ls ${SPARK_HOME}/lib/spark-assembly-*.jar`

  sparkAssemblyPath=`ls ${SPARK_HOME}/jars/*.jar`

[root@node1 ~]# scp /root/hive-1.2.1/bin/hive node3:/root/hive-1.2.1/bin

 

13               清理與壓縮

yum 會把下載的軟件包和header存儲在cache中,而不會自動刪除。清除YUM緩存:

[root@node1 ~]# yum clean all

[root@node1 ~]# dd if=/dev/zero of=/0bits bs=20M       //將碎片空間填充上0,結束的時候會提示磁盤空間不足,忽略即可

[root@node1 ~]# rm  /0bits                           //刪除上面的填充

 

關閉虛擬機,然后打開cmd ,用cd命令進入到你的vmware安裝文件夾,如D:\BOE4 然后執行:

vmware-vdiskmanager -k  D:\hadoop\spark\VM\node1\node1.vmdk       //注:這個vmdk文件為總文件,而不是子的

14               hadoop2.x常用端口

 

組件

節點

默認端口

配置

用途說明

HDFS

DataNode

50010

dfs.datanode.address

datanode服務端口,用於數據傳輸

HDFS

DataNode

50075

dfs.datanode.http.address

http服務的端口

HDFS

DataNode

50475

dfs.datanode.https.address

https服務的端口

HDFS

DataNode

50020

dfs.datanode.ipc.address

ipc服務的端口

HDFS

NameNode

50070

dfs.namenode.http-address

http服務的端口

HDFS

NameNode

50470

dfs.namenode.https-address

https服務的端口

HDFS

NameNode

8020

fs.defaultFS

接收Client連接的RPC端口,用於獲取文件系統metadata信息。

HDFS

journalnode

8485

dfs.journalnode.rpc-address

RPC服務

HDFS

journalnode

8480

dfs.journalnode.http-address

HTTP服務

HDFS

ZKFC

8019

dfs.ha.zkfc.port

ZooKeeper FailoverController,用於NN HA

YARN

ResourceManager

8032

yarn.resourcemanager.address

RMapplications manager(ASM)端口

YARN

ResourceManager

8030

yarn.resourcemanager.scheduler.address

scheduler組件的IPC端口

YARN

ResourceManager

8031

yarn.resourcemanager.resource-tracker.address

IPC

YARN

ResourceManager

8033

yarn.resourcemanager.admin.address

IPC

YARN

ResourceManager

8088

yarn.resourcemanager.webapp.address

http服務端口

YARN

NodeManager

8040

yarn.nodemanager.localizer.address

localizer IPC

YARN

NodeManager

8042

yarn.nodemanager.webapp.address

http服務端口

YARN

NodeManager

8041

yarn.nodemanager.address

NMcontainer manager的端口

YARN

JobHistory Server

10020

mapreduce.jobhistory.address

IPC

YARN

JobHistory Server

19888

mapreduce.jobhistory.webapp.address

http服務端口

HBase

Master

60000

hbase.master.port

IPC

HBase

Master

60010

hbase.master.info.port

http服務端口

HBase

RegionServer

60020

hbase.regionserver.port

IPC

HBase

RegionServer

60030

hbase.regionserver.info.port

http服務端口

HBase

HQuorumPeer

2181

hbase.zookeeper.property.clientPort

HBase-managed ZK mode,使用獨立的ZooKeeper集群則不會啟用該端口。

HBase

HQuorumPeer

2888

hbase.zookeeper.peerport

HBase-managed ZK mode,使用獨立的ZooKeeper集群則不會啟用該端口。

HBase

HQuorumPeer

3888

hbase.zookeeper.leaderport

HBase-managed ZK mode,使用獨立的ZooKeeper集群則不會啟用該端口。

Hive

Metastore

9083

/etc/default/hive-metastoreexport PORT=<port>來更新默認端口

 

Hive

HiveServer

10000

/etc/hive/conf/hive-env.shexport HIVE_SERVER2_THRIFT_PORT=<port>來更新默認端口

 

ZooKeeper

Server

2181

/etc/zookeeper/conf/zoo.cfgclientPort=<port>

對客戶端提供服務的端口

ZooKeeper

Server

2888

/etc/zookeeper/conf/zoo.cfgserver.x=[hostname]:nnnnn[:nnnnn],標藍部分

follower用來連接到leader,只在leader上監聽該端口。

ZooKeeper

Server

3888

/etc/zookeeper/conf/zoo.cfgserver.x=[hostname]:nnnnn[:nnnnn],標藍部分

用於leader選舉的。只在electionAlg1,23(默認)時需要。

 

15               Linux命令

查超出10M的文件:

find . -type f -size +10M  -print0 | xargs -0 du -h | sort -nr

 

將前最大的前20目錄列出來,--max-depth表示目錄深度,如果去掉,則遍歷所有子目錄:

du -hm --max-depth=5 / | sort -nr | head -20

 

find /etc -name '*srm*'  #表示在/etc目錄下查找文件名中含有srm字符的所有文件

 

 

清除YUM緩存

  yum 會把下載的軟件包和header存儲在cache中,而不會自動刪除。假如我們覺得他們占用了磁盤空間,能夠使用yum clean指令進行清除,更精確 的用法是yum clean headers清除headeryum clean packages清除下載的rpm包,yum clean all一股腦兒端 .

 

更改所有者

chown -R -v 15040078 /tmp

 

16               hadoop文件系統命令

[root@node1 ~/hadoop-2.6.0/bin]# ./hdfs dfs -chmod -R 700 /tmp

 

附件列表

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM