Hadoop 集群的三種方式

本文轉載自查看原文 2019-04-13 15:47 965 大數據

1,Local(Standalone) Mode 單機模式

  $ mkdir input
  $ cp etc/hadoop/*.xml input
  $ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.2.jar grep input output 'dfs[a-z.]+'
  $ cat output/*

  解析$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.2.jar grep input output 'dfs[a-z.]+'
  input 夾下面的文件 ：capacity-scheduler.xml  core-site.xml  hadoop-policy.xml  hdfs-site.xml  httpfs-site.xml  yarn-site.xml

  bin/hadoop    hadoop 命令
  jar           這個命令在jar包里面
  share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.2.jar  具體位置
  grep          grep 函數
  input         grep 函數的目標文件夾
  output        grep 函數結果的輸出文件夾
  'dfs[a-z.]+'  grep 函數的匹配正則條件

  直譯：將input文件下面的文件中包含 'dfs[a-z.]+' 的字符串給輸出到output 文件夾中
  輸出結果：part-r-00000  _SUCCESS
  cat part-r-00000：1 dfsadmin
  在hadoop-policy.xml 存在此字符串

2,Pseudo-Distributed Operation 偽分布式

在 etc/hadoop/core.site.xml 添加以下屬性
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://hella-hadoop.chris.com:8020</value>   hella-hadoop.chris.com是主機名，已經和ip相互映射
    </property>

還需要覆蓋默認的設定,mkdir -p data/tmp
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/opt/modules/hadoop-2.5.0/data/tmp</value>   hella-hadoop.chris.com是主機名，已經和ip相互映射
    </property>

 垃圾箱設置刪除文件保留時間（分鍾）
    <property>
        <name>fs.trash.interval</name>
        <value>10080</value>
    </property>

</configuration>

etc/hadoop/hdfs-site.xml: 偽分布式1個備份
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
配置從節點
    <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>主機名:50090</value>
    </property> 
</configuration>

格式化元數據，進入到安裝目錄下

bin/hdfs namenode -format

啟動namenode,所有的命令都在sbin下，通過ls sbin/ 可以查看

sbin/hadoop-daemon.sh start namenode hadoop 的守護線程啟動(主數據)

sbin/hadoop-daemon.sh start datanode 啟動datanode（從數據）

nameNode都有個web網頁,端口50070

創建hdfs 文件夾，創建在用戶名下面

bin/hdfs dfs -mkdir -p /user/chris

查看文件夾

bin/hdfs dfs -ls -R / 回調查詢

本地新建文件夾mkdir wcinput mkdir wcoutput vi wc.input創建wc.input文件，並寫入內容

hdfs文件系統新建文件夾

bin/hdfs dfs -mkdir -p /user/chris/mapreduce/wordcount/input

本地文件上傳hdfs文件系統

bin/hdfs dfs -put wcinput/wc.input /user/chris/mapreduce/wordcount/input/

在hdfs文件系統上使用mapreduce

$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.2.jar wordcount /user/chris/mapreduce/wordcount/input /user/chris/mapreduce/wordcount/output

紅色代表：讀取路徑

藍色代表：輸出路徑

所以mapreduce的結果已經寫到了hdfs的輸出文件里面去了

Yarn on a Single Node

/opt/modules/hadoop-2.5.0/etc/hadoop/yarn-site.xml 在hadoop的安裝路徑下
<configuration>
   <property>
     <name>yarn.resourcemanager.hostname</name>
  <value>hella-hadoop.chris.com</value>
   </property>
   <property>
     <name>yarn.nodemanager.aux-services</name>
  <value>mapreduce_shuffle</value>
   </property>
</configuration>

yarn 的配置已經完成

在同一目錄下slave文件上添加主機名或者主機ip,默認是localhost

yarn-env.sh 和 mapred-env.sh把JAVA_HOME 更改下，防止出錯

export JAVA_HOME=/home/chris/software/jdk1.8.0_201

將mapred-site.xml.template 重命名為mapred-site.xml,同時添加以下配置

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</name>
    </property>
</configuration>

先將/user/chris/mapreduce/wordcount/output/刪除

再次執行$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.2.jar wordcount /user/chris/mapreduce/wordcount/input /user/chris/mapreduce/wordcount/output

偽分布式執行完畢，mapreduce 執行在了yarn 上

3，完全分布式

基於偽分布式，配置好一台機器后，分發至其它機器

step1: 配置ip 和 hostname 映射

vi /etc/hosts

192.168.178.110 hella-hadoop.chris.com hella-hadoop

192.168.178.111 hella-hadoop02.chris.com hella-hadoop02

192.168.178.112 hella-hadoop03.chris.com hella-hadoop03

同時在window以下路徑也得設置

C:\Windows\System32\drivers\etc\hosts

192.168.178.110 hella-hadoop.chris.com hella-hadoop

192.168.178.111 hella-hadoop02.chris.com hella-hadoop02

192.168.178.112 hella-hadoop03.chris.com hella-hadoop03

具體可參考linux ip hostname 映射

https://www.cnblogs.com/pickKnow/p/10701914.html

step2:部署（假設三台機器）

不同機器配置不同的節點


部署：


        hella-hadoop        hella-hadoop02         hella-hadoop03
HDFS:
        NameNode
        DataNode            DataNode               DataNode
                                                   SecondaryNameNode
YARN:
                            ResourceManager
        NodeManager            NodeManager            NodeManager                                   
                                                
MapReduce:
          JobHistoryServer
        
配置：
     * hdfs
           hadoop-env.sh 
           core.site.xml
           hdfs-site.xml
           slaves
     *yarn
           yarn-env.sh
           yarn-site.xml
           slaves
     *mapreduce
           mapred-env.sh
           mapred-site.xml

step3:修改配置文件

core.xml

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://hella-hadoop.chris.com:8020</value>
    </property>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/opt/app/hadoop-2.5.0/data/tmp</value>
    </property>
    <property>
        <name>fs.trash.interval</name>
        <value>10080</value>
    </property>

</configuration>

hdfs-site.xml
<configuration>
    <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>hella-hadoop03.chris.com:50090</value>
    </property>    
</configuration>

slaves

hella-hadoop.chris.com
hella-hadoop02.chris.com
hella-hadoop03.chris.com

yarn-site.xml

<configuration>
   <property>
     <name>yarn.resourcemanager.hostname</name>
     <value>hella-hadoop02.chris.com</value>
   </property>
   <property>
     <name>yarn.nodemanager.aux-services</name>
     <value>mapreduce_shuffle</value>
   </property>
   <!--NodeManager Resouce -->
   <property>
     <name>yarn.nodemanager.resource.memory-mb</name>
     <value>4096</value>
   </property>
   <property>
     <name>yarn.nodemanager.resource.cpu-vcores</name>
     <value>4</value>
   </property>
   
   
   <property>
     <name>yarn.log-aggregation-enable</name>
     <value>true</value>
   </property>
   <property>
     <name>yarn.log-aggregation-retain-seconds</name>
     <value>640800</value>
   </property>
   
</configuration>

mapred-site.xml

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
    <property>
        <name>mapreduce.jobhistory.address</name>
        <value>hella-hadoop.chris.com:10020</value>
    </property>
    
    <property>
      <name>mapreduce.jobhistory.webapp.address</name>
      <value>hella-hadoop.chris.com:19888</value>
    </property>
</configuration

step4:集群的配置路徑在各個機器上要一樣，用戶名一樣

step5: 分發hadoop 安裝包至各個機器節點

scp -p 源節點目標節點

使用scp 命令需要配置ssh 無密鑰登陸，博文如下：

https://www.cnblogs.com/pickKnow/p/10734642.html

step6:啟動並且test mapreduce

可能會有問題No route to Host 的Error，查看hostname 以及 ip 配置，或者是防火牆有沒有關閉

防火牆關閉，打開，狀態查詢，請參考以下博文：

https://www.cnblogs.com/pickKnow/p/10670882.html

4，完全分布式+ HA

HA全稱：HDFS High Availability Using the Quorum Journal Manager 即 HDFS高可用性通過配置分布式日志管理

HDFS集群中存在單點故障（SPOF）,對於只有一個NameNode 的集群，若是NameNode 出現故障，則整個集群無法使用，知道NameNode 重新啟動。

HDFS HA 功能則通過配置Active/StandBy 兩個NameNodes 實現在集群中對NameNode 的熱備來解決上述問題，如果出現故障，如機器崩潰或機器需要升級維護，這時可以通過此種方式將NameNode很快的切換到另一台機器.

在以上的分布式配置如下：假設有三台機器

配置要點：

* share edits

JournalNode

*NameNode

Active,Standby

*Client

proxy

*fence

隔離，同一時刻只能僅有一個NameNode對外提供服務

規划集群：

hella-hadoop.chris.com hella-hadoop02.chris.com hella-hadoop03.chris.com

NameNode NameNode

JournalNode JournalNode JournalNode

DateNode DateNode DateNode

因為NameNode有兩個，一個備份，所以就不需要secondarynamenode了

配置：

core-site.xml

    <property>
      <name>fs.defaultFS</name>
      <value>hdfs://ns1</value>
    </property>

hdfs-site.xml

<!-- 代表一個nameservice -->
    <property>
      <name>dfs.nameservices</name>
      <value>ns1</value>
    </property>
    
    <!-- ns1 有兩個namenode -->
    <property>
      <name>dfs.ha.namenodes.ns1</name>
      <value>nn1,nn2</value>
     </property>
     
     <!-- 分別配置namenode的地址 -->
     <property>
       <name>dfs.namenode.rpc-address.ns1.nn1</name>
       <value>hella-hadoop.chris.com:8020</value>
     </property>
     <property>
       <name>dfs.namenode.rpc-address.ns1.nn2</name>
       <value>hella-hadoop02.chris.com:8020</value>
      </property>

       <!-- 分別配置namenode web 端地址 -->
      <property>
          <name>dfs.namenode.http-address.ns1.nn1</name>
          <value>hella-hadoop.chris.com:50070</value>
      </property>
      <property>
        <name>dfs.namenode.http-address.ns1.nn2</name>
        <value>hella-hadoop02.chris.com:50070</value>
      </property>
      
         <!-- NameNode Shared Edits Address 即 journal node 地址 -->     
    <property>
      <name>dfs.namenode.shared.edits.dir</name>
      <value>qjournal://hella-hadoop.chris.com:8485;hella-hadoop02.chris.com:8485;hella-hadoop03.chris.com:8485/ns1</value>
     </property>
      <!-- journal node 目錄-->     
     <property>
      <name>dfs.journalnode.edits.dir</name>
      <value>/opt/app/hadoop-2.5.0/data/dfs/jn</value>
     </property>
     
      
      <!-- HDFS 代理客戶端 -->     
    <property>
      <name>dfs.client.failover.proxy.provider.ns1</name>
      <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
    </property>
      
      <!-- fence 隔離 只允許一個namenode 激活 -->    
      <!-- 如果使用fence ssh 隔離，要求機器namenode 的機器能夠相互無密鑰登陸-->    
    <property>
      <name>dfs.ha.fencing.methods</name>
      <value>sshfence</value>
    </property>
    <property>
      <name>dfs.ha.fencing.ssh.private-key-files</name>
      <value>/home/chris/.ssh/id_rsa</value>
    </property>

配置完畢,分發到其他的兩台機器，開始啟動

step1:在各個JournalNode 節點桑，輸入以下命令啟動journalnode 服務

$sbin/hadoop-daemon.sh start journalnode

step2:在【nn1】上，對其進行格式化，並啟動：

$bin/hdfs namenode-format

$sbin/hadoop-daemon.sh start namenode

step3:在【nn2】上，同步nn1的元數據信息：

$bin/hdfs namenode-bootstrapStandby

step4:啟動【nn2】

$sbin/hadoop-daemon.sh start namenode

step5:將【nn1】切換為Active

$bin/hdfs haadmin-transitionToActive nn1

step6:在【nn1】上，啟動所有的datanode

$sbin/hadoop-daemon.sh start datanode

4，完全分布式+ HA + zookeeper

只配置HA,只是手動的故障轉移，要想做到自動的故障轉移，需要通過zookeeper 對集群的服務進行一個監控

zookeeper的作用：
* 啟動以后兩個namenode 都是standby

zookeeper 選舉一個為Active

*監控

ZKFC:zookeeper failover controller

集群的守護進程更新如下：

hella-hadoop.chris.com hella-hadoop02.chris.com hella-hadoop03.chris.com

NameNode NameNode

ZKFC ZKFC

JournalNode JournalNode JournalNode

DateNode DateNode DateNode

ZKFC用來監控namenode

開始配置：

core-site.xml

<!--zookeeper集群配置-->  
    <property>
       <name>ha.zookeeper.quorum</name>
      <value>hella-hadoop.chris.com:2181,hella-hadoop02.chris.com:2181,hella-hadoop03.chris.com:2181</value>
    </property>

hdfs-site.xml

<!-- failover 故障自動轉移，依靠zookeeper 集群，zookeeper 配置在core -->    
     <property>
      <name>dfs.ha.automatic-failover.enabled</name>
      <value>true</value>
     </property>

配置完畢，開始啟動並且驗證：

step1：關閉所有的HDFS 服務 sbin/stop-dfs.sh

step2: 啟動Zookeeper集群 bin/zkServer.sh start

step3: 初始化HA 在Zookeeper中的狀態 bin/hdfs zkfc -formatZK

step4:啟動HDFS服務sbin/start-dfs.sh

stepc5:在各個NameNode 節點上啟動DFSZK Failover Controller,先在那台機器啟動，那台機器的NameNode就是Active NameNode

sbin/hadoop-daemon.sh start zkfc

驗證：

jps 查看進程，可以將Active的進程kill, kill -9 pid

可以通過50070端口號在網頁上直接查看，也可以通過命令查看namenode 是否實現故障自動轉移，本來是standby 的namenode 轉化為active

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 集群部署的三種方式（hadoop集群部署三種方式） hadoop三種啟動方式 Redis集群的三種方式 Redis的三種集群方式 redis的三種集群方式 redis的三種集群方式 Redis集群的三種方式 Redis集群搭建的三種方式 Redis集群搭建的三種方式 Mongodb集群搭建的三種方式