Hadoop YARN資源管理-容量調度器(Yahoo!的Capacity Scheduler)

本文轉載自查看原文 2020-07-27 01:27 622 YARN

　　　Hadoop YARN資源管理-容量調度器(Yahoo!的Capacity Scheduler)

　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　作者：尹正傑

一.隊列和子隊列

1>.YARN資源調度器概述

　　博主推薦閱讀:
　　　　https://www.cnblogs.com/yinzhengjie/p/13341939.html

2>.隊列概述

　　容量調度器依賴於隊列的概念來控制集群中的資源分配。一個(作業)隊列是作業的有序列表。當創建隊列是，為其分配一些集群資源。

　　然后，用戶應用程序被提交到此隊列以訪問隊列的資源，關於隊列我們需要了解以下幾點:
　　　　(1)可以配置隊列容量的軟限制以及硬限制;
　　　　(2)被提交到隊列的應用程序以FIFO順序運行;
　　　　(3)一旦提交到隊列的應用程序開始運行，它們不能被搶占，但隨着任務的完成，任何空閑的資源都將被分配到其他資源低於允許容量的隊列;
　　　　(4)如果一個隊列沒有使用分配給它的所有資源，那么多余的資源可以被集群中的其他隊列使用，從而優化集群的資源利用率;

　　容量調度器支持使用分層隊列來確保組織(在多租戶設置中，指共享相同集群的多個組織)資源在其子隊列之間共享，這優先於讓其他隊列使用這些可用資源。

3>.Apache Hadoop的容量調度器默認隊列

　　作業隊列是一切事情的開端，可以在"${HADOOP_HOME}/etc/hadoop/capacity-scheduler.xml"文件中設置隊列，該文件默認位於Hadoop安裝目錄的下的"etc/hadoop/"目錄中。

　　如下圖所示，root隊列是預定的隊列，隨后創建的所有隊列都將被視為root隊列下的子隊列（比如Apache Hadoop在其root隊列下就有默認的子隊列"default"）。

4>.容量調度器隊列的命名規則

　　創建任何的隊列相對於隊列路徑來命名，該路徑顯示隊列的層次結構，使用YARN配置屬性"yarn.scheduler.capacity.<queue-path>.queues"來配置隊列。
　　　　yarn.scheduler.capacity.root.queues
　　　　yarn.scheduler.capacity.root.queues.default.queues
　　　　yarn.scheduler.capacity.root.queues.yinzhengjie.queues
　　　　yarn.scheduler.capacity.root.queues.yinzhengjie.queues.op.queues

　　溫馨提示:
　　　　root始終是創建所有隊列的頂級隊列(這一點不能更改，如果你將頂級隊列進行更名，那么YARN集群在啟動時就會拋出如下圖所示的異常)，此外，子隊列可能有也可能沒有喲。
　　　　頂級子隊列(如下圖所示的"default")是直接位於root隊列下的子隊列。在每個頂級子隊列下，也可以創建子隊列，因此我們可以說隊列是支持嵌套的。

5>.分層隊列

　　為了細粒度級別控制資源分配，還可以在每個隊列下配置稱為分層隊列的子隊列，從而允許來自特定組織的應用程序有效利用分配給它的所有資源。

　　隊列的多余或空閑資源只有在其子隊列滿足其資源需求之后才被其他隊列使用。

　　除了隊列的配額和最大容量之外，管理員還可以做以下限制:
　　　　(1)特定用戶可以使用最大的資源量;
　　　　(2)每個隊列(或每個用戶)的待處理任務數量;
　　　　(3)每個隊列(或每個用戶)的活動(或接受)作業的數量;
　　　　(4)容量保證和彈性;

　　如下圖所示，就是典型的容量調度器隊列分層的案例。

6>.容量保證

　　容量調度器的主要目標是確保資源共享的可預測性。它通過為配置的作業隊列提供容量保證來實現這一可預測性。發送到隊列的應用程序能夠訪問隊列的容量。

　　每個隊列被分配一部分集群容量，因此具體容量在隊列中。可以為隊列分配的容量設置軟和硬(可選)限制。

7>.隊列彈性

　　為了充分利用集群資源，調度器還允許隊列具有一定彈性，如果集群中有空閑資源，則隊列總是可以利用超出其配置容量的資源。

　　這里的彈性是指基於資源的可用性(或不可用性)，集群可以分配超過(或少於)原始配置的資源。這意味着超載的作業隊列可以潛在地使用集群中其他隊列的未使用容量，從而最優使用集群資源。

　　當然，隨着其他隊列的增加並要求為它們保證容量，Hadoop將回收分配給隊列的超額資源。為了防止隊列使用比分配的容量更多的資源，可以設置隊列彈性的上線。

8>.容量調度器的元素

　　以上我們了解了容量調度器的基本配置元素，接下來我們探討如何在集群中設置調度器，需要做兩件事:
　　　　(1)設置隊列;
　　　　(2)配置隊列的容量;

　　容量調度器配置文件(${HADOOP_HOME}/etc/hadoop/capacity-scheduler.xml)中的隊列元素是容量調度器中關鍵的調度單位，一切都圍繞它來做。因此，要配置容量調度器，必須首先配置隊列。

　　容量調度器中可以有多個隊列，每個隊列具有以下特性:
　　　　(1)隊列名稱和完整隊列路徑名;
　　　　(2)子隊列和應用程序的列表;
　　　　(3)用戶列表及其資源分配限制;
　　　　(4)隊列的保證容量和最大容量;
　　　　(5)隊列的狀態(運行或停止);
　　　　(6)隊列的訪問控制，格式為Access Control List(ACL);

　　可以在調度器的配置文件(${HADOOP_HOME}/etc/hadoop/capacity-scheduler.xml)中指定所有這些屬性，該文件通常位於Hadoop安裝目錄的下的"etc/hadoop/"目錄中。

　　溫馨提示:
　　　　如下圖所示，可以通過配置"${HADOOP_HOME}/etc/hadoop/yarn-site.xml"文件中的"yarn.admin.acl"屬性控制誰可以通過"yarn rmadmin -refreshQueues"命令來更新"capacity-scheduler.xml"文件。
　　　　　　<property>
　　　　　　　　<name>yarn.admin.acl</name>
　　　　　　　　<value>yinzhengjie</value>
　　　　　　　　<description>用於指定誰可以管理YARN集群的ACL，默認值為"*"，即任何用戶都可以用來管理Hadoop集群.</description>
　　　　　　</property>

[root@hadoop101.yinzhengjie.com ~]# yarn rmadmin -help
rmadmin is the command to execute YARN administrative commands.
The full syntax is: 

yarn rmadmin [-refreshQueues] [-refreshNodes [-g|graceful [timeout in seconds] -client|server]] [-refreshNodesResources] [-refreshSuperUserGroupsConfiguration] [-refreshUserToGroupsMappings
] [-refreshAdminAcls] [-refreshServiceAcl] [-getGroup [username]] [-addToClusterNodeLabels <"label1(exclusive=true),label2(exclusive=false),label3">] [-removeFromClusterNodeLabels <label1,label2,label3>] [-replaceLabelsOnNode <"node1[:port]=label1,label2 node2[:port]=label1"> [-failOnUnknownNodes]] [-directlyAccessNodeLabelStore] [-refreshClusterMaxPriority] [-updateNodeResource [NodeID] [MemSize] [vCores] ([OvercommitTimeout]) [-help [cmd]]
   -refreshQueues: Reload the queues' acls, states and scheduler specific properties. 
        ResourceManager will reload the mapred-queues configuration file.
   -refreshNodes [-g|graceful [timeout in seconds] -client|server]: Refresh the hosts information at the ResourceManager. Here [-g|graceful [timeout in seconds] -client|server] is optional,
 if we specify the timeout then ResourceManager will wait for timeout before marking the NodeManager as decommissioned. The -client|server indicates if the timeout tracking should be handled by the client or the ResourceManager. The client-side tracking is blocking, while the server-side tracking is not. Omitting the timeout, or a timeout of -1, indicates an infinite timeout. Known Issue: the server-side tracking will immediately decommission if an RM HA failover occurs.   -refreshNodesResources: Refresh resources of NodeManagers at the ResourceManager.
   -refreshSuperUserGroupsConfiguration: Refresh superuser proxy groups mappings
   -refreshUserToGroupsMappings: Refresh user-to-groups mappings
   -refreshAdminAcls: Refresh acls for administration of ResourceManager
   -refreshServiceAcl: Reload the service-level authorization policy file. 
        ResourceManager will reload the authorization policy file.
   -getGroups [username]: Get the groups which given user belongs to.
   -addToClusterNodeLabels <"label1(exclusive=true),label2(exclusive=false),label3">: add to cluster node labels. Default exclusivity is true
   -removeFromClusterNodeLabels <label1,label2,label3> (label splitted by ","): remove from cluster node labels
   -replaceLabelsOnNode <"node1[:port]=label1,label2 node2[:port]=label1,label2"> [-failOnUnknownNodes] : replace labels on nodes (please note that we do not support specifying multiple lab
els on a single host for now.)        [-failOnUnknownNodes] is optional, when we set this option, it will fail if specified nodes are unknown.
   -directlyAccessNodeLabelStore: This is DEPRECATED, will be removed in future releases. Directly access node label store, with this option, all node label related operations will not conn
ect RM. Instead, they will access/modify stored node labels directly. By default, it is false (access via RM). AND PLEASE NOTE: if you configured yarn.node-labels.fs-store.root-dir to a local directory (instead of NFS or HDFS), this option will only work when the command run on the machine where RM is running.   -refreshClusterMaxPriority: Refresh cluster max priority
   -updateNodeResource [NodeID] [MemSize] [vCores] ([OvercommitTimeout]): Update resource on specific node.
   -help [cmd]: Displays help for the given command or all commands if none is specified.

Generic options supported are:
-conf <configuration file>        specify an application configuration file
-D <property=value>               define a value for a given property
-fs <file:///|hdfs://namenode:port> specify default filesystem URL to use, overrides 'fs.defaultFS' property from configurations.
-jt <local|resourcemanager:port>  specify a ResourceManager
-files <file1,...>                specify a comma-separated list of files to be copied to the map reduce cluster
-libjars <jar1,...>               specify a comma-separated list of jar files to be included in the classpath
-archives <archive1,...>          specify a comma-separated list of archives to be unarchived on the compute machines

The general command line syntax is:
command [genericOptions] [commandOptions]

[root@hadoop101.yinzhengjie.com ~]#

[root@hadoop101.yinzhengjie.com ~]# yarn rmadmin -help

9>.隊列創建示例

　　以下示例是將第5點介紹的分層隊列所畫圖的關系來定義容量調度器的配置。

　　配置成功后可直接使用"yarn rmadmin -refreshQueues"命令來刷新隊列配置信息，而無需重啟YARN集群。

　　溫馨提示:
　　　　(1)我們不能直接向root隊列提交JOB，當然也不能向父隊列提交JOB，僅能向葉子隊列提交JOB。
　　　　(2)如果您試圖想要將正在運行的葉子隊列更改為父隊列(即將狀態為RUNNING的葉子隊列創建為父隊列)，則需要重啟YARN集群喲;

[root@hadoop101.yinzhengjie.com ~]# vim ${HADOOP_HOME}/etc/hadoop/capacity-scheduler.xml
[root@hadoop101.yinzhengjie.com ~]# 
[root@hadoop101.yinzhengjie.com ~]# cat ${HADOOP_HOME}/etc/hadoop/capacity-scheduler.xml
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->
<configuration>

  <!-- root始終是創建所有隊列的頂級隊列，因此我們現在頂級隊列中創建2個子頂級隊列。 -->
  <property>
    <name>yarn.scheduler.capacity.root.queues</name>
    <value>default,yinzhengjie</value>
    <description>這是為root頂級隊列定義子隊列，默認值為:"default"</description>
  </property>

  <!-- 注意哈，當我們定義好頂級隊列的子隊列后，我們接下來做為其設置隊列容量，如果你沒有做該步驟，那么啟動RM將會失敗。  -->
  <property>
    <name>yarn.scheduler.capacity.root.yinzhengjie.capacity</name>
    <value>80</value>
    <description>這里指定的是root頂隊列下的yinzhengjie這個子隊列，該隊列占用整個集群的80%的資源</description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.default.capacity</name>
    <value>20</value>
    <description>這里指定的是root頂隊列下的default這個子隊列，該隊列占用整個集群的20%的資源</description>
  </property>

  <!-- 
    我們可以為子頂隊列繼續分配子隊列，比如我們將yinzhengjie這個隊列分為:"operation","development"和"testing"這3個子隊列。

    下面配置的隊列存在以下關系:
        (1)我們可以說"yinzhengjie"這個隊列是"operation","development"和"testing"的父隊列;
        (2)"operation","development"和"testing"這3個隊列是"yinzhengjie"的子隊列;

    溫馨提示:
        我們不能直接向父隊列提交作業，只能向葉子隊(就是沒有子隊列的隊列)列提交作業。
    -->
  <property>
    <name>yarn.scheduler.capacity.root.yinzhengjie.queues</name>
    <value>operation,development,testing</value>
    <description>此處我在"yinzhengjie"這個頂級隊列中定義了三個子頂隊列，分別為"operation","development"和"testing"</description>
  </property>

  <!--
       按百分比為"yinzhengjie"的3個子隊列(即"operation","development"和"testing")分配容量，其容量之和為100%。  
       
       需要注意的是:
       各個子隊列容量之和為父隊列的總容量,但其父隊列的總容量又受頂隊列資源限制;
       換句話說，"operation","development"和"testing"這3個隊列能使用的總容量只有集群總量的80%，因為"yinzhengjie"這個隊列容量我配置的就是80%.
   -->
  <property>
    <name>yarn.scheduler.capacity.root.yinzhengjie.operation.capacity</name>
    <value>85</value>
    <description>指定"yinzhengjie"隊列的大小,這里指定的是一個"yinzhengjie"隊列占"root"隊列的百分比，即80%的資源歸該隊列使用</description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.yinzhengjie.development.capacity</name>
    <value>10</value>
    <description>指定"yinzhengjie"隊列的大小,這里指定的是一個"yinzhengjie"隊列占"root"隊列的百分比，即80%的資源歸該隊列使用</description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.yinzhengjie.testing.capacity</name>
    <value>5</value>
    <description>指定"yinzhengjie"隊列的大小,這里指定的是一個"yinzhengjie"隊列占"root"隊列的百分比，即80%的資源歸該隊列使用</description>
  </property>

  <!--
    接下來我做的的操作基本上是在重復上述的步驟，大致步驟如下:
        (1)為"root.yinzhengjie.operation"隊列新建了2個子隊列，分別為:"op_queue01","op_queue02";
        (2)再為這些葉子隊列配置容量比例;
   -->
  <property>
    <name>yarn.scheduler.capacity.root.yinzhengjie.operation.queues</name>
    <value>op_queue01,op_queue02</value>
    <description>此處我在"yinzhengjie"這個頂級隊列中定義了三個子頂隊列，分別為"operation","development"和"testing"</description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.yinzhengjie.operation.op_queue01.capacity</name>
    <value>70</value>
    <description>指定"yinzhengjie"隊列的大小,這里指定的是一個"yinzhengjie"隊列占"root"隊列的百分比，即80%的資源歸該隊列使用</description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.yinzhengjie.operation.op_queue02.capacity</name>
    <value>30</value>
    <description>指定"yinzhengjie"隊列的大小,這里指定的是一個"yinzhengjie"隊列占"root"隊列的百分比，即80%的資源歸該隊列使用</description>
  </property>

</configuration>
[root@hadoop101.yinzhengjie.com ~]#

[root@hadoop101.yinzhengjie.com ~]# vim ${HADOOP_HOME}/etc/hadoop/capacity-scheduler.xml

二.集群如何分配資源

1>.集群資源概述

　　隊列資源分配的第一個原則是Hadoop永遠不會讓容量空閑。如果隊列不適用為其配置的容量，則其隊列將獲取這些資源，即這這些隊列將使用超出其配置容量的資源。這就是上面所說的彈性原理。

　　Hadoop根據每個隊列當前使用配置容量的多少來確定如何在集群隊列之間分配資源。它首先向使用配置容量最少的隊列分配資源，從而在隊列中分配可用的資源。

　　隊列當前使用的容量越低，從集群接受額外資源的優先級越高。一旦父隊列獲得了額外的資源，它將使用完全相同的原則，首先將這些資源分配給當前使用配置容量最少的葉子隊列。

　　我們來看個分析個案例，假設目前集群有100TB內存容量，按照上面的案例的容量比例來划分，將頂級隊列root划分成"yinzhengjie"和"default"兩個頂級子隊列。
　　　　我們以"yinzhengjie"隊列為例，且滿足以下三個條件:
　　　　　　(1)假設"operation"隊列沒有作業運行情況，這意味着"operation"所有的分配容量(該隊列分配容量約有68T內存，其占"yinzhengjie"隊列的85%的資源)是空閑的;
　　　　　　(2)假設另外兩個隊列，即"development"和"testing",正在充分利用其配置容量(假設利用率高達90%以上);
　　　　　　(3)假設兩個用戶"Jason"和"Tom"兩位DevOps開發人員將每個應用程序提交給"op_queue02"葉子隊列;
　　　　集群資源分配情況如下:
　　　　　　(1)即使"op_queue02"隊列的配置容量只有父隊列("operation")的百分之30%，即配置容量總共只有約20.4T;
　　　　　　(2)但由於"op_queue02"隊列父隊列("operation")的其他葉子隊列(即"op_queue01")中沒有任何作業在運行,所以調度器為兩個用戶中的每個用戶分配了20.4T內存空間;
　　　　　　(3)如果有更多的用戶將作業提交到"op_queue02"葉子隊列，則它們可以占用分配給"op_queue02"葉子隊列的所有資源，因為沒有人將作業提交給該隊列(即"op_queue01"隊列);
　　　　綜上所述，當沒有人將作業提交到"op_queue01"葉子隊列時，所有這一切都很好。當有人將作業提交到此隊列時會發生什么情況呢?
　　　　　　(1)因為隊列的資源已被"op_queue02"葉子隊列中運行的作業所使用，那么這個作業必須等到"op_queue02"隊列中運行的作業開始釋放容量才能運行;
　　　　　　(2)隨着時間的推移，兩個葉子隊列的資源使用趨於配置的2:1(因為我配置的"op_queue01"占據父隊列(即"operation"隊列)的70%，而"op_queue02"占據父隊列(即"operation"隊列)的30%)的比例;
　　　　　　(3)如果不希望用戶等待使用向數據隊列承諾的"容量保證"，則必須啟用搶占模式(后面會詳細介紹)。

　　溫馨提示:
　　　　以我的經驗，使用容量調度器是最重要的兩點是容量平衡和彈性。這兩者之間有一個折中：如果設置剛性容器限制(配置最大容量)，則隊列變得不那么有彈性，從而背離了容量調度器的關鍵目標之一。

2>.限制用戶容量

　　現在我們知道了如何創建隊列和葉子隊列，以及如何配置它們的容量。接下來我們討論一下集群中最重要的實體：用戶！即將作業提交配置的隊列的用戶。

　　容量調度器使用FIFO原則(注意哈，這里不是指FIFO調度器)，因此先前提交大作業將優先於稍后提交的作業運行。

　　我們可以限制分配給葉子隊列中運行作業的用戶的資源。可以使用以下參數限制用戶可以消耗多少葉子隊列的容量:
　　　　[root@hadoop101.yinzhengjie.com ~]# vim ${HADOOP_HOME}/etc/hadoop/capacity-scheduler.xml
　　　　...
　　　　<!-- 配置限制用戶容量的相關參數 -->
　　　　<property>
　　　　　　<name>yarn.scheduler.capacity.root.yinzhengjie.operation.user-limit-factor</name>
　　　　　　<value>2</value>
　　　　　　<description>
　　　　　　　　為支持葉子隊列中特定用戶設置最大容量。defalut隊列用戶將百分比限制在0.0到1.0之間。此參數的默認值為1，這意味着用戶可以使用所有葉子隊列的配置容量。
　　　　　　　　如果將此參數的值設置大於1，則用戶可以使用超出葉子隊列容量限制的資源。比如設置為2，則意味着用戶最多可以使用2倍與配置容量的容量喲。
　　　　　　　　如果將其設置為0.25，則該用戶僅可以使用隊列配置容量的四分之一。
　　　　　　</description>
　　　　</property>

　　　　<property>
　　　　　　<name>yarn.scheduler.capacity.root.yinzhengjie.operation.op_queue02.maximum-capacity</name>
　　　　　　<value>50</value>
　　　　　　<description>
　　　　　　　　此參數用於設置容量的硬限制，此參數的默認值為100。如果要確保用戶不能獲取所有父隊列的容量，則可以設置此參數。
　　　　　　　　此處我將向"root.yinzhengjie.operation.op_queue02"葉子隊列提交作業的用戶不能占用"root.yinzhengjie.operation"隊列容量的50%以上。
　　　　　　</description>
　　　　</property>

　　　　<property>
　　　　　　<name>yarn.scheduler.capacity.root.yinzhengjie.operation.op_queue02.minmum-user-limit-percent</name>
　　　　　　<value>10</value>
　　　　　　<description>
　　　　　　　　假設配置了可以占用500GB RAM的葉子隊列，如果20個用戶象征隊列提交作業怎么樣？當然，你可以讓所有20個用戶的容器占用25GB的RAM，但那樣太慢了。
　　　　　　　　我們可以通過配置該參數來控制分配給葉子隊列用戶的最小資源百分比。
　　　　　　　　如果將此參數的值設置為10，則意味着通過此隊列運行的應用程序至少會被分配10%的已配置給op_queue02葉子隊列的容量。
　　　　　　　　第一個向這個葉子隊列提交作業的用戶可以使用100%的葉子隊列的資源分配，隨着其他用戶開始將作業提交到此隊列，最終每個用戶可以穩定地使用隊列10%的資源，
　　　　　　　　綜上所述，只有10個用戶可以隨時使用隊列，而其他用戶必須等待前10名用戶任意一個用戶釋放資源。才能依次運行已提交的Job。
　　　　　　</description>
　　　　</property>
　　　　...
　　　　[root@hadoop101.yinzhengjie.com ~]#

3>.限制應用程序數量

　　單個用戶或隊列可能壟斷集群的資源。為了避免過度使用集群，可以限制任何給定時間內在集群中能夠調度的最大應用程序數量。

　　限制應用程序數量的關鍵參數如下所示:
　　　　[root@hadoop101.yinzhengjie.com ~]# vim ${HADOOP_HOME}/etc/hadoop/capacity-scheduler.xml
　　　　...
　　　　<!-- 限制應用程序數量 -->
　　　　<property>
　　　　　　<name>yarn.scheduler.capacity.root.yinzhengjie.operation.op_queue02.maximum-applications</name>
　　　　　　<value>5000</value>
　　　　　　<description>
　　　　　　　　該參數可以對容量調度器提交的應用程序數量設置上限，即為在任何時候給定時間可以運行的最大應用程序數量設置硬限制。此參數root隊列的默認值為10000。
　　　　　　　　對應的子頂隊列以及葉子隊列的最大應用上限也有對應的計算公式，比如我們要計算default隊列的最大容器大小公式如下:
　　　　　　　　default_max_applications = root_max_applications * (100 - yarn.scheduler.capacity.root.yinzhengjie.capacity)
　　　　　　　　最終算得default_max_applications的值為2000(帶入上面的公式:"10000 * (100 - 80)%",即:10000 * 0.2)
　　　　　　</description>
　　　　</property>

　　　　<property>
　　　　　　<name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
　　　　　　<value>10</value>
　　　　　　<description>
　　　　　　　　該參數用於設置所有正在運行的ApplicationMasters可以使用的集群資源的百分比，即控制並發運行的應用程序的數量。此參數的默認值為10%。
　　　　　　　　當設置為10%這意味着所有ApplicationMaster不能占用集群資源的10%以上(ApplicationMaster容器的RAM內存分配，這是為應用程序創建第一個容器)。
　　　　　　</description>
　　　　</property>
　　　　...
　　　　[root@hadoop101.yinzhengjie.com ~]#

4>.搶占申請

　　搶占一個應用程序意味着其他應用程序的容器可能要被殺死，以便於為新應用程序(ApplicationMaster)騰出空間。

　　如果不希望后來的應用程序在特定的葉子隊列等待，因為葉子隊列中其他運行的應用程序正在占用所有分配的資源，則可以使用搶占策略。

　　在這種情況下，盡管已經為"隊列"設置一個容量，但是沒有可用的資源給這個葉子隊列分配。殺死ApplicationMaster容器只能作為最后的手段，優先考慮殺死尚未執行的容器。

　　YARN可以通過以下兩種方式搶占作業:
　　　　最小份額搶占：
　　　　　　當資源池的占用低於配置的最小份額時。
　　　　公平份額搶占：
　　　　　　當一個資源池在其公平份額下運行時。
　　　　溫馨提示:
　　　　　　(1)這兩種方式中，最小份額搶占更嚴格。當一個資源池低於其最小份額運行的時長達到某個特定的值后，最小份額搶占會立即介入。該時長取決於最小共享搶占超時參數。
　　　　　　(2)公平份額搶占則沒這么激進。只有當一個資源池在其公平份額的一半以下運行某段時長后，才開始介入。該時長取決於公平份額搶占超時時間。
　　　　　　(3)一旦搶占開始，低於最小份額的資源池可以增加至其最小份額，同時低於公平份額50%的資源池也會一直增加至其公平份額。

　　可以在yarn-site.xml文件中設置幾個與搶占相關的配置參數：
　　　　(1)如下圖所示，Apache Hadoop在默認情況下，已經禁用了搶占策略，將"yarn.resourcemanager.scheduler.monitor.enable"的值設置為true可以啟用搶占策略。
　　　　(2)也可以通過配置"yarn.resourcemanager.monitor.capacity.preemption.total_preemption_per_round"參數來搶占速度，即設置在一輪定期監控中搶占資源的最大百分比。

5>.啟用容量調度器

　　必須配置ResourceManager才能在集群中開啟並使用容量調度器。

　　在"${HADOOP_HOME}/etc/hadoop/yarn-site.xml"文件中為"yarn.resourcemanager.scheduler.class"參數添加以下屬性，就可以使用容量調度器。
　　　　<property>
　　　　　　<name>yarn.resourcemanager.scheduler.class</name>
　　　　　　<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
　　　　　　<description>指定resourcemanager的調度器(如上所示，默認為容量調度器)</description>
　　　　</property>

　　如下圖所示，Apache Hadoop默認的調度器是容量調度器，因為我們可以不做該步驟，除非你要顯示指定你使用的調度器是公平調度器。

　　溫馨提示:
　　　　容量調度器(Yahoo!的Capacity Scheduler)是Apache Hadoop的默認調度器，而對於某些Hadoop發行版本，如Cloudera，則公平調度器(Fackbook的Fair Scheduler)是默認調度器。

6>.隊列狀態管理

　　可以隨時在跟對任意隊列級別停止或啟動隊列，並使用"yarn rmadmin -refreshQueues"使得配置生效，無需重啟整個YARN集群。

　　隊列有兩種狀態，即STOPPED和RUNNING，默認均是RUNNING狀態。

　　溫馨提示:
　　　　(1)如果停止root或者父隊列，則葉子隊列將變為非活動狀態(即STOPPED狀態)。
　　　　(2)如果停止運行中的隊列(即將一個隊列由RUNNING狀態變更為STOPPED狀態)，則當前正在運行的應用程序將繼續運行，直到完成該隊列中的已經運行的所有作業，並且不會將新的的應用程序提交到此隊列。
　　　　(3)若父隊列為STOPPED，則子隊列無法配置為RUNNING，若您真這樣做，將會拋出如下圖所示的異常喲。

7>.博主推薦閱讀

　　博主推薦閱讀：
　　　　https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html

三.一個典型的容量調度器案例

1>.配置一個完整的容量調度器示例

[root@hadoop101.yinzhengjie.com ~]# cat ${HADOOP_HOME}/etc/hadoop/capacity-scheduler.xml
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->
<configuration>

  <!-- root始終是創建所有隊列的頂級隊列，因此我們現在頂級隊列中創建2個子頂級隊列。 -->
  <property>
    <name>yarn.scheduler.capacity.root.queues</name>
    <value>default,yinzhengjie</value>
    <description>這是為root頂級隊列定義子隊列，默認值為:"default"</description>
  </property>

  <!-- 注意哈，當我們定義好頂級隊列的子隊列后，我們接下來做為其設置隊列容量，如果你沒有做該步驟，那么啟動RM將會失敗。  -->
  <property>
    <name>yarn.scheduler.capacity.root.yinzhengjie.capacity</name>
    <value>80</value>
    <description>這里指定的是root頂隊列下的yinzhengjie這個子隊列，該隊列占用整個集群的80%的資源</description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.default.capacity</name>
    <value>20</value>
    <description>這里指定的是root頂隊列下的default這個子隊列，該隊列占用整個集群的20%的資源</description>
  </property>

  <!-- 
    我們可以為子頂隊列繼續分配子隊列，比如我們將yinzhengjie這個隊列分為:"operation","development"和"testing"這3個子隊列。

    下面配置的隊列存在以下關系:
        (1)我們可以說"yinzhengjie"這個隊列是"operation","development"和"testing"的父隊列;
        (2)"operation","development"和"testing"這3個隊列是"yinzhengjie"的子隊列;

    溫馨提示:
        我們不能直接向父隊列提交作業，只能向葉子隊(就是沒有子隊列的隊列)列提交作業。
    -->
  <property>
    <name>yarn.scheduler.capacity.root.yinzhengjie.queues</name>
    <value>operation,development,testing</value>
    <description>此處我在"yinzhengjie"這個頂級隊列中定義了三個子頂隊列，分別為"operation","development"和"testing"</description>
  </property>

  <!--
       按百分比為"yinzhengjie"的3個子隊列(即"operation","development"和"testing")分配容量，其容量之和為100%。  
       
       需要注意的是:
       各個子隊列容量之和為父隊列的總容量,但其父隊列的總容量又受頂隊列資源限制;
       換句話說，"operation","development"和"testing"這3個隊列能使用的總容量只有集群總量的80%，因為"yinzhengjie"這個隊列容量我配置的就是80%.
   -->
  <property>
    <name>yarn.scheduler.capacity.root.yinzhengjie.operation.capacity</name>
    <value>85</value>
    <description>指定"yinzhengjie"隊列的大小,這里指定的是一個"yinzhengjie"隊列占"root"隊列的百分比，即80%的資源歸該隊列使用</description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.yinzhengjie.development.capacity</name>
    <value>10</value>
    <description>指定"yinzhengjie"隊列的大小,這里指定的是一個"yinzhengjie"隊列占"root"隊列的百分比，即80%的資源歸該隊列使用</description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.yinzhengjie.testing.capacity</name>
    <value>5</value>
    <description>指定"yinzhengjie"隊列的大小,這里指定的是一個"yinzhengjie"隊列占"root"隊列的百分比，即80%的資源歸該隊列使用</description>
  </property>

  <!--
    接下來我做的的操作基本上是在重復上述的步驟，大致步驟如下:
        (1)為"root.yinzhengjie.operation"隊列新建了2個子隊列，分別為:"op_queue01","op_queue02";
        (2)再為這些葉子隊列配置容量比例;
   -->
  <property>
    <name>yarn.scheduler.capacity.root.yinzhengjie.operation.queues</name>
    <value>op_queue01,op_queue02</value>
    <description>此處我在"yinzhengjie"這個頂級隊列中定義了三個子頂隊列，分別為"operation","development"和"testing"</description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.yinzhengjie.operation.op_queue01.capacity</name>
    <value>70</value>
    <description>指定"yinzhengjie"隊列的大小,這里指定的是一個"yinzhengjie"隊列占"root"隊列的百分比，即80%的資源歸該隊列使用</description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.yinzhengjie.operation.op_queue02.capacity</name>
    <value>30</value>
    <description>指定"yinzhengjie"隊列的大小,這里指定的是一個"yinzhengjie"隊列占"root"隊列的百分比，即80%的資源歸該隊列使用</description>
  </property>

  <!-- 配置限制用戶容量的相關參數 -->
  <property>
    <name>yarn.scheduler.capacity.root.yinzhengjie.operation.user-limit-factor</name>
    <value>2</value>
    <description>
    為支持葉子隊列中特定用戶設置最大容量。defalut隊列用戶將百分比限制在0.0到1.0之間。此參數的默認值為1，這意味着用戶可以使用所有葉子隊列的配置容量。
    如果將此參數的值設置大於1，則用戶可以使用超出葉子隊列容量限制的資源。比如設置為2，則意味着用戶最多可以使用2倍與配置容量的容量喲。
    如果將其設置為0.25，則該用戶僅可以使用隊列配置容量的四分之一。
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.yinzhengjie.operation.op_queue02.maximum-capacity</name>
    <value>50</value>
    <description>
    此參數用於設置容量的硬限制，此參數的默認值為100。如果要確保用戶不能獲取所有父隊列的容量，則可以設置此參數。
    此處我將向"root.yinzhengjie.operation.op_queue02"葉子隊列提交作業的用戶不能占用"root.yinzhengjie.operation"隊列容量的50%以上。
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.yinzhengjie.operation.op_queue02.minmum-user-limit-percent</name>
    <value>10</value>
    <description>
    假設配置了可以占用500GB RAM的葉子隊列，如果20個用戶象征隊列提交作業怎么樣？當然，你可以讓所有20個用戶的容器占用25GB的RAM，但那樣太慢了。
    我們可以通過配置該參數來控制分配給葉子隊列用戶的最小資源百分比。如果將此參數的值設置為10，則意味着通過此隊列運行的應用程序至少會被分配10%的已配置給op_queue02葉子隊列的容量。
    綜上所述，此參數可以限制用戶的最小值資源百分比，最大值取決於集群中運行應用程序的用戶數，它的工作流程如下:
        (1)當第一個向這個葉子隊列提交作業的用戶可以使用100%的葉子隊列的資源分配;
        (2)當第二個向這個葉子隊列提交作業的用戶使用該隊列的50%的資源;
            (3)當第三個用戶向隊列提交應用程序時，所有用戶被限制為該隊列33%;
            (4)隨着其他用戶開始將作業提交到此隊列，最終每個用戶可以穩定地使用隊列10%的資源，但不會低於該值，這就是我們設置最小資源百分比的作用;
        (5)需要注意的時，只有10個用戶可以隨時使用隊列(因為10個用戶已經占用完該隊列資源)，而其他用戶必須等待前10名用戶任意一個用戶釋放資源。才能依次運行已提交的Job;
   </description>
  </property>

  <!-- 限制應用程序數量 -->
  <property>
    <name>yarn.scheduler.capacity.root.yinzhengjie.operation.op_queue02.maximum-applications</name>
    <value>5000</value>
    <description>
    該參數可以對容量調度器提交的應用程序數量設置上限，即為在任何時候給定時間可以運行的最大應用程序數量設置硬限制。此參數root隊列的默認值為10000。
    對應的子頂隊列以及葉子隊列的最大應用上限也有對應的計算公式，比如我們要計算default隊列的最大容器大小公式如下:
        default_max_applications = root_max_applications * (100 - yarn.scheduler.capacity.root.yinzhengjie.capacity)
    最終算得default_max_applications的值為2000(帶入上面的公式:"10000 * (100 - 80)%",即:10000 * 0.2)
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
    <value>0.2</value>
    <description>
    該參數用於設置所有正在運行的ApplicationMasters可以使用的集群資源的百分比，即控制並發運行的應用程序的數量。此參數的默認值為10%。
    當設置為0.2這意味着所有ApplicationMaster不能占用集群資源的20%以上(ApplicationMaster容器的RAM內存分配，這是為應用程序創建第一個容器)。
    </description>
  </property>

  <!-- 配置隊列管理權限 -->
  <property>
    <name>yarn.scheduler.capacity.root.yinzhengjie.operation.op_queue02.acl_administer_queue</name>
    <value>hadoop_admin</value>
    <description>
    指定誰可以管理root.yinzhengjie.operation.op_queue02該葉子隊列，其中"*"表示術語指定組的任何人都可以管理此隊列。
    可以配置容量調度器隊列管理員來執行隊列管理操作，例如將應用程序提交到隊列，殺死應用程序，停止隊列和查看隊列信息等。
    上面我配置的"hadoop_admin"，這意味着在hadoop_admin組的所有用戶均可以管理"root.yinzhengjie.operation.op_queue02"隊列喲~
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.yinzhengjie.operation.op_queue02.acl_submit_applications</name>
    <value>jason,yinzhengjie</value>
    <description>
    此參數可以指定那些用戶將應用程序提交到隊列的ACL。如果不知定制，則從層次結果中的父隊列派生ACL。根隊列的默認值為"*"，即表示任何用戶
    常規用戶無法查看或修改其他用戶提交的應用程序，作為集群管理員，你可以對隊列和作業執行以下操作:
        (1)在運行時更改隊列的定義和屬性;
        (2)停止隊列以防止提交新的應用程序;
        (3)啟動停止的備份隊列;
    </description>
  </property>


  <!-- 配置用戶映射到隊列 -->
  <property>
    <name>yarn.scheduler.capacity.queue-mappings</name>
    <value>u:jason:op_queue02,g:hadoop_admin:op_queue01,u:yinzhengjie:%primary_group</value>
    <description>
    此參數可以將用戶映射到指定隊列，其中u表示用戶，g表示組。
    "u:jason:op_queue02":
        表示將jason用戶映射到op_queue02隊列中。
    "g:hadoop_admin:op_queue01":
        表示將hadoop_admin組中的用戶映射到op_queue01隊列中。
    "u:yinzhengjie:%primary_group":
        表示將yinzhengjie用戶映射到與Linux中主組名相同的隊列。
    溫馨提示:
        YARN從左到右匹配此屬性的映射，並使用其找到的第一個有效映射。
    </description>
  </property>

  <!-- 配置隊列運行狀態 -->
  <property>
    <name>yarn.scheduler.capacity.root.state</name>
    <value>RUNNING</value>
    <description>
    可以隨時在跟對任意隊列級別停止或啟動隊列，並使用"yarn rmadmin -refreshQueues"使得配置生效，無需重啟整個YARN集群。
    隊列有兩種狀態，即STOPPED和RUNNING，默認均是RUNNING狀態。
    需要注意的是:
        (1)如果停止root或者父隊列，則葉子隊列將變為非活動狀態(即STOPPED狀態)。
            (2)如果停止運行中的隊列，則當前正在運行的應用程序會繼續運行直到完成，並且不會將新的的應用程序提交到此隊列。
        (3)若父隊列為STOPPED，則子隊列無法配置為RUNNING，若您真這樣做，將會拋出異常喲。
        溫馨提示:
        可以通過ResourceManager Web UI的Application頁面中的Scheduler頁面，來監視容量調度器隊列的狀態和設置。
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.yinzhengjie.state</name>
    <value>RUNNING</value>
    <description>將"root.yinzhengjie"隊列的狀態設置為"RUNNING"狀態</description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.yinzhengjie.operation.state</name>
    <value>RUNNING</value>
    <description>將"root.yinzhengjie.operation"隊列設置為"RUNNING"狀態</description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.yinzhengjie.development.state</name>
    <value>RUNNING</value>
    <description>將"root.yinzhengjie.development"隊列設置為"RUNNING"狀態</description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.yinzhengjie.testing.state</name>
    <value>RUNNING</value>
    <description>將"root.yinzhengjie.testing"隊列設置為"RUNNING"狀態</description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.yinzhengjie.operation.op_queue01.state</name>
    <value>RUNNING</value>
    <description>將"root.yinzhengjie.operation.op_queue01"隊列設置為"RUNNING"狀態</description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.yinzhengjie.operation.op_queue02.state</name>
    <value>RUNNING</value>
    <description>將"root.yinzhengjie.operation.op_queue02"隊列設置為"RUNNING"狀態</description>
  </property>

</configuration>
[root@hadoop101.yinzhengjie.com ~]# 
[root@hadoop101.yinzhengjie.com ~]#

[root@hadoop101.yinzhengjie.com ~]# cat ${HADOOP_HOME}/etc/hadoop/capacity-scheduler.xml

2>.使用jason用戶提交一個Job，在RM WebUI查看其隊列信息(如下圖所示，若是op_queue02說明咱們的配置生效啦~)

[root@hadoop101.yinzhengjie.com ~]# su -l jason
[jason@hadoop101.yinzhengjie.com ~]$ 
[jason@hadoop101.yinzhengjie.com ~]$ 
[jason@hadoop101.yinzhengjie.com ~]$ hadoop jar ${HADOOP_HOME}/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.10.0.jar wordcount /input /output
20/10/31 00:03:46 INFO client.RMProxy: Connecting to ResourceManager at hadoop101.yinzhengjie.com/172.200.6.101:8032
20/10/31 00:03:47 INFO input.FileInputFormat: Total input files to process : 1
20/10/31 00:03:47 INFO mapreduce.JobSubmitter: number of splits:1
20/10/31 00:03:47 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
20/10/31 00:03:47 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1604073793284_0001
20/10/31 00:03:47 INFO conf.Configuration: resource-types.xml not found
20/10/31 00:03:47 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
20/10/31 00:03:47 INFO resource.ResourceUtils: Adding resource type - name = memory-mb, units = Mi, type = COUNTABLE
20/10/31 00:03:47 INFO resource.ResourceUtils: Adding resource type - name = vcores, units = , type = COUNTABLE
20/10/31 00:03:47 INFO impl.YarnClientImpl: Submitted application application_1604073793284_0001
20/10/31 00:03:47 INFO mapreduce.Job: The url to track the job: http://hadoop101.yinzhengjie.com:8088/proxy/application_1604073793284_0001/
20/10/31 00:03:47 INFO mapreduce.Job: Running job: job_1604073793284_0001
20/10/31 00:03:55 INFO mapreduce.Job: Job job_1604073793284_0001 running in uber mode : false
20/10/31 00:03:55 INFO mapreduce.Job:  map 0% reduce 0%
20/10/31 00:04:00 INFO mapreduce.Job:  map 100% reduce 0%
20/10/31 00:04:05 INFO mapreduce.Job:  map 100% reduce 100%
20/10/31 00:04:06 INFO mapreduce.Job: Job job_1604073793284_0001 completed successfully
20/10/31 00:04:07 INFO mapreduce.Job: Counters: 49
    File System Counters
        FILE: Number of bytes read=1014
        FILE: Number of bytes written=417761
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=781
        HDFS: Number of bytes written=708
        HDFS: Number of read operations=6
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=2
    Job Counters 
        Launched map tasks=1
        Launched reduce tasks=1
        Data-local map tasks=1
        Total time spent by all maps in occupied slots (ms)=3152
        Total time spent by all reduces in occupied slots (ms)=2538
        Total time spent by all map tasks (ms)=3152
        Total time spent by all reduce tasks (ms)=2538
        Total vcore-milliseconds taken by all map tasks=3152
        Total vcore-milliseconds taken by all reduce tasks=2538
        Total megabyte-milliseconds taken by all map tasks=6455296
        Total megabyte-milliseconds taken by all reduce tasks=5197824
    Map-Reduce Framework
        Map input records=3
        Map output records=99
        Map output bytes=1057
        Map output materialized bytes=1014
        Input split bytes=119
        Combine input records=99
        Combine output records=75
        Reduce input groups=75
        Reduce shuffle bytes=1014
        Reduce input records=75
        Reduce output records=75
        Spilled Records=150
        Shuffled Maps =1
        Failed Shuffles=0
        Merged Map outputs=1
        GC time elapsed (ms)=329
        CPU time spent (ms)=1840
        Physical memory (bytes) snapshot=532664320
        Virtual memory (bytes) snapshot=7253307392
        Total committed heap usage (bytes)=419430400
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    File Input Format Counters 
        Bytes Read=662
    File Output Format Counters 
        Bytes Written=708
[jason@hadoop101.yinzhengjie.com ~]$

[jason@hadoop101.yinzhengjie.com ~]$ hadoop jar ${HADOOP_HOME}/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.10.0.jar wordcount /input /output

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Hadoop YARN資源管理-公平調度器(Fackbook的Fair Scheduler) capacity-scheduler.xml yarn容量調度配置文件 YARN資源調度策略之Capacity Scheduler Hadoop集群資源管理篇-資源調度器 Yarn和Mesos：資源管理調度平台 yarn資源管理 yarn資源管理 yarn資源管理資源管理與調度系統-YARN的基本架構與原理 Yarn 調度器Scheduler詳解