Spark2.x（五十六）：Queue's AM resource limit exceeded.

本文轉載自查看原文 2019-07-30 22:14 1046 BigData-Kafka/ Hadoop+Spark

背景：

按照業務需求將數據拆分為60份，啟動60個application分別運行對每一份數據，application的提交腳本如下：

#/bin/sh
#LANG=zh_CN.utf8
#export LANG
export SPARK_KAFKA_VERSION=0.10
export LANG=zh_CN.UTF-8
jarspath=''
for file in `ls /home/dx/pro2.0/app01/sparkjars/*.jar`
do
  jarspath=${file},$jarspath
done
jarspath=${jarspath%?}
echo $jarspath

./bin/spark-submit.sh \
--jars $jarspath \
--properties-file ../conf/spark-properties.conf \
--verbose \
--master yarn \
--deploy-mode cluster \
--name Streaming-$2-$3-$4-$5-$1-Agg-Parser \
--driver-memory 9g \
--driver-cores 1 \
--num-executors 1 \
--executor-cores 12 \
--executor-memory 22g \
--driver-java-options "-XX:+TraceClassPaths" \
--class com.dx.app01.streaming.Main \
/home/dx/pro2.0/app01/lib/app01-streaming-driver.jar $1 $2 $3 $4 $5

運行集群包含的運行節點43個節點，每個節點配置信息如下：24VCores 64G

yarn配置情況：

yarn.scheduler.minimum-allocation-mb	單個容器可申請的最小內存 1G
yarn.scheduler.maximum-allocation-mb	單個容器可申請的最大內存 51G
yarn.nodemanager.resource.cpu-vcores	NodeManager總的可用虛擬CPU個數 21vcores
yarn.nodemanager.resource.memory-mb	每個節點可用的最大內存，RM中的兩個值不應該超過此值 51G

問題：

執行上邊腳本啟動了60個任務，但是經過測試發現最多只能提交24個任務，然后剩余還有一個部分任務都是處於 Accepted 狀態，按照目前情況至少要執行43個任務。

通過yarn node -list命令查看當前節點上運行containers情況如下：

Node-Id	Node-State	Node-Http-Address	Number-of-Running-Containers
node-53:45454	RUNNING	node-53:8042	1
node-62:45454	RUNNING	node-62:8042	4
node-44:45454	RUNNING	node-44:8042	3
node-37:45454	RUNNING	node-37:8042	0
node-35:45454	RUNNING	node-35:8042	1
node-07:45454	RUNNING	node-07:8042	0
node-30:45454	RUNNING	node-30:8042	0
node-56:45454	RUNNING	node-56:8042	2
node-47:45454	RUNNING	node-47:8042	0
node-42:45454	RUNNING	node-42:8042	2
node-03:45454	RUNNING	node-03:8042	6
node-51:45454	RUNNING	node-51:8042	2
node-33:45454	RUNNING	node-33:8042	1
node-04:45454	RUNNING	node-04:8042	1
node-48:45454	RUNNING	node-48:8042	6
node-39:45454	RUNNING	node-39:8042	0
node-60:45454	RUNNING	node-60:8042	1
node-54:45454	RUNNING	node-54:8042	0
node-45:45454	RUNNING	node-45:8042	0
node-63:45454	RUNNING	node-63:8042	1
node-09:45454	RUNNING	node-09:8042	1
node-01:45454	RUNNING	node-01:8042	1
node-36:45454	RUNNING	node-36:8042	3
node-06:45454	RUNNING	node-06:8042	0
node-61:45454	RUNNING	node-61:8042	1
node-31:45454	RUNNING	node-31:8042	0
node-40:45454	RUNNING	node-40:8042	0
node-57:45454	RUNNING	node-57:8042	1
node-59:45454	RUNNING	node-59:8042	1
node-43:45454	RUNNING	node-43:8042	1
node-52:45454	RUNNING	node-52:8042	1
node-34:45454	RUNNING	node-34:8042	1
node-38:45454	RUNNING	node-38:8042	0
node-50:45454	RUNNING	node-50:8042	4
node-46:45454	RUNNING	node-46:8042	1
node-08:45454	RUNNING	node-08:8042	1
node-55:45454	RUNNING	node-55:8042	1
node-32:45454	RUNNING	node-32:8042	0
node-41:45454	RUNNING	node-41:8042	2
node-05:45454	RUNNING	node-05:8042	1
node-02:45454	RUNNING	node-02:8042	1
node-58:45454	RUNNING	node-58:8042	0
node-49:45454	RUNNING	node-49:8042	0

很明顯，目前集群還有一部分節點未被使用，說明資源時充足的。

那么，至少應該能提交43個任務才對，但是目前只提交了24個任務，而且在Yarn上還提示錯誤信息：

[Tue Jul 30 16:33:29 +0000 2019] Application is added to the scheduler and is not yet activated. Queue's AM resource limit exceeded. Details : AM Partition = <DEFAULT_PARTITION>; 
AM Resource Request = <memory:9216MB(9G), vCores:1>; Queue Resource Limit for AM = <memory:454656MB(444G), vCores:1>; User AM Resource Limit of the queue = <memory:229376MB(224G), vCores:1>; Queue AM Resource Usage = <memory:221184MB(216G), vCores:24>;

解決方案：

其中錯誤日志：“Queue AM Resource Usage = <memory:221184MB(216G), vCores:24>;”中正是指目前已經運行了24個app（yarn-cluster模式下，每個app包含一個driver，driver也就是等同於AM）：每個app的driver包含1個vcores，一共占用24vcores；每個app的driver內存為9G，9G*24=216G。
其中錯誤日志：“User AM Resource Limit of the queue = <memory:229376MB(224G), vCores:1>; ”中集群中用於運行應用程序ApplicationMaster的資源最大允許224G，這個值由參數”yarn.scheduler.capacity.maximum-am-resource-percent“決定。

yarn.scheduler.capacity.maximum-am-resource-percent

/ yarn.scheduler.capacity.<queue-path>.maximum-am-resource-percent

集群中用於運行應用程序ApplicationMaster的資源比例上限，該參數通常用於限制處於活動狀態的應用程序數目。該參數類型為浮點型，默認是0.1，表示10%。

所有隊列的ApplicationMaster資源比例上限可通過參數yarn.scheduler.capacity. maximum-am-resource-percent設置（可看做默認值），

而單個隊列可通過參數yarn.scheduler.capacity.<queue-path>.maximum-am-resource-percent設置適合自己的值。

1）yarn.scheduler.capacity.maximum-am-resource-percent（調大）

<property>
    <!-- Maximum resources to allocate to application masters
    If this is too high application masters can crowd out actual work -->
    <name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
    <value>0.5</value>
</property>

2）降低 driver 內存。

關於Yarn Capacity更多，更官方問題請參考官網文檔：《Hadoop: Capacity Scheduler》

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Spark2.x 與 Spark1.x 關系 Spark2.X集群運行模式本地idea調試spark2.x程序 .net Core 解決Form value count limit 1024 exceeded. （文件上傳過大） Spark2.x學習筆記：Spark SQL快速入門 Spark2.x學習筆記：Spark SQL的SQL [spark]-Spark2.x集群搭建與參數詳解 CDH5.11安裝spark2.x詳細步驟基於IDEA環境下的Spark2.X程序開發 spark異常篇-OutOfMemory:GC overhead limit exceeded