背景:
按照業務需求將數據拆分為60份,啟動60個application分別運行對每一份數據,application的提交腳本如下:
#/bin/sh #LANG=zh_CN.utf8 #export LANG export SPARK_KAFKA_VERSION=0.10 export LANG=zh_CN.UTF-8 jarspath='' for file in `ls /home/dx/pro2.0/app01/sparkjars/*.jar` do jarspath=${file},$jarspath done jarspath=${jarspath%?} echo $jarspath ./bin/spark-submit.sh \ --jars $jarspath \ --properties-file ../conf/spark-properties.conf \ --verbose \ --master yarn \ --deploy-mode cluster \ --name Streaming-$2-$3-$4-$5-$1-Agg-Parser \ --driver-memory 9g \ --driver-cores 1 \ --num-executors 1 \ --executor-cores 12 \ --executor-memory 22g \ --driver-java-options "-XX:+TraceClassPaths" \ --class com.dx.app01.streaming.Main \ /home/dx/pro2.0/app01/lib/app01-streaming-driver.jar $1 $2 $3 $4 $5
運行集群包含的運行節點43個節點,每個節點配置信息如下:24VCores 64G
yarn配置情況:
yarn.scheduler.minimum-allocation-mb | 單個容器可申請的最小內存 1G |
yarn.scheduler.maximum-allocation-mb | 單個容器可申請的最大內存 51G |
yarn.nodemanager.resource.cpu-vcores | NodeManager總的可用虛擬CPU個數 21vcores |
yarn.nodemanager.resource.memory-mb | 每個節點可用的最大內存,RM中的兩個值不應該超過此值 51G |
問題:
執行上邊腳本啟動了60個任務,但是經過測試發現最多只能提交24個任務,然后剩余還有一個部分任務都是處於 Accepted 狀態,按照目前情況至少要執行43個任務。
通過yarn node -list命令查看當前節點上運行containers情況如下:
Node-Id | Node-State | Node-Http-Address | Number-of-Running-Containers |
node-53:45454 | RUNNING | node-53:8042 | 1 |
node-62:45454 | RUNNING | node-62:8042 | 4 |
node-44:45454 | RUNNING | node-44:8042 | 3 |
node-37:45454 | RUNNING | node-37:8042 | 0 |
node-35:45454 | RUNNING | node-35:8042 | 1 |
node-07:45454 | RUNNING | node-07:8042 | 0 |
node-30:45454 | RUNNING | node-30:8042 | 0 |
node-56:45454 | RUNNING | node-56:8042 | 2 |
node-47:45454 | RUNNING | node-47:8042 | 0 |
node-42:45454 | RUNNING | node-42:8042 | 2 |
node-03:45454 | RUNNING | node-03:8042 | 6 |
node-51:45454 | RUNNING | node-51:8042 | 2 |
node-33:45454 | RUNNING | node-33:8042 | 1 |
node-04:45454 | RUNNING | node-04:8042 | 1 |
node-48:45454 | RUNNING | node-48:8042 | 6 |
node-39:45454 | RUNNING | node-39:8042 | 0 |
node-60:45454 | RUNNING | node-60:8042 | 1 |
node-54:45454 | RUNNING | node-54:8042 | 0 |
node-45:45454 | RUNNING | node-45:8042 | 0 |
node-63:45454 | RUNNING | node-63:8042 | 1 |
node-09:45454 | RUNNING | node-09:8042 | 1 |
node-01:45454 | RUNNING | node-01:8042 | 1 |
node-36:45454 | RUNNING | node-36:8042 | 3 |
node-06:45454 | RUNNING | node-06:8042 | 0 |
node-61:45454 | RUNNING | node-61:8042 | 1 |
node-31:45454 | RUNNING | node-31:8042 | 0 |
node-40:45454 | RUNNING | node-40:8042 | 0 |
node-57:45454 | RUNNING | node-57:8042 | 1 |
node-59:45454 | RUNNING | node-59:8042 | 1 |
node-43:45454 | RUNNING | node-43:8042 | 1 |
node-52:45454 | RUNNING | node-52:8042 | 1 |
node-34:45454 | RUNNING | node-34:8042 | 1 |
node-38:45454 | RUNNING | node-38:8042 | 0 |
node-50:45454 | RUNNING | node-50:8042 | 4 |
node-46:45454 | RUNNING | node-46:8042 | 1 |
node-08:45454 | RUNNING | node-08:8042 | 1 |
node-55:45454 | RUNNING | node-55:8042 | 1 |
node-32:45454 | RUNNING | node-32:8042 | 0 |
node-41:45454 | RUNNING | node-41:8042 | 2 |
node-05:45454 | RUNNING | node-05:8042 | 1 |
node-02:45454 | RUNNING | node-02:8042 | 1 |
node-58:45454 | RUNNING | node-58:8042 | 0 |
node-49:45454 | RUNNING | node-49:8042 | 0 |
很明顯,目前集群還有一部分節點未被使用,說明資源時充足的。
那么,至少應該能提交43個任務才對,但是目前只提交了24個任務,而且在Yarn上還提示錯誤信息:
[Tue Jul 30 16:33:29 +0000 2019] Application is added to the scheduler and is not yet activated. Queue's AM resource limit exceeded. Details : AM Partition = <DEFAULT_PARTITION>;
AM Resource Request = <memory:9216MB(9G), vCores:1>; Queue Resource Limit for AM = <memory:454656MB(444G), vCores:1>; User AM Resource Limit of the queue = <memory:229376MB(224G), vCores:1>; Queue AM Resource Usage = <memory:221184MB(216G), vCores:24>;
解決方案:
其中錯誤日志:“Queue AM Resource Usage = <memory:221184MB(216G), vCores:24>;”中正是指目前已經運行了24個app(yarn-cluster模式下,每個app包含一個driver,driver也就是等同於AM):每個app的driver包含1個vcores,一共占用24vcores;每個app的driver內存為9G,9G*24=216G。
其中錯誤日志:“User AM Resource Limit of the queue = <memory:229376MB(224G), vCores:1>; ”中集群中用於運行應用程序ApplicationMaster的資源最大允許224G,這個值由參數”yarn.scheduler.capacity.maximum-am-resource-percent“決定。
yarn.scheduler.capacity.maximum-am-resource-percent / yarn.scheduler.capacity.<queue-path>.maximum-am-resource-percent |
集群中用於運行應用程序ApplicationMaster的資源比例上限,該參數通常用於限制處於活動狀態的應用程序數目。該參數類型為浮點型,默認是0.1,表示10%。 所有隊列的ApplicationMaster資源比例上限可通過參數yarn.scheduler.capacity. maximum-am-resource-percent設置(可看做默認值), 而單個隊列可通過參數yarn.scheduler.capacity.<queue-path>.maximum-am-resource-percent設置適合自己的值。 |
1)yarn.scheduler.capacity.maximum-am-resource-percent(調大)
<property> <!-- Maximum resources to allocate to application masters If this is too high application masters can crowd out actual work --> <name>yarn.scheduler.capacity.maximum-am-resource-percent</name> <value>0.5</value> </property>
2)降低 driver 內存。
關於Yarn Capacity更多,更官方問題請參考官網文檔:《Hadoop: Capacity Scheduler》