spark on yarn任務提交緩慢解決

本文轉載自查看原文 2017-02-08 19:23 5769 spark

spark on yarn任務提交緩慢解決

spark版本：spark-2.0.0 hadoop 2.7.2。

在spark on yarn 模式執行任務提交，發現特別慢，要等待幾分鍾，

使用集群模式模式提交任務：
./bin/spark-submit --class org.apache.spark.examples.SparkPi
--master yarn
--deploy-mode cluster
--driver-memory 4g
--executor-memory 2g
--executor-cores 1
--queue thequeue
examples/jars/spark-examples*.jar
10

發現報出如下警告信息：

17/02/08 18:26:23 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
17/02/08 18:26:29 INFO yarn.Client: Uploading resource file:/tmp/spark-91508860-fdda-4203-b733-e19625ef23a0/__spark_libs__4918922933506017904.zip -> hdfs://dbmtimehadoop/user/fuxin.zhao/.sparkStaging/application_1486451708427_0392/__spark_libs__4918922933506017904.zip

這個日志之后在上傳程序依賴的jar，大概要耗時30s左右，造成任務提交速度超雞慢，在官網上查到有關的解決辦法：

To make Spark runtime jars accessible from YARN side, you can specify spark.yarn.archive or spark.yarn.jars. 
For details please refer to Spark Properties. If neither spark.yarn.archive nor spark.yarn.jars is specified, 
Spark will create a zip file with all jars under $SPARK_HOME/jars and upload it to the distributed cache.

大意是：如果想要在yarn端（yarn的節點）訪問spark的runtime jars，需要指定spark.yarn.archive 或者 spark.yarn.jars。如果都這兩個參數都沒有指定，spark就會把$SPARK_HOME/jars/所有的jar上傳到分布式緩存中。這也是之前任務提交特別慢的原因。

下面是解決方案：
將$SPARK_HOME/jars/* 下spark運行依賴的jar上傳到hdfs上。

hadoop fs -mkdir hdfs://dbmtimehadoop/tmp/spark/lib_jars/
hadoop fs -put  $SPARK_HOME/jars/* hdfs://dbmtimehadoop/tmp/spark/lib_jars/

vi $SPARK_HOME/conf/spark-defaults.conf
添加如下內容：
spark.yarn.jars hdfs://dbmtimehadoop/tmp/spark/lib_jars/

再執行任務提交，發現報出如下異常：

Exception in thread "main" org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.
	at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:85)
	at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:62)
	at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:149)
	at org.apache.spark.SparkContext.<init>(SparkContext.scala:500)

查看ResourceManager的日志的異常：http://db-namenode01.host-mtime.com:19888/jobhistory/logs/db-datanode03.host-mtime.com:34545/container_e08_1486451708427_0346_02_000001/

Log Length: 191

Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=256m; support was removed in 8.0
Error: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher

說明之前的配置有誤，spark相關的jar包沒有加載成功，嘗試了一下，如下幾種配置方法是有效的：

#生效
spark.yarn.jars                  hdfs://dbmtimehadoop/tmp/spark/lib_jars/*.jar ##生效
#spark.yarn.jars                  hdfs://dbmtimehadoop/tmp/spark/lib_jars/*   ##生效
##直接配置多個以逗號分隔的jar，也可以生效。
#spark.yarn.jars                 hdfs://dbmtimehadoop/tmp/spark/lib_jars/activation-1.1.1.jar,hdfs://dbmtimehadoop/tmp/spark/lib_jars/antlr-2.7.7.jar,hdfs://dbmtimehadoop/tmp/spark/lib_jars/antlr4-runtime-4.5.3.jar,hdfs://dbmtimehadoop/tmp/spark/lib_jars/antlr-runtime-3.4.jar

再重新提交任務，執行成功。
出現如下信息說明jar添加成功。

17/02/08 19:28:21 INFO yarn.Client: Source and destination file systems are the same. Not copying hdfs://dbmtimehadoop/tmp/spark/lib_jars/spark-mllib-local_2.11-2.0.0.jar
17/02/08 19:28:21 INFO yarn.Client: Source and destination file systems are the same. Not copying hdfs://dbmtimehadoop/tmp/spark/lib_jars/spark-mllib_2.11-2.0.0.jar
17/02/08 19:28:21 INFO yarn.Client: Source and destination file systems are the same. Not copying hdfs://dbmtimehadoop/tmp/spark/lib_jars/spark-network-common_2.11-2.0.0.jar
17/02/08 19:28:21 INFO yarn.Client: Source and destination file systems are the same. Not copying hdfs://dbmtimehadoop/tmp/spark/lib_jars/spark-network-shuffle_2.11-2.0.0.jar

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 spark任務提交到yarn上命令總結 spark-submit提交任務到yarn錯誤 yarn隊列提交spark任務權限控制 spark yarn cluster模式下任務提交和計算流程分析 Spark：提交yarn任務時的配置文件分發 Spark2.x（五十九）：yarn-cluster模式提交Spark任務，如何關閉client進程? Spark查詢yarn任務日志 spark跑YARN模式或Client模式提交任務不成功（application state: ACCEPTED）提交任務到Spark Spark提交任務到集群