Spark提交任務(Standalone和Yarn)

本文轉載自查看原文 2018-11-24 13:16 768 Spark

Spark Standalone模式提交任務

　　Cluster模式:

./spark-submit  \
--master spark://node01:7077  \
--deploy-mode cluster 
--class org.apache.spark.examples.SparkPi \
--driver-memory 1g \ 
--executor-memory 1g \ 
--executor-cores 2 \ 
../lib/spark-examples-1.6.0-hadoop2.6.0.jar  100

執行流程

1、cluster模式提交應用程序后，會向Master請求啟動Driver.(而不是啟動application)

2、Master接受請求，隨機在集群一台節點啟動Driver進程。

3、Driver啟動后為當前的應用程序申請資源。Master返回資源，並在對應的worker節點上發送消息啟動Worker中的executor進程。

4、Driver端發送task到worker節點上執行。

5、worker將執行情況和執行結果返回給Driver端。Driver監控task任務，並回收結果。

總結

1、當在客戶端提交多個application時，Driver會在Woker節點上隨機啟動，這種模式會將單節點的網卡流量激增問題分散到集群中。在客戶端看不到task執行情況和結果。要去webui中看。cluster模式適用於生產環境

2、 Master模式先啟動Driver，再啟動Application。

　　Client模式:

 ./spark-submit \
--master  spark://node01:7077 \
--class org.apache.spark.examples.SparkPi  \
--driver-memory 1g \
--executor-memory 1g \ --executor-cores 2 \
../lib/spark-examples-1.6.0-hadoop2.6.0.jar 100

---------------------------------------------------------------------

 ./spark-submit \
--master  spark://node01:7077 \
--deploy-mode client \
--class org.apache.spark.examples.SparkPi \
--driver-memory 1g \
--executor-memory 1g \ --executor-cores 2 \
../lib/spark-examples-1.6.0-hadoop2.6.0.jar 100

執行流程

1、client模式提交任務后，會在客戶端啟動Driver進程。

2、Driver會向Master申請啟動Application啟動的資源。

3、資源申請成功，Driver端將task發送到worker端執行。

4、worker將task執行結果返回到Driver端。

總結

1、client模式適用於測試調試程序。Driver進程是在客戶端啟動的，這里的客戶端就是指提交應用程序的當前節點。在Driver端可以看到task執行的情況。生產環境下不能使用client模式，是因為：假設要提交100個application到集群運行，Driver每次都會在client端啟動，那么就會導致客戶端100次網卡流量暴增的問題。（因為要監控task的運行情況，會占用很多端口，如上圖的結果圖）客戶端網卡通信，都被task監控信息占用。

2、Client端作用

1. Driver負責應用程序資源的申請

2. 任務的分發。

3. 結果的回收。

4. 監控task執行情況。

Spark on yarn模式提交任務　　

　　官方文檔:http://spark.apache.org/docs/latest/running-on-yarn.html

　　Spark-Yarn Cluster模式:

./bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster \
--driver-memory 1g \
--executor-memory 1g \
--executor-cores 2 \
--queue default \
lib/spark-examples*.jar \
10

---------------------------------------------------------------------------------------------------------------------------------
./bin/spark-submit --class cn.edu360.spark.day1.WordCount \
--master yarn \
--deploy-mode cluster \
--driver-memory 1g \
--executor-memory 1g \
--executor-cores 2 \
--queue default \
/home/bigdata/hello-spark-1.0.jar \
hdfs://node-1.edu360.cn:9000/wc hdfs://node-1.edu360.cn:9000/out-yarn-1

　　Spark-yarn Cluster集群模式原理

　　　　Spark Driver首先作為一個ApplicationMaster在YARN集群中啟動，客戶端提交給ResourceManager的每一個job都會在集群的NodeManager節點上分配一個唯一的ApplicationMaster，由該ApplicationMaster管理全生命周期的應用。具體過程：

1. 由client向ResourceManager提交請求，並上傳jar到HDFS上
　　這期間包括四個步驟：
　　　　a).連接到RM
　　　　b).從RM的ASM（ApplicationsManager ）中獲得metric、queue和resource等信息。
　　　　c). upload app jar and spark-assembly jar
　　　　d).設置運行環境和container上下文（launch-container.sh等腳本)
2. ResouceManager向NodeManager申請資源，創建Spark ApplicationMaster（每個SparkContext都有一個ApplicationMaster）
3. NodeManager啟動ApplicationMaster，並向ResourceManager AsM注冊
4. ApplicationMaster從HDFS中找到jar文件，啟動SparkContext、DAGscheduler和YARN Cluster Scheduler
5. ResourceManager向ResourceManager AsM注冊申請container資源
6. ResourceManager通知NodeManager分配Container，這時可以收到來自ASM關於container的報告。（每個container對應一個executor）
7. Spark ApplicationMaster直接和container（executor）進行交互，完成這個分布式任務。

　　Spark-Yarn Client模式:

./bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode client \
--driver-memory 1g \
--executor-memory 1g \
--executor-cores 2 \
--queue default \
lib/spark-examples*.jar \
10

spark-shell必須使用client模式
./bin/spark-shell --master yarn --deploy-mode client

實際案例：在YARN模式，executor-cores和executor-memory的設置對調度計算機的性能作用很重要

$ ./bin/spark-submit \
  --class cn.cstor.face.BatchCompare \ --master yarn \ --deploy-mode client \ --executor-memory 30G \ --executor-cores 20 \ --properties-file $BIN_DIR/conf/cstor-spark.properties \ cstor-deep-1.0-SNAPSHOT.jar

在client模式下，Driver運行在Client上，通過ApplicationMaster向RM獲取資源。本地Driver負責與所有的executor container進行交互，並將最后的結果匯總。結束掉終端，相當於kill掉這個spark應用。一般來說，如果運行的結果僅僅返回到terminal上時需要配置這個。

客戶端的Driver將應用提交給Yarn后，Yarn會先后啟動ApplicationMaster和executor，另外ApplicationMaster和executor都是裝載在container里運行，container默認的內存是1G，ApplicationMaster分配的內存是driver- memory，executor分配的內存是executor-memory。同時，因為Driver在客戶端，所以程序的運行結果可以在客戶端顯示，Driver以進程名為SparkSubmit的形式存在。

如果使用spark on yarn 提交任務，一般情況，都使用cluster模式，該模式，Driver運行在集群中，其實就是運行在ApplicattionMaster這個進程成，如果該進程出現問題，yarn會重啟ApplicattionMaster（Driver），SparkSubmit的功能就是為了提交任務。

如果使用交換式的命令行，必須用Client模式，該模式，Driver是運行在SparkSubmit進程中，因為收集的結果，必須返回到命令行（即啟動命令的那台機器上），該模式，一般測試，或者運行spark-shell、spark-sql這個交互式命令行是使用

注意：如果你配置spark-on-yarn的client模式，其實會報錯。
修改所有yarn節點的yarn-site.xml，在該文件中添加如下配置

<property>
    <name>yarn.nodemanager.pmem-check-enabled</name>
    <value>false</value>
</property>
 
<property>
    <name>yarn.nodemanager.vmem-check-enabled</name>
    <value>false</value>
</property>

　 兩種模式的區別(yarn):

　　　　cluster模式:Driver程序在Yarn中運行,應用的運行結果不能在客戶端顯示,所以最好運行那些將結果最終保存在外部存儲介質(如HDFS,Redis,MySQL)而非stdout輸出的應用程序,客戶端的終端顯示的僅是作為Yarn的job的簡單運行狀況.

　　　　client模式:Driver運行在Client上,應用程序運行結果會在客戶端顯示,所有適合運行結果又輸出的應用程序(Spark-shell)

　 Spark-Submit 參數詳解:

  --master MASTER_URL         spark://host:port, mesos://host:port, yarn, or local.
  --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally ("client") or
                              on one of the worker machines inside the cluster ("cluster")
                              (Default: client).
  --class CLASS_NAME          Your application's main class (for Java / Scala apps).
  --name NAME                 A name of your application.
  --jars JARS                 Comma-separated list of local jars to include on the driver
                              and executor classpaths.
  --packages                  Comma-separated list of maven coordinates of jars to include
                              on the driver and executor classpaths. Will search the local
                              maven repo, then maven central and any additional remote
                              repositories given by --repositories. The format for the
                              coordinates should be groupId:artifactId:version.
  --exclude-packages          Comma-separated list of groupId:artifactId, to exclude while
                              resolving the dependencies provided in --packages to avoid
                              dependency conflicts.
  --repositories              Comma-separated list of additional remote repositories to
                              search for the maven coordinates given with --packages.
  --py-files PY_FILES         Comma-separated list of .zip, .egg, or .py files to place
                              on the PYTHONPATH for Python apps.
  --files FILES               Comma-separated list of files to be placed in the working
                              directory of each executor.

  --conf PROP=VALUE 

  --properties-file FILE      從文件中載入額外的配置，如果不指定則載入conf/spark-defaults.conf。

  --driver-memory MEM         Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
  --driver-java-options       Extra Java options to pass to the driver.
  --driver-library-path       Extra library path entries to pass to the driver.
  --driver-class-path         Extra class path entries to pass to the driver. Note that
                              jars added with --jars are automatically included in the
                              classpath.

  --executor-memory MEM       Memory per executor (e.g. 1000M, 2G) (Default: 1G).

  --proxy-user NAME           User to impersonate when submitting the application.

  --help, -h                  Show this help message and exit
  --verbose, -v               Print additional debug output
  --version,                  Print the version of current Spark

YARN-only:

Options:
  --driver-cores NUM          driver使用的核心數，只在cluster模式使用，默認值為1。
  --queue QUEUE_NAME          提交到指定的YARN隊列，默認隊列為"default"。
  --num-executors NUM         啟動的executor的數量，默認值為2.
  --archives ARCHIVES         Comma separated list of archives to be extracted into the
                              working directory of each executor.
  --principal PRINCIPAL       Principal to be used to login to KDC, while running on
                              secure HDFS.
  --keytab KEYTAB             The full path to the file that contains the keytab for the
                              principal specified above. This keytab will be copied to
                              the node running the Application Master via the Secure
                              Distributed Cache, for renewing the login tickets and the
                              delegation tokens periodically.

注意:如果部署模式是cluster,但是代碼中有標准輸出的話將看不到，需要把結果寫到HDFS中，如果是client模式則可以看到輸出。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Spark Standalone與Spark on YARN的幾種提交方式 Spark:三種任務提交流程standalone、yarn-cluster、yarn-client 使用Python腳本提交Spark任務到Yarn spark on yarn任務提交緩慢解決 Spark on Yarn：任務提交參數配置 Spark基於Standalone提交任務兩種方式 Spark Standalone 提交模式 Spark通過YARN提交任務不成功（包含YARN cluster和YARN client) spark利用yarn提交任務報:YARN application has exited unexpectedly with state UNDEFINED 【Spark篇】---Spark中yarn模式兩種提交任務方式