在spark1.0中推出spark-submit來統一提交applicaiton
./bin/spark-submit \ --class <main-class> --master <master-url> \ --deploy-mode <deploy-mode> \ ... # other options <application-jar> \ [application-arguments]
--class:application的入口點;
--master:集群的master url;
--deploy-mode:driver在集群中的部署模式;
application-jar:application代碼的jar包, 可以放在HDFS上,也可以放在本地文件系統上;
standalone模式案例:
spark-submit \ --name SparkSubmit_Demo \ --class com.luogankun.spark.WordCount \ --master spark://hadoop000:7077 \ --executor-memory 1G \ --total-executor-cores 1 \ /home/spark/data/spark.jar \ hdfs://hadoop000:8020/hello.txt
需要在master中設置spark集群的master地址;
yarn-client模式案例:
spark-submit \ --name SparkSubmit_Demo \ --class com.luogankun.spark.WordCount \ --master yarn-client \ --executor-memory 1G \ --total-executor-cores 1 \ /home/spark/data/spark.jar \ hdfs://hadoop000:8020/hello.txt
yarn-cluster模式案例:
spark-submit \ --name SparkSubmit_Demo \ --class com.luogankun.spark.WordCount \ --master yarn-cluster \ --executor-memory 1G \ --total-executor-cores 1 \ /home/spark/data/spark.jar \ hdfs://hadoop000:8020/hello.txt
注:提交yarn上執行需要配置HADOOP_CONF_DIR
yarn-client和yarn-cluser的區別:以Driver的位置來區分
yarn-client:
Client和Driver運行在一起,ApplicationMaster只用來獲取資源;結果實時輸出在客戶端控制台上,可以方便的看到日志信息,推薦使用該模式;
提交到yarn后,yarn先啟動ApplicationMaster和Executor,兩者都是運行在Container中。注意:一個container中只運行一個executorbackend;
yarn-cluser:
Driver和ApplicationMaster運行在一起,所以運行結果不能在客戶端控制台顯示,需要將結果需要存放在HDFS或者寫到數據庫中;
driver在集群上運行,可通過ui界面訪問driver的狀態。