【Spark-core學習之三】 Spark集群搭建 & spark-shell & Master HA


環境
  虛擬機:VMware 10
  Linux版本:CentOS-6.5-x86_64
  客戶端:Xshell4
  FTP:Xftp4
  jdk1.8
  scala-2.10.4(依賴jdk1.8)
  spark-1.6

 

一、搭建集群
組建方案:
master:PCS101,slave:PCS102、PCS103

搭建方式一:Standalone


步驟一:解壓文件 改名

[root@PCS101 src]# tar -zxvf spark-1.6.0-bin-hadoop2.6.tgz -C /usr/local
[root@PCS101 local]# mv spark-1.6.0-bin-hadoop2.6 spark-1.6.0

 

步驟二:修改配置文件
1、slaves.template 設置從節點

[root@PCS101 conf]# cd /usr/local/spark-1.6.0/conf && mv slaves.template slaves && vi slaves

PCS102
PCS103

 

2、spark-config.sh 設置java_home

export JAVA_HOME=/usr/local/jdk1.8.0_65

 

3、spark-env.sh

[root@PCS101 conf]# mv spark-env.sh.template spark-env.sh && vi spark-env.sh

#SPARK_MASTER_IP:master的ip
export SPARK_MASTER_IP=PCS101

#SPARK_MASTER_PORT:提交任務的端口,默認是7077
export SPARK_MASTER_PORT=7077

#SPARK_WORKER_CORES:每個worker從節點能夠支配的core的個數
export SPARK_WORKER_CORES=2

#SPARK_WORKER_MEMORY:每個worker從節點能夠支配的內存數
export SPARK_WORKER_MEMORY=3g

#SPARK_MASTER_WEBUI_PORT:sparkwebUI端口 默認8080 或者修改spark-master.sh
export SPARK_MASTER_WEBUI_PORT=8080

 

步驟三、分發spark到另外兩個節點

[root@PCS101 local]# scp -r /usr/local/spark-1.6.0 root@PCS102:`pwd`
[root@PCS101 local]# scp -r /usr/local/spark-1.6.0 root@PCS103:`pwd`

 

步驟四:啟動集群

[root@PCS101 sbin]# /usr/local/spark-1.6.0/sbin/start-all.sh

 

步驟五:關閉集群

[root@PCS101 sbin]# /usr/local/spark-1.6.0/sbin/stop-all.sh

master界面:端口默認8080

 job界面,端口是4040:

步驟六:搭建客戶端
將spark安裝包原封不動的拷貝到一個新的節點上,然后,在新的節點上提交任務即可


提交任務測試:

[root@PCS101 bin]# ./spark-submit 
Usage: spark-submit [options] <app jar | python file> [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]

Options:
  --master MASTER_URL         spark://host:port, mesos://host:port, yarn, or local.
  --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally ("client") or
                              on one of the worker machines inside the cluster ("cluster")
                              (Default: client).
  --class CLASS_NAME          Your application's main class (for Java / Scala apps).
  --name NAME                 A name of your application.
  --jars JARS                 Comma-separated list of local jars to include on the driver
                              and executor classpaths.
  --packages                  Comma-separated list of maven coordinates of jars to include
                              on the driver and executor classpaths. Will search the local
                              maven repo, then maven central and any additional remote
                              repositories given by --repositories. The format for the
                              coordinates should be groupId:artifactId:version.
  --exclude-packages          Comma-separated list of groupId:artifactId, to exclude while
                              resolving the dependencies provided in --packages to avoid
                              dependency conflicts.
  --repositories              Comma-separated list of additional remote repositories to
                              search for the maven coordinates given with --packages.
  --py-files PY_FILES         Comma-separated list of .zip, .egg, or .py files to place
                              on the PYTHONPATH for Python apps.
  --files FILES               Comma-separated list of files to be placed in the working
                              directory of each executor.

  --conf PROP=VALUE           Arbitrary Spark configuration property.
  --properties-file FILE      Path to a file from which to load extra properties. If not
                              specified, this will look for conf/spark-defaults.conf.

  --driver-memory MEM         Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
  --driver-java-options       Extra Java options to pass to the driver.
  --driver-library-path       Extra library path entries to pass to the driver.
  --driver-class-path         Extra class path entries to pass to the driver. Note that
                              jars added with --jars are automatically included in the
                              classpath.

  --executor-memory MEM       Memory per executor (e.g. 1000M, 2G) (Default: 1G).

  --proxy-user NAME           User to impersonate when submitting the application.

  --help, -h                  Show this help message and exit
  --verbose, -v               Print additional debug output
  --version,                  Print the version of current Spark

 Spark standalone with cluster deploy mode only:
  --driver-cores NUM          Cores for driver (Default: 1).

 Spark standalone or Mesos with cluster deploy mode only:
  --supervise                 If given, restarts the driver on failure.
  --kill SUBMISSION_ID        If given, kills the driver specified.
  --status SUBMISSION_ID      If given, requests the status of the driver specified.

 Spark standalone and Mesos only:
  --total-executor-cores NUM  Total cores for all executors.

 Spark standalone and YARN only:
  --executor-cores NUM        Number of cores per executor. (Default: 1 in YARN mode,
                              or all available cores on the worker in standalone mode)

 YARN-only:
  --driver-cores NUM          Number of cores used by the driver, only in cluster mode
                              (Default: 1).
  --queue QUEUE_NAME          The YARN queue to submit to (Default: "default").
  --num-executors NUM         Number of executors to launch (Default: 2).
  --archives ARCHIVES         Comma separated list of archives to be extracted into the
                              working directory of each executor.
  --principal PRINCIPAL       Principal to be used to login to KDC, while running on
                              secure HDFS.
  --keytab KEYTAB             The full path to the file that contains the keytab for the
                              principal specified above. This keytab will be copied to
                              the node running the Application Master via the Secure
                              Distributed Cache, for renewing the login tickets and the
                              delegation tokens periodically.
      

 

(1)三台節點選擇任意一個節點提交:
#--master指定master 四種方式:
  standlone方式:spark://host:port;
  mesos方式:mesos://host:port;
  yarn方式:yarn
  local方式:local
#--class指定任務類名以及所在的jar包路徑
#--deploy-mode 提交任務模式:client客戶端模式 cluster 集群模式

示例:SparkPi:后面的10000是SparkPi的參數

[root@PCS101 bin]# /usr/local/spark-1.6.0/bin/spark-submit --master spark://PCS101:7077 --class org.apache.spark.examples.SparkPi ../lib/spark-examples-1.6.0-hadoop2.6.0.jar 10000

 

(2)使用單獨節點作為客戶端提交任務:
將PCS101上spark-1.6.0安裝目錄整個拷貝到新節點PCS104上,然后在PCS104上提交任務,效果是一樣的;
PCS104與spark集群無關的,僅僅是上面有spark的提交任務腳本,可以將PCS104當作spark客戶端。


搭建方式二:Yarn

注意:如果數據來源於HDFS或者需要使用YARN提交任務,那么spark需要依賴HDFS,否則兩者沒有聯系
(1)步驟一、二、三、六同standalone
(2)在客戶端中配置:

export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

 

提交任務:

[root@PCS101 bin]# /usr/local/spark-1.6.0/bin/spark-submit --master yarn --class org.apache.spark.examples.SparkPi ../lib/spark-examples-1.6.0-hadoop2.6.0.jar 10000

 

二、spark-shell

SparkShell是Spark自帶的一個快速原型開發工具,也可以說是Spark的scala REPL(Read-Eval-Print-Loop),即交互式shell。支持使用scala語言來進行Spark的交互式編程。

1、運行
步驟一:啟動standalone集群和HDFS集群(PCS102上HDFS偽分布),之后啟動spark-shell

[root@PCS101 bin]# /usr/local/spark-1.6.0/bin/spark-shell --master spark://PCS101:7077

 

步驟二:運行wordcount

scala>sc.textFile("hdfs://PCS102:9820/spark/test/wc.txt").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).foreach(println)

 

2、配置historyServer

2、配置historyServer
(2.1)臨時配置,對本次提交的應用程序起作用

[root@PCS101 bin]# /usr/local/spark-1.6.0/bin/spark-shell --master spark://PCS101:7077 --name myapp1 --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=hdfs://PCS102:9820/spark/test

 

停止程序,在Web UI中Completed Applications對應的ApplicationID中能查看history。

(2.2)spark-default.conf配置文件中配置HistoryServer,對所有提交的Application都起作用
如果想看歷史事件日志,可以新搭建一個HistoryServer專門用來看歷史應用日志,跟當前的集群沒有關系,

這里我們新啟動客戶端PCS104節點,進入../spark-1.6.0/conf/spark-defaults.conf最后加入:

//開啟記錄事件日志的功能
spark.eventLog.enabled true
//設置事件日志存儲的目錄
spark.eventLog.dir hdfs://PCS102:9820/spark/test
//設置HistoryServer加載事件日志的位置  恢復查看
spark.history.fs.logDirectory hdfs://PCS102:9820/spark/test
//日志優化選項,壓縮日志
spark.eventLog.compress true

 

啟動HistoryServer:

[root@PCS104 sbin]# /usr/local/spark-1.6.0/sbin/start-history-server.sh

 

訪問HistoryServer:PCS104:18080,之后所有提交的應用程序運行狀況都會被記錄。這里的HistoryServer和當前spark集群無關的

三、Master HA

 

1、Master的高可用可以使用fileSystem(文件系統)和zookeeper(分布式協調服務),一般使用zookeeper方式


2、Master高可用搭建
搭建方案:
Spark集群:PCS101、PCS102、PCS103
主master:PCS101
備master:PCS102

1)在Spark Master節點上配置主Master,配置/usr/local/spark-1.6.0/conf/spark-env.sh

export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=PCS101:2181,PCS102:2181,PCS103:2181  -Dspark.deploy.zookeeper.dir=/sparkmaster0409"


2)發送到其他worker節點上

[root@PCS101 conf]# scp spark-env.sh root@PCS102:`pwd`
[root@PCS101 conf]# scp spark-env.sh root@PCS103:`pwd`

3)PCS102配置備用 Master,修改spark-env.sh配置節點上的MasterIP

export SPARK_MASTER_IP=PCS102

4)啟動集群之前啟動zookeeper集群:
../zkServer.sh start

5)啟動spark Standalone集群,啟動備用Master

[root@PCS101 sbin]# /usr/local/spark-1.6.0/sbin/start-all.sh
[root@PCS102 sbin]# /usr/local/spark-1.6.0/sbin/start-master.sh

6) 打開主Master和備用Master WebUI頁面,觀察狀態。

PCS101:Status:ALIVE
PCS102:Status:STANDBY
kill掉PCS101的master,PCS102的Status:STANDBY變為ALIVE

3. 注意點
  (1)主備切換過程中不能提交Application。
  (2)主備切換過程中不影響已經在集群中運行的Application。因為Spark是粗粒度資源調度。

4. 測試驗證
提交SparkPi程序,kill主Master觀察現象。

./spark-submit 
--master spark://PCS101:7077,PCS102:7077 
--class org.apache.spark.examples.SparkPi 
../lib/spark-examples-1.6.0-hadoop2.6.0.jar 
10000

 

 

 

參考:
Spark2.3 HA集群的分布式安裝

Spark


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2026 CODEPRJ.COM