spark on yarn 安装部署


准备

1.基本环境配置

[ec2-user@rcf-ai-datafeed-spark-prd-01 conf]$ cat /etc/hosts

127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost6 localhost6.localdomain6
10.16.5.162 rcf-ai-datafeed-spark-prd-01.wisers.com rcf-ai-datafeed-spark-prd-01     #master
10.16.5.177 rcf-ai-datafeed-spark-prd-02.wisers.com rcf-ai-datafeed-spark-prd-02     #slave
10.16.5.22 rcf-ai-datafeed-spark-prd-03.wisers.com rcf-ai-datafeed-spark-prd-03       #slave
10.16.5.243 rcf-ai-datafeed-spark-prd-04.wisers.com rcf-ai-datafeed-spark-prd-04     #slave

 

[ec2-user@rcf-ai-datafeed-spark-prd-01 conf]$ cat /etc/profile

#java config
export JAVA_HOME=/data/server/jdk

#hadoop config
export HADOOP_HOME=/data/server/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_HOME=$HADOOP_HOME
export YARN_CONF_DIR=$HADOOP_CONF_DIR

#scala config
export SCALA_HOME=/data/server/scala-2.11.12

#spark config
export SPARK_HOME=/data/server/spark
export SPARK_CONF_DI=$SPARK_HOME/conf

export PATH=$JAVA_HOME/bin:$SPARK_HOME/bin:$SPARK_HOME/sbin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$SCALA_HOME/bin:$PATH

 

[ec2-user@rcf-ai-datafeed-spark-prd-01 data]$ ll

total 28
drwxrwxr-x 4 ec2-user ec2-user 4096 Apr 11 12:46 hadoop
drwx------ 2 root root 16384 Apr 10 15:48 lost+found
drwxrwxr-x 2 ec2-user ec2-user 4096 Apr 10 19:19 programs
drwxrwxr-x 6 ec2-user ec2-user 4096 Apr 10 19:19 server


[ec2-user@rcf-ai-datafeed-spark-prd-01 data]$ cd server/
[ec2-user@rcf-ai-datafeed-spark-prd-01 server]$ ll  (ln -s hadoop-2.7.7 hadoop)

total 16
lrwxrwxrwx 1 ec2-user ec2-user 12 Apr 10 19:15 hadoop -> hadoop-2.7.7
drwxr-xr-x 10 ec2-user ec2-user 4096 Apr 11 11:23 hadoop-2.7.7
lrwxrwxrwx 1 ec2-user ec2-user 12 Apr 10 19:15 jdk -> jdk1.8.0_191
drwxr-xr-x 7 ec2-user ec2-user 4096 Oct 6 2018 jdk1.8.0_191
drwxrwxr-x 6 ec2-user ec2-user 4096 Nov 10 2017 scala-2.11.12
lrwxrwxrwx 1 ec2-user ec2-user 25 Apr 10 19:15 spark -> spark-2.4.0-bin-hadoop2.7
drwxr-xr-x 13 ec2-user ec2-user 4096 Oct 29 14:36 spark-2.4.0-bin-hadoop2.7

注:4台机器配置ssh无密码链接(ssh-keygen -t rsa)

2.修改以下配置文件

 [ec2-user@rcf-ai-datafeed-spark-prd-02 ~]$ cd /data/server/hadoop/etc/hadoop/

[ec2-user@rcf-ai-datafeed-spark-prd-02 hadoop]$ ls

capacity-scheduler.xml hadoop-metrics2.properties httpfs-signature.secret log4j.properties ssl-client.xml.example
configuration.xsl hadoop-metrics.properties httpfs-site.xml mapred-env.cmd ssl-server.xml.example
container-executor.cfg hadoop-policy.xml kms-acls.xml mapred-env.sh yarn-env.cmd
core-site.xml hdfs-site.xml kms-env.sh mapred-queues.xml.template yarn-env.sh
hadoop-env.cmd httpfs-env.sh kms-log4j.properties mapred-site.xml.template yarn-site.xml
hadoop-env.sh httpfs-log4j.properties kms-site.xml slaves

[ec2-user@rcf-ai-datafeed-spark-prd-02 hadoop]$ cat slaves

rcf-ai-datafeed-spark-prd-02
rcf-ai-datafeed-spark-prd-03
rcf-ai-datafeed-spark-prd-04

[ec2-user@rcf-ai-datafeed-spark-prd-02 hadoop]$ vim hadoop-env.sh 

export JAVA_HOME=/data/server/jdk  #修改

[ec2-user@rcf-ai-datafeed-spark-prd-02 hadoop]$ cat hdfs-site.xml

<configuration>
<property>
 <name>dfs.namenode.secondary.http-address</name>
 <value>rcf-ai-datafeed-spark-prd-01:9001</value>
 </property>
 <property>
 <name>dfs.namenode.name.dir</name>
 <value>file:/data/hadoop/dfs/name</value>
 </property>
 <property>
 <name>dfs.datanode.data.dir</name>
 <value>file:/data/hadoop/dfs/data</value>
 </property>
 <property>
 <name>dfs.replication</name>
 <value>1</value>
</property>
</configuration>

[ec2-user@rcf-ai-datafeed-spark-prd-02 hadoop]$ cat core-site.xml 

<configuration>
  <property>
     <name>fs.defaultFS</name>
     <value>hdfs://rcf-ai-datafeed-spark-prd-01:9000/</value>
  </property>
  <property>
     <name>hadoop.tmp.dir</name>
     <value>file:/data/hadoop/tmp</value>
  </property>
</configuration>

[ec2-user@rcf-ai-datafeed-spark-prd-02 hadoop]$ cat yarn-site.xml 

<configuration>
       <!-- <property>
                                <name>yarn.resourcemanager.hostname</name>
        <value>rcf-ai-datafeed-spark-prd-01</value>
        </property>-->
        <property>
                <name>yarn.scheduler.maximum-allocation-vcores</name>
                <value>10</value>
        </property>
        <property>
                <name>yarn.nodemanager.aux-services</name>
                <value>spark_shuffle</value>
        </property>
        <property>
                <name>yarn.nodemanager.aux-services.spark_shuffle.class</name>
                <value>org.apache.spark.network.yarn.YarnShuffleService</value>
        </property>
        <property>
                <name>yarn.nodemanager.resource.memory-mb</name>
                <value>32768</value>
        </property>
        <property>
                <name>yarn.nodemanager.resource.cpu-vcores</name>
                <value>16</value>
        </property>
        <property>
                <name>yarn.nodemanager.resource.detect-hardware-capabilities</name>
                <value>true</value>
        </property>
        <property>
                <name>yarn.nodemanager.vmem-check-enabled</name>
                <value>false</value>
                <description>Whether virtual memory limits will be enforced for containers</description>
        </property>
        <property>
                <name>yarn.nodemanager.vmem-pmem-ratio</name>
                <value>4</value>
                <description>Ratio between virtual memory to physical memory when setting memory limits for containers</description>
        </property>
        <!--<property>
                             <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
                <value>org.apache.hadoop.mapred.ShuffleHandler</value>
        </property>-->
        <property>
                <name>yarn.resourcemanager.webapp.address</name>
                <value>rcf-ai-datafeed-spark-prd-01:8088</value>
        </property>
        <property>
                <name>yarn.resourcemanager.address</name>
                <value>rcf-ai-datafeed-spark-prd-01:8032</value>
        </property>
        <property>
                <name>yarn.resourcemanager.scheduler.address</name>
                <value>rcf-ai-datafeed-spark-prd-01:8030</value>
        </property>
        <property>
                <name>yarn.resourcemanager.resource-tracker.address</name>
                <value>rcf-ai-datafeed-spark-prd-01:8035</value>
        </property>
        <property>
                <name>yarn.resourcemanager.admin.address</name>
                <value>rcf-ai-datafeed-spark-prd-01:8033</value>
        </property>
        <property>
                <name>yarn.scheduler.maximum-allocation-vcores</name>
                <value>10</value>
        </property>
        <property>
                <name>spark.shuffle.service.port</name>
                <value>7337</value>
        </property>

</configuration>

 

3.启动yarn

# cp /data/server/spark-2.4.0-bin-hadoop2.7/yarn/spark-2.4.0-yarn-shuffle.jar    /data/server/hadoop-2.7.7/share/hadoop/yarn/   复制hadoop缺少spark的jar包

namenode 格式化:在终端中执行命令: hadoop namenode -format(直接运行hadoop命令是依赖于之前配置的环境变量)

3.1  启动dfs

然后终端执行:start-dfs.sh 命令启动hdfs,运行后将会看到启动相关服务的提示,启动成功后,使用jps命令看看服务启动是否正常

 在master端运行jps命令会看到如下服务:

[ec2-user@rcf-ai-datafeed-spark-prd-01 ~]$ jps

14129 NameNode
14356 SecondaryNameNode
20765 Jps

在slave端看到服务如下:

[ec2-user@rcf-ai-datafeed-spark-prd-02 hadoop]$ jps

6144 DataNode
7083 Jps
  • 其中NameNode和SeondaryNameNode服务是必须的,如果slaves配置没有配置master,则不会DataNode,否则会有dataNode服务
  •    在slave1节点中运行jps会只有DataNode服务
  •  这时在浏览器中输入网址:http://rcf-ai-datafeed-spark-prd-01:50070/ 可以看到hdfs控制台页面

启动 yarn

  • 最后在终端执行:start-yarn.sh命令启动yarn,在运行jps命令,查看服务器启动是否正常,如下入
  • 这时在master节点会发现多出两个服务ResourceManager和NodeManager,说明启动成功。在slave节点中会多出一个NodeManager服务,说明启动成功,在浏览器中输入 http://rcf-ai-datafeed-spark-prd-01:8088/cluster/nodes 查看yarn任务控制台

 

4.配置spark整合hadoop

4.1 将spark运行时环境上传到hdfs

 

4.2 spark配置hadoop

[ec2-user@rcf-ai-datafeed-spark-prd-01 ~]$ cd /data/server/spark/conf/
[ec2-user@rcf-ai-datafeed-spark-prd-01 conf]$ cat spark-defaults.conf    #添加如下内容

spark.yarn.jars hdfs://rcf-ai-datafeed-spark-prd-01:9000/tmp/spark/lib_jars/*.jar
spark.shuffle.service.enabled true
spark.shuffle.service.port 7337

spark.sql.hive.thriftServer.singleSession true

#此处是因为要配置spark shuffle服务是使用,详情请见:http://spark.apache.org/docs/latest/running-on-yarn.html#configuring-the-external-shuffle-service

[ec2-user@rcf-ai-datafeed-spark-prd-01 conf]$ cat spark-env.sh     #  添加如下内容

export SPARK_DIST_CLASSPATH=$(/data/server/hadoop/bin/hadoop classpath)
export HADOOP_CONF_DIR=/data/server/hadoop/etc/hadoop
export YARN_CONF_DIR=/data/server/hadoop/etc/hadoop
# 这里是Spark的配置文件所在的目录,在本例中为:$spark_home/conf
export SPARK_CONF_DIR=/data/server/spark/conf

# 为Spark Thrift Server分配的内存大小
export SPARK_DAEMON_MEMORY=1024m

 

5. spark submit job on yarn

  •  在spark01(10.16.5.86)上  spark_home=/data/server/spark-2.4.0-bin-without-hadoop
  • 进入到spark目录,cd ${spark_home}执行以下命令来提交job,命令行参数可以参考spark-submit脚本的提示
  • bin/spark-submit --master yarn --deploy-mode cluster --driver-memory 1G --num-executors 2 --executor-cores 2 --executor-memory 2G --class com.wisers.cloud.spark.Main /home/ec2-user/app/character-conver.jar(自己开发的应用jar包)

 


免责声明!

本站转载的文章为个人学习借鉴使用,本站对版权不负任何法律责任。如果侵犯了您的隐私权益,请联系本站邮箱yoyou2525@163.com删除。



 
粤ICP备18138465号  © 2018-2025 CODEPRJ.COM