准备
- 下载spark,地址:http://spark.apache.org/downloads.html
- 下载不带hadoop预编译环境的spark最新版本,好处是可以自由使用最新版本的hadoop
- 下载hadoop,地址:https://hadoop.apache.org/releases.html
1.基本环境配置
[ec2-user@rcf-ai-datafeed-spark-prd-01 conf]$ cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost6 localhost6.localdomain6 10.16.5.162 rcf-ai-datafeed-spark-prd-01.wisers.com rcf-ai-datafeed-spark-prd-01 #master 10.16.5.177 rcf-ai-datafeed-spark-prd-02.wisers.com rcf-ai-datafeed-spark-prd-02 #slave 10.16.5.22 rcf-ai-datafeed-spark-prd-03.wisers.com rcf-ai-datafeed-spark-prd-03 #slave 10.16.5.243 rcf-ai-datafeed-spark-prd-04.wisers.com rcf-ai-datafeed-spark-prd-04 #slave
[ec2-user@rcf-ai-datafeed-spark-prd-01 conf]$ cat /etc/profile
#java config export JAVA_HOME=/data/server/jdk #hadoop config export HADOOP_HOME=/data/server/hadoop export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop export YARN_HOME=$HADOOP_HOME export YARN_CONF_DIR=$HADOOP_CONF_DIR #scala config export SCALA_HOME=/data/server/scala-2.11.12 #spark config export SPARK_HOME=/data/server/spark export SPARK_CONF_DI=$SPARK_HOME/conf export PATH=$JAVA_HOME/bin:$SPARK_HOME/bin:$SPARK_HOME/sbin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$SCALA_HOME/bin:$PATH
[ec2-user@rcf-ai-datafeed-spark-prd-01 data]$ ll
total 28 drwxrwxr-x 4 ec2-user ec2-user 4096 Apr 11 12:46 hadoop drwx------ 2 root root 16384 Apr 10 15:48 lost+found drwxrwxr-x 2 ec2-user ec2-user 4096 Apr 10 19:19 programs drwxrwxr-x 6 ec2-user ec2-user 4096 Apr 10 19:19 server
[ec2-user@rcf-ai-datafeed-spark-prd-01 data]$ cd server/
[ec2-user@rcf-ai-datafeed-spark-prd-01 server]$ ll (ln -s hadoop-2.7.7 hadoop)
total 16 lrwxrwxrwx 1 ec2-user ec2-user 12 Apr 10 19:15 hadoop -> hadoop-2.7.7 drwxr-xr-x 10 ec2-user ec2-user 4096 Apr 11 11:23 hadoop-2.7.7 lrwxrwxrwx 1 ec2-user ec2-user 12 Apr 10 19:15 jdk -> jdk1.8.0_191 drwxr-xr-x 7 ec2-user ec2-user 4096 Oct 6 2018 jdk1.8.0_191 drwxrwxr-x 6 ec2-user ec2-user 4096 Nov 10 2017 scala-2.11.12 lrwxrwxrwx 1 ec2-user ec2-user 25 Apr 10 19:15 spark -> spark-2.4.0-bin-hadoop2.7 drwxr-xr-x 13 ec2-user ec2-user 4096 Oct 29 14:36 spark-2.4.0-bin-hadoop2.7
注:4台机器配置ssh无密码链接(ssh-keygen -t rsa)
2.修改以下配置文件
[ec2-user@rcf-ai-datafeed-spark-prd-02 ~]$ cd /data/server/hadoop/etc/hadoop/
[ec2-user@rcf-ai-datafeed-spark-prd-02 hadoop]$ ls
capacity-scheduler.xml hadoop-metrics2.properties httpfs-signature.secret log4j.properties ssl-client.xml.example configuration.xsl hadoop-metrics.properties httpfs-site.xml mapred-env.cmd ssl-server.xml.example container-executor.cfg hadoop-policy.xml kms-acls.xml mapred-env.sh yarn-env.cmd core-site.xml hdfs-site.xml kms-env.sh mapred-queues.xml.template yarn-env.sh hadoop-env.cmd httpfs-env.sh kms-log4j.properties mapred-site.xml.template yarn-site.xml hadoop-env.sh httpfs-log4j.properties kms-site.xml slaves
[ec2-user@rcf-ai-datafeed-spark-prd-02 hadoop]$ cat slaves
rcf-ai-datafeed-spark-prd-02 rcf-ai-datafeed-spark-prd-03 rcf-ai-datafeed-spark-prd-04
[ec2-user@rcf-ai-datafeed-spark-prd-02 hadoop]$ vim hadoop-env.sh
export JAVA_HOME=/data/server/jdk #修改
[ec2-user@rcf-ai-datafeed-spark-prd-02 hadoop]$ cat hdfs-site.xml
<configuration> <property> <name>dfs.namenode.secondary.http-address</name> <value>rcf-ai-datafeed-spark-prd-01:9001</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/data/hadoop/dfs/name</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:/data/hadoop/dfs/data</value> </property> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
[ec2-user@rcf-ai-datafeed-spark-prd-02 hadoop]$ cat core-site.xml
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://rcf-ai-datafeed-spark-prd-01:9000/</value> </property> <property> <name>hadoop.tmp.dir</name> <value>file:/data/hadoop/tmp</value> </property> </configuration>
[ec2-user@rcf-ai-datafeed-spark-prd-02 hadoop]$ cat yarn-site.xml
<configuration> <!-- <property> <name>yarn.resourcemanager.hostname</name> <value>rcf-ai-datafeed-spark-prd-01</value> </property>--> <property> <name>yarn.scheduler.maximum-allocation-vcores</name> <value>10</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>spark_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.spark_shuffle.class</name> <value>org.apache.spark.network.yarn.YarnShuffleService</value> </property> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>32768</value> </property> <property> <name>yarn.nodemanager.resource.cpu-vcores</name> <value>16</value> </property> <property> <name>yarn.nodemanager.resource.detect-hardware-capabilities</name> <value>true</value> </property> <property> <name>yarn.nodemanager.vmem-check-enabled</name> <value>false</value> <description>Whether virtual memory limits will be enforced for containers</description> </property> <property> <name>yarn.nodemanager.vmem-pmem-ratio</name> <value>4</value> <description>Ratio between virtual memory to physical memory when setting memory limits for containers</description> </property> <!--<property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property>--> <property> <name>yarn.resourcemanager.webapp.address</name> <value>rcf-ai-datafeed-spark-prd-01:8088</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>rcf-ai-datafeed-spark-prd-01:8032</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>rcf-ai-datafeed-spark-prd-01:8030</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>rcf-ai-datafeed-spark-prd-01:8035</value> </property> <property> <name>yarn.resourcemanager.admin.address</name> <value>rcf-ai-datafeed-spark-prd-01:8033</value> </property> <property> <name>yarn.scheduler.maximum-allocation-vcores</name> <value>10</value> </property> <property> <name>spark.shuffle.service.port</name> <value>7337</value> </property> </configuration>
3.启动yarn
# cp /data/server/spark-2.4.0-bin-hadoop2.7/yarn/spark-2.4.0-yarn-shuffle.jar /data/server/hadoop-2.7.7/share/hadoop/yarn/ 复制hadoop缺少spark的jar包
namenode 格式化:在终端中执行命令: hadoop namenode -format(直接运行hadoop命令是依赖于之前配置的环境变量)
3.1 启动dfs
然后终端执行:start-dfs.sh 命令启动hdfs,运行后将会看到启动相关服务的提示,启动成功后,使用jps命令看看服务启动是否正常
在master端运行jps命令会看到如下服务:
[ec2-user@rcf-ai-datafeed-spark-prd-01 ~]$ jps
14129 NameNode 14356 SecondaryNameNode 20765 Jps
在slave端看到服务如下:
[ec2-user@rcf-ai-datafeed-spark-prd-02 hadoop]$ jps
6144 DataNode
7083 Jps
- 其中NameNode和SeondaryNameNode服务是必须的,如果slaves配置没有配置master,则不会DataNode,否则会有dataNode服务
- 在slave1节点中运行jps会只有DataNode服务
- 这时在浏览器中输入网址:http://rcf-ai-datafeed-spark-prd-01:50070/ 可以看到hdfs控制台页面
启动 yarn
- 最后在终端执行:start-yarn.sh命令启动yarn,在运行jps命令,查看服务器启动是否正常,如下入
- 这时在master节点会发现多出两个服务ResourceManager和NodeManager,说明启动成功。在slave节点中会多出一个NodeManager服务,说明启动成功,在浏览器中输入 http://rcf-ai-datafeed-spark-prd-01:8088/cluster/nodes 查看yarn任务控制台
4.配置spark整合hadoop
4.1 将spark运行时环境上传到hdfs
- 创建目录(如果有需要):hadoop fs -mkdir -p hdfs://rcf-ai-datafeed-spark-prd-01:9000/tmp/spark/lib_jars/
- 上传jar包: hadoop fs -put jars/* hdfs://rcf-ai-datafeed-spark-prd-01:9000/tmp/spark/lib_jars/
- 查看上传是否成功: hadoop fs -ls hdfs://rcf-ai-datafeed-spark-prd-01:9000/tmp/spark/lib_jars/ (确认看到spark的jar已经上传到hdfs中)
4.2 spark配置hadoop
[ec2-user@rcf-ai-datafeed-spark-prd-01 ~]$ cd /data/server/spark/conf/
[ec2-user@rcf-ai-datafeed-spark-prd-01 conf]$ cat spark-defaults.conf #添加如下内容
spark.yarn.jars hdfs://rcf-ai-datafeed-spark-prd-01:9000/tmp/spark/lib_jars/*.jar spark.shuffle.service.enabled true spark.shuffle.service.port 7337 spark.sql.hive.thriftServer.singleSession true
#此处是因为要配置spark shuffle服务是使用,详情请见:http://spark.apache.org/docs/latest/running-on-yarn.html#configuring-the-external-shuffle-service
[ec2-user@rcf-ai-datafeed-spark-prd-01 conf]$ cat spark-env.sh # 添加如下内容
export SPARK_DIST_CLASSPATH=$(/data/server/hadoop/bin/hadoop classpath) export HADOOP_CONF_DIR=/data/server/hadoop/etc/hadoop export YARN_CONF_DIR=/data/server/hadoop/etc/hadoop # 这里是Spark的配置文件所在的目录,在本例中为:$spark_home/conf export SPARK_CONF_DIR=/data/server/spark/conf # 为Spark Thrift Server分配的内存大小 export SPARK_DAEMON_MEMORY=1024m
5. spark submit job on yarn
- 在spark01(10.16.5.86)上 spark_home=/data/server/spark-2.4.0-bin-without-hadoop
- 进入到spark目录,cd ${spark_home}执行以下命令来提交job,命令行参数可以参考spark-submit脚本的提示
- bin/spark-submit --master yarn --deploy-mode cluster --driver-memory 1G --num-executors 2 --executor-cores 2 --executor-memory 2G --class com.wisers.cloud.spark.Main /home/ec2-user/app/character-conver.jar(自己开发的应用jar包)