准備
- 下載spark,地址:http://spark.apache.org/downloads.html
- 下載不帶hadoop預編譯環境的spark最新版本,好處是可以自由使用最新版本的hadoop
- 下載hadoop,地址:https://hadoop.apache.org/releases.html
1.基本環境配置
[ec2-user@rcf-ai-datafeed-spark-prd-01 conf]$ cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost6 localhost6.localdomain6 10.16.5.162 rcf-ai-datafeed-spark-prd-01.wisers.com rcf-ai-datafeed-spark-prd-01 #master 10.16.5.177 rcf-ai-datafeed-spark-prd-02.wisers.com rcf-ai-datafeed-spark-prd-02 #slave 10.16.5.22 rcf-ai-datafeed-spark-prd-03.wisers.com rcf-ai-datafeed-spark-prd-03 #slave 10.16.5.243 rcf-ai-datafeed-spark-prd-04.wisers.com rcf-ai-datafeed-spark-prd-04 #slave
[ec2-user@rcf-ai-datafeed-spark-prd-01 conf]$ cat /etc/profile
#java config export JAVA_HOME=/data/server/jdk #hadoop config export HADOOP_HOME=/data/server/hadoop export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop export YARN_HOME=$HADOOP_HOME export YARN_CONF_DIR=$HADOOP_CONF_DIR #scala config export SCALA_HOME=/data/server/scala-2.11.12 #spark config export SPARK_HOME=/data/server/spark export SPARK_CONF_DI=$SPARK_HOME/conf export PATH=$JAVA_HOME/bin:$SPARK_HOME/bin:$SPARK_HOME/sbin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$SCALA_HOME/bin:$PATH
[ec2-user@rcf-ai-datafeed-spark-prd-01 data]$ ll
total 28 drwxrwxr-x 4 ec2-user ec2-user 4096 Apr 11 12:46 hadoop drwx------ 2 root root 16384 Apr 10 15:48 lost+found drwxrwxr-x 2 ec2-user ec2-user 4096 Apr 10 19:19 programs drwxrwxr-x 6 ec2-user ec2-user 4096 Apr 10 19:19 server
[ec2-user@rcf-ai-datafeed-spark-prd-01 data]$ cd server/
[ec2-user@rcf-ai-datafeed-spark-prd-01 server]$ ll (ln -s hadoop-2.7.7 hadoop)
total 16 lrwxrwxrwx 1 ec2-user ec2-user 12 Apr 10 19:15 hadoop -> hadoop-2.7.7 drwxr-xr-x 10 ec2-user ec2-user 4096 Apr 11 11:23 hadoop-2.7.7 lrwxrwxrwx 1 ec2-user ec2-user 12 Apr 10 19:15 jdk -> jdk1.8.0_191 drwxr-xr-x 7 ec2-user ec2-user 4096 Oct 6 2018 jdk1.8.0_191 drwxrwxr-x 6 ec2-user ec2-user 4096 Nov 10 2017 scala-2.11.12 lrwxrwxrwx 1 ec2-user ec2-user 25 Apr 10 19:15 spark -> spark-2.4.0-bin-hadoop2.7 drwxr-xr-x 13 ec2-user ec2-user 4096 Oct 29 14:36 spark-2.4.0-bin-hadoop2.7
注:4台機器配置ssh無密碼鏈接(ssh-keygen -t rsa)
2.修改以下配置文件
[ec2-user@rcf-ai-datafeed-spark-prd-02 ~]$ cd /data/server/hadoop/etc/hadoop/
[ec2-user@rcf-ai-datafeed-spark-prd-02 hadoop]$ ls
capacity-scheduler.xml hadoop-metrics2.properties httpfs-signature.secret log4j.properties ssl-client.xml.example configuration.xsl hadoop-metrics.properties httpfs-site.xml mapred-env.cmd ssl-server.xml.example container-executor.cfg hadoop-policy.xml kms-acls.xml mapred-env.sh yarn-env.cmd core-site.xml hdfs-site.xml kms-env.sh mapred-queues.xml.template yarn-env.sh hadoop-env.cmd httpfs-env.sh kms-log4j.properties mapred-site.xml.template yarn-site.xml hadoop-env.sh httpfs-log4j.properties kms-site.xml slaves
[ec2-user@rcf-ai-datafeed-spark-prd-02 hadoop]$ cat slaves
rcf-ai-datafeed-spark-prd-02 rcf-ai-datafeed-spark-prd-03 rcf-ai-datafeed-spark-prd-04
[ec2-user@rcf-ai-datafeed-spark-prd-02 hadoop]$ vim hadoop-env.sh
export JAVA_HOME=/data/server/jdk #修改
[ec2-user@rcf-ai-datafeed-spark-prd-02 hadoop]$ cat hdfs-site.xml
<configuration> <property> <name>dfs.namenode.secondary.http-address</name> <value>rcf-ai-datafeed-spark-prd-01:9001</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/data/hadoop/dfs/name</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:/data/hadoop/dfs/data</value> </property> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
[ec2-user@rcf-ai-datafeed-spark-prd-02 hadoop]$ cat core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://rcf-ai-datafeed-spark-prd-01:9000/</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>file:/data/hadoop/tmp</value>
</property>
</configuration>
[ec2-user@rcf-ai-datafeed-spark-prd-02 hadoop]$ cat yarn-site.xml
<configuration>
<!-- <property>
<name>yarn.resourcemanager.hostname</name>
<value>rcf-ai-datafeed-spark-prd-01</value>
</property>-->
<property>
<name>yarn.scheduler.maximum-allocation-vcores</name>
<value>10</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>spark_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.spark_shuffle.class</name>
<value>org.apache.spark.network.yarn.YarnShuffleService</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>32768</value>
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>16</value>
</property>
<property>
<name>yarn.nodemanager.resource.detect-hardware-capabilities</name>
<value>true</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
<description>Whether virtual memory limits will be enforced for containers</description>
</property>
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>4</value>
<description>Ratio between virtual memory to physical memory when setting memory limits for containers</description>
</property>
<!--<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>-->
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>rcf-ai-datafeed-spark-prd-01:8088</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>rcf-ai-datafeed-spark-prd-01:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>rcf-ai-datafeed-spark-prd-01:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>rcf-ai-datafeed-spark-prd-01:8035</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>rcf-ai-datafeed-spark-prd-01:8033</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-vcores</name>
<value>10</value>
</property>
<property>
<name>spark.shuffle.service.port</name>
<value>7337</value>
</property>
</configuration>
3.啟動yarn
# cp /data/server/spark-2.4.0-bin-hadoop2.7/yarn/spark-2.4.0-yarn-shuffle.jar /data/server/hadoop-2.7.7/share/hadoop/yarn/ 復制hadoop缺少spark的jar包
namenode 格式化:在終端中執行命令: hadoop namenode -format(直接運行hadoop命令是依賴於之前配置的環境變量)
3.1 啟動dfs
然后終端執行:start-dfs.sh 命令啟動hdfs,運行后將會看到啟動相關服務的提示,啟動成功后,使用jps命令看看服務啟動是否正常
在master端運行jps命令會看到如下服務:
[ec2-user@rcf-ai-datafeed-spark-prd-01 ~]$ jps
14129 NameNode 14356 SecondaryNameNode 20765 Jps
在slave端看到服務如下:
[ec2-user@rcf-ai-datafeed-spark-prd-02 hadoop]$ jps
6144 DataNode
7083 Jps
- 其中NameNode和SeondaryNameNode服務是必須的,如果slaves配置沒有配置master,則不會DataNode,否則會有dataNode服務
- 在slave1節點中運行jps會只有DataNode服務
- 這時在瀏覽器中輸入網址:http://rcf-ai-datafeed-spark-prd-01:50070/ 可以看到hdfs控制台頁面
啟動 yarn
- 最后在終端執行:start-yarn.sh命令啟動yarn,在運行jps命令,查看服務器啟動是否正常,如下入
- 這時在master節點會發現多出兩個服務ResourceManager和NodeManager,說明啟動成功。在slave節點中會多出一個NodeManager服務,說明啟動成功,在瀏覽器中輸入 http://rcf-ai-datafeed-spark-prd-01:8088/cluster/nodes 查看yarn任務控制台
4.配置spark整合hadoop
4.1 將spark運行時環境上傳到hdfs
- 創建目錄(如果有需要):hadoop fs -mkdir -p hdfs://rcf-ai-datafeed-spark-prd-01:9000/tmp/spark/lib_jars/
- 上傳jar包: hadoop fs -put jars/* hdfs://rcf-ai-datafeed-spark-prd-01:9000/tmp/spark/lib_jars/
- 查看上傳是否成功: hadoop fs -ls hdfs://rcf-ai-datafeed-spark-prd-01:9000/tmp/spark/lib_jars/ (確認看到spark的jar已經上傳到hdfs中)
4.2 spark配置hadoop
[ec2-user@rcf-ai-datafeed-spark-prd-01 ~]$ cd /data/server/spark/conf/
[ec2-user@rcf-ai-datafeed-spark-prd-01 conf]$ cat spark-defaults.conf #添加如下內容
spark.yarn.jars hdfs://rcf-ai-datafeed-spark-prd-01:9000/tmp/spark/lib_jars/*.jar spark.shuffle.service.enabled true spark.shuffle.service.port 7337 spark.sql.hive.thriftServer.singleSession true
#此處是因為要配置spark shuffle服務是使用,詳情請見:http://spark.apache.org/docs/latest/running-on-yarn.html#configuring-the-external-shuffle-service
[ec2-user@rcf-ai-datafeed-spark-prd-01 conf]$ cat spark-env.sh # 添加如下內容
export SPARK_DIST_CLASSPATH=$(/data/server/hadoop/bin/hadoop classpath) export HADOOP_CONF_DIR=/data/server/hadoop/etc/hadoop export YARN_CONF_DIR=/data/server/hadoop/etc/hadoop # 這里是Spark的配置文件所在的目錄,在本例中為:$spark_home/conf export SPARK_CONF_DIR=/data/server/spark/conf # 為Spark Thrift Server分配的內存大小 export SPARK_DAEMON_MEMORY=1024m
5. spark submit job on yarn
- 在spark01(10.16.5.86)上 spark_home=/data/server/spark-2.4.0-bin-without-hadoop
- 進入到spark目錄,cd ${spark_home}執行以下命令來提交job,命令行參數可以參考spark-submit腳本的提示
- bin/spark-submit --master yarn --deploy-mode cluster --driver-memory 1G --num-executors 2 --executor-cores 2 --executor-memory 2G --class com.wisers.cloud.spark.Main /home/ec2-user/app/character-conver.jar(自己開發的應用jar包)
