一開始覺得簡單,參考某些文章用apache編譯后的2.4.0的包直接替換就行,發現搞了好久spark-sql都不成功。
於是下決心參考網上的自己編譯了。
軟件版本:jdk-1.8、maven-3.6.3、scala-2.11.12 、spark-3.1.2
1.下載軟件
wget http://distfiles.macports.org/scala2.11/scala-2.11.12.tgz wget https://archive.apache.org/dist/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz wget https://archive.apache.org/dist/spark/spark-3.1.2/spark-3.1.2.tgz
把壓縮包放在/opt目錄,全部解壓,設置jdk、scala、maven 的環境變量
#####java##### export JAVA_HOME=/usr/java/jdk1.8.0_181-cloudera export JRE_HOME=$JAVA_HOME/jre export PATH=$JAVA_HOME/bin:$JRE_HOME/bin:$PATH export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:$JRE_HOME/lib ######maven####### export PATH=/opt/apache-maven-3.6.3/bin:$PATH ####scala##### export SCALA_HOME=/opt/scala-2.11.12 export PATH=${SCALA_HOME}/bin:$PATH
2.編譯spark3
修改spark3的pom配置 /opt/spark-3.1.2/pom.xml,增加cloudera maven倉庫
<repositories> <repository> <id>central</id> <!-- This should be at top, it makes maven try the central repo first and then others and hence faster dep resolution --> <name>Maven Repository</name> <url>https://repo1.maven.org/maven2</url> <releases> <enabled>true</enabled> </releases> <snapshots> <enabled>false</enabled> </snapshots> </repository> <repository> <id>cloudera</id> <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url> </repository> </repositories>
修改pom文件中的hadoop版本
默認是帶的hadoop 3.2 ,需要將 hadoop.version 屬性改為 3.0.0-cdh6.3.2
注意2:maven環境內存要符合條件。如果用maven進行編譯需要先設置maven內存,如果用make-distribution.sh ,則在這個/opt/spark-3.1.2/dev/make-distribution.sh腳本中進行修改:
編譯的時候,Xmx設置的4G,CacheSize設置的2G,否則編譯總是失敗
export MAVEN_OPTS="-Xmx4g -XX:ReservedCodeCacheSize=2g"
注意3:如果scala 版本為2.10.x ,需要進行
# cd /opt/spark-3.1.2
# ./dev/change-scala-version.sh 2.10
如果為2.11.x,需要進行
# cd /opt/spark-3.1.2
#./dev/change-scala-version.sh 2.11
注意4:
推薦使用一下命令編譯:
./dev/make-distribution.sh \ --name 3.0.0-cdh6.3.2 --tgz -Pyarn -Phadoop-3.0 \ -Phive -Phive-thriftserver -Dhadoop.version=3.0.0-cdh6.3.2 -X
用的是spark的make-distribution.sh腳本進行編譯,這個腳本其實也是用maven編譯的,
- –tgz 指定以tgz結尾
- –name后面跟的是我們Hadoop的版本,在后面生成的tar包我們會發現名字后面的版本號也是這個(這個可以看make-distribution.sh源碼了解)
- -Pyarn 是基於yarn
- -Dhadoop.version=3.0.0-cdh6.3.2 指定Hadoop的版本。
編譯報錯報錯信息:
/root/spark-3.1.2/build/mvn: 行 212: 6877 已殺死 "${MVN_BIN}" -DzincPort=${ZINC_PORT} "$@"
解決方法:
修改./dev/make-distribution.sh
文件,將原來的maven地址指定為自己系統里裝的maven環境:
# cat make-distribution.sh # Figure out where the Spark framework is installed SPARK_HOME="$(cd "`dirname "$0"`/.."; pwd)" DISTDIR="$SPARK_HOME/dist" MAKE_TGZ=false MAKE_PIP=false MAKE_R=false NAME=none #MVN="$SPARK_HOME/build/mvn" MVN="/opt/apache-maven-3.6.3/bin/mvn"
編譯過程很漫長:
編譯成功后的目錄:
編譯完后的spark文件就是:
spark-3.1.2-bin-3.0.0-cdh6.3.2.tgz
3.部署
tar zxvf spark-3.1.2-bin-3.0.0-cdh6.3.2.tgz /opt/cloudera/parcels/CDH/lib/spark3
將CDH集群的spark-env.sh 復制到/opt/cloudera/parcels/CDH/lib/spark3/conf 下:
cp /etc/spark/conf/spark-env.sh /opt/cloudera/parcels/CDH/lib/spark3/conf
然后將spark-home 修改一下:
[root@master1 conf]# cat spark-env.sh #!/usr/bin/env bash ## # Generated by Cloudera Manager and should not be modified directly ## SELF="$(cd $(dirname $BASH_SOURCE) && pwd)" if [ -z "$SPARK_CONF_DIR" ]; then export SPARK_CONF_DIR="$SELF" fi #export SPARK_HOME=/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib/spark export SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark3
將gateway節點的hive-site.xml復制到spark2/conf目錄下,不需要做變動:
cp /etc/hive/conf/hive-site.xml /opt/cloudera/parcels/CDH/lib/spark3/conf/
配置yarn.resourcemanager,查看你CDH的yarn配置里是否有如下配置,需要開啟:
正常情況下,resourcemanager應該會默認啟用以上配置的,
創建spark-sql
cat /opt/cloudera/parcels/CDH/bin/spark-sql #!/bin/bash # Reference: http://stackoverflow.com/questions/59895/can-a-bash-script-tell-what-directory-its-stored-in export HADOOP_CONF_DIR=/etc/hadoop/conf export YARN_CONF_DIR=/etc/hadoop/conf SOURCE="${BASH_SOURCE[0]}" BIN_DIR="$( dirname "$SOURCE" )" while [ -h "$SOURCE" ] do SOURCE="$(readlink "$SOURCE")" [[ $SOURCE != /* ]] && SOURCE="$BIN_DIR/$SOURCE" BIN_DIR="$( cd -P "$( dirname "$SOURCE" )" && pwd )" done BIN_DIR="$( cd -P "$( dirname "$SOURCE" )" && pwd )" LIB_DIR=$BIN_DIR/../lib export HADOOP_HOME=$LIB_DIR/hadoop # Autodetect JAVA_HOME if not defined . $LIB_DIR/bigtop-utils/bigtop-detect-javahome exec $LIB_DIR/spark3/bin/spark-submit --class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver "$@"
配置快捷方式
alternatives --install /usr/bin/spark-sql spark-sql /opt/cloudera/parcels/CDH/lib/spark3/bin/spark-sql 1
測試:
參考:
https://its401.com/article/qq_26502245/120355741
https://blog.csdn.net/Mrheiiow/article/details/123007848