在使用hive3.1.2和spark3.1.2配置hive on spark的時候,發現官方下載的hive3.1.2和spark3.1.2不兼容,hive3.1.2對應的版本是spark2.3.0,而spark3.1.2對應的hadoop版本是hadoop3.2.0。
所以,如果想要使用高版本的hive和hadoop,我們要重新編譯hive,兼容spark3.1.2。
1. 環境准備
這里在Mac編譯,電腦環境需要Java、Maven、idea。
注:在Windows 10中無法編譯,沒有.bat文件,可以選擇在虛擬機中,安裝一台帶圖形化界面的CentOS7。
提前下載好hive3.1.2源碼,並用idea打開源碼。
https://github.com/gitlbo/hive/tree/3.1.2
這里使用的是GitHub上的源碼,因為集群中所安裝的Hadoop-3.3.0中和Hive-3.1.2中都包含guava的依賴,Hadoop-3.3.0中的版本為guava-27.0-jre,而官網下載的Hive-3.1.2中的版本為guava-19.0。
由於Hive運行時會加載Hadoop依賴,故會出現依賴沖突的問題。如果直接將官網下載的源碼包中pom.xml文件中的guava版本修改為27.0-jre,編譯會報錯,所以這里直接選擇用GitHub上的源碼。
注意:下載完依賴后,pom文件會報很多處錯誤,這個不能決定是否是錯誤。需要使用官方提供的編譯打包方式去檢驗才行。
2. 打包測試
執行編譯命令
打開terminal終端,使用如下命令進行進行打包,檢驗編譯環境是否正常
mvn clean package -Pdist -DskipTests -Dmaven.javadoc.skip=true
3. 可能會遇到的問題
如果沒有遇到,就直接跳過。
maven打包報錯
[ERROR] Failed to execute goal on project hive-upgrade-acid: Could not resolve dependencies for project org.apache.hive:hive-upgrade-acid:jar:3.1.2: Failure to find org.pentaho:pentaho-aggdesigner-algorithm:jar:5.1.5-jhyde in http://maven.aliyun.com/nexus/content/groups/public/ was cached in the local repository, resolution will not be reattempted until the update interval of alimaven has elapsed or updates are forced -> [Help 1]
3.1 解決jar缺失
pentaho-aggdesigner-algorithm-5.1.5-jhyde.jar缺失
方法一
在maven的setting文件中添加,增加2個阿里雲倉庫地址
<mirror> <id>aliyunmaven</id> <mirrorOf>*</mirrorOf> <name>spring-plugin</name> <url>https://maven.aliyun.com/repository/spring-plugin</url> </mirror> <mirror> <id>repo2</id> <name>Mirror from Maven Repo2</name> <url>https://repo.spring.io/plugins-release/</url> <mirrorOf>central</mirrorOf> </mirror>
重新執行打包命令
方法二
手動下載jar包,並上傳到目標目錄
jar包下載地址
重新執行打包命令
這兩種方法多試幾次,一般就能解決了。
3.2 error in opening zip file
若出現讀取\XX\XXX..jar時出錯; error in opening zip file,找到對應目錄,刪除jar包,重新下載就好了。
3.3 After correcting the problems, you can resume the build with the command
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-javadoc-plugin:2.4:javadoc (resourcesdoc.xml) on project hive-webhcat: An error has occurred in JavaDocs report generation:Unable to find javadoc command: The environment variable JAVA_HOME is not correctly set. -> [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException [ERROR] [ERROR] After correcting the problems, you can resume the build with the command [ERROR] mvn <args> -rf :hive-webhcat
沒有配置JAVA_HOME導致的,配置之后還是不行,記得重啟一下idea。
也有可能是之前使用的是USE JAVA_HOME,修改成項目的就可以成功build project了。
編譯成功提示:
[INFO] ------------------------------------------------------------------------ [INFO] Reactor Summary for Hive 3.1.2: [INFO] [INFO] Hive Upgrade Acid .................................. SUCCESS [ 5.537 s] [INFO] Hive ............................................... SUCCESS [ 0.232 s] [INFO] Hive Classifications ............................... SUCCESS [ 0.393 s] [INFO] Hive Shims Common .................................. SUCCESS [ 1.809 s] [INFO] Hive Shims 0.23 .................................... SUCCESS [ 2.859 s] [INFO] Hive Shims Scheduler ............................... SUCCESS [ 1.573 s] [INFO] Hive Shims ......................................... SUCCESS [ 1.018 s] [INFO] Hive Common ........................................ SUCCESS [ 7.054 s] [INFO] Hive Service RPC ................................... SUCCESS [ 2.797 s] [INFO] Hive Serde ......................................... SUCCESS [ 4.794 s] [INFO] Hive Standalone Metastore .......................... SUCCESS [ 27.884 s] [INFO] Hive Metastore ..................................... SUCCESS [ 2.779 s] [INFO] Hive Vector-Code-Gen Utilities ..................... SUCCESS [ 0.237 s] [INFO] Hive Llap Common ................................... SUCCESS [ 3.263 s] [INFO] Hive Llap Client ................................... SUCCESS [ 2.194 s] [INFO] Hive Llap Tez ...................................... SUCCESS [ 2.383 s] [INFO] Hive Spark Remote Client ........................... SUCCESS [ 2.915 s] [INFO] Hive Query Language ................................ SUCCESS [ 52.792 s] [INFO] Hive Llap Server ................................... SUCCESS [ 5.707 s] [INFO] Hive Service ....................................... SUCCESS [ 5.299 s] [INFO] Hive Accumulo Handler .............................. SUCCESS [ 3.621 s] [INFO] Hive JDBC .......................................... SUCCESS [ 18.186 s] [INFO] Hive Beeline ....................................... SUCCESS [ 3.277 s] [INFO] Hive CLI ........................................... SUCCESS [ 2.593 s] [INFO] Hive Contrib ....................................... SUCCESS [ 2.074 s] [INFO] Hive Druid Handler ................................. SUCCESS [ 13.076 s] [INFO] Hive HBase Handler ................................. SUCCESS [ 4.767 s] [INFO] Hive JDBC Handler .................................. SUCCESS [ 2.537 s] [INFO] Hive HCatalog ...................................... SUCCESS [ 0.439 s] [INFO] Hive HCatalog Core ................................. SUCCESS [ 4.441 s] [INFO] Hive HCatalog Pig Adapter .......................... SUCCESS [ 2.914 s] [INFO] Hive HCatalog Server Extensions .................... SUCCESS [ 2.732 s] [INFO] Hive HCatalog Webhcat Java Client .................. SUCCESS [ 2.935 s] [INFO] Hive HCatalog Webhcat .............................. SUCCESS [ 5.959 s] [INFO] Hive HCatalog Streaming ............................ SUCCESS [ 3.133 s] [INFO] Hive HPL/SQL ....................................... SUCCESS [ 4.280 s] [INFO] Hive Streaming ..................................... SUCCESS [ 2.540 s] [INFO] Hive Llap External Client .......................... SUCCESS [ 2.564 s] [INFO] Hive Shims Aggregator .............................. SUCCESS [ 0.051 s] [INFO] Hive Kryo Registrator .............................. SUCCESS [ 2.208 s] [INFO] Hive TestUtils ..................................... SUCCESS [ 0.156 s] [INFO] Hive Packaging ..................................... SUCCESS [ 55.564 s] [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 04:33 min [INFO] Finished at: 2021-06-12T23:32:27+08:00 [INFO] ------------------------------------------------------------------------
編譯成功后,可以在**/packaging/target目錄下查看編譯完成的安裝包。
4. 整合Spark3.1.2
4.1 修改pom.xml文件
將pom.xml201行的
<spark.version>3.0.0</spark.version>
改為
<spark.version>3.1.2</spark.version>
4.2 重新編譯
mvn clean package -Pdist -DskipTests -Dmaven.javadoc.skip=true
基本等上幾分鍾,就能編譯完成。
若使用官網提供的3.1.2版本編譯,這里會報錯,需要修改源碼,GitHub上的源碼,是已經修改過的。
5. Hive on Spark 配置
以下出現的路徑,請根據自己的環境修改。
5.1 解壓spark-3.1.2-bin-without-hive.tgz
[bigdata@bigdata-node00001 software]$ tar -xzf spark-3.1.2-bin-without-hive.tgz -C /opt/module [bigdata@bigdata-node00001 software]$ cd /opt/module/ [bigdata@bigdata-node00001 module]$ mv spark-3.1.2-bin-without-hive/ spark-3.1.2
5.2 配置SPARK_HOME環境變量
[bigdata@bigdata-node00001 module]$ sudo vim /etc/profile.d/my_env.sh
添加如下內容:
#SPARK_HOME export SPARK_HOME=/opt/module/spark-3.1.2 export PATH=$PATH:$SPARK_HOME/bin
source使其生效
[bigdata@bigdata-node00001 module]$ source /etc/profile.d/my_env.sh
注:my_env.sh是自定義的配置文件,系統默認的/etc/profile。
5.3 配置spark運行環境
[bigdata@bigdata-node00001 module]$ cd spark-3.1.2/ [bigdata@bigdata-node00001 spark-3.1.2]$ cp conf/spark-env.sh.template conf/spark-env.sh [bigdata@bigdata-node00001 spark-3.1.2]$ vim conf/spark-env.sh
添加如下內容:
export SPARK_DIST_CLASSPATH=$(hadoop classpath)
5.4 連接sparkjar包到hive,如果hive中已存在則跳過
主要包含3個文件:scala-library-2.12.10.jar、spark-core_2.12-3.1.2.jar、spark-network-common_2.12-3.1.2.jar。
軟連接請用ln -s 命令,我這里是直接復制到對應的路徑下。
[bigdata@bigdata-node00001 software]$ cp /opt/module/spark-3.1.2/jars/scala-library-2.12.10.jar /opt/module/hive-3.1.2/lib/ [bigdata@bigdata-node00001 software]$ cp /opt/module/spark-3.1.2/jars/spark-core_2.12-3.1.2.jar /opt/module/hive-3.1.2/lib/ [bigdata@bigdata-node00001 software]$ cp /opt/module/spark-3.1.2/jars/spark-network-common_2.12-3.1.2.jar /opt/module/hive-3.1.2/lib/
5.5 新建spark配置文件
bigdata@bigdata-node00001 software]$ vim /opt/module/hive-3.1.2/conf/spark-defaults.conf
添加如下內容:
spark.master yarn spark.eventLog.enabled true spark.eventLog.dir hdfs://node00001:8020/spark-history spark.driver.memory 4g spark.executor.memory 4g
注:具體參數根據自身集群環境作相應的調整。
5.6 在HDFS創建如下路徑
hadoop fs -mkdir /spark-history
5.7 上傳Spark依賴到HDFS
[bigdata@bigdata-node00001 software]$ hadoop fs -mkdir /spark-jars [bigdata@bigdata-node00001 software]$ hadoop fs -put /opt/module/spark-3.1.2/jars/* /spark-jars
5.8 修改hive-site.xml
<!--Spark依賴位置--> <property> <name>spark.yarn.jars</name> <value>hdfs://node00001:8020/spark-jars/*</value> </property> <!--Hive執行引擎--> <property> <name>hive.execution.engine</name> <value>spark</value> </property> <!--Hive和spark連接超時時間--> <property> <name>hive.spark.client.connect.timeout</name> <value>10000ms</value> </property>
6. 測試Hive on Spark
6.1 啟動環境
啟動zookeeper,hadoop集群,hive,啟動hive客戶端
6.2 插入測試數據
創建一張測試表
hive (default)> create external table student(id int, name string) location '/student';
插入一條測試數據
hive (default)> insert into table student values(1,'abc');
執行結果
Query ID = bigdata_20210613144232_7bafd4ac-0552-4d67-b53c-dc04b2a6f45c Total jobs = 1 Launching Job 1 out of 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Running with YARN Application = application_1623236770182_0016 Kill Command = /opt/module/hadoop-3.3.0/bin/yarn application -kill application_1623236770182_0016 Hive on Spark Session Web UI URL: http://node00001:39673 Query Hive on Spark job[0] stages: [0, 1] Spark job[0] status = RUNNING -------------------------------------------------------------------------------------- STAGES ATTEMPT STATUS TOTAL COMPLETED RUNNING PENDING FAILED -------------------------------------------------------------------------------------- Stage-0 ........ 0 FINISHED 1 1 0 0 0 Stage-1 ........ 0 FINISHED 1 1 0 0 0 -------------------------------------------------------------------------------------- STAGES: 02/02 [==========================>>] 100% ELAPSED TIME: 5.05 s -------------------------------------------------------------------------------------- Spark job[0] finished successfully in 5.05 second(s) Loading data to table default.student OK Time taken: 21.836 seconds