【原創】大數據基礎之Hive(5)hive on spark


hive 2.3.4 on spark 2.4.0

 

Hive on Spark provides Hive with the ability to utilize Apache Spark as its execution engine.

set hive.execution.engine=spark;

1 version

Hive on Spark is only tested with a specific version of Spark, so a given version of Hive is only guaranteed to work with a specific version of Spark. Other versions of Spark may work with a given version of Hive, but that is not guaranteed. Below is a list of Hive versions and their corresponding compatible Spark versions.

以上版本對應是測試過的,其他版本也可能可用,需要測試;

2 yarn

Instead of the capacity scheduler, the fair scheduler is required.  This fairly distributes an equal share of resources for jobs in the YARN cluster.

yarn-site.xml

    <property>

        <name>yarn.resourcemanager.scheduler.class</name>

        <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>

    </property>

3 spark

$ export SPARK_HOME=...

Note that you must have a version of Spark which does not include the Hive jars. Meaning one which was not built with the Hive profile. If you will use Parquet tables, it's recommended to also enable the "parquet-provided" profile. Otherwise there could be conflicts in Parquet dependency.

不能直接使用現有的spark安裝目錄,一個是hive依賴,一個parquet依賴,這兩個依賴很容易導致問題;

4 library

$ ln -s $SPARK_HOME/jars/scala-library-2.11.8.jar $HIVE_HOME/lib/scala-library-2.11.8.jar
$ ln -s $SPARK_HOME/jars/spark-core_2.11-2.0.2.jar $HIVE_HOME/lib/spark-core_2.11-2.0.2.jar
$ ln -s $SPARK_HOME/jars/spark-network-common_2.11-2.0.2.jar $HIVE_HOME/lib/spark-network-common_2.11-2.0.2.jar

Prior to Hive 2.2.0, link the spark-assembly jar to HIVE_HOME/lib

spark2之前的版本有spark-assembly.jar,直接將該jar link到HIVE_HOME/lib

5 hive

$ hive
hive> set hive.execution.engine=spark;

默認的spark.master=yarn,更多配置

set spark.master=<Spark Master URL>
set spark.eventLog.enabled=true;
set spark.eventLog.dir=<Spark event log folder (must exist)>
set spark.executor.memory=512m;
set spark.executor.instances=10;
set spark.executor.cores=1;
set spark.serializer=org.apache.spark.serializer.KryoSerializer;

以上配置可以像設置hive config一樣直接執行,也可以放到hive-site.xml中,也可以放到HIVE_CONF_DIR中的spark-defaults.conf中

This can be done either by adding a file "spark-defaults.conf" with these properties to the Hive classpath, or by setting them on Hive configuration (hive-site.xml).

6 報錯

hive執行sql報錯:

FAILED: SemanticException Failed to get a spark session: org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create spark client

hive執行日志位於 /tmp/$user/hive.log

詳細錯誤日志

2019-03-05 11:06:43 ERROR ApplicationMaster:91 - User class threw exception: java.lang.NoSuchFieldError: SPARK_RPC_SERVER_ADDRESS
java.lang.NoSuchFieldError: SPARK_RPC_SERVER_ADDRESS
at org.apache.hive.spark.client.rpc.RpcConfiguration.<clinit>(RpcConfiguration.java:47)
at org.apache.hive.spark.client.RemoteDriver.<init>(RemoteDriver.java:134)
at org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:516)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:678)

因為spark打包時加了hive依賴,嘗試使用沒有hive的包

https://archive.apache.org/dist/spark/spark-2.0.0/spark-2.0.0-bin-hadoop2.4-without-hive.tgz

再執行,報parquet版本沖突

Caused by: java.lang.NoSuchMethodError: org.apache.parquet.schema.Types$MessageTypeBuilder.addFields([Lorg/apache/parquet/schema/Type;)Lorg/apache/parquet/schema/Types$BaseGroupBuilder;

只能編譯了

1)spark 2.0-2.2

./dev/make-distribution.sh --name "hadoop2-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.7,parquet-provided"

得到spark-2.0.2-bin-hadoop2-without-hive.tgz

2)spark 2.3及以上

./dev/make-distribution.sh --name "hadoop2-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.7,parquet-provided,orc-provided"

得到spark-2.4.0-bin-hadoop2-without-hive.tgz

 

使用spark-2.0.2-bin-hadoop2-without-hive.tgz再執行,還有報錯

2019-03-05T17:10:55,537 ERROR [901dc3cf-a990-4e8b-95ec-fcf6a9c9002c main] ql.Driver: FAILED: SemanticException Failed to get a spark session: org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create spark client.
org.apache.hadoop.hive.ql.parse.SemanticException: Failed to get a spark session: org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create spark client.

詳細錯誤日志

2019-03-05T17:08:37,364 INFO [stderr-redir-1] client.SparkClientImpl: Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.FSDataInputStream

缺少jar,直接從spark-2.0.0-bin-hadoop2.4-without-hive里拷貝

$ cd spark-2.0.2-bin-hadoop2-without-hive
$ cp ../spark-2.4.0-bin-hadoop2.6/jars/hadoop-* jars/
$ cp ../spark-2.4.0-bin-hadoop2.6/jars/slf4j-* jars/
$ cp ../spark-2.4.0-bin-hadoop2.6/jars/log4j-* jars/
$ cp ../spark-2.4.0-bin-hadoop2.6/jars/guava-* jars/
$ cp ../spark-2.4.0-bin-hadoop2.6/jars/commons-* jars/
$ cp ../spark-2.4.0-bin-hadoop2.6/jars/protobuf-* jars/
$ cp ../spark-2.4.0-bin-hadoop2.6/jars/htrace-* jars/

這次ok了,執行sql輸出

Query ID = hadoop_20190305180847_e8b638c8-394c-496d-a43e-26a0a17f9e18
Total jobs = 1
Launching Job 1 out of 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Spark Job = d5fea72c-c67c-49ec-9f4c-650a795c74c3
Running with YARN Application = application_1551754784891_0008
Kill Command = $HADOOP_HOME/bin/yarn application -kill application_1551754784891_0008

Query Hive on Spark job[1] stages: [2, 3]

Status: Running (Hive on Spark job[1])
--------------------------------------------------------------------------------------
STAGES ATTEMPT STATUS TOTAL COMPLETED RUNNING PENDING FAILED
--------------------------------------------------------------------------------------
Stage-2 ........ 0 FINISHED 275 275 0 0 0
Stage-3 ........ 0 FINISHED 1009 1009 0 0 0
--------------------------------------------------------------------------------------
STAGES: 02/02 [==========================>>] 100% ELAPSED TIME: 149.58 s
--------------------------------------------------------------------------------------
Status: Finished successfully in 149.58 seconds
OK

使用spark-2.4.0-bin-hadoop2-without-hive.tgz也沒有問題;

 

參考:

https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark

https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM