hive on spark真的很折騰人啊!!!!!!!
一.軟件准備階段
maven3.3.9
spark2.0.0
hive2.3.3
hadoop2.7.6
二.下載源碼spark2.0.0,編譯
下載地址:http://archive.apache.org/dist/spark/spark-2.0.0/
編譯: ./dev/make-distribution.sh --name "hadoop2-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.7,parquet-provided"
三.將編譯好的spark-2.0.0-bin-hadoop2-without-hive.tgz tar -zxvf 到目錄
在/etc/profile里配置好 $SPARK_HOME環境變量,並 . /etc/profile使環境變量生效。
接下來配置hive/spark/yarn
1) 配置hive
1.拷貝spark下的jar包到hive的lib下
- cp scala-library-2.11.8.jar /usr/share/hive-2.3.3/lib/
- cp spark-core_2.11-2.0.0.jar /usr/share/hive-2.3.3/lib/
- cp spark-network-common_2.11-2.0.0.jar /usr/share/hive-2.3.3/lib/
2.在hive的conf下建立文件spark-defaults.conf
set hive.execution.engine=spark;
set spark.master=yarn;
set spark.submit.deployMode=client;
set spark.eventLog.enabled=true;
set spark.executor.memory=2g;
set spark.serializer=org.apache.spark.serializer.KryoSerializer;
3. 修改hive-site.xml,增加
目的:允許yarn緩存spark依賴的一些jar包到各個nodeManager節點上,避免每次應用運行頻繁分發。
upload all jars in $SPARK_HOME/jars to hdfs file(for example:hdfs://bi/spark-jars/)
1)hdfs dfs -put ../jars /spark-jars //上傳spark依賴的jars到hdfs的spark-jars目錄里。
2)修改hive-site.xml,增加
<property>
<name>spark.yarn.jars</name>
<value>hdfs://bi/spark-jars/*</value>
</property>
2)配置spark
cp spark-env.sh.template spark-env.sh
配置spark-env.sh
export SPARK_DIST_CLASSPATH=$(/usr/share/hadoop-HA/hadoop-2.7.6/bin/hadoop classpath)
export HADOOP_HOME=/usr/share/hadoop-HA/hadoop-2.7.6
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop/
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop/
四)測試
開啟metastore: nohup hive --service metastore &
開啟hiveserver2: nohup hive --service hiveserver2 &
set hive.execution.engine=spark;