1.hive執行引擎
Hive默認使用MapReduce作為執行引擎,即Hive on mr。實際上,Hive還可以使用Tez和Spark作為其執行引擎,分別為Hive on Tez和Hive on Spark。由於MapReduce中間計算均需要寫入磁盤,而Spark是放在內存中,所以總體來講Spark比MapReduce快很多。
默認情況下,Hive on Spark 在YARN模式下支持Spark。
2.前提條件:安裝JDK-1.8/hadoop-2.7.2等,參考之前的博文
3.下載hive-2.1.1.src.tar.gz源碼解壓后,打開pom.xml發現spark版本為1.6.0---官網介紹版本必須對應才能兼容如hive2.1.1-spark1.6.0
4.下載spark-1.6.0.tgz源碼(網上都是帶有集成hive的,需要重新編譯)
5.上傳到Linux服務器,解壓
6.源碼編譯
#cd spark-1.6.0
#修改make-distribution.sh的MVN路徑為/usr/app/maven/bin/mvn ###查看並安裝pom.xml的mvn版本
#./make-distribution.sh --name "hadoop2-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.4,parquet-provided"
#等待一個多小時左右吧,保證聯網環境,有可能外網訪問不到下載不了依賴項,配置訪問外網或配置阿里雲倉庫,重新編譯
7.配置
#vim /etc/hosts 192.168.66.66 xinfang
#解壓spark-1.6.0-bin-hadoop2-without-hive.tgz,並命名為spark
#官網下載hive-2.1.1解壓 並命令為hive(關於hive詳細配置,參考http://blog.csdn.net/xinfang520/article/details/77774522)
#官網下載scala2.10.5解壓,並命令為scala
#chmod -R 755 /usr/app/spark /usr/app/hive /usr/app/scala
#配置環境變量-vim /etc/profile
#set hive
export HIVE_HOME=/usr/app/hive
export PATH=$PATH:$HIVE_HOME/bin
#set spark
export SPARK_HOME=/usr/app/spark
export PATH=$SPARK_HOME/bin:$PATH
#set scala
export SCALA_HOME=/usr/app/scala
export PATH=$SCALA_HOME/bin:$PATH
#配置/spark/conf/spark-env.sh
export JAVA_HOME=/usr/app/jdk1.8.0 export SCALA_HOME=/usr/app/scala export HADOOP_HOME=/usr/app/hadoop export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop export SPARK_DIST_CLASSPATH=$(hadoop classpath) export SPARK_LAUNCH_WITH_SCALA=0 export SPARK_WORKER_MEMORY=512m export SPARK_DRIVER_MEMORY=512m export SPARK_MASTER_IP=192.168.66.66 #export SPARK_EXECUTOR_MEMORY=512M export SPARK_HOME=/usr/app/spark export SPARK_LIBRARY_PATH=/usr/app/spark/lib export SPARK_MASTER_WEBUI_PORT=18080 export SPARK_WORKER_DIR=/usr/app/spark/work export SPARK_MASTER_PORT=7077 export SPARK_WORKER_PORT=7078 export SPARK_LOG_DIR=/usr/app/spark/logs export SPARK_PID_DIR='/usr/app/spark/run'
#配置/spark/conf/spark-default.conf
spark.master spark://xinfang:7077 spark.eventLog.enabled true spark.eventLog.dir hdfs://xinfang:9000/spark-log spark.serializer org.apache.spark.serializer.KryoSerializer spark.executor.memory 512m spark.driver.memory 512m spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
#修改hive-site.xml(hive詳細部署參考http://blog.csdn.net/xinfang520/article/details/77774522)
<configuration> <property> <name>hive.metastore.schema.verification</name> <value>false</value> </property> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://192.168.66.66:3306/hive?createDatabaseIfNotExist=true</value> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>hive</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>1</value> </property> <!--<property> <name>hive.hwi.listen.host</name> <value>192.168.66.66</value> </property> <property> <name>hive.hwi.listen.port</name> <value>9999</value> </property> <property> <name>hive.hwi.war.file</name> <value>lib/hive-hwi-2.1.1.war</value> </property>--> <property> <name>hive.metastore.warehouse.dir</name> <value>/user/hive/warehouse</value> </property> <property> <name>hive.exec.scratchdir</name> <value>/user/hive/tmp</value> </property> <property> <name>hive.querylog.location</name> <value>/user/hive/log</value> </property> <property> <name>hive.server2.thrift.port</name> <value>10000</value> </property> <property> <name>hive.server2.thrift.bind.host</name> <value>192.168.66.66</value> </property> <property> <name>hive.server2.webui.host</name> <value>192.168.66.66</value> </property> <property> <name>hive.server2.webui.port</name> <value>10002</value> </property> <property> <name>hive.server2.long.polling.timeout</name> <value>5000</value> </property> <property> <name>hive.server2.enable.doAs</name> <value>true</value> </property> <property> <name>datanucleus.autoCreateSchema </name> <value>false</value> </property> <property> <name>datanucleus.fixedDatastore </name> <value>true</value> </property> <!-- hive on mr--> <!-- <property> <name>mapred.job.tracker</name> <value>http://192.168.66.66:9001</value> </property> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> --> <!--hive on spark or spark on yarn --> <property> <name>hive.execution.engine</name> <value>spark</value> </property> <property> <name>spark.home</name> <value>/usr/app/spark</value> </property> <property> <name>spark.master</name> <value>spark://xinfang:7077</value> 或者yarn-cluster/yarn-client </property> <property> <name>spark.submit.deployMode</name> <value>client</value> </property> <property> <name>spark.eventLog.enabled</name> <value>true</value> </property> <property> <name>spark.eventLog.dir</name> <value>hdfs://xinfang:9000/spark-log</value> </property> <property> <name>spark.serializer</name> <value>org.apache.spark.serializer.KryoSerializer</value> </property> <property> <name>spark.executor.memeory</name> <value>512m</value> </property> <property> <name>spark.driver.memeory</name> <value>512m</value> </property> <property> <name>spark.executor.extraJavaOptions</name> <value>-XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"</value> </property> </configuration>
#新建目錄
hadoop fs -mkdir -p /spark-log hadoop fs -chmod 777 /spark-log mkdir -p /usr/app/spark/work /usr/app/spark/logs /usr/app/spark/run mkdir -p /usr/app/hive/logs
#拷貝hive-site.xml到spark/conf下(這點非常關鍵)
#hive進入客戶端
hive>set hive.execution.engine=spark; (將執行引擎設為Spark,默認是mr,退出hive CLI后,回到默認設置。若想讓引擎默認為Spark,需要在hive-site.xml里設置) hive>create table test(ts BIGINT,line STRING); (創建表) hive>select count(*) from test;
若整個過程沒有報錯,並出現正確結果,則Hive on Spark配置成功。
http://192.168.66.66:18080
8.網上轉載部分解決方案
第一個坑:要想在Hive中使用Spark執行引擎,最簡單的方法是把spark-assembly-1.5.0-hadoop2.4.0.jar包直接拷貝 到$HIVE_HOME/lib目錄下。
第二個坑:版本不對,剛開始以為hive 能使用 spark的任何版本,結果發現錯了,hive對spark版本有着嚴格要求,具體對應版本你可以下載hive源碼里面,搜索他pom.xml文件里面的spark版本,如果版本不對,啟動hive后會報錯。具體錯誤如下:
Failed to execute spark task, with exception 'org.apache.hadoop.hive.ql.metadata.HiveException(Failed to create spark client.)' FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.spark.SparkTask
第三個坑:./make-distribution.sh --name "hadoop2-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.4" ,開啟spark報錯找不到類
解決辦法是在spark-env.sh里面添加 :export SPARK_DIST_CLASSPATH=$(hadoop classpath)
#如果啟動包日志包重復需要刪除
#根據實際修改hive/bin/hive:(根據spark2后的包分散了)
sparkAssemblyPath='ls ${SPARK_HOME}/lib/spark-assembly-*.jar'
將其修改為:sparkAssemblyPath='ls ${SPARK_HOME}/jars/*.jar'
#spark1 拷貝spark/lib/spark-* 到/usr/app/hive/lib
9.參考文章說明
#參考http://spark.apache.org/docs/latest/building-spark.html
#參考http://www.cnblogs.com/linbingdong/p/5806329.html
#參考http://blog.csdn.net/pucao_cug/article/details/72773564
#參考https://cwiki.apache.org//confluence/display/Hive/Hive+on+Spark:+Getting+Started