一、版本如下
注意:Hive on Spark對版本有着嚴格的要求,下面的版本是經過驗證的版本
a) apache-hive-2.3.2-bin.tar.gz
b) hadoop-2.7.2.tar.gz
c) jdk-8u144-linux-x64.tar.gz
d) mysql-5.7.19-1.el7.x86_64.rpm-bundle.tar
e) mysql-connector-java-5.1.43-bin.jar
f) spark-2.0.0.tgz(spark源碼包,需要從源碼編譯)
g) Redhat Linux 7.4 64位
二、安裝Linux和JDK、關閉防火牆
三、安裝和配置MySQL數據庫
a) 解壓MySQL 安裝包

b) 安裝MySQL
yum remove mysql-libs
rpm -ivh mysql-community-common-5.7.19-1.el7.x86_64.rpm
rpm -ivh mysql-community-libs-5.7.19-1.el7.x86_64.rpm
rpm -ivh mysql-community-client-5.7.19-1.el7.x86_64.rpm
rpm -ivh mysql-community-server-5.7.19-1.el7.x86_64.rpm
rpm -ivh mysql-community-devel-5.7.19-1.el7.x86_64.rpm (可選)
c) 啟動MySQL
systemctl start mysqld.service
d) 查看並修改root用戶的密碼
查看root用戶的密碼:cat /var/log/mysqld.log | grep password
登錄后修改密碼:alter user 'root'@'localhost' identified by 'Welcome_1';
e) 創建hive的數據庫和hiveowner用戶:
- 創建一個新的數據庫:create database hive;
- 創建一個新的用戶:
create user 'hiveowner'@'%' identified by ‘Welcome_1’; - 給該用戶授權
grant all on hive.* TO 'hiveowner'@'%';
grant all on hive.* TO 'hiveowner'@'localhost' identified by 'Welcome_1';
四、安裝Hadoop(以偽分布式為例)
由於Hive on Spark默認支持Spark on Yarn的方式,所以需要配置Hadoop。
a) 准備工作:
- 配置主機名(編輯/etc/hosts文件)
- 配置免密碼登錄
b) Hadoop的配置文件如下:
hadoop-env.sh |
|||||
JAVA_HOME |
/root/training/jdk1.8.0_144 |
|
|||
hdfs-site.xml |
|||||
dfs.replication |
1 |
數據塊的冗余度,默認是3 |
|||
dfs.permissions |
false |
是否開啟HDFS的權限檢查 |
|||
core-site.xml |
|||||
fs.defaultFS |
hdfs://hive77:9000 |
NameNode的地址 |
|||
hadoop.tmp.dir |
/root/training/hadoop-2.7.2/tmp/ |
HDFS數據保存的目錄 |
|||
mapred-site.xml |
|||||
mapreduce.framework.name |
yarn |
|
|||
yarn-site.xml |
|||||
yarn.resourcemanager.hostname |
hive77 |
|
|||
yarn.nodemanager.aux-services |
mapreduce_shuffle |
|
|||
yarn.resourcemanager.scheduler.class |
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler |
Spark on Yarn的方式,需要使用公平調度原則來保證Yarn集群中的任務都能獲取到相等的資源運行。 |
c) 啟動Hadoop

d) 通過Yarn Web Console檢查是否為公平調度原則

五、編譯Spark源碼
(需要使用Maven,Spark源碼包中自帶Maven)
a) 執行下面的語句進行編譯(執行時間很長,耐心等待)
./dev/make-distribution.sh --name "hadoop2-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.7,parquet-provided"
b) 編譯成功后,會生成:spark-2.0.0-bin-hadoop2-without-hive.tgz
c) 安裝和配置Spark
1.目錄結構如下:
2.將下面的配置加入spark-env.sh
export JAVA_HOME=/root/training/jdk1.8.0_144
export HADOOP_CONF_DIR=/root/training/hadoop-2.7.2/etc/hadoop
export YARN_CONF_DIR=/root/training/hadoop-2.7.2/etc/hadoop
export SPARK_MASTER_HOST=hive77
export SPARK_MASTER_PORT=7077
export SPARK_EXECUTOR_MEMORY=512m
export SPARK_DRIVER_MEMORY=512m
export SPARK_WORKER_MEMORY=512m
3.將hadoop的相關jar包放入spark的lib目錄下,如下:
cp ~/training/hadoop-2.7.2/share/hadoop/common/*.jar jars/
cp ~/training/hadoop-2.7.2/share/hadoop/common/lib/*.jar jars/
cp ~/training/hadoop-2.7.2/share/hadoop/hdfs/*.jar jars/
cp ~/training/hadoop-2.7.2/share/hadoop/hdfs/lib/*.jar jars/
cp ~/training/hadoop-2.7.2/share/hadoop/mapreduce/.jar jars/
cp ~/training/hadoop-2.7.2/share/hadoop/mapreduce/*.jar jars/
cp ~/training/hadoop-2.7.2/share/hadoop/mapreduce/lib/*.jar jars/
cp ~/training/hadoop-2.7.2/share/hadoop/yarn/*.jar jars/
cp ~/training/hadoop-2.7.2/share/hadoop/yarn/lib/*.jar jars/
4.在HDFS上創建目錄:spark-jars,並將spark的jars上傳至該目錄。這樣在運行Application的時候,就無需每次都分發這些jar包。
- hdfs dfs -mkdir /spark-jars
- hdfs dfs -put jars/*.jar /spark-jars
d) 啟動Spark:sbin/start-all.sh,驗證Spark是否配置成功
六、安裝配置Hive
a) 解壓Hive安裝包,並把mysql的JDBC驅動放到HIve的lib目錄下,如下圖:
b) 設置Hive的環境變量
HIVE_HOME=/root/training/apache-hive-2.3.2-bin
export HIVE_HOME
PATH=$HIVE_HOME/bin:$PATH
export PATH
c) 拷貝下面spark的jar包到Hive的lib目錄
- scala-library
- spark-core
- spark-network-common
d) 在HDFS上創建目錄:/sparkeventlog用於保存log信息
hdfs dfs -mkdir /sparkeventlog
e) 配置hive-site.xml,如下:
參數 |
參考值 |
javax.jdo.option.ConnectionURL |
jdbc:mysql://localhost:3306/hive?useSSL=false |
javax.jdo.option.ConnectionDriverName |
com.mysql.jdbc.Driver |
javax.jdo.option.ConnectionUserName |
hiveowner |
javax.jdo.option.ConnectionPassword |
Welcome_1 |
hive.execution.engine |
spark |
hive.enable.spark.execution.engine |
true |
spark.home |
/root/training/spark-2.0.0-bin-hadoop2-without-hive |
spark.master |
yarn-client |
spark.eventLog.enabled |
true |
spark.eventLog.dir |
hdfs://hive77:9000/sparkeventlog |
spark.serializer |
org.apache.spark.serializer.KryoSerializer |
spark.executor.memeory |
512m |
spark.driver.memeory |
512m |
f) 初始化MySQL數據庫:schematool -dbType mysql -initSchema
g) 啟動hive shell,並創建員工表,用於保存員工數據
h) 導入emp.csv文件:
load data local inpath '/root/temp/emp.csv' into table emp1;
i) 執行查詢,按照員工薪水排序:(執行失敗)
select * from emp1 order by sal;
j) 檢查Yarn Web Console
該錯誤是由於是Yarn的虛擬內存計算方式導致,可在yarn-site.xml文件中,將yarn.nodemanager.vmem-check-enabled設置為false,禁用虛擬內存檢查。
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
k) 重啟:Hadoop、Spark、Hive,並執行查詢
最后說明一下:由於配置好了Spark on Yarn,我們在執行Hive的時候,可以不用啟動Spark集群,因為此時都有Yarn進行管理。