下載
wget https://mirrors.bfsu.edu.cn/apache/spark/spark-3.1.1/spark-3.1.1-bin-hadoop2.7.tgz
解壓
tar -vxf spark-3.1.1-bin-hadoop2.7.tgz -C /opt/module/
配置文件改名
cp spark-env.sh.template spark-env.sh
cp workers.template workers
修改配置表
[datalink@slave3 conf]$ vim spark-env.sh
export JAVA_HOME=/usr/java/jdk1.8.0_131
export HADOOP_HOME=/opt/module/hadoop-3.1.4
export SCALA_HOME=/opt/module/scala-2.12.13
export HADOOP_CONF_DIR=/opt/module/hadoop-3.1.4/etc/hadoop
export SPARK_MASTER_HOST=slave2
export SPARK_EXECUTOR_MEMORY=1G
export SPARK_WORKER_CORES=2
export SPARK_WORKER_INSTANCES=1
export SPARK_WORKER_PORT=7078
export SPARK_MASTER_PORT=7077
[datalink@slave3 conf]$ vim workers
slave1
slave3
slave4
修改啟動腳本名稱
[datalink@slave3 sbin]$ cp start-all.sh start-spark-all.sh
[datalink@slave3 sbin]$ cp stop-all.sh stop-spark-all.sh
分發到其他服務器
scp -r spark-3.1.1-bin-hadoop2.7/ datalink@slave2:/opt/module/
scp -r spark-3.1.1-bin-hadoop2.7/ datalink@slave1:/opt/module/
scp -r spark-3.1.1-bin-hadoop2.7/ datalink@slave2:/opt/module/
scp -r spark-3.1.1-bin-hadoop2.7/ datalink@slave4:/opt/module/
啟動
[datalink@slave2 sbin]$ ./start-spark-all.sh
starting org.apache.spark.deploy.master.Master, logging to /opt/module/spark-3.1.1-bin-hadoop2.7/logs/spark-datalink-org.apache.spark.deploy.master.Master-1-slave2.out
slave4: starting org.apache.spark.deploy.worker.Worker, logging to /opt/module/spark-3.1.1-bin-hadoop2.7/logs/spark-datalink-org.apache.spark.deploy.worker.Worker-1-slave4.out
slave3: starting org.apache.spark.deploy.worker.Worker, logging to /opt/module/spark-3.1.1-bin-hadoop2.7/logs/spark-datalink-org.apache.spark.deploy.worker.Worker-1-slave3.out
slave1: starting org.apache.spark.deploy.worker.Worker, logging to /opt/module/spark-3.1.1-bin-hadoop2.7/logs/spark-datalink-org.apache.spark.deploy.worker.Worker-1-slave1.out
[datalink@slave2 sbin]$
sparksql整合hive
Spark SQL主要目的是使得用戶可以在Spark上使用SQL,其數據源既可以是RDD,也可以是外部的數據源(比如文本、Hive、Json等)。Spark SQL的其中一個分支就是Spark on Hive,也就是使用Hive中HQL的解析、邏輯執行計划翻譯、執行計划優化等邏輯,可以近似認為僅將物理執行計划從MR作業替換成了Spark作業。SparkSql整合hive就是獲取hive表中的元數據信息,然后通過SparkSql來操作數據。
將hive-site.xml文件拷貝到Spark的conf目錄下,這樣就可以通過這個配置文件找到Hive的元數據以及數據存放位置。
[datalink@slave3 conf]$ cp hive-site.xml /opt/module/spark-3.1.1-bin-hadoop2.7/conf/
如果Hive的元數據存放在Mysql中,我們還需要准備好Mysql相關驅動
[datalink@slave3 module]$ cp mysql-connector-java-8.0.23.jar /opt/module/spark-3.1.1-bin-hadoop2.7/jars/
測試sparksql整合hive是否成功
[datalink@slave2 bin]$ ./spark-sql --master spark://slave2:7077 --executor-memory 1g --total-executor-cores 4
.....................
spark-sql (default)> show databases;
namespace
default
testdb
zqgamedb
Time taken: 3.768 seconds, Fetched 3 row(s)
sparksql與hive 對比,差距明顯:
spark-sql (default)> select count(1) from fact_login;
count(1)
8529410
Time taken: 1.798 seconds, Fetched 1 row(s)
hive (zqgamedb)> select count(1) from fact_login;
...........................
8529410
Time taken: 46.151 seconds, Fetched: 1 row(s)