SparkSQL數據源-Hive數據庫

本文轉載自查看原文 2020-06-30 22:36 500 Spark

　　　　　　　　　　　　　SparkSQL數據源-Hive數據庫

　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　作者：尹正傑

一.Hive應用

1>.內嵌Hive應用

　　Apache Hive是Hadoop上的SQL引擎，Spark SQL編譯時可以包含Hive支持，也可以不包含。包含Hive支持的Spark SQL可以支持Hive表訪問、UDF(用戶自定義函數)以及 Hive 查詢語言(HiveQL/HQL)等。

　　需要強調的一點是，如果要在Spark SQL中包含Hive的庫，並不需要事先安裝Hive。一般來說，最好還是在編譯Spark SQL時引入Hive支持，這樣就可以使用這些特性了。如果你下載的是二進制版本的 Spark，它應該已經在編譯時添加了 Hive 支持。 

　　若要把Spark SQL連接到一個部署好的Hive上，你必須把hive-site.xml復制到 Spark的配置文件目錄中($SPARK_HOME/conf)。即使沒有部署好Hive，Spark SQL也可以運行。 

　　需要注意的是，如果你沒有部署好Hive，Spark SQL會在當前的工作目錄中創建出自己的Hive 元數據倉庫，叫作 metastore_db。

　　此外，如果你嘗試使用HiveQL中的 CREATE TABLE (並非 CREATE EXTERNAL TABLE)語句來創建表，這些表會被放在你默認的文件系統中的 /user/hive/warehouse 目錄中(如果你的classpath中有配好的hdfs-site.xml，默認的文件系統就是HDFS，否則就是本地文件系統)。

　　如果要使用內嵌的Hive，什么都不用做，直接用就可以了。 當然可以通過添加參數初次指定數據倉庫地址：--conf spark.sql.warehouse.dir=hdfs://hadoop101.yinzhengjie.org.cn:9000/spark-wearhouse

　　溫馨提示:
　　　　如果你使用的是內部的Hive，在Spark2.0之后，spark.sql.warehouse.dir用於指定數據倉庫的地址，如果你需要是用HDFS作為路徑，那么需要將core-site.xml和hdfs-site.xml 加入到Spark conf目錄，否則只會創建master節點上的warehouse目錄，查詢時會出現文件找不到的問題，這是需要使用HDFS，則需要將metastore刪除，重啟集群。

[root@hadoop105.yinzhengjie.org.cn ~]# vim /tmp/id.txt
[root@hadoop105.yinzhengjie.org.cn ~]# 
[root@hadoop105.yinzhengjie.org.cn ~]# cat /tmp/id.txt
100
200
3
400
500
[root@hadoop105.yinzhengjie.org.cn ~]#

[root@hadoop105.yinzhengjie.org.cn ~]# vim /tmp/id.txt　　　　　　　　#創建測試數據

scala> spark.sql("show tables").show
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
+--------+---------+-----------+


scala> spark.sql("create table test(id int)")
20/07/15 04:10:36 WARN HiveMetaStore: Location: file:/root/spark-warehouse/test specified for non-external table:test
res2: org.apache.spark.sql.DataFrame = []

scala> spark.sql("show tables").show
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
| default|     test|      false|
+--------+---------+-----------+


scala> spark.sql("load data local inpath '/tmp/id.txt' into table test")
res4: org.apache.spark.sql.DataFrame = []

scala> spark.sql("select * from test").show
+---+
| id|
+---+
|100|
|200|
|  3|
|400|
|500|
+---+


scala>

scala> spark.sql("show tables").show

2>.外部Hive應用

　　如果想連接外部已經部署好的Hive，需要通過以下幾個步驟。
　　　　(1)將Hive中的hive-site.xml拷貝或者軟連接到Spark安裝目錄下的conf目錄下。
　　　　(2)打開spark shell，注意帶上訪問Hive元數據庫的JDBC客戶端,如下所示(如果你將對應的Hive的元數據庫驅動已經放在spark的安裝目錄下的jars目錄下則可以不加"--jars"選項喲~)。
 　　　　　　[root@hadoop105.yinzhengjie.org.cn ~]# spark-shell  --jars mysql-connector-java-5.1.36-bin.jar

二.運行Spark SQL CLI

　　Spark SQL CLI可以很方便的在本地運行Hive元數據服務以及從命令行執行查詢任務。其效果等效於你在spark-shell中執行的spark.sql("...")中執行的SQL語句。

[root@hadoop105.yinzhengjie.org.cn ~]# spark-sql 
20/07/15 04:28:33 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
log4j:WARN No appenders could be found for logger (org.apache.hadoop.conf.Configuration).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/07/15 04:28:35 INFO SparkContext: Running Spark version 2.4.6
20/07/15 04:28:35 INFO SparkContext: Submitted application: SparkSQL::172.200.4.105
20/07/15 04:28:35 INFO SecurityManager: Changing view acls to: root
20/07/15 04:28:35 INFO SecurityManager: Changing modify acls to: root
20/07/15 04:28:35 INFO SecurityManager: Changing view acls groups to: 
20/07/15 04:28:35 INFO SecurityManager: Changing modify acls groups to: 
20/07/15 04:28:35 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
20/07/15 04:28:35 INFO Utils: Successfully started service 'sparkDriver' on port 33260.
20/07/15 04:28:35 INFO SparkEnv: Registering MapOutputTracker
20/07/15 04:28:35 INFO SparkEnv: Registering BlockManagerMaster
20/07/15 04:28:35 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
20/07/15 04:28:35 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
20/07/15 04:28:35 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-ae5de0c5-5282-4cc6-8ce6-d1dbe34e82e9
20/07/15 04:28:35 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
20/07/15 04:28:35 INFO SparkEnv: Registering OutputCommitCoordinator
20/07/15 04:28:36 INFO Utils: Successfully started service 'SparkUI' on port 4040.
20/07/15 04:28:36 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://hadoop105.yinzhengjie.org.cn:4040
20/07/15 04:28:36 INFO Executor: Starting executor ID driver on host localhost
20/07/15 04:28:36 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 29320.
20/07/15 04:28:36 INFO NettyBlockTransferService: Server created on hadoop105.yinzhengjie.org.cn:29320
20/07/15 04:28:36 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
20/07/15 04:28:36 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, hadoop105.yinzhengjie.org.cn, 29320, None)
20/07/15 04:28:36 INFO BlockManagerMasterEndpoint: Registering block manager hadoop105.yinzhengjie.org.cn:29320 with 366.3 MB RAM, BlockManagerId(driver, hadoop105.yinzhengjie.org.cn, 29320, None)
20/07/15 04:28:36 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, hadoop105.yinzhengjie.org.cn, 29320, None)
20/07/15 04:28:36 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, hadoop105.yinzhengjie.org.cn, 29320, None)
20/07/15 04:28:36 INFO EventLoggingListener: Logging events to hdfs://hadoop101.yinzhengjie.org.cn:9000/yinzhengjie/spark/jobhistory/local-1594758516077
20/07/15 04:28:36 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/root/spark-warehouse/').
20/07/15 04:28:36 INFO SharedState: Warehouse path is 'file:/root/spark-warehouse/'.
20/07/15 04:28:37 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
20/07/15 04:28:37 INFO HiveUtils: Initializing HiveMetastoreConnection version 1.2.1 using Spark classes.
20/07/15 04:28:37 INFO HiveClientImpl: Warehouse location for Hive client (version 1.2.2) is file:/root/spark-warehouse/
20/07/15 04:28:37 INFO metastore: Mestastore configuration hive.metastore.warehouse.dir changed from /user/hive/warehouse to file:/root/spark-warehouse/
20/07/15 04:28:37 INFO HiveMetaStore: 0: Shutting down the object store...
20/07/15 04:28:37 INFO audit: ugi=root    ip=unknown-ip-addr    cmd=Shutting down the object store...    
20/07/15 04:28:37 INFO HiveMetaStore: 0: Metastore shutdown complete.
20/07/15 04:28:37 INFO audit: ugi=root    ip=unknown-ip-addr    cmd=Metastore shutdown complete.    
20/07/15 04:28:37 INFO HiveMetaStore: 0: get_database: default
20/07/15 04:28:37 INFO audit: ugi=root    ip=unknown-ip-addr    cmd=get_database: default    
20/07/15 04:28:37 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
20/07/15 04:28:37 INFO ObjectStore: ObjectStore, initialize called
20/07/15 04:28:37 INFO Query: Reading in results for query "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is closing
20/07/15 04:28:37 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY
20/07/15 04:28:37 INFO ObjectStore: Initialized ObjectStore
Spark master: local[*], Application Id: local-1594758516077
20/07/15 04:28:37 INFO SparkSQLCLIDriver: Spark master: local[*], Application Id: local-1594758516077
spark-sql> show tables;　　　　　　　　　　　　　　#查看現在已有的表
20/07/15 04:29:19 INFO HiveMetaStore: 0: get_database: global_temp
20/07/15 04:29:19 INFO audit: ugi=root    ip=unknown-ip-addr    cmd=get_database: global_temp    
20/07/15 04:29:19 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
20/07/15 04:29:19 INFO HiveMetaStore: 0: get_database: default
20/07/15 04:29:19 INFO audit: ugi=root    ip=unknown-ip-addr    cmd=get_database: default    
20/07/15 04:29:19 INFO HiveMetaStore: 0: get_database: default
20/07/15 04:29:19 INFO audit: ugi=root    ip=unknown-ip-addr    cmd=get_database: default    
20/07/15 04:29:19 INFO HiveMetaStore: 0: get_tables: db=default pat=*
20/07/15 04:29:19 INFO audit: ugi=root    ip=unknown-ip-addr    cmd=get_tables: db=default pat=*    
20/07/15 04:29:19 INFO CodeGenerator: Code generated in 184.459346 ms
default    test    false　　　　　　　　　　　　　　#很明顯，目前咱們就一張表喲~
Time taken: 1.518 seconds, Fetched 1 row(s)
20/07/15 04:29:19 INFO SparkSQLCLIDriver: Time taken: 1.518 seconds, Fetched 1 row(s)
spark-sql>

spark-sql> show tables;　　　　　　　　　　　　　　#查看現在已有的表

三.代碼中使用Hive

1>.添加依賴

<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-hive -->
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-hive_2.11</artifactId>
    <version>2.1.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hive/hive-exec -->
<dependency>
    <groupId>org.apache.hive</groupId>
    <artifactId>hive-exec</artifactId>
    <version>1.2.1</version>
</dependency>

2>.創建SparkSession時需要添加hive支持

    val warehouseLocation: String = new File("spark-warehouse").getAbsolutePath

    /**
      *   若使用的是外部Hive，則需要將hive-site.xml添加到ClassPath下。
      */
    val spark = SparkSession
      .builder()
      .appName("Spark Hive Example")
      .config("spark.sql.warehouse.dir", warehouseLocation)   //使用內置Hive需要指定一個Hive倉庫地址。若使用外部的hive則無需指定
      .enableHiveSupport()    //啟用hive的支持
      .getOrCreate()

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 sparkSql使用hive數據源 SparkSQL讀寫外部數據源-通過jdbc讀寫mysql數據庫 sparksql jdbc數據源 sparkSql將不同數據庫數據寫入hive 大數據基礎---SparkSQL外部數據源 SparkSQL讀寫外部數據源--數據分區 Spring系列之數據源的配置數據庫數據源連接池的區別 SpringBoot動態從數據庫中獲取數據源,動態切換數據源 SrpingDruid數據源加密數據庫密碼數據源與數據庫連接池