1.hive簡介
logo 是一個身體像蜜蜂,頭是大象的家伙,相當可愛。
Hive是一個數據倉庫基礎工具在Hadoop中用來處理結構化數據。它架構在Hadoop之上,總歸為大數據,並使得查詢和分析方便。並提供簡單的sql查詢功能,可以將sql語句轉換為MapReduce任務進行運行。
術語“大數據”是大型數據集,其中包括體積龐大,高速,以及各種由與日俱增的數據的集合。使用傳統的數據管理系統,它是難以加工大型數據。因此,Apache軟件基金會推出了一款名為Hadoop的解決大數據管理和處理難題的框架。
Hive起源於Facebook(一個美國的社交服務網絡)。Facebook有着大量的數據,而Hadoop是一個開源的MapReduce實現,可以輕松處理大量的數據。但是MapReduce程序對於Java程序員來說比較容易寫,但是對於其他語言使用者來說不太方便。此時Facebook最早地開始研發Hive,它讓對Hadoop使用SQL查詢(實際上SQL后台轉化為了MapReduce)成為可能,那些非Java程序員也可以更方便地使用。hive最早的目的也就是為了分析處理海量的日志。
官網:http://hive.apache.org/
2.hive架構
由上圖可知,hadoop和mapreduce是hive架構的根基。Hive架構包括如下組件:CLI(command line interface)、JDBC/ODBC、Thrift Server、WEB GUI、metastore和Driver(Complier、Optimizer和Executor),這些組件我可以分為兩大類:服務端組件和客戶端組件。
2.1服務端組件:
Driver組件:該組件包括Complier、Optimizer和Executor,它的作用是將我們寫的HiveQL(類SQL)語句進行解析、編譯優化,生成執行計划,然后調用底層的mapreduce計算框架。
Metastore組件:元數據服務組件,這個組件存儲hive的元數據,hive的元數據存儲在關系數據庫里,hive支持的關系數據庫有derby、mysql。元數據對於hive十分重要,因此hive支持把metastore服務獨立出來,安裝到遠程的服務器集群里,從而解耦hive服務和metastore服務,保證hive運行的健壯性.
Thrift服務:thrift是facebook開發的一個軟件框架,它用來進行可擴展且跨語言的服務的開發,hive集成了該服務,能讓不同的編程語言調用hive的接口。
2.2客戶端組件:
CLI:command line interface,命令行接口。
Thrift客戶端:上面的架構圖里沒有寫上Thrift客戶端,但是hive架構的許多客戶端接口是建立在thrift客戶端之上,包括JDBC和ODBC接口。
WEBGUI:hive客戶端提供了一種通過網頁的方式訪問hive所提供的服務。這個接口對應hive的hwi組件(hive web interface),使用前要啟動hwi服務。
詳解metastore:
Hive的metastore組件是hive元數據集中存放地。Metastore組件包括兩個部分:metastore服務和后台數據的存儲。后台數據存儲的介質就是關系數據庫,例如hive默認的嵌入式磁盤數據庫derby,還有mysql數據庫。Metastore服務是建立在后台數據存儲介質之上,並且可以和hive服務進行交互的服務組件,默認情況下,metastore服務和hive服務是安裝在一起的,運行在同一個進程當中。我也可以把metastore服務從hive服務里剝離出來,metastore獨立安裝在一個集群里,hive遠程調用metastore服務,這樣我們可以把元數據這一層放到防火牆之后,客戶端訪問hive服務,就可以連接到元數據這一層,從而提供了更好的管理性和安全保障。使用遠程的metastore服務,可以讓metastore服務和hive服務運行在不同的進程里,這樣也保證了hive的穩定性,提升了hive服務的效率。
2.3hive的詳細執行過程:
簡單的將就是說sql或者HQL會被Hive解釋,編譯,優化並生成查詢計划,一般情況而言查詢計划會被轉化為MapReduce任務進而執行。
Hive創建的表的元信息存在於結構型數據庫之內(這個數據庫可以是自帶的Derby數據庫也可以是用戶自己安裝的數據庫),而表中的內容存在於HDFS之中,用戶輸入SQL語句之后進行編譯,然后在模板庫找到對應的模板組裝,最后交給Yarn運行,最后附帶一張Yarn執行mapreduce任務的解釋圖
3.hive安裝
Hive有三種模式(內嵌模式、本地模式、遠程模式)
3.1內嵌模式
內嵌模式:內嵌derby數據庫(一個會話連接,常用於簡單測試)derby是個in-memory的數據庫。
它的安裝方法如下:
1、下載hive
(下載之前一定要去官網http://hive.apache.org/downloads.html看看安裝的hadoop版本和hive版本兼容表,找到適合自己的那一款)下載地址:http://mirror.bit.edu.cn/apache/hive/
我的hadoop版本是3.1.1 所以我選擇的hive也是3.1.1
2.配置
解壓安裝包
[root@node01 software]# tar xf apache-hive-3.1.1-bin.tar.gz -C /opt/
/etc/profile
export HIVE_HOME=/opt/apache-hive-3.1.1-bin
PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$ZOOKEEPER_HOME/bin:$HIVE_HOME/bin
hive-site.xml
<?xml version="1.0" encoding="UTF-8" standalone="no"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:derby:;databaseName=metastore_db;create=true</value> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>org.apache.derby.jdbc.EmbeddedDriver</value> </property> <property> <name>hive.metastore.local</name> <value>true</value> </property> <property> <name>hive.metastore.schema.verification</name> <value>false</value> </property> <property> <name>hive.metastore.warehouse.dir</name> <value>/user/hive/warehouse</value> </property> <property> <name>datanucleus.schema.autoCreateAll</name> <value>true</value> </property> </configuration>
hive-env.sh 底部追加兩行
HADOOP_HOME=/opt/hadoop-3.1.1
HIVE_CONF_DIR=/opt/apache-hive-3.1.1/conf
啟動驗證
4.注意
注:使用derby存儲方式時,運行hive會在當前目錄生成一個derby文件和一個metastore_db目錄。這種存儲方式的弊端是在同一個目錄下同時只能有一個hive客戶端能使用數據庫,否則會提示如下錯誤
hive> show tables; FAILED: Error in metadata: javax.jdo.JDOFatalDataStoreException: Failed to start database 'metastore_db', see the next exception for details. NestedThrowables: java.sql.SQLException: Failed to start database 'metastore_db', see the next exception for details. FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask hive> show tables; FAILED: Error in metadata: javax.jdo.JDOFatalDataStoreException: Failed to start database 'metastore_db', see the next exception for details. NestedThrowables: java.sql.SQLException: Failed to start database 'metastore_db', see the next exception for details. FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
3.2.單用戶模式(mysql)
單用戶模式
就是客戶端和服務端在一個節點上
通過網絡連接到一個數據庫,是最經常使用的一種模式
1.安裝mysql
直接使用yum 安裝,步驟省略。
安裝完成,啟動服務,進入mysql后需要對用戶授權。
剛剛本地的內嵌模式我們是安裝在node1上,現在我們單用戶模式的單用戶客戶端在node02上安裝。
1.在node02 上傳 hive安裝包和 mysql連接驅動包
[root@node02 software]# tar xf apache-hive-3.1.1-bin.tar.gz -C /opt/
2.強連接驅動包拷貝到hive的lib目錄下
[root@node02 ~]# cp /software/mysql-connector-java-5.1.47-bin.jar /opt/apache-hive-3.1.1-bin/lib/
3.配置
mv hive-default.xml.template hive-site.xml
編輯
4.初始化元數據
[root@node02 ~]# schematool -dbType mysql -initSchema
Initialization script completed。。。 表示初始化成功
5.進入hive
6.一些問題整理
出現如下頻繁警告。
Tue Jan 08 01:58:53 CST 2019 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
修改前屬性配置:
<property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://node02/hive_remote?createDatabaseIfNotExist=true</value> </property>
修改后屬性配置
<property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://node02/hive_remote?createDatabaseIfNotExist=true&useUnicode=true&characterEncoding=UTF-8&useSSL=false</value> </property>
Caused by: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot create directory /tmp/hive/root/df1ae4e0-6cdc-4100-aec7-3cce1e75efa7. Name node is in safe mode.
在node01上關閉安全模式
hdfs dfsadmin -safemode leave
3.3多用戶模式
客戶端node04和服務端node02分布在不同的節點上,客戶端通過遠程的方式連接。
客戶端node04節點操作,基本和服務端差不多操作,區別是他不需要初始化。
1.配置環境變量(同單用戶一樣省略)
2.拷貝mysql的連接驅動包到hive的lib目錄下
3.安裝mysql
4.修改配置
客戶端node04的配置如下
<?xml version="1.0" encoding="UTF-8" standalone="no"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>hive.metastore.warehouse.dir</name> <value>/user/hive/warehouse</value> </property> <property> <name>hive.metastore.local</name> <value>false</value> </property> <property> <name>hive.metastore.schema.verification</name> <value>false</value> </property> <property> <name>hive.metastore.uris</name> <value>thrift://node02:9083</value> </property> </configuration>
5.服務端后台開啟metastore
nohup hive --service metastore &
6.在客戶端執行hive操作
[root@node04 conf]# hive which: no hbase in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/java/jdk1.8.0_191-amd64/bin:/opt/hadoop-3.1.1/bin:/opt/hadoop-3.1.1/sbin:/opt/zookeeper-3.4.10/bin:/opt/apache-hive-3.1.1-bin//bin:/root/bin) SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/opt/apache-hive-3.1.1-bin/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/hadoop-3.1.1/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] Hive Session ID = 6f032213-071d-4b09-81d1-8e6020efd285 Logging initialized using configuration in jar:file:/opt/apache-hive-3.1.1-bin/lib/hive-common-3.1.1.jar!/hive-log4j2.properties Async: true Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. Hive Session ID = 6c91ed66-2839-4d1e-b5a5-4c7cc33c20b1 hive> show tables; OK test02 Time taken: 4.764 seconds, Fetched: 1 row(s) hive> create table users(id int,name string); OK Time taken: 5.851 seconds hive> insert into users values(1,'benjamin'); Query ID = root_20190108055650_32bc4f70-0564-4378-b459-15d13f2e7625 Total jobs = 3 Launching Job 1 out of 3 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1546878395750_0002, Tracking URL = http://node03:8088/proxy/application_1546878395750_0002/ Kill Command = /opt/hadoop-3.1.1/bin/mapred job -kill job_1546878395750_0002 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2019-01-08 05:59:01,834 Stage-1 map = 0%, reduce = 0% 2019-01-08 05:59:22,068 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.99 sec 2019-01-08 06:00:38,929 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.99 sec 2019-01-08 06:01:48,323 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.99 sec 2019-01-08 06:02:51,343 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.99 sec 2019-01-08 06:03:52,405 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.99 sec 2019-01-08 06:04:10,893 Stage-1 map = 100%, reduce = 67%, Cumulative CPU 4.04 sec 2019-01-08 06:04:14,173 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.63 sec MapReduce Total cumulative CPU time: 4 seconds 630 msec Ended Job = job_1546878395750_0002 Stage-4 is selected by condition resolver. Stage-3 is filtered out by condition resolver. Stage-5 is filtered out by condition resolver. Moving data to directory hdfs://mycluster/user/hive_remote/warehouse/users/.hive-staging_hive_2019-01-08_05-56-50_443_4859297556177527153-1/-ext-10000 Loading data to table default.users MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 4.63 sec HDFS Read: 15225 HDFS Write: 243 SUCCESS Total MapReduce CPU Time Spent: 4 seconds 630 msec OK Time taken: 450.721 seconds hive>
通過上面數據的插入操作,發現hive的操作最終會變成一個mapreduce任務在運行,也正驗證了之前所述。
6.在服務端查看數據庫
[root@node02 ~]# mysql -uroot -phive mysql: [Warning] Using a password on the command line interface can be insecure. Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 221 Server version: 5.7.24 MySQL Community Server (GPL) Copyright (c) 2000, 2018, Oracle and/or its affiliates. All rights reserved. Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. mysql> use hive_remote; Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Database changed mysql> select * from TBLS; +--------+-------------+-------+------------------+-------+------------+-----------+-------+----------+---------------+--------------------+--------------------+--------------------+ | TBL_ID | CREATE_TIME | DB_ID | LAST_ACCESS_TIME | OWNER | OWNER_TYPE | RETENTION | SD_ID | TBL_NAME | TBL_TYPE | VIEW_EXPANDED_TEXT | VIEW_ORIGINAL_TEXT | IS_REWRITE_ENABLED | +--------+-------------+-------+------------------+-------+------------+-----------+-------+----------+---------------+--------------------+--------------------+--------------------+ | 2 | 1546898176 | 1 | 0 | root | USER | 0 | 2 | users | MANAGED_TABLE | NULL | NULL | | +--------+-------------+-------+------------------+-------+------------+-----------+-------+----------+---------------+--------------------+--------------------+--------------------+ 1 row in set (0.00 sec) mysql>
服務端存儲的是hdfs中的元數據信息。