本博文的主要內容有:
.hive的常用語法
.內部表
.外部表
.內部表,被drop掉,會發生什么?
.外部表,被drop掉,會發生什么?
.內部表和外部表的,保存的路徑在哪?
.用於創建一些臨時表存儲中間結果
.用於向臨時表中追加中間結果數據
.分區表(分為,分區內部表和分區外部表)
.hive的結構和原理
.hive的原理和架構設計
hive的使用
對於hive的使用,在hadoop集群里,先啟動hadoop集群,再啟動mysql服務,然后,再hive即可。
1、在hadoop安裝目錄下,sbin/start-all.sh。
2、在任何路徑下,執行service mysql start (CentOS版本)、sudo /etc/init.d/mysql start (Ubuntu版本)
3、在hive安裝目錄下的bin下,./hive
對於hive的使用,在spark集群里,先啟動hadoop集群,再啟動spark集群,再啟動mysql服務,然后,再hive即可。
1、在hadoop安裝目錄下,sbin/start-all.sh。
2、在spark安裝目錄下,sbin/start-all.sh
3、在任何路徑下,執行service mysql start (CentOS版本)、sudo /etc/init.d/mysql start (Ubuntu版本)
3、在hive安裝目錄下的bin下,./hive
[hadoop@weekend110 bin]$ pwd
/home/hadoop/app/hive-0.12.0/bin
[hadoop@weekend110 bin]$ mysql -uhive -hweekend110 -phive
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 110
Server version: 5.1.73 Source distribution
Copyright (c) 2000, 2013, Oracle and/or its affiliates. All rights reserved.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql> SHOW DATABASES;
+--------------------+
| Database |
+--------------------+
| information_schema |
| hive |
| mysql |
| test |
+--------------------+
4 rows in set (0.00 sec)
mysql> quit;
Bye
[hadoop@weekend110 bin]$
[hadoop@weekend110 bin]$ pwd
/home/hadoop/app/hive-0.12.0/bin
[hadoop@weekend110 bin]$ ./hive
16/10/10 22:36:25 INFO Configuration.deprecation: mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive
16/10/10 22:36:25 INFO Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
16/10/10 22:36:25 INFO Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
16/10/10 22:36:25 INFO Configuration.deprecation: mapred.min.split.size.per.rack is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.rack
16/10/10 22:36:25 INFO Configuration.deprecation: mapred.min.split.size.per.node is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.node
16/10/10 22:36:25 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
16/10/10 22:36:25 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
Logging initialized using configuration in jar:file:/home/hadoop/app/hive-0.12.0/lib/hive-common-0.12.0.jar!/hive-log4j.properties
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoop/app/hadoop-2.4.1/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoop/app/hive-0.12.0/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
hive> SHOW DATABASES;
OK
default
hive
Time taken: 12.226 seconds, Fetched: 2 row(s)
hive> quit;
[hadoop@weekend110 bin]$
總結,mysql比hive,多出了自己本身mysql而已。
如
CREATE TABLE page_view(
viewTime INT,
userid BIGINT,
page_url STRING,
referrer_url STRING,
ip STRING COMMENT 'IP Address of the User'
)
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
STORED AS SEQUENCEFILE; TEXTFILE
原因解釋如下:
0000101 iphone6pluse 64G 6888
0000102 xiaominote 64G 2388
CREATE TABLE t_order(id int,name string,rongliang string,price double)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
;
現在,我們來開始玩玩
[hadoop@weekend110 bin]$ pwd
/home/hadoop/app/hive-0.12.0/bin
[hadoop@weekend110 bin]$ ./hive
16/10/10 10:16:38 INFO Configuration.deprecation: mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive
16/10/10 10:16:38 INFO Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
16/10/10 10:16:38 INFO Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
16/10/10 10:16:38 INFO Configuration.deprecation: mapred.min.split.size.per.rack is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.rack
16/10/10 10:16:38 INFO Configuration.deprecation: mapred.min.split.size.per.node is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.node
16/10/10 10:16:38 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
16/10/10 10:16:38 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
Logging initialized using configuration in jar:file:/home/hadoop/app/hive-0.12.0/lib/hive-common-0.12.0.jar!/hive-log4j.properties
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoop/app/hadoop-2.4.1/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoop/app/hive-0.12.0/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
hive>
遇到,如下問題
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
參考,http://blog.163.com/songyalong1117@126/blog/static/1713918972014124481752/
hive常見問題解決干貨大全
先Esc,再Shift,再 . + /
<property>
<name>hive.metastore.schema.verification</name>
<value>true</value>
<description>
Enforce metastore schema version consistency.
True: Verify that version information stored in metastore matches with one from Hive jars. Also disable automatic
schema migration attempt. Users are required to manully migrate schema after Hive upgrade which ensures
proper metastore schema migration. (Default)
False: Warn if the version information stored in metastore doesn't match with one from in Hive jars.
</description>
</property>
改為
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
<description>
Enforce metastore schema version consistency.
True: Verify that version information stored in metastore matches with one from Hive jars. Also disable automatic
schema migration attempt. Users are required to manully migrate schema after Hive upgrade which ensures
proper metastore schema migration. (Default)
False: Warn if the version information stored in metastore doesn't match with one from in Hive jars.
</description>
</property>
很多人這樣寫
CREATE TABLE t_order(
id int,
name string,
rongliang string,
price double
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';
hive> CREATE TABLE t_order(id int,name string,rongliang string,price double)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '\t'
> ;
OK
Time taken: 28.755 seconds
hive>
測試連接下,
正式連接
成功!
這里呢,我推薦一款新的軟件,作為入門。
Navicat for MySQL的下載、安裝和使用
之類呢,再可以玩更高級的,見
個人推薦,比較好的MySQL客戶端工具
MySQL Workbench類型之MySQL客戶端工具的下載、安裝和使用
MySQL Server類型之MySQL客戶端工具的下載、安裝和使用
前提得要開啟hive
注意:第一步里,輸入后,不要點擊“確定”。直接切換到“常規。”
關於,第二步。看下你的hive安裝目錄下的hive-site.xml,你的user和password。若你配置的是root,則第二步里用root用戶。
配置完第二步,之后,再最后點擊“確定。”
通過show databases;可以查看數據庫。默認database只有default。
hive> CREATE DATABASE hive; //創建hive數據庫,只是這個數據庫的名稱,命名為hive而已。
OK
Time taken: 1.856 seconds
hive> SHOW DATABASES;
OK
default
hive
Time taken: 0.16 seconds, Fetched: 2 row(s)
hive> use hive; //使用hive數據庫
OK
Time taken: 0.276 seconds
很多人這樣寫法
CREATE TABLE t_order(
id int,
name string,
rongliang string,
price double
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';
hive> CREATE TABLE t_order(id int,name string,rongliang string,price double)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '\t'
> ;
OK
Time taken: 0.713 seconds
hive> SHOW TABLES;
OK
t_order
Time taken: 0.099 seconds, Fetched: 1 row(s)
hive>
對應着,
TBLS,其實就是TABLES,記錄的是表名等。
好的,現在,來導入數據。
新建
[hadoop@weekend110 ~]$ ls
app c.txt flowArea.jar jdk1.7.0_65 wc.jar
a.txt data flow.jar jdk-7u65-linux-i586.tar.gz words.log
blk_1073741856 download flowSort.jar Link to eclipse workspace
blk_1073741857 eclipse HTTP_20130313143750.dat qingshu.txt
b.txt eclipse-jee-luna-SR2-linux-gtk-x86_64.tar.gz ii.jar report.evt
[hadoop@weekend110 ~]$ mkdir hiveTestData
[hadoop@weekend110 ~]$ cd hiveTestData/
[hadoop@weekend110 hiveTestData]$ ls
[hadoop@weekend110 hiveTestData]$ vim XXX.data
0000101 iphone6pluse 64G 6888
0000102 xiaominote 64G 2388
0000103 iphone5s 64G 6888
0000104 mi4 64G 2388
0000105 mi3 64G 6388
0000106 meizu 64G 2388
0000107 huawei 64G 6888
0000108 zhongxing 64G 6888
本地文件的路徑是在,
[hadoop@weekend110 hiveTestData]$ pwd
/home/hadoop/hiveTestData
[hadoop@weekend110 hiveTestData]$ ls
XXX.data
[hadoop@weekend110 hiveTestData]$
[hadoop@weekend110 bin]$ ./hive
16/10/10 17:23:09 INFO Configuration.deprecation: mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive
16/10/10 17:23:09 INFO Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
16/10/10 17:23:09 INFO Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
16/10/10 17:23:09 INFO Configuration.deprecation: mapred.min.split.size.per.rack is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.rack
16/10/10 17:23:09 INFO Configuration.deprecation: mapred.min.split.size.per.node is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.node
16/10/10 17:23:09 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
16/10/10 17:23:09 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
Logging initialized using configuration in jar:file:/home/hadoop/app/hive-0.12.0/lib/hive-common-0.12.0.jar!/hive-log4j.properties
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoop/app/hadoop-2.4.1/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoop/app/hive-0.12.0/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
hive> SHOW DATABASES;
OK
default
hive
Time taken: 15.031 seconds, Fetched: 2 row(s)
hive> use hive;
OK
Time taken: 0.109 seconds
hive> LOAD DATA LOCAL INPATH '/home/hadoop/hiveTestData/XXX.data' INTO TABLE t_order;
Copying data from file:/home/hadoop/hiveTestData/XXX.data
Copying file: file:/home/hadoop/hiveTestData/XXX.data
Failed with exception File /tmp/hive-hadoop/hive_2016-10-10_17-24-21_574_6921522331212372447-1/-ext-10000/XXX.data could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and no node(s) are excluded in this operation.
at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1441)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2702)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:584)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:440)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.CopyTask
hive>
http://blog.itpub.net/29050044/viewspace-2098563/
http://blog.sina.com.cn/s/blog_75353ff40102v0d3.html
http://jingyan.baidu.com/article/7082dc1c65a76be40a89bd09.html (最后在這里,找到了)
錯誤是:
Failed with exception File /tmp/hive-hadoop/hive_2016-10-10_17-54-30_887_2531771020597467111-1/-ext-10000/XXX.data could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and no node(s) are excluded in this operation.
解決方法:
出現此類報錯主要原因是datanode存在問題,要么硬盤容量不夠,要么datanode服務器down了。檢查datanode,重啟Hadoop即可解決。
我的這里,有錯誤,還沒解決!
這里,t_order_wk,對應,我的t_order而已。只是表名不一樣
哇,多么的明了啊!
這一句命令,比起MapReduce語句來,多么的牛! 其實,hive本來就是用mapreduce寫的,只是作為數據倉庫,為了方便。
hive的常用語法
已經看到了hive的使用,很方便,把SQL語句,翻譯成mapreduce語句。
由此可見,xxx.data是向hive中表,加載進文件,也即,這文件是用LOAD進入。(從linux本地 -> hive中數據庫)
yyy.data是向hive中表,加載進文件,也即,這文件是用hadoop fs –put進入。(從hdfs里 -> hive中數據庫)
無論,是哪種途徑,只要文件放進了/user/hive/warehouse/t_order_wk里,則,都可以讀取到。
則,LOAD DATA LOCAL INPATH ,這文件是在,本地,即Linux里。從本地里導入
LOAD DATA INPATH,這文件是在,hdfs里。從hdfs里導入
那么,由此可見,若這DATA,即這文件,是在hdfs里,如uuu.data,則如“剪切”。
會有一個問題。如是業務系統產生的,我們業務或經常要讀,路徑是寫好的,把文件移動了,
會干擾業務系統的進行。為此,解決這個問題,則,表為EXTERNAL。這是它的好處。
//external
CREATE EXTERNAL TABLE tab_ip_ext(id int, name string,
ip STRING,
country STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/external/hive';
為此,我們現在,去建立一個ETTERNAL表,與它jjj.data,關聯起來。Soga,終於懂了。
內部表,被drop掉,會發生什么?
以及,內部表t_order_wk里的那些文件(xxx.data、yyy.data、zzz.data、jjj.data)都被drop掉了。
外部表,被drop掉,會發生什么?
是自定義的,hive_ext,在/下
內部表和外部表的,保存的路徑在哪?
用於創建一些臨時表存儲中間結果
CTAS,即CREATE AS的意思
// CTAS 用於創建一些臨時表存儲中間結果
CREATE TABLE tab_ip_ctas
AS
SELECT id new_id, name new_name, ip new_ip,country new_country
FROM tab_ip_ext
SORT BY new_id;
用於向臨時表中追加中間結果數據
//insert from select 用於向臨時表中追加中間結果數據
create table tab_ip_like like tab_ip;
insert overwrite table tab_ip_like
select * from tab_ip;
這里,沒演示
分區表
//PARTITION
create table tab_ip_part(
id int,
name string,
ip string,
country string
)
partitioned by (part_flag string)
row format delimited fields terminated by ',';
LOAD DATA LOCAL INPATH '/home/hadoop/ip.txt' OVERWRITE INTO TABLE tab_ip_part PARTITION(part_flag='part1');
LOAD DATA LOCAL INPATH '/home/hadoop/ip_part2.txt' OVERWRITE INTO TABLE tab_ip_part PARTITION(part_flag='part2');
select * from tab_ip_part;
select * from tab_ip_part where part_flag='part2';
select count(*) from tab_ip_part where part_flag='part2';
alter table tab_ip change id id_alter string;
ALTER TABLE tab_cts ADD PARTITION (partCol = 'dt') location '/external/hive/dt';
show partitions tab_ip_part;
每個月生成的訂單記錄,對訂單進行統計,哪些商品的最熱門,哪些商品的銷售最大,哪些商品點擊率最大,哪些商品的關聯最高。
如果,對訂單整個分析很大,為提高效率,在建立表時,就分區。
則,多了一個選擇,你也可以對全部來,也可以對某個分區來。則按分區來。
//PARTITION
create table tab_ip_part(id int,name string,ip string,country string)
partitioned by (part_flag string)
row format delimited fields terminated by ',';
load data local inpath '/home/hadoop/ip.txt' overwrite into table tab_ip_part
partition(part_flag='part1');
load data local inpath '/home/hadoop/ip_part2.txt' overwrite into table tab_ip_part
partition(part_flag='part2');
select * from tab_ip_part;
select * from tab_ip_part where part_flag='part2';
select count(*) from tab_ip_part where part_flag='part2';
alter table tab_ip change id id_alter string;
ALTER TABLE tab_cts ADD PARTITION (partCol = 'dt') location '/external/hive/dt';
show partitions tab_ip_part;
這里,不多演示贅述了。
hive的結構和原理
hive的原理和架構設計