Hive是一個數據倉庫基礎的應用工具,在Hadoop中用來處理結構化數據,它架構在Hadoop之上,通過SQL來對數據進行操作。
Hive 查詢操作過程嚴格遵守Hadoop MapReduce 的作業執行模型,Hive 將用戶的Hive SQL 語句通過解釋器轉換為MapReduce 作業提交到Hadoop 集群上,Hadoop 監控作業執行過程,然后返回作業執行結果給用戶。Hive 並非為聯機事務處理而設計,Hive 並不提供實時的查詢和基於行級的數據更新操作。Hive 的最佳使用場合是大數據集的批處理作業。
下面總結一下Hive操作常用的一些SQL語法:
"[ ]"括起來的代表我們可以寫也可以不寫的語句。
創建數據庫
CREATE DARABASE name;
- 顯示查看操作命令
show tables; --顯示表 show databases; --顯示數據庫 show partitions table_name; --顯示表名為table_name的表的所有分區 show functions ; --顯示所有函數 describe extended table_name col_name; --查看表中字段 |
DDL(Data Defination Language)
數據庫定義語言
- 創建表結構
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name [(col_name data_type [COMMENT col_comment], ...)] [COMMENT table_comment] [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] [ROW FORMAT row_format] [STORED AS file_format] [LOCATION hdfs_path] |
- CREATE TABLE 創建一個指定名字的表。如果相同名字的表已經存在,則拋出異常;用戶可以用 IF NOT EXIST 選項來忽略這個異常
- EXTERNAL 關鍵字可以讓用戶創建一個外部表,在建表的同時指定一個指向實際數據的路徑(LOCATION)
- LIKE 允許用戶復制現有的表結構,但是不復制數據
- COMMENT可以為表與字段增加描述
- ROW FORMAT 設置行數據分割格式
DELIMITED [FIELDS TERMINATED BY char] [COLLECTION ITEMS TERMINATED BY char] [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char] | SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)] |
- STORED AS
SEQUENCEFILE | TEXTFILE | RCFILE | INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname |
如果文件數據是純文本,可以使用 STORED AS TEXTFILE。
如果數據需要壓縮,使用 STORED AS SEQUENCE 。
創建簡單表:
CREATE TABLE person(name STRING,age INT); |
創建外部表:
CREATE EXTERNAL TABLE page_view(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User', country STRING COMMENT 'country of origination') COMMENT '這里寫表的描述信息' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054' STORED AS TEXTFILE LOCATION '<hdfs_location>'; |
創建分區表:
CREATE TABLE par_table(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User') COMMENT 'This is the page view table' PARTITIONED BY(date STRING, pos STRING) ROW FORMAT DELIMITED '\t' FIELDS TERMINATED BY '\n' STORED AS SEQUENCEFILE; |
創建分桶表:
CREATE TABLE par_table(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User') COMMENT 'This is the page view table' PARTITIONED BY(date STRING, pos STRING) CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS ROW FORMAT DELIMITED '\t' FIELDS TERMINATED BY '\n' STORED AS SEQUENCEFILE; |
創建帶索引字段的表:
CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (dindex STRING); |
復制一個空表:
CREATE TABLE empty_key_value_store LIKE key_value_store; |
顯示所有表:
SHOW TABLES; |
按正則表達式顯示表:
SHOW TABLES '.*s'; |
表中添加一個字段:
ALTER TABLE pokes ADD COLUMNS (new_col INT); |
添加一個字段並為其添加注釋:
ALTER TABLE invites ADD COLUMNS (new_col2 INT COMMENT 'a comment'); |
刪除列:
ALTER TABLE test REPLACE COLUMNS(id BIGINT, name STRING); |
更改表名:
ALTER TABLE events RENAME TO new_events; |
增加、刪除分區
--增加: ALTER TABLE table_name ADD [IF NOT EXISTS] partition_spec [ LOCATION 'location1' ] partition_spec [ LOCATION 'location2' ] ... partition_spec: : PARTITION (partition_col = partition_col_value, partition_col = partiton_col_value, ...)
--刪除: ALTER TABLE table_name DROP partition_spec, partition_spec,... |
改變表的文件格式與組織:
ALTER TABLE table_name SET FILEFORMAT file_format ALTER TABLE table_name CLUSTERED BY(userid) SORTED BY(viewTime) INTO num_buckets BUCKETS --這個命令修改了表的物理存儲屬性 |
創建和刪除視圖:
--創建視圖:
CREATE VIEW [IF NOT EXISTS] view_name [ (column_name [COMMENT column_comment], ...) ][COMMENT view_comment][TBLPROPERTIES (property_name = property_value, ...)] AS SELECT;
--刪除視圖: DROP VIEW view_name; |
DML(Data manipulation language)
數據操作語言,主要是數據庫增刪改三種操作,DML包括:INSERT插入、UPDATE新、DELETE刪除。
向數據表內加載文件:
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)] --load操作只是單純的復制/移動操作,將數據文件移動到Hive表對應的位置。 --加載本地 LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;
--加載HDFS數據,同時給定分區信息 LOAD DATA INPATH '/user/myname/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15'); |
將查詢結果插入到Hive表:
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement;
--多插入模式: FROM from_statement INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 [INSERT OVERWRITE TABLE tablename2 [PARTITION ...] select_statement2] ...
--自動分區模式 INSERT OVERWRITE TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement; |
將查詢結果插入到HDFS文件系統中:
INSERT OVERWRITE [LOCAL] DIRECTORY directory1 SELECT ... FROM ... FROM from_statement INSERT OVERWRITE [LOCAL] DIRECTORY directory1 select_statement1 [INSERT OVERWRITE [LOCAL] DIRECTORY directory2 select_statement2] |
INSERT INTO
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement; |
insert overwrite和insert into的區別:
- insert overwrite 會覆蓋已經存在的數據,假如原始表使用overwrite 上述的數據,先現將原始表的數據remove,再插入新數據。
- insert into 只是簡單的插入,不考慮原始表的數據,直接追加到表中。最后表的數據是原始數據和新插入數據。
DQL(data query language)數據查詢語言 select操作
SELECT查詢結構:
SELECT [ALL | DISTINCT] select_expr, select_expr, ... FROM table_reference [WHERE where_condition] [GROUP BY col_list [HAVING condition]] [ CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY| ORDER BY col_list] ] [LIMIT number] |
- 使用ALL和DISTINCT選項區分對重復記錄的處理。默認是ALL,表示查詢所有記錄DISTINCT表示去掉重復的記錄
- Where 條件 類似我們傳統SQL的where 條件
- ORDER BY 全局排序,只有一個Reduce任務
- SORT BY 只在本機做排序
- LIMIT限制輸出的個數和輸出起始位置
將查詢數據輸出至目錄:
INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a WHERE a.ds='<DATE>'; |
將查詢結果輸出至本地目錄:
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/local_out' SELECT a.* FROM pokes a; |
將一個表的結果插入到另一個表:
FROM invites a INSERT OVERWRITE TABLE events SELECT a.bar, count(1) WHERE a.foo > 0 GROUP BY a.bar; INSERT OVERWRITE TABLE events SELECT a.bar, count(1) FROM invites a WHERE a.foo > 0 GROUP BY a.bar; JOIN FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) INSERT OVERWRITE TABLE events SELECT t1.bar, t1.foo, t2.foo; |
將多表數據插入到同一表中
FROM src INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100 INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 and src.key < 200 INSERT OVERWRITE TABLE dest3 PARTITION(ds='2008-04-08', hr='12') SELECT src.key WHERE src.key >= 200 and src.key < 300 INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE src.key >= 300; |
Hive 只支持等值連接(equality joins)、外連接(outer joins)和(left semi joins)。Hive 不支持所有非等值的連接,因為非等值連接非常難轉化到 map/reduce 任務。
- LEFT,RIGHT和FULL OUTER關鍵字用於處理join中空記錄的情況
- LEFT SEMI JOIN 是 IN/EXISTS 子查詢的一種更高效的實現
- join 時,每次 map/reduce 任務的邏輯是這樣的:reducer 會緩存 join 序列中除了最后一個表的所有表的記錄,再通過最后一個表將結果序列化到文件系統
-
實際應用過程中應盡量使用小表join大表
join查詢時應注意的點:
--只支持等值連接 SELECT a.* FROM a JOIN b ON (a.id = b.id)
SELECT a.* FROM a JOIN b ON (a.id = b.id AND a.department = b.department)
--可以 join 多個表 SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key2)
--如果join中多個表的 join key 是同一個,則 join 會被轉化為單個 map/reduce 任務 |
LEFT,RIGHT和FULL OUTER關鍵字
--左外連接 SELECT a.val, b.val FROM a LEFT OUTER JOIN b ON (a.key=b.key) --右外鏈接 SELECT a.val, b.val FROM a RIGHT OUTER JOIN b ON (a.key=b.key) --滿外連接 SELECT a.val, b.val FROM a FULL OUTER JOIN b ON (a.key=b.key) |
LEFT SEMI JOIN關鍵字
--LEFT SEMI JOIN 的限制是, JOIN 子句中右邊的表只能在 ON 子句中設置過濾條件,在 WHERE 子句、SELECT 子句或其他地方過濾都不行 SELECT a.key, a.value FROM a WHERE a.key in (SELECT b.key FROM B); --可以被寫為: SELECT a.key, a.val FROM a LEFT SEMI JOIN b on (a.key = b.key) |
UNION 與 UNION ALL
--用來合並多個select的查詢結果,需要保證select中字段須一致 select_statement UNION ALL select_statement UNION ALL select_statement ... --UNION 和 UNION ALL的區別 --UNION只會查詢到兩個表中不同的數據,相同的部分不會被查出 --UNION ALL會把兩個表的所有數據都查詢出 |