關於 Hive DML 語法,你可以參考 apache 官方文檔的說明:Hive Data Manipulation Language。
apache的hive版本現在應該是 0.13.0,而我使用的 hadoop 版本是 CDH5.0.1,其對應的 hive 版本是 0.12.0。故只能參考apache官方文檔來看 cdh5.0.1 實現了哪些特性。
因為 hive 版本會持續升級,故本篇文章不一定會和最新版本保持一致。
1. 准備測試數據
首先創建普通表:
create table test(id int, name string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
創建分區表:
CREATE EXTERNAL TABLE test_p(id int,name string)partitioned by (date STRING)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\,' LINES TERMINATED BY '\n'STORED AS TEXTFILE;
准備數據文件:
[/tmp]# cat test.txt
1,a
2,b
3,c
4,d
2.加載數據
語法如下:
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
說明:
- filepath 可能是:
- 一個相對路徑
- 一個絕對路徑,例如:
/root/project/data1
- 一個url地址,可選的可以帶上授權信息,例如:
hdfs://namenode:9000/user/hive/project/data1
- 目標可能是一個表或者分區,如果該表是分區,則必須制定分區列。
- filepath 可以是一個文件也可以是目錄
- 如果指定了
LOCAL
,則:load
命令會在本地查找 filepath。如果 filepath 是相對路徑,則相對於當前路徑,也可以指定一個 url 或者本地文件,例如:file:///user/hive/project/data1
- 如果沒有指定
LOCAL
,則hive會使用全路徑的url,url 中如果沒有制定 schema,則默認使用fs.default.name
的值;如果該路徑不是絕對路徑,則會相對於/user/<username>
- 如果使用
OVERWRITE
,則會刪除原來的數據,然后導入新的數據,否則,就是追加數據。
需要注意的:
filepath
中不能包括子目錄- 如果沒有指定
LOCAL
,則filepath
指向目標表或者分區所在的文件系統。 - 如果需要壓縮,則參考 CompressedStorage
2.1 測試
2.1.1 加載本地文件
a) 加載到普通表中
hive> load data local inpath '/tmp/test.txt' into table test;Copying data from file:/tmp/test.txtCopying file: file:/tmp/test.txtLoading data to table default.testTable default.test stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 16, raw_data_size: 0]OKTime taken: 0.572 seconds
查看hdfs上的數據:
$ hadoop fs -ls /user/hive/warehouse/test
Found 1 items
-rwxrwxrwt 3 hive hadoop 16 2014-06-09 18:36 /user/hive/warehouse/test/test.txt
查看表中數據:
hive> select * from test;OK1 a2 b3 c4 dTime taken: 0.562 seconds, Fetched: 4 row(s)
b) 加載文件到分區表
通常是直接使用 load 命令加載:
LOAD DATA LOCAL INPATH "/tmp/test.txt" INTO TABLE test_p PARTITION (date=20140722)
注意:如果沒有加上
overwrite
關鍵字,則加載相同文件最后會存在多個文件
還有一種方法是:創建分區目錄,手動上傳文件,最后再添加新的分區,代碼如下:
hadoop fs -mkdir /user/hive/warehouse/test/date=20140320
ALTER TABLE test_p ADD IF NOT EXISTS PARTITION (date=20140320);
hive hadoop fs -rm /user/hive/warehouse/test/date=20140320/test.txt
hadoop fs -put /tmp/test.txt /user/hive/warehouse/test/date=20140320
同樣,你也可以查看 hdfs 和表中的數據。
2.1.2 加載hdfs上的文件
拷貝 test.txt 為test_1.txt 並將其上傳到 /user/hive/warehouse
:
$ cp test.txt test_1.txt
$ sudo -u hive hadoop fs -put test_1.txt /user/hive/warehouse
然后將 /user/hive/warehouse/test_1.txt
導入到test表中:
hive> load data inpath '/user/hive/warehouse/test_1.txt' into table test;Loading data to table default.testTable default.test stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 16, raw_data_size: 0]OKTime taken: 2.941 seconds
查看hdfs上的數據:
$ hadoop fs -ls /user/hive/warehouse/test
Found 2 items
-rwxr-xr-x 3 hive hadoop 16 2014-06-09 18:48 /user/hive/warehouse/test/test.txt
-rwxr-xr-x 3 hive hadoop 16 2014-06-09 18:45 /user/hive/warehouse/test/test_1.txt
查看表中數據:
hive> select * from test;OK1 a2 b3 c4 d1 a2 b3 c4 dTime taken: 0.302 seconds, Fetched: 8 row(s)
3. 插入數據
標准語法:
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement;INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement;
擴展語法(多個insert):
FROM from_statementINSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1[INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2][INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2] ...;FROM from_statementINSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1[INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2][INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2] ...;
擴展語法(動態分區insert):
INSERT OVERWRITE TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;INSERT INTO TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;
說明:
- INSERT OVERWRITE 會覆蓋存在的數據
- 輸出的格式和序列化類取決於表的元數據
- hive 0.13.0之后,select語句可以使用 CTEs 表達式,語法請參考 SELECT syntax,示例見 Common Table Expression
Dynamic Partition Inserts
dynamic partition inserts在hive 0.6.0中引入。相關的配置參數有:
hive.exec.dynamic.partition
hive.exec.dynamic.partition.mode
hive.exec.max.dynamic.partitions.pernode
hive.exec.max.dynamic.partitions
hive.exec.max.created.files
hive.error.on.empty.partition
一個示例:
FROM page_view_stg pvsINSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country)SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip, pvs.cnt
4. 導出數據
標准語法:
INSERT OVERWRITE [LOCAL] DIRECTORY directory1
[ROW FORMAT row_format] [STORED AS file_format] (Note: Only available starting with Hive 0.11.0)
SELECT ... FROM ...
擴展語法(多個insert):
FROM from_statement
INSERT OVERWRITE [LOCAL] DIRECTORY directory1 select_statement1
[INSERT OVERWRITE [LOCAL] DIRECTORY directory2 select_statement2] ...
row_format相關語法:
DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char][MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char][NULL DEFINED AS char](Note: Only available starting with Hive 0.13)
說明:
- Directory 可以是一個全路徑的 url。
- 如果指定
LOCAL
,則會將數據寫到本地文件系統。 - 輸出的數據序列化為 text 格式,分隔符為
^A
,行於行之間通過換行符連接。如果存在不是基本類型的列,則這些列將被序列化為 JSON 格式。 - 在 Hive 0.11.0 可以輸出字段的分隔符,之前版本的默認為
^A
。
4.1 測試;
4.1.1 導出到本地文件系統
hive> insert overwrite local directory '/tmp/test' select * from test;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1402248601715_0016, Tracking URL = http://cdh1:8088/proxy/application_1402248601715_0016/
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1402248601715_0016
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2014-06-09 19:25:12,896 Stage-1 map = 0%, reduce = 0%
2014-06-09 19:25:20,380 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.99 sec
2014-06-09 19:25:21,433 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.99 sec
MapReduce Total cumulative CPU time: 990 msec
Ended Job = job_1402248601715_0016
Copying data to local directory /tmp/test
Copying data to local directory /tmp/test
MapReduce Jobs Launched:
Job 0: Map: 1 Cumulative CPU: 0.99 sec HDFS Read: 305 HDFS Write: 32 SUCCESS
Total MapReduce CPU Time Spent: 990 msec
OK
Time taken: 18.438 seconds
導出后的數據預覽如下:
[/tmp]# vim test/000000_0
1^Aa
2^Ab
3^Ac
4^Ad
1^Aa
2^Ab
3^Ac
4^Ad
可以看到數據中的列與列之間的分隔符是^A
(ascii碼是\00001
),如果想修改分隔符,可以做如下修改:
hive> insert overwrite local directory '/tmp/test' row format delimited fields terminated by ',' select * from test;
再來查看數據:
vim test/000000_3
1,a
2,b
3,c
4,d
1,a
2,b
3,c
4,d
4.1.2 導出到 HDFS 中
hive> insert overwrite directory '/user/hive/tmp' select * from test;
注意:
和導出文件到本地文件系統的HQL少一個local,數據的存放路徑不一樣了。
4.1.3 導出到Hive的另一個表中
在實際情況中,表的輸出結果可能太多,不適於顯示在控制台上,這時候,將Hive的查詢輸出結果直接存在一個新的表中是非常方便的,我們稱這種情況為CTAS( create table .. as select
)如下:
hive> create table test2 as select * from test;
----EOF