Hive中數據的加載和導出


關於 Hive DML 語法,你可以參考 apache 官方文檔的說明:Hive Data Manipulation Language

apache的hive版本現在應該是 0.13.0,而我使用的 hadoop 版本是 CDH5.0.1,其對應的 hive 版本是 0.12.0。故只能參考apache官方文檔來看 cdh5.0.1 實現了哪些特性。

因為 hive 版本會持續升級,故本篇文章不一定會和最新版本保持一致。

1. 准備測試數據

首先創建普通表:

create table test(id int, name string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;

創建分區表:

CREATE EXTERNAL TABLE test_p(id int,name string)partitioned by (date STRING)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\,' LINES TERMINATED BY '\n'STORED AS TEXTFILE;

准備數據文件:

[/tmp]# cat test.txt 
1,a
2,b
3,c
4,d

2.加載數據

語法如下:

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]

說明:

  • filepath 可能是:
    • 一個相對路徑
    • 一個絕對路徑,例如:/root/project/data1
    • 一個url地址,可選的可以帶上授權信息,例如:hdfs://namenode:9000/user/hive/project/data1
  • 目標可能是一個表或者分區,如果該表是分區,則必須制定分區列。
  • filepath 可以是一個文件也可以是目錄
  • 如果指定了 LOCAL,則:
    • load 命令會在本地查找 filepath。如果 filepath 是相對路徑,則相對於當前路徑,也可以指定一個 url 或者本地文件,例如:file:///user/hive/project/data1
  • 如果沒有指定 LOCAL ,則hive會使用全路徑的url,url 中如果沒有制定 schema,則默認使用 fs.default.name的值;如果該路徑不是絕對路徑,則會相對於 /user/<username>
  • 如果使用 OVERWRITE ,則會刪除原來的數據,然后導入新的數據,否則,就是追加數據。

需要注意的:

  • filepath 中不能包括子目錄
  • 如果沒有指定 LOCAL,則 filepath 指向目標表或者分區所在的文件系統。
  • 如果需要壓縮,則參考 CompressedStorage

2.1 測試

2.1.1 加載本地文件

a) 加載到普通表中

hive> load data local inpath '/tmp/test.txt' into table test;Copying data from file:/tmp/test.txtCopying file: file:/tmp/test.txtLoading data to table default.testTable default.test stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 16, raw_data_size: 0]OKTime taken: 0.572 seconds

查看hdfs上的數據:

$ hadoop fs -ls /user/hive/warehouse/test
Found 1 items
-rwxrwxrwt   3 hive hadoop         16 2014-06-09 18:36 /user/hive/warehouse/test/test.txt

查看表中數據:

hive> select * from test;OK1   a2   b3   c4   dTime taken: 0.562 seconds, Fetched: 4 row(s)

b) 加載文件到分區表

通常是直接使用 load 命令加載:

LOAD DATA LOCAL INPATH "/tmp/test.txt" INTO TABLE test_p PARTITION (date=20140722)

注意:如果沒有加上 overwrite 關鍵字,則加載相同文件最后會存在多個文件

還有一種方法是:創建分區目錄,手動上傳文件,最后再添加新的分區,代碼如下:

hadoop fs -mkdir  /user/hive/warehouse/test/date=20140320
ALTER TABLE test_p ADD IF NOT EXISTS PARTITION (date=20140320);

hive hadoop fs -rm /user/hive/warehouse/test/date=20140320/test.txt
hadoop fs -put /tmp/test.txt  /user/hive/warehouse/test/date=20140320

同樣,你也可以查看 hdfs 和表中的數據。

2.1.2 加載hdfs上的文件

拷貝 test.txt 為test_1.txt 並將其上傳到 /user/hive/warehouse:

$ cp test.txt test_1.txt
$ sudo -u hive hadoop fs -put test_1.txt /user/hive/warehouse

然后將 /user/hive/warehouse/test_1.txt 導入到test表中:

hive> load data inpath '/user/hive/warehouse/test_1.txt' into table test;Loading data to table default.testTable default.test stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 16, raw_data_size: 0]OKTime taken: 2.941 seconds

查看hdfs上的數據:

$ hadoop fs -ls /user/hive/warehouse/test
Found 2 items
-rwxr-xr-x   3 hive hadoop         16 2014-06-09 18:48 /user/hive/warehouse/test/test.txt
-rwxr-xr-x   3 hive hadoop         16 2014-06-09 18:45 /user/hive/warehouse/test/test_1.txt

查看表中數據:

hive> select * from test;OK1   a2   b3   c4   d1   a2   b3   c4   dTime taken: 0.302 seconds, Fetched: 8 row(s)

3. 插入數據

標准語法:

INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement;INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement;

擴展語法(多個insert):

FROM from_statementINSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1[INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2][INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2] ...;FROM from_statementINSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1[INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2][INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2] ...;

擴展語法(動態分區insert):

INSERT OVERWRITE TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;INSERT INTO TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;

說明:

Dynamic Partition Inserts

dynamic partition inserts在hive 0.6.0中引入。相關的配置參數有:

hive.exec.dynamic.partition
hive.exec.dynamic.partition.mode
hive.exec.max.dynamic.partitions.pernode
hive.exec.max.dynamic.partitions
hive.exec.max.created.files
hive.error.on.empty.partition

一個示例:

FROM page_view_stg pvsINSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country)SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip, pvs.cnt

4. 導出數據

標准語法:

INSERT OVERWRITE [LOCAL] DIRECTORY directory1
  [ROW FORMAT row_format] [STORED AS file_format] (Note: Only available starting with Hive 0.11.0)
  SELECT ... FROM ...

擴展語法(多個insert):

FROM from_statement
INSERT OVERWRITE [LOCAL] DIRECTORY directory1 select_statement1
[INSERT OVERWRITE [LOCAL] DIRECTORY directory2 select_statement2] ...

row_format相關語法:

DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char][MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char][NULL DEFINED AS char](Note: Only available starting with Hive 0.13)

說明:

  • Directory 可以是一個全路徑的 url。
  • 如果指定 LOCAL,則會將數據寫到本地文件系統。
  • 輸出的數據序列化為 text 格式,分隔符為 ^A,行於行之間通過換行符連接。如果存在不是基本類型的列,則這些列將被序列化為 JSON 格式。
  • 在 Hive 0.11.0 可以輸出字段的分隔符,之前版本的默認為 ^A

4.1 測試;

4.1.1 導出到本地文件系統

hive> insert overwrite local directory '/tmp/test' select * from test;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1402248601715_0016, Tracking URL = http://cdh1:8088/proxy/application_1402248601715_0016/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1402248601715_0016
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2014-06-09 19:25:12,896 Stage-1 map = 0%,  reduce = 0%
2014-06-09 19:25:20,380 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.99 sec
2014-06-09 19:25:21,433 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.99 sec
MapReduce Total cumulative CPU time: 990 msec
Ended Job = job_1402248601715_0016
Copying data to local directory /tmp/test
Copying data to local directory /tmp/test
MapReduce Jobs Launched: 
Job 0: Map: 1   Cumulative CPU: 0.99 sec   HDFS Read: 305 HDFS Write: 32 SUCCESS
Total MapReduce CPU Time Spent: 990 msec
OK
Time taken: 18.438 seconds

導出后的數據預覽如下:

[/tmp]# vim test/000000_0 
1^Aa
2^Ab
3^Ac
4^Ad
1^Aa
2^Ab
3^Ac
4^Ad

可以看到數據中的列與列之間的分隔符是^A(ascii碼是\00001),如果想修改分隔符,可以做如下修改:

hive> insert overwrite local directory '/tmp/test' row format delimited fields terminated by ',' select * from test;

再來查看數據:

vim test/000000_3 
1,a
2,b
3,c
4,d
1,a
2,b
3,c
4,d

4.1.2 導出到 HDFS 中

hive> insert overwrite  directory '/user/hive/tmp' select * from test;

注意:

和導出文件到本地文件系統的HQL少一個local,數據的存放路徑不一樣了。

4.1.3 導出到Hive的另一個表中

在實際情況中,表的輸出結果可能太多,不適於顯示在控制台上,這時候,將Hive的查詢輸出結果直接存在一個新的表中是非常方便的,我們稱這種情況為CTAS( create table .. as select)如下:

hive> create table test2 as select * from test;

----EOF


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM