HIVE表數據的導入與導出(load data&insert overwrite)


1. 准備測試數據

首先創建普通表:

create table test(id int, name string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE; 

創建分區表:

CREATE EXTERNAL TABLE test_p( id int, name string ) partitioned by (date STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\,' LINES TERMINATED BY '\n' STORED AS TEXTFILE; 

准備數據文件:

[/tmp]# cat test.txt 1,a 2,b 3,c 4,d 

2.加載數據

語法如下:

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)] 

說明:

  • filepath 可能是:
  • 一個相對路徑
  • 一個絕對路徑,例如:/root/project/data1
  • 一個url地址,可選的可以帶上授權信息,例如:hdfs://namenode:9000/user/hive/project/data1
  • 目標可能是一個表或者分區,如果該表是分區,則必須制定分區列。
  • filepath 可以是一個文件也可以是目錄
  • 如果指定了 LOCAL,則:
  • load 命令會在本地查找 filepath。如果 filepath 是相對路徑,則相對於當前路徑,也可以指定一個 url 或者本地文件,例如:file:///user/hive/project/data1
  • 如果沒有指定 LOCAL ,則hive會使用全路徑的url,url 中如果沒有制定 schema,則默認使用 fs.default.name的值;如果該路徑不是絕對路徑,則會相對於 /user/<username>
  • 如果使用 OVERWRITE ,則會刪除原來的數據,然后導入新的數據,否則,就是追加數據。

需要注意的:

  • filepath 中不能包括子目錄
  • 如果沒有指定 LOCAL,則 filepath 指向目標表或者分區所在的文件系統。
  • 如果需要壓縮,則參考 CompressedStorage

2.1 測試

2.1.1 加載本地文件

a) 加載到普通表中

hive> load data local inpath '/tmp/test.txt' into table test; Copying data from file:/tmp/test.txt Copying file: file:/tmp/test.txt Loading data to table default.test Table default.test stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 16, raw_data_size: 0] OK Time taken: 0.572 seconds 

查看hdfs上的數據:

$ hadoop fs -ls /user/hive/warehouse/test Found 1 items -rwxrwxrwt 3 hive hadoop 16 2014-06-09 18:36 /user/hive/warehouse/test/test.txt 

查看表中數據:

hive> select * from test; OK 1 a 2 b 3 c 4 d Time taken: 0.562 seconds, Fetched: 4 row(s) 

b) 加載文件到分區表

通常是直接使用 load 命令加載:

LOAD DATA LOCAL INPATH "/tmp/test.txt" INTO TABLE test_p PARTITION (date=20140722) 

注意:如果沒有加上 overwrite 關鍵字,則加載相同文件最后會存在多個文件

還有一種方法是:創建分區目錄,手動上傳文件,最后再添加新的分區,代碼如下:

hadoop fs -mkdir /user/hive/warehouse/test/date=20140320 ALTER TABLE test_p ADD IF NOT EXISTS PARTITION (date=20140320); hive hadoop fs -rm /user/hive/warehouse/test/date=20140320/test.txt hadoop fs -put /tmp/test.txt /user/hive/warehouse/test/date=20140320 

同樣,你也可以查看 hdfs 和表中的數據。

2.1.2 加載hdfs上的文件

拷貝 test.txt 為test_1.txt 並將其上傳到 /user/hive/warehouse:

$ cp test.txt test_1.txt $ sudo -u hive hadoop fs -put test_1.txt /user/hive/warehouse 

然后將 /user/hive/warehouse/test_1.txt 導入到test表中:

hive> load data inpath '/user/hive/warehouse/test_1.txt' into table test; Loading data to table default.test Table default.test stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 16, raw_data_size: 0] OK Time taken: 2.941 seconds 

查看hdfs上的數據:

$ hadoop fs -ls /user/hive/warehouse/test Found 2 items -rwxr-xr-x 3 hive hadoop 16 2014-06-09 18:48 /user/hive/warehouse/test/test.txt -rwxr-xr-x 3 hive hadoop 16 2014-06-09 18:45 /user/hive/warehouse/test/test_1.txt 

查看表中數據:

hive> select * from test; OK 1 a 2 b 3 c 4 d 1 a 2 b 3 c 4 d Time taken: 0.302 seconds, Fetched: 8 row(s) 

3. 插入數據

標准語法:

INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement; INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement; 

擴展語法(多個insert):

FROM from_statement INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 [INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2] [INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2] ...; FROM from_statement INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 [INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2] [INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2] ...; 

擴展語法(動態分區insert):

INSERT OVERWRITE TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement; INSERT INTO TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement; 

說明:

Dynamic Partition Inserts

dynamic partition inserts在hive 0.6.0中引入。相關的配置參數有:

hive.exec.dynamic.partition hive.exec.dynamic.partition.mode hive.exec.max.dynamic.partitions.pernode hive.exec.max.dynamic.partitions hive.exec.max.created.files hive.error.on.empty.partition 

一個示例:

FROM page_view_stg pvs INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country) SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip, pvs.cnt 

4. 導出數據

標准語法:

INSERT OVERWRITE [LOCAL] DIRECTORY directory1 [ROW FORMAT row_format] [STORED AS file_format] (Note: Only available starting with Hive 0.11.0) SELECT ... FROM ... 

擴展語法(多個insert):

FROM from_statement INSERT OVERWRITE [LOCAL] DIRECTORY directory1 select_statement1 [INSERT OVERWRITE [LOCAL] DIRECTORY directory2 select_statement2] ... 

row_format相關語法:

DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char] [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char] [NULL DEFINED AS char](Note: Only available starting with Hive 0.13) 

說明:

  • Directory 可以是一個全路徑的 url。
  • 如果指定 LOCAL,則會將數據寫到本地文件系統。
  • 輸出的數據序列化為 text 格式,分隔符為 ^A,行於行之間通過換行符連接。如果存在不是基本類型的列,則這些列將被序列化為 JSON 格式。
  • 在 Hive 0.11.0 可以輸出字段的分隔符,之前版本的默認為 ^A

4.1 測試;

4.1.1 導出到本地文件系統

hive> insert overwrite local directory '/tmp/test' select * from test; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_1402248601715_0016, Tracking URL = http://cdh1:8088/proxy/application_1402248601715_0016/ Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1402248601715_0016 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 2014-06-09 19:25:12,896 Stage-1 map = 0%, reduce = 0% 2014-06-09 19:25:20,380 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.99 sec 2014-06-09 19:25:21,433 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.99 sec MapReduce Total cumulative CPU time: 990 msec Ended Job = job_1402248601715_0016 Copying data to local directory /tmp/test Copying data to local directory /tmp/test MapReduce Jobs Launched: Job 0: Map: 1 Cumulative CPU: 0.99 sec HDFS Read: 305 HDFS Write: 32 SUCCESS Total MapReduce CPU Time Spent: 990 msec OK Time taken: 18.438 seconds 

導出后的數據預覽如下:

[/tmp]# vim test/000000_0 1^Aa 2^Ab 3^Ac 4^Ad 1^Aa 2^Ab 3^Ac 4^Ad 

可以看到數據中的列與列之間的分隔符是^A(ascii碼是\00001),如果想修改分隔符,可以做如下修改:

hive> insert overwrite local directory '/tmp/test' row format delimited fields terminated by ',' select * from test; 

再來查看數據:

vim test/000000_3 1,a 2,b 3,c 4,d 1,a 2,b 3,c 4,d 

4.1.2 導出到 HDFS 中

hive> insert overwrite directory '/user/hive/tmp' select * from test; 

注意:

和導出文件到本地文件系統的HQL少一個local,數據的存放路徑不一樣了。

4.1.3 導出到Hive的另一個表中

在實際情況中,表的輸出結果可能太多,不適於顯示在控制台上,這時候,將Hive的查詢輸出結果直接存在一個新的表中是非常方便的,我們稱這種情況為CTAS( create table .. as select)如下:

hive> create table test2 as select * from test;


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM