Hive文件格式

本文轉載自查看原文 2014-03-20 14:23 22780 Hadoop I

hive文件存儲格式包括以下幾類：

1、TEXTFILE

2、SEQUENCEFILE

3、RCFILE

4、ORCFILE(0.11以后出現)

其中TEXTFILE為默認格式，建表時不指定默認為這個格式，導入數據時會直接把數據文件拷貝到hdfs上不進行處理；

SEQUENCEFILE，RCFILE，ORCFILE格式的表不能直接從本地文件導入數據，數據要先導入到textfile格式的表中，然后再從表中用insert導入SequenceFile,RCFile,ORCFile表中。

前提創建環境：

hive 0.8

創建一張testfile_table表，格式為textfile。

create table if not exists testfile_table( site string, url string, pv bigint, label string) row format delimited fields terminated by '\t' stored as textfile;

load data local inpath '/app/weibo.txt' overwrite into table textfile_table;

一、TEXTFILE
默認格式，數據不做壓縮，磁盤開銷大，數據解析開銷大。
可結合Gzip、Bzip2使用(系統自動檢查，執行查詢時自動解壓)，但使用這種方式，hive不會對數據進行切分，
從而無法對數據進行並行操作。
示例：

create table if not exists textfile_table(
site string,
url  string,
pv   bigint,
label string)
row format delimited
fields terminated by '\t'
stored as textfile;
插入數據操作：
set hive.exec.compress.output=true;  
set mapred.output.compress=true;  
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;  
set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;  
insert overwrite table textfile_table select * from textfile_table;

二、SEQUENCEFILE
SequenceFile是Hadoop API提供的一種二進制文件支持，其具有使用方便、可分割、可壓縮的特點。
SequenceFile支持三種壓縮選擇：NONE，RECORD，BLOCK。Record壓縮率低，一般建議使用BLOCK壓縮。
示例：

create table if not exists seqfile_table(
site string,
url  string,
pv   bigint,
label string)
row format delimited
fields terminated by '\t'
stored as sequencefile;
插入數據操作：
set hive.exec.compress.output=true;  
set mapred.output.compress=true;  
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;  
set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;  
SET mapred.output.compression.type=BLOCK;
insert overwrite table seqfile_table select * from textfile_table;

三、RCFILE
RCFILE是一種行列存儲相結合的存儲方式。首先，其將數據按行分塊，保證同一個record在一個塊上，避免讀一個記錄需要讀取多個block。其次，塊數據列式存儲，有利於數據壓縮和快速的列存取。
RCFILE文件示例：

create table if not exists rcfile_table(
site string,
url  string,
pv   bigint,
label string)
row format delimited
fields terminated by '\t'
stored as rcfile;
插入數據操作：
set hive.exec.compress.output=true;  
set mapred.output.compress=true;  
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;  
set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;  
insert overwrite table rcfile_table select * from textfile_table;

四、ORCFILE()
五、再看TEXTFILE、SEQUENCEFILE、RCFILE三種文件的存儲情況：

[hadoop@node3 ~]$ hadoop dfs -dus /user/hive/warehouse/*
hdfs://node1:19000/user/hive/warehouse/hbase_table_1    0
hdfs://node1:19000/user/hive/warehouse/hbase_table_2    0
hdfs://node1:19000/user/hive/warehouse/orcfile_table    0
hdfs://node1:19000/user/hive/warehouse/rcfile_table    102638073
hdfs://node1:19000/user/hive/warehouse/seqfile_table   112497695
hdfs://node1:19000/user/hive/warehouse/testfile_table  536799616
hdfs://node1:19000/user/hive/warehouse/textfile_table  107308067
[hadoop@node3 ~]$ hadoop dfs -ls /user/hive/warehouse/*/
-rw-r--r--   2 hadoop supergroup   51328177 2014-03-20 00:42 /user/hive/warehouse/rcfile_table/000000_0
-rw-r--r--   2 hadoop supergroup   51309896 2014-03-20 00:43 /user/hive/warehouse/rcfile_table/000001_0
-rw-r--r--   2 hadoop supergroup   56263711 2014-03-20 01:20 /user/hive/warehouse/seqfile_table/000000_0
-rw-r--r--   2 hadoop supergroup   56233984 2014-03-20 01:21 /user/hive/warehouse/seqfile_table/000001_0
-rw-r--r--   2 hadoop supergroup  536799616 2014-03-19 23:15 /user/hive/warehouse/testfile_table/weibo.txt
-rw-r--r--   2 hadoop supergroup   53659758 2014-03-19 23:24 /user/hive/warehouse/textfile_table/000000_0.gz
-rw-r--r--   2 hadoop supergroup   53648309 2014-03-19 23:26 /user/hive/warehouse/textfile_table/000001_1.gz

總結:
相比TEXTFILE和SEQUENCEFILE，RCFILE由於列式存儲方式，數據加載時性能消耗較大，但是具有較好的壓縮比和查詢響應。數據倉庫的特點是一次寫入、多次讀取，因此，整體來看，RCFILE相比其余兩種格式具有較明顯的優勢。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 hive存儲的文件格式對比 Hive支持的文件格式和壓縮格式及各自特點 Hive文件存儲格式和hive數據壓縮數倉工具hive(四)：Hive文件存儲格式以及優缺點 053 關於hive的存儲格式【HIVE】各種時間格式處理 Hive壓縮格式 Hive之存儲格式 Hive對JSON格式的支持研究 hive表的存儲格式; ORC格式的使用