Hive壓縮格式

本文轉載自查看原文 2015-08-18 18:42 11024 Hive

TextFile

Hive數據表的默認格式，存儲方式：行存儲。
可使用Gzip,Bzip2等壓縮算法壓縮,壓縮后的文件不支持split
但在反序列化過程中，必須逐個字符判斷是不是分隔符和行結束符，因此反序列化開銷會比SequenceFile高幾十倍。

--創建數據表：
create table if not exists textfile_table( site string, url string, pv bigint, label string) row format delimited fields terminated by '\t' stored as textfile; --插入數據： set hive.exec.compress.output=true; --啟用壓縮格式 
set mapred.output.compress=true; set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;  --指定輸出的壓縮格式為Gzip 
set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec; insert overwrite table textfile_table select * from T_Name;

SequenceFile

Hadoop API提供的一種二進制文件，以<key,value>的形式序列化到文件中。存儲方式：行存儲。
支持三種壓縮選擇：NONE，RECORD，BLOCK。Record壓縮率低，一般建議使用BLOCK壓縮。
優勢是文件和hadoop api中的MapFile是相互兼容的

create table if not exists seqfile_table( site string, url string, pv bigint, label string) row format delimited fields terminated by '\t' stored as sequencefile; --插入數據操作： set hive.exec.compress.output=true;  --啟用輸出壓縮格式
set mapred.output.compress=true; set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;  --指定輸出壓縮格式為Gzip
set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec; SET mapred.output.compression.type=BLOCK; --指定為Block
insert overwrite table seqfile_table select * from T_Name;

RCFile

存儲方式：數據按行分塊，每塊按列存儲。結合了行存儲和列存儲的優點：

首先，RCFile 保證同一行的數據位於同一節點，因此元組重構的開銷很低
其次，像列存儲一樣，RCFile 能夠利用列維度的數據壓縮，並且能跳過不必要的列讀取

RCFile的一個行組包括三個部分：

第一部分是行組頭部的【同步標識】，主要用於分隔 hdfs 塊中的兩個連續行組
第二部分是行組的【元數據頭部】，用於存儲行組單元的信息，包括行組中的記錄數、每個列的字節數、列中每個域的字節數
第三部分是【表格數據段】，即實際的列存儲數據。在該部分中，同一列的所有域順序存儲。
從圖可以看出，首先存儲了列 A 的所有域，然后存儲列 B 的所有域等。

數據追加：RCFile 不支持任意方式的數據寫操作，僅提供一種追加接口，這是因為底層的 HDFS當前僅僅支持數據追加寫文件尾部。
行組大小：行組變大有助於提高數據壓縮的效率，但是可能會損害數據的讀取性能，因為這樣增加了 Lazy 解壓性能的消耗。而且行組變大會占用更多的內存，這會影響並發執行的其他MR作業。考慮到存儲空間和查詢效率兩個方面，Facebook 選擇 4MB 作為默認的行組大小，當然也允許用戶自行選擇參數進行配置。

create table if not exists rcfile_table( site string, url string, pv bigint, label string) row format delimited fields terminated by '\t' stored as rcfile; --插入數據操作： set hive.exec.compress.output=true; set mapred.output.compress=true; set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec; set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec; insert overwrite table rcfile_table select * from T_Name;

ORCFile

存儲方式：數據按行分塊每塊按照列存儲
壓縮快快速列存取
效率比rcfile高,是rcfile的改良版本

自定義格式

用戶可以通過實現inputformat和 outputformat來自定義輸入輸出格式。

hive>  create table myfile_table(str STRING) >  stored as  
    >  inputformat 'org.apache.hadoop.hive.contrib.fileformat.base64.Base64TextInputFormat'  
    >  outputformat 'org.apache.hadoop.hive.contrib.fileformat.base64.Base64TextOutputFormat'; OK Time taken: 0.399 seconds

hive> load data local inpath '/root/hive/myfile_table'
    > overwrite into table myfile_table;--加載數據

hive> dfs -text /user/hive/warehouse/myfile_table/myfile_table;--數據文件內容，編碼后的格式
aGVsbG8saGl2ZQ== aGVsbG8sd29ybGQ= aGVsbG8saGFkb29w

hive> select * from myfile_table;--使用自定義格式進行解碼
OK hello,hive hello,world hello,hadoop Time taken: 0.117 seconds, Fetched: 3 row(s)

總結：

數據倉庫的特點：一次寫入、多次讀取，因此，整體來看，ORCFile相比其他格式具有較明顯的優勢。

TextFile 默認格式，加載速度最快，可以采用Gzip、bzip2等進行壓縮，壓縮后的文件無法split，即並行處理
SequenceFile 壓縮率最低，查詢速度一般，三種壓縮格式NONE，RECORD，BLOCK
RCfile 壓縮率最高，查詢速度最快，數據加載最慢。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Hive支持的文件格式和壓縮格式及各自特點 Hive文件存儲格式和hive數據壓縮 hive 總結三（壓縮）關於hive數據壓縮 Hive-壓縮和存儲（一）Snappy壓縮 4. hive parquet使用壓縮 hadoop, hive 啟用LZO壓縮 053 關於hive的存儲格式【HIVE】各種時間格式處理 Hive之存儲格式