Hive文件格式，以及ORC創建使用

本文轉載自查看原文 2018-05-16 02:10 2351

轉載出處：https://blog.csdn.net/longshenlmj/article/details/51702343

hive表的源文件存儲格式有幾類：

1、TEXTFILE

默認格式，建表時不指定默認為這個格式，導入數據時會直接把數據文件拷貝到hdfs上不進行處理。源文件可以直接通過hadoop fs -cat 查看

2、SEQUENCEFILE 一種Hadoop API提供的二進制文件，使用方便、可分割、可壓縮等特點。

SEQUENCEFILE將數據以<key,value>的形式序列化到文件中。序列化和反序列化使用Hadoop 的標准的Writable 接口實現。key為空，用value 存放實際的值，這樣可以避免map 階段的排序過程。

三種壓縮選擇：NONE, RECORD, BLOCK。 Record壓縮率低，一般建議使用BLOCK壓縮。使用時設置參數，

SET hive.exec.compress.output=true;

SET io.seqfile.compression.type=BLOCK; -- NONE/RECORD/BLOCK

create table test2(str STRING) STORED AS SEQUENCEFILE;

3、RCFILE

一種行列存儲相結合的存儲方式。首先，其將數據按行分塊，保證同一個record在一個塊上，避免讀一個記錄需要讀取多個block。其次，塊數據列式存儲，有利於數據壓縮和快速的列存取。

理論上具有高查詢效率（但hive官方說效果不明顯，只有存儲上能省10%的空間，所以不好用，可以不用）。

RCFile結合行存儲查詢的快速和列存儲節省空間的特點

1）同一行的數據位於同一節點，因此元組重構的開銷很低；

2) 塊內列存儲，可以進行列維度的數據壓縮，跳過不必要的列讀取。

查詢過程中，在IO上跳過不關心的列。實際過程是，在map階段從遠端拷貝仍然拷貝整個數據塊到本地目錄，也並不是真正直接跳過列，而是通過掃描每一個row group的頭部定義來實現的。

但是在整個HDFS Block 級別的頭部並沒有定義每個列從哪個row group起始到哪個row group結束。所以在讀取所有列的情況下，RCFile的性能反而沒有SequenceFile高。

4、ORC hive給出的新格式，屬於RCFILE的升級版。

5、自定義格式用戶的數據文件格式不能被當前 Hive 所識別的，時通過實現inputformat和outputformat來自定義輸入輸出格式，

參考代碼：.\hive-0.8.1\src\contrib\src\java\org\apache\hadoop\hive\contrib\fileformat\base64

對前集中的介紹和建表語句參見：http://www. cnblogs.com/ggjucheng/archive/2013/01/03/2843318.html

注意：

只有TEXTFILE表能直接加載數據，必須， 本地load數據，和external外部表直接加載運路徑數據，都只能用TEXTFILE表。

更深一步，hive默認支持的壓縮文件（hadoop默認支持的壓縮格式），也只能用TEXTFILE表直接讀取。其他格式不行。可以通過TEXTFILE表加載后insert到其他表中。

換句話說，SequenceFile、RCFile表不能直接加載數據，數據要先導入到textfile表，再從textfile表通過insert select from 導入到SequenceFile,RCFile表。

SequenceFile、RCFile表的源文件不能直接查看，在hive中用select看。RCFile源文件可以用 hive --service rcfilecat /xxxxxxxxxxxxxxxxxxxxxxxxxxx/000000_0查看，但是格式不同，很亂。

hive默認支持壓縮文件格式參考http://blog. csdn.net/longshenlmj/article/details/50550580

ORC格式

ORC是RCfile的升級版，性能有大幅度提升，

而且數據可以壓縮存儲，壓縮比和Lzo壓縮差不多，比text文件壓縮比可以達到70%的空間。而且讀性能非常高，可以實現高效查詢。

具體介紹https://cwiki. apache.org/confluence/display/Hive/LanguageManual+ORC

建表語句如下：

同時，將ORC的表中的NULL取值，由默認的\N改為'',

方式一

create table if not exists test_orc(
  advertiser_id string,
  ad_plan_id string,
  cnt BIGINT
) partitioned by (day string, type TINYINT COMMENT '0 as bid, 1 as win, 2 as ck', hour TINYINT)
STORED AS ORC;

alter table test_orc set serdeproperties('serialization.null.format' = '');


查看結果
hive> show create table test_orc;
CREATE  TABLE `test_orc`(
  `advertiser_id` string, 
  `ad_plan_id` string, 
  `cnt` bigint)
PARTITIONED BY ( 
  `day` string, 
  `type` tinyint COMMENT '0 as bid, 1 as win, 2 as ck', 
  `hour` tinyint)
ROW FORMAT DELIMITED 
  NULL DEFINED AS '' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
  'hdfs://namenode/hivedata/warehouse/pmp.db/test_orc'
TBLPROPERTIES (
  'last_modified_by'='pmp_bi', 
  'last_modified_time'='1465992624', 
  'transient_lastDdlTime'='1465992624’)
tblproperties只是一個表的描述信息

方式二

drop table test_orc;
create table if not exists test_orc(
  advertiser_id string,
  ad_plan_id string,
  cnt BIGINT
) partitioned by (day string, type TINYINT COMMENT '0 as bid, 1 as win, 2 as ck', hour TINYINT)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde' 
with serdeproperties('serialization.null.format' = '')
STORED AS ORC;


查看結果
hive> show create table test_orc;
CREATE  TABLE `test_orc`(
  `advertiser_id` string, 
  `ad_plan_id` string, 
  `cnt` bigint)
PARTITIONED BY ( 
  `day` string, 
  `type` tinyint COMMENT '0 as bid, 1 as win, 2 as ck', 
  `hour` tinyint)
ROW FORMAT DELIMITED 
  NULL DEFINED AS '' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
  'hdfs://namenode/hivedata/warehouse/pmp.db/test_orc'
TBLPROPERTIES (
  'transient_lastDdlTime'='1465992726')

方式三

drop table test_orc;

create table if not exists test_orc(

advertiser_id string,

ad_plan_id string,

cnt BIGINT

) partitioned by (day string, type TINYINT COMMENT '0 as bid, 1 as win, 2 as ck', hour TINYINT)

ROW FORMAT DELIMITED

NULL DEFINED AS ''

STORED AS ORC;

查看結果

hive> show create table test_orc;

CREATE TABLE `test_orc`(

`advertiser_id` string,

`ad_plan_id` string,

`cnt` bigint)

PARTITIONED BY (

`day` string,

`type` tinyint COMMENT '0 as bid, 1 as win, 2 as ck',

`hour` tinyint)

ROW FORMAT DELIMITED

NULL DEFINED AS ''

STORED AS INPUTFORMAT

'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'

OUTPUTFORMAT

'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'

LOCATION

'hdfs://namenode/hivedata/warehouse/pmp.db/test_orc'

TBLPROPERTIES (

'transient_lastDdlTime'='1465992916')

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 更高的壓縮比，更好的性能–使用ORC文件格式優化Hive Hive中文件存儲格式ORC與Parquet對比大數據：Hive - ORC 文件存儲格式 hdfs文件導入hive(ods層)，格式為ORC Hive ORC表的使用 java - hive - 讀寫orc文件 Hive Hadoop 解析 orc 文件 Hive ORC + SNAPPY HIVE ORC 報錯ClassCastException Hive on spark和Hive on mr在處理orc格式表格時數據不一致問題探究