hive中使用rcfile

本文轉載自查看原文 2014-09-19 17:46 4748 Hive

（1）建student & student1 表：（hive 托管）
create table student(id INT, age INT, name STRING)
partitioned by(stat_date STRING)
clustered by(id) sorted by(age) into 4 buckets
row format delimited fields terminated by ',';

create table studentrc(id INT, age INT, name STRING)
partitioned by(stat_date STRING)
clustered by(id) sorted by(age) into 4 buckets
row format delimited fields terminated by ',' stored as rcfile;

create table studentlzo(id INT, age INT, name STRING)
partitioned by(stat_date STRING)
clustered by(id) sorted by(age) into 4 buckets
row format delimited fields terminated by ',' stored as rcfile;

文件格式 textfile， sequencefile， rcfile
（2）設置環境變量：
set hive.enforce.bucketing = true;
（3）插入數據：
LOAD DATA local INPATH '/home/hadoop/hivetest1.txt' OVERWRITE INTO TABLE student partition(stat_date="20120802");

(CPU使用率很高)
from student
insert overwrite table student1 partition(stat_date="20120802")
select id,age,name where stat_date="20120802" sort by age;

查看數據
select id, age, name from student distribute by id ; // distribute相當於mapreduce中的key

抽選數據(一般測試的情況下使用)
select * from student tablesample(bucket 1 out of 2 on id);
TABLESAMPLE(BUCKET x OUT OF y)
其中, x必須比y小, y必須是在創建表的時候bucket on的數量的因子或者倍數, hive會根據y的大小來決定抽樣多少, 比如原本分了32分, 當y=16時, 抽取32/16=2分, 這時TABLESAMPLE(BUCKET 3 OUT OF 16) 就意味着要抽取第3和第16+3=19分的樣品. 如果y=64，這要抽取 32/64=1/2份數據, 這時TABLESAMPLE(BUCKET 3 OUT OF 64) 意味着抽取第3份數據的一半來進行.

rcfile操作

// 導入(gzip壓縮)
set hive.enforce.bucketing=true;
set hive.exec.compress.output=true;
set mapred.output.compress=true;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;
from student
insert overwrite table studentrc partition(stat_date="20120802")
select id,age,name where stat_date="20120802" sort by age;

// lzo壓縮
set hive.io.rcfile.record.buffer.size = 16777216; // 16 * 1024 * 1024
set io.file.buffer.size = 131072; // 緩沖區大小 128 * 1024

set hive.enforce.bucketing=true;
set hive.exec.compress.output=true;
set mapred.output.compress=true;
set mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;
set io.compression.codecs=com.hadoop.compression.lzo.LzoCodec;
from student
insert overwrite table studentlzo partition(stat_date="20120802")
select id,age,name where stat_date="20120802" sort by age;

// sequencefile導入
set hive.exec.compress.output=true;
set mapred.output.compress=true;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;
insert overwrite table studentseq select * from student;

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 hive中使用變量 Hive中使用python 在Hive中使用Avro MR案例：MR和Hive中使用Lzo壓縮 Hive開發中使用變量的兩種方法 hive的使用 + hive的常用語法 Hive常用函數的使用 Hive的安裝和使用 hive中partition如何使用 hive partition 分區使用