Hive分區表分桶表的認識與區別

本文轉載自查看原文 2021-07-22 18:57 140 Hive

Hive 分區

分區表實際上是在表的目錄下在以分區命名，建子目錄

作用：進行分區裁剪，避免全表掃描，減少MapReduce處理的數據量，提高效率

一般在公司的hive中，所有的表基本上都是分區表，通常按日期分區、地域分區

分區表在使用的時候記得加上分區字段

分區也不是越多越好，一般不超過3級，根據實際業務衡量

建立分區表：

create table students_pt
(
    id bigint,
    name string,
    age int,
    gender string,
    clazz string
)
PARTITIONED BY(pt string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

增加一個分區：

alter table students_pt add partition(pt='20210622');

刪除一個分區：

alter table students_pt drop partition(pt='20210112');

查看某個表的所有分區

show partitions students_pt; // 推薦這種方式（直接從元數據中獲取分區信息）

select distinct pt from students_pt; // 不推薦

往分區中插入數據：

insert into table students_pt partition(pt='20210101') select * from students;
load data local inpath '/usr/local/soft/data/students.txt' into table students_pt partition(pt='20210111');

查詢某個分區的數據：

// 全表掃描，不推薦，效率低
select count(*) from students_pt;

// 使用where條件進行分區裁剪，避免了全表掃描，效率高
select count(*) from students_pt where pt='20210101';

// 也可以在where條件中使用非等值判斷
select count(*) from students_pt where pt<='20210112' and pt>='20210110';

Hive動態分區

有的時候我們原始表中的數據里面包含了 ''日期字段 dt''，我們需要根據dt中不同的日期，分為不同的分區，將原始表改造成分區表。

hive默認不開啟動態分區

動態分區：根據數據中某幾列的不同的取值划分不同的分區

開啟Hive的動態分區支持

# 表示開啟動態分區
hive> set hive.exec.dynamic.partition=true;
# 表示動態分區模式：strict（需要配合靜態分區一起使用）、nostrict
# strict： insert into table students_pt partition(dt='anhui',pt) select ......,pt from students;
hive> set hive.exec.dynamic.partition.mode=nostrict;
# 表示支持的最大的分區數量為1000，可以根據業務自己調整
hive> set hive.exec.max.dynamic.partitions.pernode=1000;

建立原始表並加載數據

create table students_dt
(
    id bigint,
    name string,
    age int,
    gender string,
    clazz string,
    dt string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

建立分區表並加載數據

create table students_dt_p
(
    id bigint,
    name string,
    age int,
    gender string,
    clazz string
)
PARTITIONED BY(dt string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

使用動態分區插入數據

// 分區字段需要放在 select 的最后，如果有多個分區字段 同理，它是按位置匹配，不是按名字匹配
insert into table students_dt_p partition(dt) select id,name,age,gender,clazz,dt from students_dt;
// 比如下面這條語句會使用age作為分區字段，而不會使用student_dt中的dt作為分區字段
insert into table students_dt_p partition(dt) select id,name,age,gender,dt,age from students_dt;

多級分區

create table students_year_month
(
    id bigint,
    name string,
    age int,
    gender string,
    clazz string,
    year string,
    month string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

create table students_year_month_pt
(
    id bigint,
    name string,
    age int,
    gender string,
    clazz string
)
PARTITIONED BY(year string,month string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

insert into table students_year_month_pt partition(year,month) select id,name,age,gender,clazz,year,month from students_year_month;

Hive分桶

分桶實際上是對文件（數據）的進一步切分

Hive默認關閉分桶

作用：在往分桶表中插入數據的時候，會根據 clustered by 指定的字段進行hash分組對指定的buckets個數進行取余，進而可以將數據分割成buckets個數個文件，以達到是數據均勻分布，方便我們取抽樣數據，提高Map join效率

分桶字段需要根據業務進行設定可以解決數據傾斜問題

開啟分桶開關

hive> set hive.enforce.bucketing=true;

建立分桶表

create table students_buks
(
    id bigint,
    name string,
    age int,
    gender string,
    clazz string
)
CLUSTERED BY (clazz) into 12 BUCKETS
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

往分桶表中插入數據

// 直接使用load data 並不能將數據打散
load data local inpath '/usr/local/soft/data/students.txt' into table students_buks;

// 需要使用下面這種方式插入數據，才能使分桶表真正發揮作用
insert into students_buks select * from students;

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Hive SQL之分區表與分桶表 hive的分區表 Hive分區與桶表 Hive的分桶表 Hive之分區表 hive創建分區表 hive刪除分區表以及修復分區表 Hive靜態分區表&動態分區表 HIVE外部表分區表 Hive分區表的分區操作