HIVE-分桶表的詳解和創建實例

本文轉載自查看原文 2017-11-09 14:41 16041

我們學習一下分桶表，其實分區和分桶這兩個概念對於初學者來說是比較難理解的。但對於理解了的人來說，發現又是如此簡單。

我們先建立一個分桶表，並嘗試直接上傳一個數據

create table student4(sno int,sname string,sex string,sage int, sdept string) clustered by(sno) into 3 buckets row format delimited fields terminated by ',';
set hive.enforce.bucketing = true;強制分桶。
load data local inpath '/home/hadoop/hivedata/students.txt' overwrite into table student4;

我們看到雖然設置了強制分桶，但實際student表下面只有一個students一個文件。分桶也就是分區，分區數量等於文件數，所以上面方法並沒有分桶。

現在，我們用插入的方法給另外一個分桶表傳入同樣數據

create table student4(sno int,sname string,sex string,sage int, sdept string) clustered by(sno) into 3 buckets row format delimited fields terminated by ',';
set hive.enforce.bucketing = true;強制分桶。
load data local inpath '/home/hadoop/hivedata/students.txt' overwrite into table student4;
我們看到雖然設置了強制分桶，但實際STUDENT表下面只有一個STUDENTS一個文件。
分桶也就是分區，分區數量等於文件數，所以上面方法並沒有分桶。
#創建第2個分桶表
create table stu_buck(sno int,sname string,sex string,sage int,sdept string)
clustered by(sno) 
sorted by(sno DESC)
into 4 buckets
row format delimited
fields terminated by ',';

#設置變量,設置分桶為true, 設置reduce數量是分桶的數量個數
set hive.enforce.bucketing = true;
set mapreduce.job.reduces=4;
#開會往創建的分通表插入數據(插入數據需要是已分桶, 且排序的)
#可以使用distribute by(sno) sort by(sno asc)   或是排序和分桶的字段相同的時候使用Cluster by(字段)
#注意使用cluster by  就等同於分桶+排序(sort)
insert into table stu_buck
select sno,sname,sex,sage,sdept from student distribute by(sno) sort by(sno asc);

Query ID = root_20171109145012_7088af00-9356-46e6-a988-f1fc5f6d2e13
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 4
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1510197346181_0014, Tracking URL = http://server71:8088/proxy/application_1510197346181_0014/
Kill Command = /usr/local/hadoop/bin/hadoop job  -kill job_1510197346181_0014
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 4
2017-11-09 14:50:59,642 Stage-1 map = 0%,  reduce = 0%
2017-11-09 14:51:38,682 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.04 sec
2017-11-09 14:52:31,935 Stage-1 map = 100%,  reduce = 50%, Cumulative CPU 7.91 sec
2017-11-09 14:52:33,467 Stage-1 map = 100%,  reduce = 67%, Cumulative CPU 15.51 sec
2017-11-09 14:52:39,420 Stage-1 map = 100%,  reduce = 83%, Cumulative CPU 22.5 sec
2017-11-09 14:52:40,953 Stage-1 map = 100%,  reduce = 92%, Cumulative CPU 25.86 sec
2017-11-09 14:52:42,243 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 28.01 sec
MapReduce Total cumulative CPU time: 28 seconds 10 msec
Ended Job = job_1510197346181_0014
Loading data to table default.stu_buck
Table default.stu_buck stats: [numFiles=4, numRows=22, totalSize=527, rawDataSize=505]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 4   Cumulative CPU: 28.01 sec   HDFS Read: 18642 HDFS Write: 819 SUCCESS
Total MapReduce CPU Time Spent: 28 seconds 10 msec
OK
Time taken: 153.794 seconds

我們設置reduce的數量為4，學過mapreduce的人應該知道reduce數等於分區數，也等於處理的文件數量。

把表或分區划分成bucket有兩個理由

1，更快，桶為表加上額外結構，鏈接相同列划分了桶的表，可以使用map-side join更加高效。

2，取樣sampling更高效。沒有分區的話需要掃描整個數據集。

hive> create table bucketed_user (id int,name string)

> clustered by (id) sorted by (id asc) into 4 buckets;

重點1：CLUSTERED BY來指定划分桶所用列和划分桶的個數。HIVE對key的hash值除bucket個數取余數，保證數據均勻隨機分布在所有bucket里。

重點2:SORTED BY對桶中的一個或多個列另外排序

總結：我們發現其實桶的概念就是MapReduce的分區的概念，兩者完全相同。物理上每個桶就是目錄里的一個文件，一個作業產生的桶（輸出文件）數量和reduce任務個數相同。

而分區表的概念，則是新的概念。分區代表了數據的倉庫，也就是文件夾目錄。每個文件夾下面可以放不同的數據文件。通過文件夾可以查詢里面存放的文件。但文件夾本身和數據的內容毫無關系。

桶則是按照數據內容的某個值進行分桶，把一個大文件散列稱為一個個小文件。

這些小文件可以單獨排序。如果另外一個表也按照同樣的規則分成了一個個小文件。兩個表join的時候，就不必要掃描整個表，只需要匹配相同分桶的數據即可。效率當然大大提升。

同樣，對數據抽樣的時候，也不需要掃描整個文件。只需要對每個分區按照相同規則抽取一部分數據即可。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Hive-查詢（四）分桶及抽樣查詢 Hive的分桶表 Hive分區表分桶表的認識與區別 Hive 分區和分桶 Hive為什么要分桶 Hive 表操作（HIVE的數據存儲、數據庫、表、分區、分桶） Hive 的分桶 & Parquet 概念 Hive動態分區和分桶（八） Hive學習筆記——Hive中的分桶【Hive學習之五】Hive 參數&動態分區&分桶