hive中order by ,sort by ,distribute by, cluster by 的區別（**很詳細**）

hive中order by ,sort by ,distribute by, cluster by 的區別（很詳細）

本文轉載自查看原文 2019-06-17 17:55 1213 hive

hive 查詢語法

select [all | distinct] select_ condition, select_ condition
from table_name a
[join table_other b on a.id=b.id]
[where wehre_condition]
[group by col_list [having condition]]
[cluster by col_list | [distribute by col_list] [sort by col_list | order by col_list]]
[limit number]

准備數據：

create table if not exists stu_test(id int,name string,sex string,age int)
row format delimited fields terminated by ','
;

insert into stu_test values
(1,'zs','m',18)
,(2,'ls','m',19)
,(3,'ww','m',20)
,(4,'zq','f',18)
,(5,'ll','f',21)
,(6,'hl','f',19)
,(7,'xh','f',20)
,(8,'cl','f',22)
,(9,'fj','m',19)
,(10,'wb','m',23)
,(11,'wf','f',24)
,(12,'jj','m',21)
,(13,'yy','m',20)
,(14,'ld','f',18)
,(15,'ch','f',22)
;

1.order by col_list:

排序 全局排序 默認為升序asc ,因此只有一個reducer,只有一個reduce task的結果，
比如文件名是000000_0,會導致當輸入規模較大時，需要較長的計算時間。

如果指定了hive.mapred.mode=strict（默認值是nonstrict）,這時就必須指定limit來限制輸出條數，原因是：所有的數據都會在同一個reducer端進行，數據量大的情況下可能不能出結果，那么在這樣的嚴格模式下，必須指定輸出的條數。

例如：stu:按照年齡排序

select * from stu_test order by age desc;

結果：

id name sex age
11 wf f 24 
10 wb m 23 
15 ch f 22 
8  cl f 22 
5  ll f 21 
12 jj m 21 
13 yy m 20 
7  xh f 20 
3  ww m 20 
9  fj m 19 
6  hl f 19 
2  ls m 19 
14 ld f 18 
1  zs m 18 
4  zq f 18

2.sort by col_list :

局部排序，其在數據進入reducer前完成排序。因此，如果用sort by 進行排序，並且設置mapred.reduce.tasks>1,
則sort by 只保證每個reducer的輸出有序，不保證全局排序。
在每一個reducetask中，每一個小的輸出結果排序，但是當reducetask的個數為1的話和order by 的排序結果一致
注意：sort by 指定的字段僅僅是用於排序的字段，不用於分reducetask輸出結果，最終的輸出文件中的結果是隨機生成的

select * from stu_test sort by age desc;

結果：

id name sex age
11 wf f 24 
10 wb m 23 
15 ch f 22 
8  cl f 22 
5  ll f 21 
12 jj m 21 
13 yy m 20 
7  xh f 20 
3  ww m 20 
9  fj m 19 
6  hl f 19 
2  ls m 19 
14 ld f 18 
1  zs m 18 
4  zq f 18

正常（set mapreduce.job.reduces=1）結果和sort by 的結果一致
如果設置reduce task 個數為3的話： set mapreduce.job.reduces=3，此時不一致

set mapreduce.job.reduces=3
select * from stu_test sort by age desc;

結果：

id name sex age
10 wb m 23 
15 ch f 22 
8 cl f 22 
5 ll f 21 
7 xh f 20 
9 fj m 19 
6 hl f 19

11 wf f 24 
12 jj m 21 
3 ww m 20 
2 ls m 19 
14 ld f 18 
4 zq f 18

13 yy m 20 
1 zs m 18

隨機生成3個文件，然后在每個文件中進行排序。

3.distribute by col_list：

根據指定的字段將數據分到不同的reducer,且分發算法是hash散列
類似於分桶的概念按照指定的distribute by 字段和設置的reducetask的個數進行取余分組，但是並沒有排序，只是分，沒有排序
select * from stu_test distribute by age desc; 錯誤，不能使用desc,因為它不是排序的意思

set mapreduce.job.reduces=3
select * from stu_test distribute by age；

結果：

id name sex age
000000_0 age%3=0
14 ld f 18 
1  zs m 18 
4  zq f 18 
12 jj m 21 
11 wf f 24 
5  ll f 21 
000000_1 age%3=1
15 ch f 22 
9  fj m 19 
6  hl f 19 
2  ls m 19 
8  cl f 22 
000000_1 age%3=2
13 yy m 20 
7  xh f 20 
3  ww m 20 
10 wb m 23

set mapreduce.job.reduces=2

select * from stu_test distribute by age；

結果：

000000_0 age%2=0
15 ch f 22 
14 ld f 18 
13 yy m 20 
11 wf f 24 
8 cl f 22 
7 xh f 20 
4 zq f 18 
3 ww m 20 
1 zs m 18
000000_0 age%2=1
12 jj m 21 
6 hl f 19 
10 wb m 23 
9 fj m 19 
5 ll f 21 
2 ls m 19

4.cluster by col_list ：

除了具有distribute by 的功能外，還會對該字段進行排序
cluster by = distribute by+sort by
cluster by id = distribute by id +sort by id
注意：1）cluster by 和sort by 不可以同時使用
2）當分組字段和排序字段是同一個字段的時候 cluster by id = distribute by id +sort by id
不是同一個字段的時候請不要使用 cluster by id

select * from stu_test cluster by age；

結果：

id name sex age
000000_0 age%3=0
14 ld f 18 
4  zq f 18 
1  zs m 18 
12 jj m 21 
5  ll f 21 
11 wf f 24 
000000_1 age%3=1
6  hl f 19 
2  ls m 19 
9  fj m 19 
15 ch f 22 
8  cl f 22 
000000_1 age%3=2
3  ww m 20 
13 yy m 20 
7  xh f 20 
10 wb m 23

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 hive中Sort By，Order By，Cluster By，Distribute By，Group By的區別 hive中order by、distribute by、sort by和cluster by的區別和聯系 Hive中order by sort by distribute by cluster by用法 hive中order by,sort by, distribute by, cluster by的用法 hive中order by,sort by, distribute by, cluster by作用以及用法 HiveQL之Sort by、Distribute by、Cluster by、Order By詳解 hive 中 Order by, Sort by ,Dristribute by,Cluster By 的作用和用法 [大數據相關] Hive中的全排序：order by,sort by, distribute by hive的高級查詢（group by、 order by、 join 、 distribute by、sort by、 clusrer by、 union all等） Hive的order by和sort by