Hive中排序和聚集

本文轉載自查看原文 2015-08-17 14:21 5042 Hive

//五種子句是有嚴格順序的：
where → group by → having → order by → limit

//where和having的區別:
//where是先過濾再分組(對原始數據過濾),where限定聚合函數
hive> select count(*),age from tea where id>18 group by age;

//having是先分組再過濾(對每個組進行過濾,having后只能跟select中已有的列)
hive> select age,count(*) c from tea group by age having c>2;

//group by后面沒有的列,select后面也絕不能有(聚合函數除外)
hive> select ip,sum(load) as c from logs  group by ip sort by c desc limit 5;

//distinct關鍵字返回唯一不同的值(返回age和id均不相同的記錄)
hive> select distinct age,id from tea;

//hive只支持Union All,不支持Union
//hive的Union All相對sql有所不同,要求列的數量相同,並且對應的列名也相同,但不要求類的類型相同(可能是存在隱式轉換吧)
select name,age from tea where id<80
union all
select name,age from stu where age>18;

Order By特性：

對數據進行全局排序，只有一個reducer task，效率低下。
與mysql中 order by區別在於：在 strict 模式下，必須指定 limit，否則執行會報錯

• 使用命令set hive.mapred.mode; 查詢當前模式 • 使用命令set hive.mapred.mode=strick; 設置當前模式

hive> select * from logs where date='2015-01-02' order by te; FAILED: SemanticException 1:52 In strict mode, if ORDER BY is specified, LIMIT must also be specified. Error encountered near token 'te'

對於分區表，還必須顯示指定分區字段查詢

hive> select * from logs order by te limit 5; FAILED: SemanticException [Error 10041]: No partition predicate found for Alias "logs" Table "logs"

Sort BY特性：

可以有多個Reduce Task（以DISTRIBUTE BY后字段的個數為准）。也可以手工指定：set mapred.reduce.tasks=4;
每個Reduce Task 內部數據有序，但全局無序

set mapred.reduce.tasks = 2; insert overwrite local directory '/root/hive/b'
    select * from logs sort by te;

上述查詢語句，將結果保存在本地磁盤 /root/hive/b ，此目錄下產生2個結果文件：000000_0 + 000001_0 。每個文件中依據te字段排序。

Distribute by特性：

按照指定的字段對數據進行划分到不同的輸出 reduce 文件中
distribute by相當於MR 中的paritioner，默認是基於hash 實現的
distribute by通常與Sort by連用

set mapred.reduce.tasks = 2; insert overwrite local directory '/root/hive/b'
    select * from logs distribute by date sort by te;

Cluster By特性：

如果 Sort By 和 Distribute By 中所有的列相同，可以縮寫為Cluster By以便同時指定兩者所使用的列。
注意被cluster by指定的列只能是降序，不能指定asc和desc。一般用於桶表

set mapred.reduce.tasks = 2; insert overwrite local directory '/root/hive/b'
    select * from logs cluster by date;

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 js中排序的幾種方法 MySQL中排序與去重用法 matlab中排序（矩陣的行排序及列排序） hive排序 hive 排序 SQL必知必會 -------- 聚集函數、分組排序計算一個未排序數組中排序后相鄰元素的最大差值 Hive中的排序語法 SQL中的ORDER BY排序，聚集函數，GROUP BY分組微軟BI 之SSRS 系列 - 巧用 RunningValue 函數在分組中排序並設置 RANK 排名