一、Hive的排序

order by：會對輸入做全局排序，因此只有一個 reducer。
- order by 在 hive.mapred.mode = strict 模式下必須指定 limit 否則執行會報錯。
sort by：不是全局排序，其在數據進入 reducer 前完成排序。
- 因此，如果用 sort by 進行排序，並且設置 mapred.reduce.tasks>1（如果為1就和 order by 效果一致），則 sort by 只保證每個 reducer 的輸出有序，不保證全局有序。
distribute by：(類似於分桶)，就是把相同的 key 分到一個 reducer 中,根據 distribute by 指定的字段對數據進行划分到不同的輸出 reduce 文件中。
CLUSTER BY （cluster）：
- cluster by column = distribute by column + sort by column （注意，都是針對 column 列，且采用默認 ASC (升序)，不能指定排序規則為 asc 或者 desc）

二、窗口函數

聚合函數：（如sum()、avg()、max()等等）是針對定義的行集（組）執行聚集，每組只返回一個值。
窗口函數：是針對定義的行集（組）執行聚集，可為每組返回多個值。如既要顯示聚集前的數據,又要顯示聚集后的數據。

2.1、over( )

  over (order by col1)                     --按照 col1 排序
  over (partition by col1)                 --按照 col1 分區 
  over (partition by col1 order by col2)   -- 按照 col1 分區,按照 col2 排序
  
  --帶有窗口范圍
  over (partition by col1 order by col2 ROWS 窗口范圍)   -- 在窗口范圍內，按照 col1 分區,按照 col2 排序

over_table
id　　name　　age
1　　a1    10
2　　a2　　10
3　　a3　　10
4　　a4　　20
5　　a5　　20
6　　a6　　20
7　　a7　　20
8　　a8　　30

--建表
CREATE TABLE over_table(
id int,
name string,
age int 
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
-- 窗口范圍是整個表
-- 按照age排序，每階段的age數據進行統計求和
select 
id, name, age,
count() over (order by age) as n
from over_table;  
-- 結果如下
1    a1    10  3
2    a2    10  3
3    a3    10  3
4    a4    20  7
5    a5    20  7
6    a6    20  7
7    a7    20  7
8    a8    30  8

------------------------------------
-- 窗口范圍是表下按照age進行分區
-- 在分區里面，再按照age進行排序
select 
id, name, age,
count() over (partition by age order by age) as n
from wt1;  
-- 結果如下
1    a1    10  3
2    a2    10  3
3    a3    10  3
4    a4    20  4
5    a5    20  4
6    a6    20  4
7    a7    20  4
8    a8    30  1

----------------------------------
-- 窗口范圍是表下按照age進行分區
-- 在分區里面，再按照id進行排序
select 
id, name, age,
count() over (partition by age order by id) as n
from wt1;  

1    a1    10  1
2    a2    10  2
3    a3    10  3
4    a4    20  1
5    a5    20  2
6    a6    20  3
7    a7    20  4
8    a8    30  1
--------------------------------------

2.2、序列函數

row_number：會對所有數值，輸出不同的序號，序號唯一且連續，如：1、2、3、4、5。
rank：會對相同數值，輸出相同的序號，而且下一個序號間斷，如：1、1、3、3、5。
dense_rank：會對相同數值，輸出相同的序號，但下一個序號不間斷，如：1、1、2、2、3。

over 中 partition by 和 distribute by 區別

partition by [key..] order by [key..] 只能在窗口函數中使用，
distribute by [key...] sort by [key...] 在窗口函數和 select 中都可以使用。
窗口函數中兩者是沒有區別的
where 后面不能用 partition by

2.3、Window 函數

ROWS 窗口函數中的行選擇器 rows between [n|unbounded preceding]|[n|unbounded following]|[current row] and [n|unbounded preceding]|[n|unbounded following]|[current row]
參數解釋：
- n 行數
- unbounded 不限行數
- preceding 在前N行
- following 在后N行
- current row 當前行

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Hive窗口函數 hive sql 窗口函數 Hive中的窗口函數 Hive分析窗口函數 Hive Sql的窗口函數 hive之窗口函數 Hive（七）Hive分析窗口函數 Hive 窗口函數、分析函數 hive Spark SQL分析窗口函數 hive 常用窗口函數練習