Hive高級：函數

本文轉載自查看原文 2020-03-06 09:19 776

函數

內置函數
自定義

cli命令

show functions [like "<pattern>"] 會列出所有函數，包括自定義函數。可以用正則檢索。
desc function fun_name ：顯示簡單的信息介紹
desc function extended fun_name :顯示詳細介紹，包括例子。

hive> desc function extended concat;
OK
tab_name
concat(str1, str2, ... strN) - returns the concatenation of str1, str2, ... strN or concat(bin1, bin2, ... binN) - returns the concatenation of bytes in binary data  bin1, bin2, ... binN
Returns NULL if any argument is NULL.
Example:
  > SELECT concat('abc', 'def') FROM src LIMIT 1;
  'abcdef'
Function class:org.apache.hadoop.hive.ql.udf.generic.GenericUDFConcat
Function type:BUILTIN

簡單函數

函數的計算粒度-單條記錄。

特殊函數

窗口函數
分析函數
混合函數
UDTF

窗口

分析函數：

混合：

內置函數

get_json_object

hive> select get_json_object('{"name":"jack", "age":"22"}', '$.name');
OK
_c0
jack

parse_url() 取url的一部分。

concat

# 把type字段的值和"123"拼接
hive> select concat(type, '123') from winfunc;
_c0
abc123
bcd123
cde123
def123
...

concat_ws

帶分隔符號的拼接。分隔付哈可以

hive> select concat_ws('.',type, '123') from winfunc; OK
_c0
abc.123
bcd.123
cde.123
def.123
abc.123

split(string, 分隔符)

返回一個數組

hive> select split("abc", "");
OK
_c0
["a","b","c",""]

hive> select concat_ws('.',split(type,"")) from winfunc;
OK
_c0
a.b.c.
b.c.d.
c.d.e.

collect_set

返回set，去重

collect_list

返回數組.

select collect_list(id) from winfunc;
。。。 ["1001","1001","1001","1001","1002","1002","1002","1002","1002","1002","1003","1003","1004"]

select collect_set(id) from winfunc;

⚠️：hive sql語法中的這2個函數的作用等同於mysql中的聚合函數group_concat()。group_concat()相對功能更完善：

有去重關鍵字distinct
排序子句order by
有分隔符號子句: separator

GROUP_CONCAT(
    DISTINCT expression     
    ORDER BY expression
    SEPARATOR sep          
);

相比之下，hsq中的collect_set, collect_list，concat, concat_ws，4個函數要配合使用，才達到group_concat的功能。

窗口函數：類似mysql新版本的窗口函數

hive sql的窗口函數和mysql的窗口函數用法完全一樣！

具體用法見mysql:https://www.cnblogs.com/chentianwei/p/12145280.html

附加2020-4：再次理解窗口函數：

由三部分組成窗口函數：

partition by，先對表格分組。【可選】
order by，對每個組排序【可選】
函數+frame子句。對每行數據執行的操作，及操作的作用范圍，默認是當前行及這行上面的本組的所有行。

概念：

當前行：　　函數計算所在行被稱為當前行current row
當前行的窗口：當前行涉及的使用函數計算的query rows組成了一個窗口（由frame設置范圍）。所以窗口就是指當前行涉及的使用函數計算的query rows。
over()內部有3塊分別是partition by , order by , 和最后一個frame子句。

格式：

函數() OVER ([PARTITION BY expr,..], [order by expr [asc|desc],...] [, window frame子句]) AS 別名

🌿即省缺window frame子句，range between unbounded precending and current row是默認值
同時省缺order by 和window frame子句，使用ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.
frame中的range和row的區別：

range，第一，當前行的被函數()取值的列的value，然后，以這個value為根基，上下行的值如果都是這個value，都算當前行，然后再確定范圍。邏輯意義上的行。
rowS, 代表以行號來決定frame范圍。物理意義上的行。

- ROWS: The frame is defined by beginning and ending row positions. Offsets are differences in row numbers from the current row number. 行位置，物理意義上的行。
- RANGE: The frame is defined by rows within a value range. Offsets are differences in row values from the current row value.

窗口函數：

lead,lag
first_value, last_value

和over合作的標准聚合函數：⚠️聚合函數也可以在over內使用（hive2.1以后版本）

count
sum
min, max
avg

分析函數：

rank, dense_rank
row_number 行號。
cume_dist 累加行數占總行數的比例cumulative。
percent_rank 類似cume_dist,計算方法有區別。
ntile 分片，把數據平分成幾片

例子：

數據：

col_name 　　data_type
id　　　　　　 string
money　　　　 int
type　　　　 string

hive> select * from winfunc;
OK
winfunc.id    winfunc.money    winfunc.type
1001    100    abc
1001    150    bcd
1001    200    cde
1001    150    def
1002    200    abc
1002    200    abc
1002    100    bcd
1002    300    cde
1002    50    def
1002    400    efg
1003    400    abc
1003    50    bdc
1004    60    abc

根據id分區，然后根據money排序，然后挑出first_value(money)。

hive> select id, money, first_value(money) over(partition by id order by money)
    > from winfunc;

id money first_value_window_0
1001 100 100
1001 150 100
1001 150 100
1001 200 100
1002 50 50
1002 100 50
1002 200 50
1002 200 50
1002 300 50
1002 400 50
1003 50 50
1003 400 50
1004 60 60

再看使用sum()函數的這個例子：

hive> select id, money,
    > sum(money) over (partition by id order by money)
    > from winfunc;
#中間略 id money    sum_window_0
1001    100    100
1001    150    400
1001    150    400
1001    200    600
1002    50    50
1002    100    150
1002    200    550
1002    200    550
1002    300    850
1002    400    1250
1003    50    50
1003    400    450
1004    60    60

如果再加上frame子句部分：

🌿即省缺window frame子句，range between unbounded precending and current row是默認值，因此本查詢語句等於👆的查詢語句的結果。

⚠️相同的值,被同時sum了。即第2行150，第3行150的第三列都是400，100+150+150.

hive> select id, money,
    > sum(money) over (partition by id order by money RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
    > from winfunc;
#略
id    money    sum_window_0
    NULL    NULL
1001    100    100
1001    150    400
1001    150    400
1001    200    600
1002    50    50
1002    100    150
1002    200    550
1002    200    550
1002    300    850
1002    400    1250
1003    50    50
1003    400    450
1004    60    60

但是，如果改成用rows取代range, 結果不同：

⚠️用rows不會考慮相同的值，此時的sum相當於累加計算。

hive> select id, money,
    > sum(money) over (partition by id order by money rows BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) > from winfunc;
#略 id money    sum_window_0
    NULL    NULL
1001    100    100
1001    150    250
1001    150    400
1001    200    600
1002    50    50
1002    100    150
1002    200    350
1002    200    550
1002    300    850
1002    400    1250
1003    50    50
1003    400    450
1004    60    60

看mysql文檔關於rows和range:

ROWS: The frame is defined by beginning and ending row positions. Offsets are differences in row numbers from the current row number.框架范圍是從開始到行結束位置。考慮的是行號。
RANGE: The frame is defined by rows within a value range. Offsets are differences in row values from the current row value.范圍是關於值的范圍。考慮的是這行的取值。如果當前行的上下行的值和當前行的相同，那么會被算作范圍frame之內。

lead(列名，offset值， [default])

返回當前行的下面的指定行數的值。

hive> select id, money, > lead(money,2) over (partition by id order by money) > from winfunc;
#
id    money    lead_window_0
    NULL    NULL
1001    100    150
1001    150    200
1001    150    NULL
1001    200    NULL
1002    50    200
1002    100    200
1002    200    300
1002    200    400
1002    300    NULL
1002    400    NULL
1003    50    NULL
1003    400    NULL
1004    60    NULL

lag(列名， offset, [default])

返回前offset行的值，如果沒有，則返回defalut參數。default可以自己定義。

ntile(N) + over子句

把partition分成n組/桶，每行都分配所在組/桶號，返回當前行所在partition的桶號。

rank()

排序，相同值給予相同序列標記。考慮gap。即1，2，2，4這樣排序。2，2行是相同的值。

dense_rank()

排序，相同值給予相同序列標記。考慮gap。即1，2，2，3這樣排序。2，2行是相同的值。

混合函數

java_method和reflect一樣的功能。可以使用java的類的方法。

hive> select reflect("java.lang.Math", "sqrt", cast(id as double)) from winfunc;

參數1 是類
參數2 是方法

表函數(wiki) lateral view（橫向視野）

LATERAL VIEW udtf(expression) tableAlias AS columnAlias (',' columnAlias) * fromClause: FROM baseTable (lateralView)*

udft(): user defined function table。這里特指內建表格生成函數。如explode(expression)

用途:

和內建的table-generating function聯合使用.

a UDTF generates zero or more output rows for each input row, 從每個輸入行產生0～多個輸出行。

lateral view first applies the UDTF to each row of base table and then joins resulting output rows to the input rows to form a virtual table having the supplied table alias.

首先，在原表的每行使用udtf(函數)。
然后，把結果輸出行和輸入行連接起來，形成一個虛擬表。

內建表格生成函數

比如explode()函數。根據傳入的參數數據類型，進行轉化。array, map, structure

例子：

hive> select id, adid
    > from winfunc
    > lateral view explode(split(type, "b")) tt as adid;
OK
id    adid
1001    a
1001    c
1001
1001    cd
1001    cde
1001    def
1002    a
1002    c
1002    a
1002    c
1002
1002    cd
1002    cde
1002    def
1002    efg
1003    a
1003    c
1003
1003    dc
1004    a
1004    c

select dept_id, sum(if(sex='男',1,0)) as male_count, sum(if(sex='女',1,0)) as female_count from emp_sex

正則表達式

regexp_replace(字符串a, 字符串b, 字符串c)

a是原字符串
b是正則表達式
c是符合條件后，要替換的字符串。

hive> select regexp_replace('foobar', 'oo|ar', "");
#返回
fb

regexp_extract()

a rlike b

nvl函數，

把null轉化為指定值：比如nvl(列名， -1)把列中的null轉化為-1.

⚠️其實就是if(expression, true結果, false結果) 的變種。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Hive 高級函數----開窗函數 Hive（六）內置函數與高級操作 Hive（六）內置函數與高級操作 Hive(三)hive的高級操作【Hive五】Hive函數UDF Hive 函數（六） hive函數之~reflect函數【SQL】高級函數匯總 JS中的高級函數 python高級之函數