HIVE-----count(distinct ) over() 無法使用解決辦法
在使用hive時發現count(distinct ) over() 報錯
hive> with da as ( > select 1 a, 'a' b union all > select 1 a, 'a' b union all > select 2 a, 'a' b union all > select 2 a, 'a' b union all > select 2 a, 'a' b union all > select 3 a, 'b' b union all > select 3 a, 'b' b union all > select 3 a, 'b' b union all > select 3 a, 'b' b union all > select 3 a, 'b' b union all > select 3 a, 'b' b union all > select 3 a, 'b' b > ) > select > a > ,b > ,sum(a) over(partition by b) > , count(distinct a) over(partition by b) > from da; FAILED: SemanticException Failed to breakup Windowing invocations into Groups. At least 1 group must only depend on input columns. Also check for circular dependencies. Underlying error: org.apache.hadoop.hive.ql.parse.SemanticException: Line 18:26 Expression not in GROUP BY key 'b'
經過測試將
with da as ( select 1 a, 'a' b union all select 1 a, 'a' b union all select 2 a, 'a' b union all select 2 a, 'a' b union all select 2 a, 'a' b union all select 3 a, 'b' b union all select 3 a, 'b' b union all select 3 a, 'b' b union all select 3 a, 'b' b union all select 3 a, 'b' b union all select 3 a, 'b' b union all select 3 a, 'b' b ) select count(distinct a) over(partition by b) from da
當且僅當至於count(distinct ) over()一個時段時能夠使用,原因可能時內部實現distinct出錯 不知道是否和版本有關 使用版本為Hive version 1.1.0
解決辦法:如下使用collect_set(a) over(partition by b)函數將合並成一個分好組的集合 然后求出集合的值個數
因為collect_set()不能放入重復函數所以使用size()求集合元素數量時能達到count(distinct )的效果
with da as ( select 1 a, 'a' b union all select 1 a, 'a' b union all select 2 a, 'a' b union all select 2 a, 'a' b union all select 2 a, 'a' b union all select 3 a, 'b' b union all select 3 a, 'b' b union all select 3 a, 'b' b union all select 3 a, 'b' b union all select 3 a, 'b' b union all select 3 a, 'b' b union all select 3 a, 'b' b ) select a ,b ,sum(a) over(partition by b) ,size(collect_set(a) over(partition by b)) from da
結果