impala在一個select中執行多個count distinct時會報錯,比如執行
select key, count(distinct column_a), count(distinct column_b) from test_table group by key
會報錯
Query submitted at: 2019-09-28 00:34:20 (Coordinator: http://DataOne-001:25000)
ERROR: AnalysisException: all DISTINCT aggregate functions need to have the same set of parameters as count(DISTINCT column_a);
deviating function: count(DISTINCT column_b)
Consider using NDV() instead of COUNT(DISTINCT) if estimated counts are acceptable. Enable the APPX_COUNT_DISTINCT query option to
perform this rewrite automatically.
這時有幾種方法:
1 使用近似值
1.1 set APPX_COUNT_DISTINCT = true
1.2 count distinct改為ndv,即ndv(column_a)
這兩種方法底層實現是一樣的,設置APPX_COUNT_DISTINCT會自動將count distinct改寫為ndv,ndv全稱為(number of distinct values),用到
Cardinality(基數計數),底層實現是類似HLLC(Hyper LogLog Counting)這種概率算法,詳見參考;
An aggregate function that returns an approximate value similar to the result of COUNT(DISTINCT col), the "number of distinct values". It is much faster than the combination of COUNT and DISTINCT, and uses a constant amount of memory and thus is less memory-intensive for columns with high cardinality.
2 使用精確值
改寫為多個子查詢然后join,比如
select a.key, a.count_a, b.count_b from
(select key, count(distinct column_a) count_a from test_table group by key) a join
(select key, count(distinct column_b) count_b from test_table group by key) b on a.key = b.key
參考:
ndv
http://impala.apache.org/docs/build/html/topics/impala_ndv.html#ndv
APPX_COUNT_DISTINCT
http://impala.apache.org/docs/build/html/topics/impala_appx_count_distinct.html
其他
