【原創】大叔經驗分享（83）impala執行多個select distinct

本文轉載自查看原文 2019-09-28 01:13 1719 BigData/ Impala/ SQL

impala在一個select中執行多個count distinct時會報錯，比如執行

select key, count(distinct column_a), count(distinct column_b) from test_table group by key

會報錯

Query submitted at: 2019-09-28 00:34:20 (Coordinator: http://DataOne-001:25000)
ERROR: AnalysisException: all DISTINCT aggregate functions need to have the same set of parameters as count(DISTINCT column_a);
deviating function: count(DISTINCT column_b)
Consider using NDV() instead of COUNT(DISTINCT) if estimated counts are acceptable. Enable the APPX_COUNT_DISTINCT query option to
perform this rewrite automatically.

這時有幾種方法：

1 使用近似值

1.1 set APPX_COUNT_DISTINCT = true
1.2 count distinct改為ndv，即ndv(column_a)
這兩種方法底層實現是一樣的，設置APPX_COUNT_DISTINCT會自動將count distinct改寫為ndv，ndv全稱為（number of distinct values），用到
Cardinality（基數計數），底層實現是類似HLLC（Hyper LogLog Counting）這種概率算法，詳見參考；

An aggregate function that returns an approximate value similar to the result of COUNT(DISTINCT col), the "number of distinct values". It is much faster than the combination of COUNT and DISTINCT, and uses a constant amount of memory and thus is less memory-intensive for columns with high cardinality.

2 使用精確值

改寫為多個子查詢然后join，比如

select a.key, a.count_a, b.count_b from
(select key, count(distinct column_a) count_a from test_table group by key) a join
(select key, count(distinct column_b) count_b from test_table group by key) b on a.key = b.key

參考：

ndv

http://impala.apache.org/docs/build/html/topics/impala_ndv.html#ndv

APPX_COUNT_DISTINCT

http://impala.apache.org/docs/build/html/topics/impala_appx_count_distinct.html

其他

https://stackoverflow.com/questions/39236076/impala-all-distinct-aggregate-functions-need-to-have-the-same-set-of-parameters

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。