ElasticSearch Cardinality Aggregation聚合計算的誤差


使用ES不久,今天發現生產環境數據異常,其使用的ES版本是2.1.2,其它版本也類似。通過使用ES的HTTP API進行查詢,發現得到的數據跟javaClient API 查詢得到的數據不一致,於是對代碼邏輯以及ES查詢工具產生了懷疑。通過查閱官方文檔找到如下描述:

Precision controledit

This aggregation also supports the precision_threshold option:

Warning

The precision_threshold option is specific to the current internal implementation of the cardinality agg, which may change in the future

{
    "aggs" : {
        "author_count" : {
            "cardinality" : {
                "field" : "author_hash",
                "precision_threshold": 100 
            }
        }
    }
}

The precision_threshold options allows to trade memory for accuracy, and defines a unique count below which counts are expected to be close to accurate. Above this value, counts might become a bit more fuzzy. The maximum supported value is 40000, thresholds above this number will have the same effect as a threshold of 40000. Default value depends on the number of parent aggregations that multiple create buckets (such as terms or histograms).

Counts are approximateedit

Computing exact counts requires loading values into a hash set and returning its size. This doesn’t scale when working on high-cardinality sets and/or large values as the required memory usage and the need to communicate those per-shard sets between nodes would utilize too many resources of the cluster.

This cardinality aggregation is based on the HyperLogLog++ algorithm, which counts based on the hashes of the values with some interesting properties:

  • configurable precision, which decides on how to trade memory for accuracy,
  • excellent accuracy on low-cardinality sets,
  • fixed memory usage: no matter if there are tens or billions of unique values, memory usage only depends on the configured precision.

For a precision threshold of c, the implementation that we are using requires about c * 8 bytes.

The following chart shows how the error varies before and after the threshold:

images/cardinality_error.png

For all 3 thresholds, counts have been accurate up to the configured threshold (although not guaranteed, this is likely to be the case). Please also note that even with a threshold as low as 100, the error remains under 5%, even when counting millions of items.

 

 

  其意思就是:聚合查詢存在誤差,在5%范圍之內,通過調整“precision_threshold”參數進行調整。

  於是翻閱查詢代碼:加入如下部分問題得到解決。該參數在查詢時未設置的情況下,默認值為3000。

  

 private void buildSearchQueryForAgg(NativeSearchQueryBuilder nativeSearchQueryBuilder) { // 設置聚合條件
        TermsBuilder agg = AggregationBuilders.terms(aggreName).field(XXX.XXX).size(Integer.MAX_VALUE); // 查詢條件構建
        BoolQueryBuilder packBoolQuery = QueryBuilders.boolQuery(); FilterAggregationBuilder packAgg = AggregationBuilders.filter(xxx).filter(packBoolQuery); packAgg.subAggregation(AggregationBuilders.cardinality(xxx).field(ZZZZ.XXX).precisionThreshold(CARDINALITY_PRECISION_THRESHOLD));//指定精度值 agg.subAggregation(packAgg);  nativeSearchQueryBuilder.addAggregation(agg); }

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM