使用ES不久,今天發現生產環境數據異常,其使用的ES版本是2.1.2,其它版本也類似。通過使用ES的HTTP API進行查詢,發現得到的數據跟javaClient API 查詢得到的數據不一致,於是對代碼邏輯以及ES查詢工具產生了懷疑。通過查閱官方文檔找到如下描述:
Precision controledit
This aggregation also supports the
precision_threshold
option:![]()
The
precision_threshold
option is specific to the current internal implementation of thecardinality
agg, which may change in the future{ "aggs" : { "author_count" : { "cardinality" : { "field" : "author_hash", "precision_threshold": 100
} } } }
The
precision_threshold
options allows to trade memory for accuracy, and defines a unique count below which counts are expected to be close to accurate. Above this value, counts might become a bit more fuzzy. The maximum supported value is 40000, thresholds above this number will have the same effect as a threshold of 40000. Default value depends on the number of parent aggregations that multiple create buckets (such as terms or histograms).Counts are approximateedit
Computing exact counts requires loading values into a hash set and returning its size. This doesn’t scale when working on high-cardinality sets and/or large values as the required memory usage and the need to communicate those per-shard sets between nodes would utilize too many resources of the cluster.
This
cardinality
aggregation is based on the HyperLogLog++ algorithm, which counts based on the hashes of the values with some interesting properties:
- configurable precision, which decides on how to trade memory for accuracy,
- excellent accuracy on low-cardinality sets,
- fixed memory usage: no matter if there are tens or billions of unique values, memory usage only depends on the configured precision.
For a precision threshold of
c
, the implementation that we are using requires aboutc * 8
bytes.The following chart shows how the error varies before and after the threshold:
For all 3 thresholds, counts have been accurate up to the configured threshold (although not guaranteed, this is likely to be the case). Please also note that even with a threshold as low as 100, the error remains under 5%, even when counting millions of items.
其意思就是:聚合查詢存在誤差,在5%范圍之內,通過調整“precision_threshold”參數進行調整。
於是翻閱查詢代碼:加入如下部分問題得到解決。該參數在查詢時未設置的情況下,默認值為3000。
private void buildSearchQueryForAgg(NativeSearchQueryBuilder nativeSearchQueryBuilder) { // 設置聚合條件
TermsBuilder agg = AggregationBuilders.terms(aggreName).field(XXX.XXX).size(Integer.MAX_VALUE); // 查詢條件構建
BoolQueryBuilder packBoolQuery = QueryBuilders.boolQuery(); FilterAggregationBuilder packAgg = AggregationBuilders.filter(xxx).filter(packBoolQuery); packAgg.subAggregation(AggregationBuilders.cardinality(xxx).field(ZZZZ.XXX).precisionThreshold(CARDINALITY_PRECISION_THRESHOLD));//指定精度值 agg.subAggregation(packAgg); nativeSearchQueryBuilder.addAggregation(agg); }