數據查詢
Druid的聚合查詢主要有三種形式:
- Timeseries
- TopN
- GroupBy
一般而言,OLAP系統最核心的能力是GroupBy
查詢,Druid也不例外。 但是GroupBy查詢資源消耗較多,TopN
和Timeseries
作為GroupBy的有益補充,能夠改善查詢的性能。我們建議:如果TopN
和Timeseries
能夠滿足業務的應用場景,那么盡量采用這兩種查詢,而非GroupBy
。
Druid提供RESTful的查詢接口,用戶使用JSON表達查詢意圖。
查詢命令:
curl -X POST 'broker:<port>/druid/v2/?pretty' -H 'Content-Type:application/json' -d @<query_json_file>
注意點
在Druid查詢中,過濾條件是所有查詢都可能涉及的部分,並且有一些使用技巧,需要特別注意。請參考Filters。
指標聚合這部分也是非常重要的,Aggregations也提供了系統的介紹,此處就不再贅述了。我們需要指出的是,這一頁文檔中Filtered Aggregator
能夠提供非常強大的查詢功能,比如在查詢過程中根據維度取值定制指標。
GroupBy
示例
{ "queryType": "groupBy", "dataSource": "sample_datasource", "granularity": "day", "dimensions": ["country", "device"], #需要聚合的維度列 "limitSpec": { "type": "default", "limit": 5000, "columns": ["country", "data_transfer"] }, #limit語句 "filter": { #過濾條件 "type": "and", "fields": [ { "type": "selector", "dimension": "carrier", "value": "AT&T" }, { "type": "or", "fields": [ { "type": "selector", "dimension": "make", "value": "Apple" }, { "type": "selector", "dimension": "make", "value": "Samsung" } ] } ] }, "aggregations": [ #返回的指標列 { "type": "longSum", "name": "total_usage", "fieldName": "user_count" }, { "type": "doubleSum", "name": "data_transfer", "fieldName": "data_transfer" } ], "postAggregations": [ #這部分是可選的 { "type": "arithmetic", "name": "avg_usage", "fn": "/", "fields": [ { "type": "fieldAccess", "fieldName": "data_transfer" }, { "type": "fieldAccess", "fieldName": "total_usage" } ] } ], "intervals": [ "2012-01-01T00:00:00.000/2012-01-03T00:00:00.000" ], #本次查詢需要覆蓋的時間范圍 "having": { #having語句,這部分是可選的 "type": "greaterThan", "aggregation": "total_usage", "value": 100 } }
Timeseries
示例
{ "queryType": "timeseries", "dataSource": "sample_datasource", "granularity": "day", "descending": "true", #是否排序 "filter": { #過濾條件 "type": "and", "fields": [ { "type": "selector", "dimension": "sample_dimension1", "value": "sample_value1" }, { "type": "or", "fields": [ { "type": "selector", "dimension": "sample_dimension2", "value": "sample_value2" }, { "type": "selector", "dimension": "sample_dimension3", "value": "sample_value3" } ] } ] }, "aggregations": [ #返回的指標列 { "type": "longSum", "name": "sample_name1", "fieldName": "sample_fieldName1" }, { "type": "doubleSum", "name": "sample_name2", "fieldName": "sample_fieldName2" } ], "postAggregations": [ #這部分是可選的 { "type": "arithmetic", "name": "sample_divide", "fn": "/", "fields": [ { "type": "fieldAccess", "name": "postAgg__sample_name1", "fieldName": "sample_name1" }, { "type": "fieldAccess", "name": "postAgg__sample_name2", "fieldName": "sample_name2" } ] } ], "intervals": [ "2012-01-01T00:00:00.000/2012-01-04T00:00:00.000" ] #本次查詢覆蓋的時間范圍 }
Timeseries query通常對空的查詢時間段返回0作為查詢結果
TopN
- TopN查詢返回的是根據某一維度進行group by后再排序,返回結果集
- 為了提高執行效率,TopN的查詢是近似查詢(從我們使用經驗來看,返回結果基本是比較准確的)
示例
{ "queryType": "topN", "dataSource": "sample_data", "dimension": "sample_dim", #需要聚合的維度列 "threshold": 5, "metric": "count", #作為排序依據的指標列 "granularity": "all", "filter": { #過濾條件 "type": "and", "fields": [ { "type": "selector", "dimension": "dim1", "value": "some_value" }, { "type": "selector", "dimension": "dim2", "value": "some_other_val" } ] }, "aggregations": [ #返回的指標列 { "type": "longSum", "name": "count", "fieldName": "count" }, { "type": "doubleSum", "name": "some_metric", "fieldName": "some_metric" } ], "postAggregations": [ #后處理邏輯,這部分是可選的 { "type": "arithmetic", "name": "sample_divide", "fn": "/", "fields": [ { "type": "fieldAccess", "name": "some_metric", "fieldName": "some_metric" }, { "type": "fieldAccess", "name": "count", "fieldName": "count" } ] } ], "intervals": [ "2013-08-31T00:00:00.000/2013-09-03T00:00:00.000" #查詢覆蓋的時間范圍 ] }