寫在前面

最近一段時間都在做druid實時數據查詢的工作,本文簡單將官網上的英文文檔加上自己的理解翻譯成中文,同時將自己遇到的問題及解決方法list下,防止遺忘。
本文的demo示例均來源於官網。
Druid查詢概述
Druid的查詢是使用Rest風格的http請求查詢服務節點,客戶端通過發送Json對象請求查詢接口。可以使用shell腳本查詢或通過Google的ARC插件構造Post請求進行查詢。
Query構建
Shell腳本
curl -X POST '<queryable_host>:<port>/druid/v2/?pretty' -H 'Content-Type:application/json' -d @<query_json_file>
其中<queryable_host>:<port>為broker、historical或realtime進程所在機器的ip和提供服務的端口,query_json_file為json配置文件路徑。
ARC插件
見下圖

Druid查詢
1 Druid查詢類型
不同的查詢場景使用不同的查詢方式。Druid有很多查詢類型,對於各種類型的查詢類型的配置可以通過配置不同的Query實現。Druid的查詢類型,概括為以下3類:
1.聚合查詢:時間序列查詢(Timeseroes),Top查詢(TopN),GroupBy查詢(GroupBy) 2.元數據查詢:時間范圍(Time Boundary),段元數據(Segment Metadata),數據源(DataSource) 2.Search查詢(Search)
一般聚合查詢使用的較多,其他類型的查詢方式使用場景較少且較簡單,可直接參考官網給出的demo即可查詢;本文主要介紹聚合查詢。一般情況下,Timeseries和TopN查詢性能優於GroupBy,GroupBy查詢方式最靈活但是最耗性能。Timeseries查詢性能明顯優於GroupBy,因為聚合不需要其他GroupBy其他維度;對於Groupby和排序在一個單一維度的場景,TopN優於GroupBy。
2 Druid主要查詢屬性簡介
一條Druid query中主要包含以下幾種屬性:
1.queryType:查詢類型,即timeseries,topN,groupBy等;
2.dataSource:數據源,類似Mysql中的表的概念;
3.granularity:聚合粒度,聚合粒度有none,all,week,day,hour等;
4.filter:過濾條件,類似Mysql中的where條件;
5.aggregator:聚合方式,類似Mysql中的count,sum等操作
2.1 granularity簡介
2.1.1 簡單的聚合粒度
簡單的聚合粒度有:all、none、second、minute、fifteen_minute、thirty_minute、hour、day、week、month、quarter、year;簡單聚合粒度的查詢取決於druid存儲數據的最小粒度,如果構建數據的最小粒度是小時,使用minute粒度去查詢,結果數據也是小時粒度的數據。
假設存儲在Druid中的數據使用毫秒粒度構建,數據格式如下:
{"timestamp": "2013-08-31T01:02:33Z", "page": "AAA", "language" : "en"}
{"timestamp": "2013-09-01T01:02:33Z", "page": "BBB", "language" : "en"}
{"timestamp": "2013-09-02T23:32:45Z", "page": "CCC", "language" : "en"}
{"timestamp": "2013-09-03T03:32:45Z", "page": "DDD", "language" : "en"}
提交一個小時粒度的groupBy查詢,查詢query如下:
{
"queryType":"groupBy", "dataSource":"my_dataSource", "granularity":"hour", "dimensions":[ "language" ], "aggregations":[ { "type":"count", "name":"count" } ], "intervals":[ "2000-01-01T00:00Z/3000-01-01T00:00Z" ] }
按小時粒度進行的groupby查詢結果中timestamp值精確到小時,比小時粒度更小粒度值自動補填零,以此類推按天查詢,則小時及小粒度補零。timestamp值為UTC。查詢結果如下:
[ {
"version" : "v1", "timestamp" : "2013-08-31T01:00:00.000Z", "event" : { "count" : 1, "language" : "en" } }, { "version" : "v1", "timestamp" : "2013-09-01T01:00:00.000Z", "event" : { "count" : 1, "language" : "en" } }, { "version" : "v1", "timestamp" : "2013-09-02T23:00:00.000Z", "event" : { "count" : 1, "language" : "en" } }, { "version" : "v1", "timestamp" : "2013-09-03T03:00:00.000Z", "event" : { "count" : 1, "language" : "en" } } ]
如若指定聚合粒度為day,則按照天為單位對數據進行聚合,查詢結果如下:
[ {
"version" : "v1", "timestamp" : "2013-08-31T00:00:00.000Z", "event" : { "count" : 1, "language" : "en" } }, { "version" : "v1", "timestamp" : "2013-09-01T00:00:00.000Z", "event" : { "count" : 1, "language" : "en" } }, { "version" : "v1", "timestamp" : "2013-09-02T00:00:00.000Z", "event" : { "count" : 1, "language" : "en" } }, { "version" : "v1", "timestamp" : "2013-09-03T00:00:00.000Z", "event" : { "count" : 1, "language" : "en" } } ]
如若聚合粒度設置為none,則按照druid中build數據的最小粒度查詢數據,即不進行聚合,如bulid數據的粒度是ms,則聚合出來的結果也是毫秒:
[ {
"version" : "v1", "timestamp" : "2013-08-31T01:02:33.000Z", "event" : { "count" : 1, "language" : "en" } }, { "version" : "v1", "timestamp" : "2013-09-01T01:02:33.000Z", "event" : { "count" : 1, "language" : "en" } }, { "version" : "v1", "timestamp" : "2013-09-02T23:32:45.000Z", "event" : { "count" : 1, "language" : "en" } }, { "version" : "v1", "timestamp" : "2013-09-03T03:32:45.000Z", "event" : { "count" : 1, "language" : "en" } } ]
如若將聚合粒度設置為all,則返回數據的長度為1,即把查詢時間段的數據做一個匯總:
[ {
"version" : "v1",
"timestamp" : "2000-01-01T00:00:00.000Z",
"event" : {
"count" : 4,
"language" : "en"
}
} ]
2.1.2 時間聚合粒度
可指定一定的時間段進行聚合,返回UTC時間;支持可選屬性origin;不指定時間,默認的開始時間=1970-01-01T00:00:00Z;
持續時間段2小時,從1970-01-01T00:00:00開始:
{"type": "duration", "duration": 7200000}
2.1.3 常用時間段聚合粒度
時間聚合粒度的特例,方便使用,如年、月、日、小時等,日期標准是ISO 8601。無特別指定的情況下,year從1月份開始,month從1號開始,week從周一開始。
一般的格式為:其中timeZone可選,默認值是UTC;origin可選,默認1970-01-01T00:00:00;
{"type": "period", "period": "P2D", "timeZone": "America/Los_Angeles"}
period的一般寫法為:
month:P2M代表2個月作為一個聚合粒度;
week:P2W代表2周作為一個聚合粒度;
day:P1D代表1天作為一個聚合粒度;
hour:PT1H代表1個小時作為一個聚合粒度;
minute:PT0.750S代表750s作為一個聚合粒度;
如提交一個1d作為聚合粒度的groupby查詢的query:
{
"queryType":"groupBy", "dataSource":"my_dataSource", "granularity":{"type": "period", "period": "P1D", "timeZone": "America/Los_Angeles"}, "dimensions":[ "language" ], "aggregations":[ { "type":"count", "name":"count" } ], "intervals":[ "1999-12-31T16:00:00.000-08:00/2999-12-31T16:00:00.000-08:00" ] }
查詢得到的結果為:
[ {
"version" : "v1", "timestamp" : "2013-08-30T00:00:00.000-07:00", "event" : { "count" : 1, "language" : "en" } }, { "version" : "v1", "timestamp" : "2013-08-31T00:00:00.000-07:00", "event" : { "count" : 1, "language" : "en" } }, { "version" : "v1", "timestamp" : "2013-09-02T00:00:00.000-07:00", "event" : { "count" : 2, "language" : "en" } } ]
官網給出的例子是以美國洛杉磯的時區為准,一般中國的時區這樣使用,更多時區可移步該鏈接查詢:
"granularity": {
"period": "PT1H", "timeZone": "+08:00", "type": "period" }
2.2 filter簡介
一個filter即一個json對象,代表一個過濾條件,等價於mysql中的一個where條件;過濾器的類型主要有:Selector filter,Regular expression filter(正則表達式過濾)、Logical expression filters(AND、OR、NOT)、In filter、Bound filter、Search filter、JavaScript filter、Extraction filter;
2.2.1 Selector 過濾器
等價於 WHERE <dimension_string> = '<dimension_value_string>'
json格式:
"filter": { "type": "selector", "dimension": <dimension_string>, "value": <dimension_value_string> }
2.2.2 正則表達式 過濾器
類似Selector過濾器,只不過過濾使用的是正則表達式;正則表達式為標准的java正則表達式規范;
"filter": { "type": "regex", "dimension": <dimension_string>, "pattern": <pattern_string> }
2.2.3 邏輯表達式 過濾器
AND
"filter": { "type": "and", "fields": [<filter>, <filter>, ...] }
OR
"filter": { "type": "not", "field": <filter> }
NOT
"filter": { "type": "not", "field": <filter> }
IN
等價於
SELECT COUNT(*) AS 'Count' FROM `table` WHERE `outlaw` IN ('Good', 'Bad', 'Ugly')
{
"type": "in", "dimension": "outlaw", "values": ["Good", "Bad", "Ugly"] }
BOUND
數值型:21<=age<=31
{ "type": "bound", "dimension": "age", "lower": "21", "upper": "31" , "ordering": "numeric" }
數值型:21<age<31
{ "type": "bound", "dimension": "age", "lower": "21", "lowerStrict": true, "upper": "31" , "upperStrict": true, "ordering": "numeric" }
字符型:‘foo’<=name<='hoo'
{ "type": "bound", "dimension": "name", "lower": "foo", "upper": "hoo" }
2.3 aggregations簡介
aggregations即匯總數據記性druid之前提供的一個數據采集一種聚合方式。常用的聚合類型主要有:count,sum,min/max,approximate,miscellaneous;
2.3.1 Count aggregator
符合查詢條件的行數,類似mysql中的count計算:
{ "type" : "count", "name" : <output_name> }
Note: Druid進行Count查詢的數據量並不一定等於數據采集時導入的數據量,因為Druid在采集數據查詢時已經按照相應的聚合方式對數據進行了聚合。
2.3.2 Sum aggregator
與底層druid表中的字段類型一致。
longSum
{ "type" : "longSum", "name" : <output_name>, "fieldName" : <metric_name> }
doubleSum
{ "type" : "doubleSum", "name" : <output_name>, "fieldName" : <metric_name> }
2.3.3 MIN/MAX aggregator
doubleMin
{ "type" : "doubleMin", "name" : <output_name>, "fieldName" : <metric_name> }
doubleMax
{ "type" : "doubleMax", "name" : <output_name>, "fieldName" : <metric_name> }
long類型類似,不在贅述。其他聚合方式請移步到官網查詢示例。
2.4 聚合查詢
2.4.1 Timeseries query
query
{
"queryType": "timeseries", "dataSource": "sample_datasource", "granularity": "day", "descending": "true", "filter": { "type": "and", "fields": [ { "type": "selector", "dimension": "sample_dimension1", "value": "sample_value1" }, { "type": "or", "fields": [ { "type": "selector", "dimension": "sample_dimension2", "value": "sample_value2" }, { "type": "selector", "dimension": "sample_dimension3", "value": "sample_value3" } ] } ] }, "aggregations": [ { "type": "longSum", "name": "sample_name1", "fieldName": "sample_fieldName1" }, { "type": "doubleSum", "name": "sample_name2", "fieldName": "sample_fieldName2" } ], "postAggregations": [ { "type": "arithmetic", "name": "sample_divide", "fn": "/", "fields": [ { "type": "fieldAccess", "name": "postAgg__sample_name1", "fieldName": "sample_name1" }, { "type": "fieldAccess", "name": "postAgg__sample_name2", "fieldName": "sample_name2" } ] } ], "intervals": [ "2012-01-01T00:00:00.000/2012-01-03T00:00:00.000" ] }
query屬性說明
| 屬性 | 描述 | 是否必填 |
| querytype | 字符串類型,時間序列 "timeseries" | 是 |
| dataSource | 字符串類型,數據源 | 是 |
| descending | 排序方式,默認false | 否 |
| intervals | 查詢時間范圍 | 是 |
| granularity | 聚合粒度,說明 | 是 |
| filter | 過濾條件,說明 | 否 |
| aggregations | 聚合,說明 | 是 |
| postAggregations | 后聚合,說明 | 否 |
| context | 上下文,說明 |
否 |
上述query的返回結果:
[
{
"timestamp": "2012-01-01T00:00:00.000Z", "result": { "sample_name1": <some_value>, "sample_name2": <some_value>, "sample_divide": <some_value> } }, { "timestamp": "2012-01-02T00:00:00.000Z", "result": { "sample_name1": <some_value>, "sample_name2": <some_value>, "sample_divide": <some_value> } } ]
2.4.2 TopN query
TopN查詢根據規范返回給定維度的有序的結果集,從概念上來講,TopN查詢被認為單維度、有序的類似分組查詢。在某些情況下,TopN查詢比分組查詢(groupby query)快。TopN查詢結果返回Json數組對象。TopN在每個節點將頂上K個結果排名,在Druid默認情況下最大值為1000。在實踐中,如果你要求前1000個項順序排名,那么從第1-999個項的順序正確性是100%,其后項的結果順序沒有保證。你可以通過增加threshold值來保證順序准確。
query
{
"queryType": "topN", "dataSource": "sample_data", "dimension": "sample_dim", "threshold": 5, "metric": "count", "granularity": "all", "filter": { "type": "and", "fields": [ { "type": "selector", "dimension": "dim1", "value": "some_value" }, { "type": "selector", "dimension": "dim2", "value": "some_other_val" } ] }, "aggregations": [ { "type": "longSum", "name": "count", "fieldName": "count" }, { "type": "doubleSum", "name": "some_metric", "fieldName": "some_metric" } ], "postAggregations": [ { "type": "arithmetic", "name": "sample_divide", "fn": "/", "fields": [ { "type": "fieldAccess", "name": "some_metric", "fieldName": "some_metric" }, { "type": "fieldAccess", "name": "count", "fieldName": "count" } ] } ], "intervals": [ "2013-08-31T00:00:00.000/2013-09-03T00:00:00.000" ] }
query屬性說明
| 屬性 | 描述 | 是否必填 |
| querytype | 字符串類型,時間序列 "topN" | 是 |
| dataSource | 字符串類型,數據源,說明 | 是 |
| dimension | groupBy的維度,說明 | 是 |
| intervals | 查詢時間范圍 | 是 |
| granularity | 聚合粒度,說明 | 是 |
| filter | 過濾條件,說明 | 否 |
| aggregations | 聚合,說明 | 是 |
| postAggregations | 后聚合,說明 | 否 |
| threshold | topN的N值 | 是 |
| metric | 字符串或Json對象指定度量對Top N個結果排序,說明 | 否 |
| context | 上下文,說明 | 否 |
上述query的查詢結果形如:
[
{
"timestamp": "2013-08-31T00:00:00.000Z", "result": [ { "dim1": "dim1_val", "count": 111, "some_metrics": 10669, "average": 96.11711711711712 }, { "dim1": "another_dim1_val", "count": 88, "some_metrics": 28344, "average": 322.09090909090907 }, { "dim1": "dim1_val3", "count": 70, "some_metrics": 871, "average": 12.442857142857143 }, { "dim1": "dim1_val4", "count": 62, "some_metrics": 815, "average": 13.14516129032258 }, { "dim1": "dim1_val5", "count": 60, "some_metrics": 2787, "average": 46.45 } ] } ]
2.4.3 GroupBy query
類似mysql中的groupBy查詢方式。
query
{
"queryType": "topN", "dataSource": "sample_data", "dimension": "sample_dim", "threshold": 5, "metric": "count", "granularity": "all", "filter": { "type": "and", "fields": [ { "type": "selector", "dimension": "dim1", "value": "some_value" }, { "type": "selector", "dimension": "dim2", "value": "some_other_val" } ] }, "aggregations": [ { "type": "longSum", "name": "count", "fieldName": "count" }, { "type": "doubleSum", "name": "some_metric", "fieldName": "some_metric" } ], "postAggregations": [ { "type": "arithmetic", "name": "sample_divide", "fn": "/", "fields": [ { "type": "fieldAccess", "name": "some_metric", "fieldName": "some_metric" }, { "type": "fieldAccess", "name": "count", "fieldName": "count" } ] } ], "intervals": [ "2013-08-31T00:00:00.000/2013-09-03T00:00:00.000" ] }
query屬性說明
| 屬性 | 描述 | 是否必填 |
| querytype | 字符串類型,時間序列 "topN" | 是 |
| dataSource | 字符串類型,數據源,說明 | 是 |
| dimensions | groupBy的維度,說明 | 是 |
| intervals | 查詢時間范圍 | 是 |
| granularity | 聚合粒度,說明 | 是 |
| filter | 過濾條件,說明 | 否 |
| aggregations | 聚合,說明 | 是 |
| postAggregations | 后聚合,說明 | 否 |
| limitSpec | 返回指定數量的查詢結果,類似mysql中的limit字句,說明 | 否 |
| having | 類似mysql中的having字句,說明 | 否 |
| context | 上下文,說明 | 否 |
上述query的查詢結果形如:
[
{
"timestamp": "2013-08-31T00:00:00.000Z", "result": [ { "dim1": "dim1_val", "count": 111, "some_metrics": 10669, "average": 96.11711711711712 }, { "dim1": "another_dim1_val", "count": 88, "some_metrics": 28344, "average": 322.09090909090907 }, { "dim1": "dim1_val3", "count": 70, "some_metrics": 871, "average": 12.442857142857143 }, { "dim1": "dim1_val4", "count": 62, "some_metrics": 815, "average": 13.14516129032258 }, { "dim1": "dim1_val5", "count": 60, "some_metrics": 2787, "average": 46.45 } ] } ]
groupBy多值字段
Druid中的字段會有多值查詢,針對多值查詢的groupBy操作,滿足多值中一個過濾條件,查詢結果中會把多值字段中的每個值都返回。下面通過例子進行說明。
底層數據格式
{"timestamp": "2011-01-12T00:00:00.000Z", "tags": ["t1","t2","t3"]} #row1
{"timestamp": "2011-01-13T00:00:00.000Z", "tags": ["t3","t4","t5"]} #row2 {"timestamp": "2011-01-14T00:00:00.000Z", "tags": ["t5","t6","t7"]} #row3 {"timestamp": "2011-01-14T00:00:00.000Z", "tags": []} #row4
查詢query:
{
"queryType": "groupBy", "dataSource": "test", "intervals": [ "1970-01-01T00:00:00.000Z/3000-01-01T00:00:00.000Z" ], "filter": { "type": "selector", "dimension": "tags", "value": "t3" }, "granularity": { "type": "all" }, "dimensions": [ { "type": "default", "dimension": "tags", "outputName": "tags" } ], "aggregations": [ { "type": "count", "name": "count" } ] }
返回結果:命中row1和row2
[
{
"timestamp": "1970-01-01T00:00:00.000Z", "event": { "count": 1, "tags": "t1" } }, { "timestamp": "1970-01-01T00:00:00.000Z", "event": { "count": 1, "tags": "t2" } }, { "timestamp": "1970-01-01T00:00:00.000Z", "event": { "count": 2, "tags": "t3" } }, { "timestamp": "1970-01-01T00:00:00.000Z", "event": { "count": 1, "tags": "t4" } }, { "timestamp": "1970-01-01T00:00:00.000Z", "event": { "count": 1, "tags": "t5" } } ]
遇到的問題及解決方法
問題1:北京時區進行查詢
解決方法:通過設置timeZone的時區解決
"granularity": {
"period": "PT1H", "timeZone": "+08:00", "type": "period" }
問題2:獲取datasource的最新build時間
解決方法:在context中設置requireMessageWatermark=true,在http返回結果的header中拿到該數據;
if (HttpStatus.SC_OK == httpResponse.getStatusLine().getStatusCode()) {
HttpEntity entity = httpResponse.getEntity(); result.setResultStr(EntityUtils.toString(entity, "utf-8")); String waterMark = httpResponse.getHeaders(DRUID_WATERMARK_HEADER)[0].getValue(); if (StringUtil.isNotEmpty(waterMark)) { JSONObject obj = JSONObject.parseObject(waterMark); waterMark = obj.getString(DRUID_MESSAGE_WATERMARK); result.setDruidWatermarkStr(waterMark); } // end if }
問題3:設置不同query的優先級
解決方法:在context中設置priority屬性;
問題4:設置query的超時時間
解決方法:在context中設置timeout屬性;
問題5:多值groupBy指定返回想要的值
解決方法:使用listField屬性設置多值列中想要的值;
{
"queryType": "groupBy", "dataSource": "test", "intervals": [ "1970-01-01T00:00:00.000Z/3000-01-01T00:00:00.000Z" ], "filter": { "type": "selector", "dimension": "tags", "value": "t3" }, "granularity": { "type": "all" }, "dimensions": [ { "type": "listFiltered", "delegate": { "type": "default", "dimension": "tags", "outputName": "tags" }, "values": ["t3"] } ], "aggregations": [ { "type": "count", "name": "count" } ] }
附Druid官網鏈接
