1、展示一下order by 在上億級別數據量有多慢
對於clickhouse來說,當表的基礎大到億級別, 如果做查詢后,在做order by 速遞是非常慢的;
比如我有一張表有3億條數據,表結構是:
{
dvid 設備ID ,
json 每個設備的信息 ,
dt 時間 ,
filter 過濾條件 ,
order_key 排序條件
}
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(toDate(dt))
PRIMARY KEY (dt, dvid)
ORDER BY (dt, dvid)
SETTINGS index_granularity = 256
現在要統計3個月(1.5億條)的設備數量,並按照order_key做排序
給出的sql:
select count(1) from
(select dt, toUInt32(dvid) as dvid , json from 表 where filter條件 order by order_key排序條件)
很簡單的sql,但是查詢速度很慢:
+---------+
| count() |
+---------+
| 1262100 |
+---------+
1 row in set (37.50 sec)
但是當你把order by去掉后,同樣是數據量查詢速度會非常快:
+---------+
| count() |
+---------+
| 1262100 |
+---------+
1 row in set (1.74 sec)
2、查看執行計划,看看為什么這么慢
2.1、先看沒有order by的執行計划:
+-------------------------------------------------------------------------------------------------+ | explain | +-------------------------------------------------------------------------------------------------+ | Expression ((Projection + Before ORDER BY)) | | Aggregating | | Expression ((Before GROUP BY + Projection)) | | SettingQuotaAndLimits (Set limits and quota after reading from storage) | | Union | | Expression ((Convert block structure for query from local replica + Before ORDER BY)) | | SettingQuotaAndLimits (Set limits and quota after reading from storage) | | ReadFromStorage (MergeTree) | | ReadFromPreparedSource (Read from remote replica) | +-------------------------------------------------------------------------------------------------+ 9 rows in set (0.05 sec)
大概意思是:
第一步:從每個機器的硬盤上把數據讀取出來
第二步:union一下每個機器讀取的數據
第三步:求count
沒有order by的總耗時:1.7s
2.2、進行order by的執行計划
+-------------------------------------------------------------------------------------------+ | explain | +-------------------------------------------------------------------------------------------+ | Expression ((Projection + Before ORDER BY)) | | Aggregating | | Expression ((Before GROUP BY + Projection)) | | MergingSorted (Merge sorted streams for ORDER BY) | | SettingQuotaAndLimits (Set limits and quota after reading from storage) | | Union | | Expression (Convert block structure for query from local replica) | | FinishSorting | | Expression (Before ORDER BY) | | SettingQuotaAndLimits (Set limits and quota after reading from storage) | | Expression (Remove unused columns after reading from storage) | | Union | | MergingSorted (Merge sorting mark ranges) | | Expression (Calculate sorting key prefix) | | ReadFromStorage (MergeTree with order) | | MergingSorted (Merge sorting mark ranges) | | Expression (Calculate sorting key prefix) | | ReadFromStorage (MergeTree with order) | | MergingSorted (Merge sorting mark ranges) | | Expression (Calculate sorting key prefix) | | ReadFromStorage (MergeTree with order)
...................
省略一摸一樣的N行
...................
大概意思:
1、從硬盤讀取需要的數據(部分,因為order by需要在內存里面快速排序,無法讀取全部)
2、按照order by 的key進行排序
3、N多個order by排序完的數據,在做最終匯總,然后對匯總后的數據在做排序(這一步也會根據數據量分成多步完成)
4、最終做聚合求count
最終order by的總耗時:37.5s
3、優化
3.1、切換引擎
上一步慢、猜測可能是clickhouse的mergeTree引擎並不適合做排序操作,於是嘗試了各種引擎,最終得到最適合order by的引擎:ReplicatedAggregatingMergeTree
重建建表:
ENGINE = ReplicatedAggregatingMergeTree('/clickhouse/tables/{shard}/庫/表', '{replica}') PARTITION BY toYYYYMMDD(toDate(dt)) PRIMARY KEY dvid ORDER BY dvid SETTINGS index_granularity = 8196
然后增加表的數據量,將原來的1.5億條 增加到3.5億條
執行同樣的語句,測試性能:
1、不帶order by
+---------+ | count() | +---------+ | 3290193 | +---------+ 1 row in set (3.58 sec)
2、帶order by的查詢速度
+---------+ | count() | +---------+ | 3290193 | +---------+ 1 row in set (4.16 sec)
3.2、查看切換ReplicatedAggregatingMergeTree引擎后的執行計划
1、不帶order by的執行計划
+-------------------------------------------------------------------------------------------------+ | explain | +-------------------------------------------------------------------------------------------------+ | Expression ((Projection + Before ORDER BY)) | | Aggregating | | Expression ((Before GROUP BY + Projection)) | | SettingQuotaAndLimits (Set limits and quota after reading from storage) | | Union | | Expression ((Convert block structure for query from local replica + Before ORDER BY)) | | SettingQuotaAndLimits (Set limits and quota after reading from storage) | | ReadFromStorage (MergeTree) | | ReadFromPreparedSource (Read from remote replica) | +-------------------------------------------------------------------------------------------------+ 9 rows in set (0.05 sec)
不帶order by的執行計划MergeTree的引擎一樣,所以沒任何改變,之所以查詢速度從1.74s變到3.58s 是因為基礎數據從原來的1.5億變成現在的3.3億
2、帶order by的執行計划
+-----------------------------------------------------------------------------------------------+ | explain | +-----------------------------------------------------------------------------------------------+ | Expression ((Projection + Before ORDER BY)) | | Aggregating | | Expression ((Before GROUP BY + Projection)) | | MergingSorted (Merge sorted streams for ORDER BY) | | SettingQuotaAndLimits (Set limits and quota after reading from storage) | | Union | | Expression (Convert block structure for query from local replica) | | MergingSorted (Merge sorted streams for ORDER BY) | | MergeSorting (Merge sorted blocks for ORDER BY) | | PartialSorting (Sort each block for ORDER BY) | | Expression (Before ORDER BY) | | SettingQuotaAndLimits (Set limits and quota after reading from storage) | | ReadFromStorage (MergeTree) | | ReadFromPreparedSource (Read from remote replica) | +-----------------------------------------------------------------------------------------------+ 14 rows in set (0.08 sec)
對比MergeTree引擎的order by執行計划,ReplicatedAggregatingMergeTree的要簡化很多
優化點顯而易見:分機器/分區/分段的讀取數據做排序,在匯總 , 並且是按照block進行排序的
1、對每台機器,按照一定范圍進行PartialSorting(不按照具體數據排序,而是按照block快排序)
2、對第一步排序好的block進行匯總排序(因為已經知道第一步block的順序了,所以匯總的時候直接對比最大block和最小block就可以了)
3、本地的排序好了,在做分布式的排序進行最終的匯總排序
4、求count
上面的操作很像歸並排序,適應大數據量分布式排序
4、對sql做優化
【做完上一步其實已經慢滿足需求了,但我們的需求並不是單純為了求數量,而是拿到自定義時間段的某個第一個出現的東西】
切換引擎后,查詢速度有了本質的提升,接下來在對sql做優化。假如原來的sql是查詢3個月的數據量,那么還可以進行分段查,比如每15天/一個月 做一次order by
原來的sql:
select count(1) from (select dt, toUInt32(dvid) as dvid , json from 表 where filter條件 and dt >= '2021-01-01' <= '2021-03-31' order by order_key排序條件)
優化后的sql:
select count(1) from ( select dt,toUInt32(dvid) as dvid ,json from 表 where filter條件 and dt>='2021-01-01' and dt <= '2021-01-31' order by order_key排序條件 union all select dt,toUInt32(dvid) as dvid ,json from 表 where filter條件 and dt>='2021-02-01' and dt <= '2021-02-31' order by order_key排序條件 union all select dt,toUInt32(dvid) as dvid ,json from 表 where filter條件 and dt>='2021-03-01' and dt <= '2021-03-31' order by order_key排序條件 )
查詢速度:
+---------+
| count() |
+---------+
| 3290193 |
+---------+
1 row in set (2.74 sec)
經過優化引擎+優化sql:最終的查詢速度由原來的37.5s 優化2.74s