原創文章,轉載請標明出處
1、交集、差集函數
1)介紹
bitmapAnd bitmap類交集函數
bitmapAndnot bitmap類差集函數
groupArray 查詢結果轉換為數組
bitmapBuild 構建bitmap結構的函數:從無符號整數數組構建位圖對象
xxHash32 xxHash是一種非常快速的非加密哈希算法,在RAM速度限制下工作。它有四種類型(XXH32、XXH64、XXH3_64位和XXH3_128位)。最新版本XXH3全面提高了性能,特別是在小數據方面。
cityHash64 谷歌開發的一種hash高效hash算法
arrayIntersect 返回所有數組元素的交集
arrayConcat 合並參數中傳遞的所有數組
2)官網鏈接
groupArray :https://clickhouse.tech/docs/zh/sql-reference/aggregate-functions/reference/#agg_function-grouparray
bitmapBuild :https://clickhouse.tech/docs/zh/sql-reference/functions/bitmap-functions/#bitmapbuild
xxHash32 cityHash64 官網:https://clickhouse.tech/docs/zh/sql-reference/functions/hash-functions/
arrayConcat arrayIntersect :https://clickhouse.tech/docs/zh/sql-reference/functions/array-functions/#arrayintersectarr
2、使用bitmap實現
表結構
create table q_imis_query
(
task_id UInt64,
imsi String,
insert_time DateTime
);
1)交集查詢
with
(select bitmapBuild(groupArray(xxHash32(imsi))) from testa where cityHash64(task_id)%5000=cityHash64(1270)%5000 and task_id=1270) as a01 ,
(select bitmapBuild(groupArray(xxHash32(imsi))) from testa where cityHash64(task_id)%5000=cityHash64(1271)%5000 and task_id=1271) as a02,
(select bitmapBuild(groupArray(xxHash32(imsi))) from testa where cityHash64(task_id)%5000=cityHash64(8888)%5000 and task_id=8888) as a03
select
from q_imis_query
where
cityHash64(task_id)%5000=cityHash64(1270)%5000
and task_id=1270
and has(bitmapToArray(bitmapAnd(bitmapAnd(a01,a02),a03)),xxHash32(imsi))=1;
2)差集查詢
with
(select bitmapBuild(groupArray(xxHash32(imsi))) from testa where cityHash64(task_id)%5000=cityHash64(1270)%5000 and task_id=1270) as a01 ,
(select bitmapBuild(groupArray(xxHash32(imsi))) from testa where cityHash64(task_id)%5000=cityHash64(1271)%5000 and task_id=1271) as a02,
(select bitmapBuild(groupArray(xxHash32(imsi))) from testa where cityHash64(task_id)%5000=cityHash64(8888)%5000 and task_id=8888) as a03
select
from q_imis_query
where
cityHash64(task_id)%5000=cityHash64(1270)%5000
and task_id=1270
and has(bitmapToArray(bitmapAndnot(bitmapAndnot(a01,a02),a03)),xxHash32(imsi))=1;
3、數組實現
1)交集查詢
with
(select bitmapBuild(groupArray(cityHash64(imsi))) from testa where cityHash64(task_id)%5000=cityHash64(1270)%5000 and task_id=1270) as a01 ,
(select bitmapBuild(groupArray(cityHash64(imsi))) from testa where cityHash64(task_id)%5000=cityHash64(1271)%5000 and task_id=1271) as a02,
(select bitmapBuild(groupArray(cityHash64(imsi))) from testa where cityHash64(task_id)%5000=cityHash64(8888)%5000 and task_id=8888) as a03
select
from q_imis_query
where
cityHash64(task_id)%5000=cityHash64(1270)%5000
and task_id=1270
and has(arrayConcat(arrayIntersect(a01,a02),arrayIntersect(a01,a03)),cityHash64(imsi))=1;
2)差集查詢
with
(select bitmapBuild(groupArray(cityHash64(imsi))) from testa where cityHash64(task_id)%5000=cityHash64(1270)%5000 and task_id=1270) as a01 ,
(select bitmapBuild(groupArray(cityHash64(imsi))) from testa where cityHash64(task_id)%5000=cityHash64(1271)%5000 and task_id=1271) as a02,
(select bitmapBuild(groupArray(cityHash64(imsi))) from testa where cityHash64(task_id)%5000=cityHash64(8888)%5000 and task_id=8888) as a03
select
from q_imis_query
where
cityHash64(task_id)%5000=cityHash64(1270)%5000
and task_id=1270
and has(arrayConcat(arrayIntersect(a01,a02),arrayIntersect(a01,a03)),cityHash64(imsi))=0;
4、之前實現,在數據量很大的情況下結果為0
交集例子
with
(select bitmapBuild(groupArray(cityHash64(imsi))) from testa where cityHash64(task_id)%5000=cityHash64(1270)%5000 and task_id=1270) as a01 ,
(select bitmapBuild(groupArray(cityHash64(imsi))) from testa where cityHash64(task_id)%5000=cityHash64(1271)%5000 and task_id=1271) as a02,
(select bitmapBuild(groupArray(cityHash64(imsi))) from testa where cityHash64(task_id)%5000=cityHash64(8888)%5000 and task_id=8888) as a03
select
from q_imis_query
where cityHash64(task_id)%5000=cityHash64(1270)%5000 and task_id=1270 and has(bitmapToArray(bitmapAnd(bitmapAnd(a01,a02),a03)),cityHash64(imsi))=1;
原因
因為bitmap對象,當成員小於等於32的時候,用的set對象存儲的。大於32的時候用的RoaringBitmap 對象。而RoaringBitmap 對象是UInt32位,所以當元素個數超過32的時候就會被截取。
bitmap支持32位,在ch20.6版本還未支持64位,因此將cityHash64給位xxHash32。
5、位圖函數
位圖函數用於對兩個位圖對象進行計算,對於任何一個位圖函數,它都將返回一個位圖對象,例如and,or,xor,not等等。
位圖對象有兩種構造方法。一個是由聚合函數groupBitmapState構造的,另一個是由Array Object構造的。同時還可以將位圖對象轉化為數組對象。
我們使用RoaringBitmap實際存儲位圖對象,當基數小於或等於32時,它使用Set保存。當基數大於32時,它使用RoaringBitmap保存。這也是為什么低基數集的存儲更快的原因。
RoaringBitmap:https://github.com/RoaringBitmap/CRoaring
6、幾種hash算法速度對比
https://cyan4973.github.io/xxHash/
https://clickhouse.tech/docs/zh/sql-reference/functions/hash-functions/