原创文章,转载请标明出处
1、交集、差集函数
1)介绍
bitmapAnd bitmap类交集函数
bitmapAndnot bitmap类差集函数
groupArray 查询结果转换为数组
bitmapBuild 构建bitmap结构的函数:从无符号整数数组构建位图对象
xxHash32 xxHash是一种非常快速的非加密哈希算法,在RAM速度限制下工作。它有四种类型(XXH32、XXH64、XXH3_64位和XXH3_128位)。最新版本XXH3全面提高了性能,特别是在小数据方面。
cityHash64 谷歌开发的一种hash高效hash算法
arrayIntersect 返回所有数组元素的交集
arrayConcat 合并参数中传递的所有数组
2)官网链接
groupArray :https://clickhouse.tech/docs/zh/sql-reference/aggregate-functions/reference/#agg_function-grouparray
bitmapBuild :https://clickhouse.tech/docs/zh/sql-reference/functions/bitmap-functions/#bitmapbuild
xxHash32 cityHash64 官网:https://clickhouse.tech/docs/zh/sql-reference/functions/hash-functions/
arrayConcat arrayIntersect :https://clickhouse.tech/docs/zh/sql-reference/functions/array-functions/#arrayintersectarr
2、使用bitmap实现
表结构
create table q_imis_query
(
task_id UInt64,
imsi String,
insert_time DateTime
);
1)交集查询
with
(select bitmapBuild(groupArray(xxHash32(imsi))) from testa where cityHash64(task_id)%5000=cityHash64(1270)%5000 and task_id=1270) as a01 ,
(select bitmapBuild(groupArray(xxHash32(imsi))) from testa where cityHash64(task_id)%5000=cityHash64(1271)%5000 and task_id=1271) as a02,
(select bitmapBuild(groupArray(xxHash32(imsi))) from testa where cityHash64(task_id)%5000=cityHash64(8888)%5000 and task_id=8888) as a03
select
from q_imis_query
where
cityHash64(task_id)%5000=cityHash64(1270)%5000
and task_id=1270
and has(bitmapToArray(bitmapAnd(bitmapAnd(a01,a02),a03)),xxHash32(imsi))=1;
2)差集查询
with
(select bitmapBuild(groupArray(xxHash32(imsi))) from testa where cityHash64(task_id)%5000=cityHash64(1270)%5000 and task_id=1270) as a01 ,
(select bitmapBuild(groupArray(xxHash32(imsi))) from testa where cityHash64(task_id)%5000=cityHash64(1271)%5000 and task_id=1271) as a02,
(select bitmapBuild(groupArray(xxHash32(imsi))) from testa where cityHash64(task_id)%5000=cityHash64(8888)%5000 and task_id=8888) as a03
select
from q_imis_query
where
cityHash64(task_id)%5000=cityHash64(1270)%5000
and task_id=1270
and has(bitmapToArray(bitmapAndnot(bitmapAndnot(a01,a02),a03)),xxHash32(imsi))=1;
3、数组实现
1)交集查询
with
(select bitmapBuild(groupArray(cityHash64(imsi))) from testa where cityHash64(task_id)%5000=cityHash64(1270)%5000 and task_id=1270) as a01 ,
(select bitmapBuild(groupArray(cityHash64(imsi))) from testa where cityHash64(task_id)%5000=cityHash64(1271)%5000 and task_id=1271) as a02,
(select bitmapBuild(groupArray(cityHash64(imsi))) from testa where cityHash64(task_id)%5000=cityHash64(8888)%5000 and task_id=8888) as a03
select
from q_imis_query
where
cityHash64(task_id)%5000=cityHash64(1270)%5000
and task_id=1270
and has(arrayConcat(arrayIntersect(a01,a02),arrayIntersect(a01,a03)),cityHash64(imsi))=1;
2)差集查询
with
(select bitmapBuild(groupArray(cityHash64(imsi))) from testa where cityHash64(task_id)%5000=cityHash64(1270)%5000 and task_id=1270) as a01 ,
(select bitmapBuild(groupArray(cityHash64(imsi))) from testa where cityHash64(task_id)%5000=cityHash64(1271)%5000 and task_id=1271) as a02,
(select bitmapBuild(groupArray(cityHash64(imsi))) from testa where cityHash64(task_id)%5000=cityHash64(8888)%5000 and task_id=8888) as a03
select
from q_imis_query
where
cityHash64(task_id)%5000=cityHash64(1270)%5000
and task_id=1270
and has(arrayConcat(arrayIntersect(a01,a02),arrayIntersect(a01,a03)),cityHash64(imsi))=0;
4、之前实现,在数据量很大的情况下结果为0
交集例子
with
(select bitmapBuild(groupArray(cityHash64(imsi))) from testa where cityHash64(task_id)%5000=cityHash64(1270)%5000 and task_id=1270) as a01 ,
(select bitmapBuild(groupArray(cityHash64(imsi))) from testa where cityHash64(task_id)%5000=cityHash64(1271)%5000 and task_id=1271) as a02,
(select bitmapBuild(groupArray(cityHash64(imsi))) from testa where cityHash64(task_id)%5000=cityHash64(8888)%5000 and task_id=8888) as a03
select
from q_imis_query
where cityHash64(task_id)%5000=cityHash64(1270)%5000 and task_id=1270 and has(bitmapToArray(bitmapAnd(bitmapAnd(a01,a02),a03)),cityHash64(imsi))=1;
原因
因为bitmap对象,当成员小于等于32的时候,用的set对象存储的。大于32的时候用的RoaringBitmap 对象。而RoaringBitmap 对象是UInt32位,所以当元素个数超过32的时候就会被截取。
bitmap支持32位,在ch20.6版本还未支持64位,因此将cityHash64给位xxHash32。
5、位图函数
位图函数用于对两个位图对象进行计算,对于任何一个位图函数,它都将返回一个位图对象,例如and,or,xor,not等等。
位图对象有两种构造方法。一个是由聚合函数groupBitmapState构造的,另一个是由Array Object构造的。同时还可以将位图对象转化为数组对象。
我们使用RoaringBitmap实际存储位图对象,当基数小于或等于32时,它使用Set保存。当基数大于32时,它使用RoaringBitmap保存。这也是为什么低基数集的存储更快的原因。
RoaringBitmap:https://github.com/RoaringBitmap/CRoaring
6、几种hash算法速度对比
https://cyan4973.github.io/xxHash/
https://clickhouse.tech/docs/zh/sql-reference/functions/hash-functions/