Clickhouse(20.6) 实现交集、差集


原创文章,转载请标明出处

1、交集、差集函数

1)介绍

bitmapAnd  bitmap类交集函数

bitmapAndnot bitmap类差集函数

groupArray  查询结果转换为数组

bitmapBuild 构建bitmap结构的函数:从无符号整数数组构建位图对象

xxHash32   xxHash是一种非常快速的非加密哈希算法,在RAM速度限制下工作。它有四种类型(XXH32、XXH64、XXH3_64位和XXH3_128位)。最新版本XXH3全面提高了性能,特别是在小数据方面。

cityHash64  谷歌开发的一种hash高效hash算法

arrayIntersect  返回所有数组元素的交集

 

 

arrayConcat 合并参数中传递的所有数组

 

2)官网链接

groupArray  :https://clickhouse.tech/docs/zh/sql-reference/aggregate-functions/reference/#agg_function-grouparray

bitmapBuild :https://clickhouse.tech/docs/zh/sql-reference/functions/bitmap-functions/#bitmapbuild

xxHash32 cityHash64 官网:https://clickhouse.tech/docs/zh/sql-reference/functions/hash-functions/

arrayConcat arrayIntersect :https://clickhouse.tech/docs/zh/sql-reference/functions/array-functions/#arrayintersectarr

2、使用bitmap实现

表结构

create table q_imis_query

(

task_id UInt64,

imsi String,

insert_time DateTime

);

1)交集查询

with

  (select  bitmapBuild(groupArray(xxHash32(imsi))) from testa where cityHash64(task_id)%5000=cityHash64(1270)%5000 and task_id=1270) as a01 ,

  (select  bitmapBuild(groupArray(xxHash32(imsi))) from testa where cityHash64(task_id)%5000=cityHash64(1271)%5000 and task_id=1271) as a02,

  (select  bitmapBuild(groupArray(xxHash32(imsi))) from testa where cityHash64(task_id)%5000=cityHash64(8888)%5000 and task_id=8888) as a03

select 

  from q_imis_query

where 

  cityHash64(task_id)%5000=cityHash64(1270)%5000

  and task_id=1270

  and has(bitmapToArray(bitmapAnd(bitmapAnd(a01,a02),a03)),xxHash32(imsi))=1;

2)差集查询

with

  (select  bitmapBuild(groupArray(xxHash32(imsi))) from testa where cityHash64(task_id)%5000=cityHash64(1270)%5000 and task_id=1270) as a01 ,

  (select  bitmapBuild(groupArray(xxHash32(imsi))) from testa where cityHash64(task_id)%5000=cityHash64(1271)%5000 and task_id=1271) as a02,

  (select  bitmapBuild(groupArray(xxHash32(imsi))) from testa where cityHash64(task_id)%5000=cityHash64(8888)%5000 and task_id=8888) as a03

select 

  from q_imis_query

where 

  cityHash64(task_id)%5000=cityHash64(1270)%5000

  and task_id=1270

  and has(bitmapToArray(bitmapAndnot(bitmapAndnot(a01,a02),a03)),xxHash32(imsi))=1;

3、数组实现

 

1)交集查询

with

  (select  bitmapBuild(groupArray(cityHash64(imsi))) from testa where cityHash64(task_id)%5000=cityHash64(1270)%5000 and task_id=1270) as a01 ,

  (select  bitmapBuild(groupArray(cityHash64(imsi))) from testa where cityHash64(task_id)%5000=cityHash64(1271)%5000 and task_id=1271) as a02,

  (select  bitmapBuild(groupArray(cityHash64(imsi))) from testa where cityHash64(task_id)%5000=cityHash64(8888)%5000 and task_id=8888) as a03

select 

  from q_imis_query

where 

  cityHash64(task_id)%5000=cityHash64(1270)%5000

  and task_id=1270

  and has(arrayConcat(arrayIntersect(a01,a02),arrayIntersect(a01,a03)),cityHash64(imsi))=1;

2)差集查询

with

  (select  bitmapBuild(groupArray(cityHash64(imsi))) from testa where cityHash64(task_id)%5000=cityHash64(1270)%5000 and task_id=1270) as a01 ,

  (select  bitmapBuild(groupArray(cityHash64(imsi))) from testa where cityHash64(task_id)%5000=cityHash64(1271)%5000 and task_id=1271) as a02,

  (select  bitmapBuild(groupArray(cityHash64(imsi))) from testa where cityHash64(task_id)%5000=cityHash64(8888)%5000 and task_id=8888) as a03

select 

  from q_imis_query

where 

  cityHash64(task_id)%5000=cityHash64(1270)%5000

  and task_id=1270

  and has(arrayConcat(arrayIntersect(a01,a02),arrayIntersect(a01,a03)),cityHash64(imsi))=0;

4、之前实现,在数据量很大的情况下结果为0

交集例子

with

  (select  bitmapBuild(groupArray(cityHash64(imsi))) from testa where cityHash64(task_id)%5000=cityHash64(1270)%5000 and task_id=1270) as a01 ,

  (select  bitmapBuild(groupArray(cityHash64(imsi))) from testa where cityHash64(task_id)%5000=cityHash64(1271)%5000 and task_id=1271) as a02,

  (select  bitmapBuild(groupArray(cityHash64(imsi))) from testa where cityHash64(task_id)%5000=cityHash64(8888)%5000 and task_id=8888) as a03

select 

  from q_imis_query

where cityHash64(task_id)%5000=cityHash64(1270)%5000 and task_id=1270 and has(bitmapToArray(bitmapAnd(bitmapAnd(a01,a02),a03)),cityHash64(imsi))=1;

原因

因为bitmap对象,当成员小于等于32的时候,用的set对象存储的。大于32的时候用的RoaringBitmap 对象。而RoaringBitmap 对象是UInt32位,所以当元素个数超过32的时候就会被截取。

bitmap支持32位,在ch20.6版本还未支持64位,因此将cityHash64给位xxHash32。

5、位图函数

位图函数用于对两个位图对象进行计算,对于任何一个位图函数,它都将返回一个位图对象,例如and,or,xor,not等等。

位图对象有两种构造方法。一个是由聚合函数groupBitmapState构造的,另一个是由Array Object构造的。同时还可以将位图对象转化为数组对象。

我们使用RoaringBitmap实际存储位图对象,当基数小于或等于32时,它使用Set保存。当基数大于32时,它使用RoaringBitmap保存。这也是为什么低基数集的存储更快的原因。

RoaringBitmap:https://github.com/RoaringBitmap/CRoaring

6、几种hash算法速度对比

https://cyan4973.github.io/xxHash/

 

 

 https://clickhouse.tech/docs/zh/sql-reference/functions/hash-functions/

https://cyan4973.github.io/xxHash/


免责声明!

本站转载的文章为个人学习借鉴使用,本站对版权不负任何法律责任。如果侵犯了您的隐私权益,请联系本站邮箱yoyou2525@163.com删除。



 
粤ICP备18138465号  © 2018-2025 CODEPRJ.COM