Clickhouse(20.6) 實現交集、差集


原創文章,轉載請標明出處

1、交集、差集函數

1)介紹

bitmapAnd  bitmap類交集函數

bitmapAndnot bitmap類差集函數

groupArray  查詢結果轉換為數組

bitmapBuild 構建bitmap結構的函數:從無符號整數數組構建位圖對象

xxHash32   xxHash是一種非常快速的非加密哈希算法,在RAM速度限制下工作。它有四種類型(XXH32、XXH64、XXH3_64位和XXH3_128位)。最新版本XXH3全面提高了性能,特別是在小數據方面。

cityHash64  谷歌開發的一種hash高效hash算法

arrayIntersect  返回所有數組元素的交集

 

 

arrayConcat 合並參數中傳遞的所有數組

 

2)官網鏈接

groupArray  :https://clickhouse.tech/docs/zh/sql-reference/aggregate-functions/reference/#agg_function-grouparray

bitmapBuild :https://clickhouse.tech/docs/zh/sql-reference/functions/bitmap-functions/#bitmapbuild

xxHash32 cityHash64 官網:https://clickhouse.tech/docs/zh/sql-reference/functions/hash-functions/

arrayConcat arrayIntersect :https://clickhouse.tech/docs/zh/sql-reference/functions/array-functions/#arrayintersectarr

2、使用bitmap實現

表結構

create table q_imis_query

(

task_id UInt64,

imsi String,

insert_time DateTime

);

1)交集查詢

with

  (select  bitmapBuild(groupArray(xxHash32(imsi))) from testa where cityHash64(task_id)%5000=cityHash64(1270)%5000 and task_id=1270) as a01 ,

  (select  bitmapBuild(groupArray(xxHash32(imsi))) from testa where cityHash64(task_id)%5000=cityHash64(1271)%5000 and task_id=1271) as a02,

  (select  bitmapBuild(groupArray(xxHash32(imsi))) from testa where cityHash64(task_id)%5000=cityHash64(8888)%5000 and task_id=8888) as a03

select 

  from q_imis_query

where 

  cityHash64(task_id)%5000=cityHash64(1270)%5000

  and task_id=1270

  and has(bitmapToArray(bitmapAnd(bitmapAnd(a01,a02),a03)),xxHash32(imsi))=1;

2)差集查詢

with

  (select  bitmapBuild(groupArray(xxHash32(imsi))) from testa where cityHash64(task_id)%5000=cityHash64(1270)%5000 and task_id=1270) as a01 ,

  (select  bitmapBuild(groupArray(xxHash32(imsi))) from testa where cityHash64(task_id)%5000=cityHash64(1271)%5000 and task_id=1271) as a02,

  (select  bitmapBuild(groupArray(xxHash32(imsi))) from testa where cityHash64(task_id)%5000=cityHash64(8888)%5000 and task_id=8888) as a03

select 

  from q_imis_query

where 

  cityHash64(task_id)%5000=cityHash64(1270)%5000

  and task_id=1270

  and has(bitmapToArray(bitmapAndnot(bitmapAndnot(a01,a02),a03)),xxHash32(imsi))=1;

3、數組實現

 

1)交集查詢

with

  (select  bitmapBuild(groupArray(cityHash64(imsi))) from testa where cityHash64(task_id)%5000=cityHash64(1270)%5000 and task_id=1270) as a01 ,

  (select  bitmapBuild(groupArray(cityHash64(imsi))) from testa where cityHash64(task_id)%5000=cityHash64(1271)%5000 and task_id=1271) as a02,

  (select  bitmapBuild(groupArray(cityHash64(imsi))) from testa where cityHash64(task_id)%5000=cityHash64(8888)%5000 and task_id=8888) as a03

select 

  from q_imis_query

where 

  cityHash64(task_id)%5000=cityHash64(1270)%5000

  and task_id=1270

  and has(arrayConcat(arrayIntersect(a01,a02),arrayIntersect(a01,a03)),cityHash64(imsi))=1;

2)差集查詢

with

  (select  bitmapBuild(groupArray(cityHash64(imsi))) from testa where cityHash64(task_id)%5000=cityHash64(1270)%5000 and task_id=1270) as a01 ,

  (select  bitmapBuild(groupArray(cityHash64(imsi))) from testa where cityHash64(task_id)%5000=cityHash64(1271)%5000 and task_id=1271) as a02,

  (select  bitmapBuild(groupArray(cityHash64(imsi))) from testa where cityHash64(task_id)%5000=cityHash64(8888)%5000 and task_id=8888) as a03

select 

  from q_imis_query

where 

  cityHash64(task_id)%5000=cityHash64(1270)%5000

  and task_id=1270

  and has(arrayConcat(arrayIntersect(a01,a02),arrayIntersect(a01,a03)),cityHash64(imsi))=0;

4、之前實現,在數據量很大的情況下結果為0

交集例子

with

  (select  bitmapBuild(groupArray(cityHash64(imsi))) from testa where cityHash64(task_id)%5000=cityHash64(1270)%5000 and task_id=1270) as a01 ,

  (select  bitmapBuild(groupArray(cityHash64(imsi))) from testa where cityHash64(task_id)%5000=cityHash64(1271)%5000 and task_id=1271) as a02,

  (select  bitmapBuild(groupArray(cityHash64(imsi))) from testa where cityHash64(task_id)%5000=cityHash64(8888)%5000 and task_id=8888) as a03

select 

  from q_imis_query

where cityHash64(task_id)%5000=cityHash64(1270)%5000 and task_id=1270 and has(bitmapToArray(bitmapAnd(bitmapAnd(a01,a02),a03)),cityHash64(imsi))=1;

原因

因為bitmap對象,當成員小於等於32的時候,用的set對象存儲的。大於32的時候用的RoaringBitmap 對象。而RoaringBitmap 對象是UInt32位,所以當元素個數超過32的時候就會被截取。

bitmap支持32位,在ch20.6版本還未支持64位,因此將cityHash64給位xxHash32。

5、位圖函數

位圖函數用於對兩個位圖對象進行計算,對於任何一個位圖函數,它都將返回一個位圖對象,例如and,or,xor,not等等。

位圖對象有兩種構造方法。一個是由聚合函數groupBitmapState構造的,另一個是由Array Object構造的。同時還可以將位圖對象轉化為數組對象。

我們使用RoaringBitmap實際存儲位圖對象,當基數小於或等於32時,它使用Set保存。當基數大於32時,它使用RoaringBitmap保存。這也是為什么低基數集的存儲更快的原因。

RoaringBitmap:https://github.com/RoaringBitmap/CRoaring

6、幾種hash算法速度對比

https://cyan4973.github.io/xxHash/

 

 

 https://clickhouse.tech/docs/zh/sql-reference/functions/hash-functions/

https://cyan4973.github.io/xxHash/


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM