數據結構 Roaring Bitmaps 介紹


背景:

  BitMap 是一種比較常用的數據機構,位圖索引被廣泛應用與數據庫和搜索引擎中,能快速定位一個數值是否在存在,是一種高效的數據壓縮算法,能顯著加快查詢速度。但是BitMap還是會占用大量內存(線性增長),所以我們一般還需要對BitMap進行壓縮處理。Roaring BitMaps (簡稱RBM) 就是一種壓縮算法。

  所以:BitMap 是一種數據結構/壓縮算法,RBM 是一種基於BitMap思想的數據結構/壓縮算法。

原理:

  附上一段論文原文  

  1.   We partition the range of 32-bit indexes ([0; n)) into chunks of 216 integers sharing the same 16 most significant digits. We use specialized containers to store their 16 least significant bits.
  2.   When a chunk contains no more than 4096 integers, we use a sorted array of packed 16-bit integers. When there are more than 4096 integers, we use a 216-bit bitmap. Thus, we have two types of containers: an array container for sparse chunks and a bitmap container for dense chunks. The 4096 threshold insures that at the level of the containers, each integer uses no more than 16 bits: we either use 216 bits for more than 4096 integers, using less than 16 bits/integer, or else we use exactly 16 bits/integer.
  3. The containers are stored in a dynamic array with the shared 16 most-significant bits: this serves as a first-level index. The array keeps the containers sorted by the 16 most-significant bits.We expect this first-level index to be typically small: when n = 1 000 000, it contains at most 16 entries. Thus it should often remain in the CPU cache. The containers themselves should never use much more than 8 kB.

  白話文:

  1、將0-32-bit [0, n) 內的數據劈成 高16位和低16位兩部分數據

  2、高16位用於查找數據存儲位置,低16位存在在一個容器中(不就是一個類似HashMap的結構么)

  容器補充:容器是一個動態的數組,當數據小於4096個時,使用16bit的short數組存儲,多余4096個時,使用216bits的BitMap存儲;

  為什么使用兩種數據結構來存儲低16位的值:

    short數組:2bit * 4096 = 8KB 

    BitMap:存儲16位范圍內數據 65536/8 = 8192b,

  所以低於 4096個數,short 數組更省空間。

  

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM