布隆過濾器（億級數據過濾算法）

本文轉載自查看原文 2020-12-03 19:12 382 Redis/ 布隆過濾器/ redis

介紹

我們以演進的方式來逐漸認識布隆過濾器。先拋出一個問題爬蟲系統中URL是怎么判重的？你可能最先想到的是將URL放到一個set中，但是當數據很多的時候，放在set中是不現實的。

這時你就可能想到用數組+hash函數來實現了。

index = hash(URL) % table.length

即求出URL的hash值對數組長度取模，得到數組的下標，然后設置table[index] = 1，當然數組剛開始的元素都為0

這樣每次有新的URL來的時候，先求出index，然后看table[index]的值，當為0的時候，URL肯定不存在，當為1的時候URL可能存在，因為有可能發生hash沖突。即第一次
hash(www.baidu.com) % table.length = 1，table[1]=1，第二次hash(www.javashitang.com) % table.length = 1，此時table[1]=1，系統會認為www.javashitang.com已經爬取過了，其實並沒有爬取。

從上面的流程中我們基本可以得出如下結論：hash沖突越少，誤判率越低

怎么減少hash沖突呢？

增加數組長度
優化hash函數，使用多個hash函數來判斷

多個hash函數求得數組位置的值都為1時才認為這個元素存在，只要有一個為0則認為這個元素不存在。在一定概率上能降低沖突的概率。

那么hash函數是不是越多越好呢？當然不是了，hash函數越多，數組中1的數量相應的也會增多，反而會增加沖突。所以hash函數不能太多也不能太少。

你可能沒意識到布隆過濾器的原理你已經懂了，只不過布隆過濾器存0和1不是用數組，而是用位，我們來算一下申請一個 100w 個元素的位數組只占用 1000000Bit / 8 = 125000 Byte = 125000/1024 kb ≈ 122kb 的空間，是不是很划算？

來總結一下布隆過濾器的特點

布隆過濾器說某個元素存在，其實有可能不存在，因為hash沖突會導致誤判
布隆過濾器說某個元素不存在則一定不存在

使用場景

判斷指定數據在海量數據中是否存在，防止緩存穿透等
爬蟲系統判斷某個URL是否已經處理過

手寫一個布隆過濾器：

 
                   public  
                   class  
                   MyBloomFilter { 
                  
                   // 位數組的大小 
                  
                   private  
                   static  
                   final  
                   int  
                   DEFAULT_SIZE =  
                   2  
                   <<  
                   24 
                   ; 
                  
                   // hash函數的種子 
                  
                   private  
                   static  
                   final  
                   int 
                   [] SEEDS =  
                   new  
                   int 
                   []{ 
                   3 
                   ,  
                   13 
                   ,  
                   46 
                   }; 
                  
                   // 位數組，數組中的元素只能是 0 或者 1 
                  
                   private  
                   BitSet bits =  
                   new  
                   BitSet(DEFAULT_SIZE); 
                  
                   // hash函數 
                  
                   private  
                   SimpleHash[] func =  
                   new  
                   SimpleHash[SEEDS.length]; 
                  
                   public  
                   MyBloomFilter() { 
                  
                   for  
                   ( 
                   int  
                   i =  
                   0 
                   ; i < SEEDS.length; i++) { 
                  
                   func[i] =  
                   new  
                   SimpleHash(DEFAULT_SIZE, SEEDS[i]); 
                  
                   } 
                  
                   } 
                  
                   // 添加元素到位數組 
                  
                   public  
                   void  
                   add(Object value) { 
                  
                   for  
                   (SimpleHash f : func) { 
                  
                   bits.set(f.hash(value),  
                   true 
                   ); 
                  
                   } 
                  
                   } 
                  
                   // 判斷指定元素是否存在於位數組 
                  
                   public  
                   boolean  
                   contains(Object value) { 
                  
                   boolean  
                   ret =  
                   true 
                   ; 
                  
                   for  
                   (SimpleHash f : func) { 
                  
                   ret = ret && bits.get(f.hash(value)); 
                  
                   // hash函數有一個計算出為false，則直接返回 
                  
                   if  
                   (!ret) { 
                  
                   return  
                   ret; 
                  
                   } 
                  
                   } 
                  
                   return  
                   ret; 
                  
                   } 
                  
                   // hash函數類 
                  
                   public  
                   static  
                   class  
                   SimpleHash { 
                  
                   private  
                   int  
                   cap; 
                  
                   private  
                   int  
                   seed; 
                  
                   public  
                   SimpleHash( 
                   int  
                   cap,  
                   int  
                   seed) { 
                  
                   this 
                   .cap = cap; 
                  
                   this 
                   .seed = seed; 
                  
                   } 
                  
                   public  
                   int  
                   hash(Object value) { 
                  
                   int  
                   h; 
                  
                   return  
                   (value ==  
                   null 
                   ) ?  
                   0  
                   : Math.abs(seed * (cap -  
                   1 
                   ) & ((h = value.hashCode()) ^ (h >>>  
                   16 
                   ))); 
                  
                   } 
                  
                   } 
                  
                   public  
                   static  
                   void  
                   main(String[] args) { 
                  
                   Integer value1 =  
                   13423 
                   ; 
                  
                   Integer value2 =  
                   22131 
                   ; 
                  
                   MyBloomFilter filter =  
                   new  
                   MyBloomFilter(); 
                  
                   // false 
                  
                   System.out.println(filter.contains(value1)); 
                  
                   // false 
                  
                   System.out.println(filter.contains(value2)); 
                  
                   filter.add(value1); 
                  
                   filter.add(value2); 
                  
                   // true 
                  
                   System.out.println(filter.contains(value1)); 
                  
                   // true 
                  
                   System.out.println(filter.contains(value2)); 
                  
                   } 
                  
                   }

利用Google的Guava工具庫實現布隆過濾器：

生產環境中一般不用自己手寫的布隆過濾器，用Google大牛寫好的工具類即可。

加入如下依賴：

 
                   <dependency> 
                  
                   <groupId>com.google.guava</groupId> 
                  
                   <artifactId>guava</artifactId> 
                  
                   <version> 
                   27.0 
                   . 
                   1 
                   -jre</version> 
                  
                   </dependency>

 
                   // 創建布隆過濾器對象，最多元素數量為500，期望誤報概率為0.01 
                  
                   BloomFilter<Integer> filter = BloomFilter.create( 
                  
                   Funnels.integerFunnel(),  
                   500 
                   ,  
                   0.01 
                   ); 
                  
                   // 判斷指定元素是否存在 
                  
                   // false 
                  
                   System.out.println(filter.mightContain( 
                   1 
                   )); 
                  
                   // false 
                  
                   System.out.println(filter.mightContain( 
                   2 
                   )); 
                  
                   // 將元素添加進布隆過濾器 
                  
                   filter.put( 
                   1 
                   ); 
                  
                   filter.put( 
                   2 
                   ); 
                  
                   // true 
                  
                   System.out.println(filter.mightContain( 
                   1 
                   )); 
                  
                   // true 
                  
                   System.out.println(filter.mightContain( 
                   2 
                   ));

用Redis中的布隆過濾器：

Redis4.0以插件的形式提供了布隆過濾器。來演示一波

使用docker安裝並啟動：

 
                   docker pull redislabs/rebloom 
                  
                   docker run -itd --name redis -p: 
                   6379 
                   : 
                   6379  
                   redislabs/rebloom 
                  
                   docker exec -it redis /bin/bash 
                  
                   redis-cli

常用的命令如下：

127.0.0.1:6379> bf.add test 1
(integer) 1
127.0.0.1:6379> bf.add test 2
(integer) 1
127.0.0.1:6379> bf.exists test 1
(integer) 1
127.0.0.1:6379> bf.exists test 3
(integer) 0
127.0.0.1:6379> bf.exists test 4
(integer) 0

歡迎關注微信公眾號：shoshana

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Redis(5)——億級數據過濾和布隆過濾器布隆過濾器算法篇算法(3)---布隆過濾器原理大數據算法——布隆過濾器布隆過濾器布隆過濾器布隆過濾器哈希——布隆過濾器查黑名單（大數據 100億數據）布隆過濾器，你也可以處理十幾億的大數據十幾億的大數據判斷是否存在---布隆過濾器