Guava中的布隆過濾器
示例:
import com.google.common.base.Charsets; import com.google.common.hash.BloomFilter; import com.google.common.hash.Funnels; public class GuavaBloomFilter { public static void main(String[] args) { BloomFilter<String> bloomFilter = BloomFilter.create(Funnels.stringFunnel(Charsets.UTF_8), 100000, 0.01); bloomFilter.put("shenzhen"); System.out.println(bloomFilter.mightContain("guangzhou")); System.out.println(bloomFilter.mightContain("shenzhen")); } }
結果:
false true
采用Guava 27.0.1版本的源碼,BF的具體邏輯位於com.google.common.hash.BloomFilter類中。開始讀代碼吧。
BloomFilter類的成員屬性
不多,只有4個。
/** The bit set of the BloomFilter (not necessarily power of 2!) */ private final LockFreeBitArray bits; /** Number of hashes per element */ private final int numHashFunctions; /** The funnel to translate Ts to bytes */ private final Funnel<? super T> funnel; /** The strategy we employ to map an element T to {@code numHashFunctions} bit indexes. */ private final Strategy strategy;
- bits即上文講到的長度為m的位數組,采用LockFreeBitArray類型做了封裝。
- numHashFunctions即哈希函數的個數k。
- funnel是Funnel接口實現類的實例,它用於將任意類型T的輸入數據轉化為Java基本類型的數據(byte、int、char等等)。這里是會轉化為byte。
- strategy是布隆過濾器的哈希策略,即數據如何映射到位數組,其具體方法在BloomFilterStrategies枚舉中。
BloomFilter的構造
這個類的構造方法是私有的。要創建它的實例,應該通過公有的create()方法。它一共有5種重載方法,但最終都是調用了如下的邏輯。
@VisibleForTesting static <T> BloomFilter<T> create( Funnel<? super T> funnel, long expectedInsertions, double fpp, Strategy strategy) { checkNotNull(funnel); checkArgument( expectedInsertions >= 0, "Expected insertions (%s) must be >= 0", expectedInsertions); checkArgument(fpp > 0.0, "False positive probability (%s) must be > 0.0", fpp); checkArgument(fpp < 1.0, "False positive probability (%s) must be < 1.0", fpp); checkNotNull(strategy); if (expectedInsertions == 0) { expectedInsertions = 1; } /* * TODO(user): Put a warning in the javadoc about tiny fpp values, since the resulting size * is proportional to -log(p), but there is not much of a point after all, e.g. * optimalM(1000, 0.0000000000000001) = 76680 which is less than 10kb. Who cares! */ long numBits = optimalNumOfBits(expectedInsertions, fpp); int numHashFunctions = optimalNumOfHashFunctions(expectedInsertions, numBits); try { return new BloomFilter<T>(new LockFreeBitArray(numBits), numHashFunctions, funnel, strategy); } catch (IllegalArgumentException e) { throw new IllegalArgumentException("Could not create BloomFilter of " + numBits + " bits", e); } }
該方法接受4個參數:funnel是插入數據的Funnel,expectedInsertions是期望插入的元素總個數n,fpp即期望假陽性率p,strategy即哈希策略。
由上可知,位數組的長度m和哈希函數的個數k分別通過optimalNumOfBits()方法和optimalNumOfHashFunctions()方法來估計。
估計最優m值和k值
@VisibleForTesting static long optimalNumOfBits(long n, double p) { if (p == 0) { p = Double.MIN_VALUE; } return (long) (-n * Math.log(p) / (Math.log(2) * Math.log(2))); } @VisibleForTesting static int optimalNumOfHashFunctions(long n, long m) { // (m / n) * log(2), but avoid truncation due to division! return Math.max(1, (int) Math.round((double) m / n * Math.log(2))); }
要看懂這兩個方法,我們得接着上一節的推導繼續做下去。
由假陽性率的近似計算方法可知,如果要使假陽性率盡量小,在m和n給定的情況下,k值應為:

這就是optimalNumOfHashFunctions()方法的邏輯。那么m該如何估計呢?
將k代入上一節的式子並化簡,我們可以整理出期望假陽性率p與m、n的關系:

亦即:

這就是optimalNumOfBits()方法的邏輯。
從上也可以得出:
- 如果指定期望假陽性率p,那么最優的m值與期望元素數n呈線性關系。
-
所以,在創建BloomFilter時,確定合適的p和n值很重要。
哈希策略
在BloomFilterStrategies枚舉中定義了兩種哈希策略,都基於著名的MurmurHash算法,分別是MURMUR128_MITZ_32和MURMUR128_MITZ_64。前者是一個簡化版,所以我們來看看后者的實現方法。
MURMUR128_MITZ_64() { @Override public <T> boolean put( T object, Funnel<? super T> funnel, int numHashFunctions, LockFreeBitArray bits) { long bitSize = bits.bitSize(); byte[] bytes = Hashing.murmur3_128().hashObject(object, funnel).getBytesInternal(); long hash1 = lowerEight(bytes); long hash2 = upperEight(bytes); boolean bitsChanged = false; long combinedHash = hash1; for (int i = 0; i < numHashFunctions; i++) { // Make the combined hash positive and indexable bitsChanged |= bits.set((combinedHash & Long.MAX_VALUE) % bitSize); combinedHash += hash2; } return bitsChanged; } @Override public <T> boolean mightContain( T object, Funnel<? super T> funnel, int numHashFunctions, LockFreeBitArray bits) { long bitSize = bits.bitSize(); byte[] bytes = Hashing.murmur3_128().hashObject(object, funnel).getBytesInternal(); long hash1 = lowerEight(bytes); long hash2 = upperEight(bytes); long combinedHash = hash1; for (int i = 0; i < numHashFunctions; i++) { // Make the combined hash positive and indexable if (!bits.get((combinedHash & Long.MAX_VALUE) % bitSize)) { return false; } combinedHash += hash2; } return true; } private /* static */ long lowerEight(byte[] bytes) { return Longs.fromBytes( bytes[7], bytes[6], bytes[5], bytes[4], bytes[3], bytes[2], bytes[