guava之BloomFilter

本文轉載自查看原文 2021-04-25 09:17 389 算法/ cache/ guava/ bloomfilter

Guava中的布隆過濾器

示例：

import com.google.common.base.Charsets;
import com.google.common.hash.BloomFilter;
import com.google.common.hash.Funnels;

public class GuavaBloomFilter {
    public static void main(String[] args) {
        BloomFilter<String> bloomFilter = BloomFilter.create(Funnels.stringFunnel(Charsets.UTF_8), 100000, 0.01);

        bloomFilter.put("shenzhen");

        System.out.println(bloomFilter.mightContain("guangzhou"));
        System.out.println(bloomFilter.mightContain("shenzhen"));
    }
}

結果：

false
true

采用Guava 27.0.1版本的源碼，BF的具體邏輯位於com.google.common.hash.BloomFilter類中。開始讀代碼吧。

BloomFilter類的成員屬性

不多，只有4個。

  /** The bit set of the BloomFilter (not necessarily power of 2!) */ private final LockFreeBitArray bits; /** Number of hashes per element */ private final int numHashFunctions; /** The funnel to translate Ts to bytes */ private final Funnel<? super T> funnel; /** The strategy we employ to map an element T to {@code numHashFunctions} bit indexes. */ private final Strategy strategy;

bits即上文講到的長度為m的位數組，采用LockFreeBitArray類型做了封裝。
numHashFunctions即哈希函數的個數k。
funnel是Funnel接口實現類的實例，它用於將任意類型T的輸入數據轉化為Java基本類型的數據（byte、int、char等等）。這里是會轉化為byte。
strategy是布隆過濾器的哈希策略，即數據如何映射到位數組，其具體方法在BloomFilterStrategies枚舉中。

BloomFilter的構造

這個類的構造方法是私有的。要創建它的實例，應該通過公有的create()方法。它一共有5種重載方法，但最終都是調用了如下的邏輯。

  @VisibleForTesting static <T> BloomFilter<T> create( Funnel<? super T> funnel, long expectedInsertions, double fpp, Strategy strategy) { checkNotNull(funnel); checkArgument( expectedInsertions >= 0, "Expected insertions (%s) must be >= 0", expectedInsertions); checkArgument(fpp > 0.0, "False positive probability (%s) must be > 0.0", fpp); checkArgument(fpp < 1.0, "False positive probability (%s) must be < 1.0", fpp); checkNotNull(strategy); if (expectedInsertions == 0) { expectedInsertions = 1; } /* * TODO(user): Put a warning in the javadoc about tiny fpp values, since the resulting size * is proportional to -log(p), but there is not much of a point after all, e.g. * optimalM(1000, 0.0000000000000001) = 76680 which is less than 10kb. Who cares! */ long numBits = optimalNumOfBits(expectedInsertions, fpp); int numHashFunctions = optimalNumOfHashFunctions(expectedInsertions, numBits); try { return new BloomFilter<T>(new LockFreeBitArray(numBits), numHashFunctions, funnel, strategy); } catch (IllegalArgumentException e) { throw new IllegalArgumentException("Could not create BloomFilter of " + numBits + " bits", e); } }

該方法接受4個參數：funnel是插入數據的Funnel，expectedInsertions是期望插入的元素總個數n，fpp即期望假陽性率p，strategy即哈希策略。

由上可知，位數組的長度m和哈希函數的個數k分別通過optimalNumOfBits()方法和optimalNumOfHashFunctions()方法來估計。

估計最優m值和k值

  @VisibleForTesting static long optimalNumOfBits(long n, double p) { if (p == 0) { p = Double.MIN_VALUE; } return (long) (-n * Math.log(p) / (Math.log(2) * Math.log(2))); } @VisibleForTesting static int optimalNumOfHashFunctions(long n, long m) { // (m / n) * log(2), but avoid truncation due to division! return Math.max(1, (int) Math.round((double) m / n * Math.log(2))); }

要看懂這兩個方法，我們得接着上一節的推導繼續做下去。

由假陽性率的近似計算方法可知，如果要使假陽性率盡量小，在m和n給定的情況下，k值應為：

這就是optimalNumOfHashFunctions()方法的邏輯。那么m該如何估計呢？

將k代入上一節的式子並化簡，我們可以整理出期望假陽性率p與m、n的關系：

亦即：

這就是optimalNumOfBits()方法的邏輯。

從上也可以得出：

如果指定期望假陽性率p，那么最優的m值與期望元素數n呈線性關系。
最優的k值實際上只與p有關，與m和n都無關，即：

所以，在創建BloomFilter時，確定合適的p和n值很重要。

哈希策略

在BloomFilterStrategies枚舉中定義了兩種哈希策略，都基於著名的MurmurHash算法，分別是MURMUR128_MITZ_32和MURMUR128_MITZ_64。前者是一個簡化版，所以我們來看看后者的實現方法。

  MURMUR128_MITZ_64() { @Override public <T> boolean put( T object, Funnel<? super T> funnel, int numHashFunctions, LockFreeBitArray bits) { long bitSize = bits.bitSize(); byte[] bytes = Hashing.murmur3_128().hashObject(object, funnel).getBytesInternal(); long hash1 = lowerEight(bytes); long hash2 = upperEight(bytes); boolean bitsChanged = false; long combinedHash = hash1; for (int i = 0; i < numHashFunctions; i++) { // Make the combined hash positive and indexable bitsChanged |= bits.set((combinedHash & Long.MAX_VALUE) % bitSize); combinedHash += hash2; } return bitsChanged; } @Override public <T> boolean mightContain( T object, Funnel<? super T> funnel, int numHashFunctions, LockFreeBitArray bits) { long bitSize = bits.bitSize(); byte[] bytes = Hashing.murmur3_128().hashObject(object, funnel).getBytesInternal(); long hash1 = lowerEight(bytes); long hash2 = upperEight(bytes); long combinedHash = hash1; for (int i = 0; i < numHashFunctions; i++) { // Make the combined hash positive and indexable if (!bits.get((combinedHash & Long.MAX_VALUE) % bitSize)) { return false; } combinedHash += hash2; } return true; } private /* static */ long lowerEight(byte[] bytes) { return Longs.fromBytes( bytes[7], bytes[6], bytes[5], bytes[4], bytes[3], bytes[2], bytes[1], bytes[0]); } private /* static */ long upperEight(byte[] bytes) { return Longs.fromBytes( bytes[15], bytes[14], bytes[13], bytes[12], bytes[11], bytes[10], bytes[9], bytes[8]); } };

其中put()方法負責向布隆過濾器中插入元素，mightContain()方法負責判斷元素是否存在。以put()方法為例講解一下流程吧。

使用MurmurHash算法對funnel的輸入數據進行散列，得到128bit（16B）的字節數組。
取低8字節作為第一個哈希值hash1，取高8字節作為第二個哈希值hash2。
進行k次循環，每次循環都用hash1與hash2的復合哈希做散列，然后對m取模，將位數組中的對應比特設為1。

這里需要注意兩點：

在循環中實際上應用了雙重哈希（double hashing）的思想，即可以用兩個哈希函數來模擬k個，其中i為步長：

這種方法在開放定址的哈希表中，也經常用來減少沖突。
哈希值有可能為負數，而負數是不能在位數組中定位的。所以哈希值需要與Long.MAX_VALUE做bitwise AND，直接將其最高位（符號位）置為0，就變成正數了。

位數組具體實現

來看LockFreeBitArray類的部分代碼。

  static final class LockFreeBitArray { private static final int LONG_ADDRESSABLE_BITS = 6; final AtomicLongArray data; private final LongAddable bitCount; LockFreeBitArray(long bits) { this(new long[Ints.checkedCast(LongMath.divide(bits, 64, RoundingMode.CEILING))]); } // Used by serialization LockFreeBitArray(long[] data) { checkArgument(data.length > 0, "data length is zero!"); this.data = new AtomicLongArray(data); this.bitCount = LongAddables.create(); long bitCount = 0; for (long value : data) { bitCount += Long.bitCount(value); } this.bitCount.add(bitCount); } /** Returns true if the bit changed value. */ boolean set(long bitIndex) { if (get(bitIndex)) { return false; } int longIndex = (int) (bitIndex >>> LONG_ADDRESSABLE_BITS); long mask = 1L << bitIndex; // only cares about low 6 bits of bitIndex long oldValue; long newValue; do { oldValue = data.get(longIndex); newValue = oldValue | mask; if (oldValue == newValue) { return false; } } while (!data.compareAndSet(longIndex, oldValue, newValue)); // We turned the bit on, so increment bitCount. bitCount.increment(); return true; } boolean get(long bitIndex) { return (data.get((int) (bitIndex >>> 6)) & (1L << bitIndex)) != 0; } // .... }

看官應該能明白為什么它要叫做“LockFree”BitArray了，因為它是采用原子類型AtomicLongArray作為位數組的存儲的，確實不需要加鎖。另外還有一個Guava中特有的LongAddable類型的計數器，用來統計置為1的比特數。

采用AtomicLongArray除了有並發上的優勢之外，更主要的是它可以表示非常長的位數組。一個長整型數占用64bit，因此data[0]可以代表第0~63bit，data[1]代表64~127bit，data[2]代表128~191bit……依次類推。這樣設計的話，將下標i無符號右移6位就可以獲得data數組中對應的位置，再在其基礎上左移i位就可以取得對應的比特了。

最后多嘴一句，上面的代碼中用到了Long.bitCount()方法計算long型二進制表示中1的數量，堪稱Java語言中最強的騷操作之一：

 public static int bitCount(long i) { // HD, Figure 5-14 i = i - ((i >>> 1) & 0x5555555555555555L); i = (i & 0x3333333333333333L) + ((i >>> 2) & 0x3333333333333333L); i = (i + (i >>> 4)) & 0x0f0f0f0f0f0f0f0fL; i = i + (i >>> 8); i = i + (i >>> 16); i = i + (i >>> 32); return (int)i & 0x7f; }

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 谷歌Guava工具類的使用（1）：BloomFilter的使用 BloomFilter BloomFilter 與 Cuckoo Filter Redis實戰-BloomFilter 布隆算法（BloomFilter）基於Redis的BloomFilter算法去重【Guava】基於guava的重試組件Guava-Retryer Redis之布隆過濾器BloomFilter Guava的SetMultimap Guava CaseFormat