平時總用hashmap,tree, set,vector,queue/stack/heap, linklist, graph,是不是覺得數據結構就那點東西。新年到,卿哥給大家分享點段位比較高的大數據專用數據結構--概率數據結構,讓你不管是參與系統設計也好,平時和老板同事聊天也好,找工作面試也好都能讓人眼前一亮,即probabilistic data structure, 也有人稱之為 approximation algorithm 近似算法或者 online algorithm在線算法。今天教大家概率數據結構的5種招式俗稱打狗5式,分別是用於基數統計的HyperLogLog, 元素存在檢測的Bloom filter, 相似度檢測的MinHash, 頻率統計的count-min sketch 和 流統計的tdigest,把這打狗5式吃透就足夠你闖盪江湖華山論劍啦。
Set cardinality -- HyperLogLog
即基數統計,這是很常用的功能,比如我這個網站這段時間到底被多少獨立IP訪問過啊?諸如此類需要counting unique的問題。那么正常的思路是什么?建一個hashset往里裝,最后返回hashset的結果。可是如果數據量很大,hashset內存就不夠了怎么半呢?那就用多台機器memcached或者redis,內存不夠了再往硬盤里裝,總之就是硬吃。
這個時候其實可以換個思路,犧牲一點准確度來換取內存的節省。這就是HyperLogLog啦!下面的例子用1%的error bound來達到近似結果的目的,效率超級高,內存使用率超級低。下面用https://github.com/svpcom/hyperloglog提供的實現:
#!/usr/bin/python import re jabber_text = """ `Twas brillig, and the slithy toves Did gyre and gimble in the wabe: All mimsy were the borogoves, And the mome raths outgrabe. "Beware the Jabberwock, my son! The jaws that bite, the claws that catch! Beware the Jubjub bird, and shun The frumious Bandersnatch!" He took his vorpal sword in hand; Long time the manxome foe he sought- So rested he by the Tumtum tree And stood awhile in thought. And, as in uffish thought he stood, The Jabberwock, with eyes of flame, Came whiffling through the tulgey wood, And burbled as it came! One, two! One, two! And through and through The vorpal blade went snicker-snack! He left it dead, and with its head He went galumphing back. "And hast thou slain the Jabberwock? Come to my arms, my beamish boy! O frabjous day! Callooh! Callay!" He chortled in his joy. `Twas brillig, and the slithy toves Did gyre and gimble in the wabe: All mimsy were the borogoves, And the mome raths outgrabe. """ packer_text = """ My answers are inadequate To those demanding day and date And ever set a tiny shock Through strangers asking what's o'clock; Whose days are spent in whittling rhyme- What's time to her, or she to Time? """ def clean_words(text): return filter(lambda x: len(x) >0, re.sub("[^A-Za-z]", " ", text).split(" ")) jabber_words = clean_words(jabber_text.lower()) #print jabber_words packer_words = clean_words(packer_text.lower()) #print packer_words jabber_uniq = sorted(set(jabber_words)) #print jabber_uniq import hyperloglog hll = hyperloglog.HyperLogLog(0.01) for word in jabber_words: hll.add(word) print "prob count %d, true count %d" % (len(hll),len(jabber_uniq)) print "observed error rate %0.2f" % (abs(len(hll) - len(jabber_uniq))/float(len(jabber_uniq)))
打印結果:
prob count 90, true count 91 observed error rate 0.01
Set membership -- Bloom Filter
Bloom Filter是這幾種數據結構里你最應該掌握的,如果是讀過我“聊聊canssandra”的讀者一定耳熟能詳。在cassandra的讀操作里,如果memtable里沒有那么就看bloomfilter,如果bloomfilter說沒有就結束了,真沒有,如果說有繼續查key cache,etc。說沒有就沒有這個屬性太牛逼了。下半句是說有也不一定有但是誤差率可以控制為0.001,下面用https://github.com/jaybaird/python-bloomfilter給大家舉個例子(上面重復的code我就不寫了):
from pybloom import BloomFilter bf = BloomFilter(capacity=1000, error_rate=0.001) for word in packer_words: bf.add(word) intersect = set([]) for word in jabber_words: if word in bf: intersect.add(word) print intersect
打印結果:
set(['and', 'in', 'o', 'to', 'through', 'time', 'my', 'day'])
Set Similarity -- MinHash
就是說兩篇文章我來看看它們到底有多相似,聽起來可以預防論文抄襲啥的,下面我們通過https://github.com/ekzhu/datasketch的實現來看一下:
from datasketch import MinHash def mh_digest(data): m = MinHash(num_perm=512) #number of permutation for d in data: m.update(d.encode('utf8')) return m m1 = mh_digest(set(jabber_words)) m2 = mh_digest(set(packer_words)) print "Jaccard simularity %f" % m1.jaccard(m2), "estimated" s1 = set(jabber_words) s2 = set(packer_words) actual_jaccard = float(len(s1.intersection(s2)))/float(len(s1.union(s2))) print "Jaccard simularity %f" % actual_jaccard, "actual"
打印結果:
Jaccard simularity 0.060547 estimated Jaccard simularity 0.069565 actual
Frequency Summaries -- count-min sketch
頻率統計算法一般用在排行榜中,既當前的🏆是誰,🥈是誰,🥉是誰,我們比較關注和比較發生次數非常多的事件,我們不太關心誰排第n和誰排第n+1,具體說你打槍打了1000環還是1001環這種誤差也不重要。還有一個應用實例是寫作語言檢測,把出現頻率最多的一些詞取出就能知道你這篇文章是用哪種語言寫的。下面通過https://github.com/IsaacHaze/countminsketch來具體看一看:
from collections import Counter from yacms import CountMinSketch counts = Counter() # 200 is hash width in bits, 3 is number of hash functions cms = CountMinSketch(200, 3) for word in jabber_words: counts[word] += 1 cms.update(word, 1) for word in ["the", "he", "and", "that"]: print "word %s counts %d" % (word, cms.estimate(word)) for e in counts: if counts[e] != cms.estimate(e): print "missed %s counter: %d, sketch: %d" % (e, counts[e], cms.estimate(e))
打印結果:
word the counts 19 word he counts 7 word and counts 14 word that counts 2 missed two counter: 2, sketch: 3 missed chortled counter: 1, sketch: 2
Streaming Quantiles -- tdigest
流統計,這個厲害了,假設你有一個超大流數據源源不斷,比如是交易數據,現在讓你來檢測哪些可能是屬於信用卡盜刷,那么拿到平均交易價格,最大交易價格就很重要。實時獲取這個統計也增大了計算和處理難度,那么t-digest就要閃亮登場了,下面我們看看怎么實時拿到5%頭數據,5%尾數據和中間值。下面通過https://github.com/trademob/t-digest來看一下:
from tdigest import TDigest import random td = TDigest() for x in xrange(0, 1000): td.add(random.random(), 1) for q in [0.05, 0.5, 0.95]: print "%f @ %f" % (q, td.quantile(q))
打印結果:
0.050000 @ 0.052331 0.500000 @ 0.491775 0.950000 @ 0.955989
好了,今天關於概率數據結構的講解就到此為止,大家有興趣可以看看相關代碼的具體實現,代碼行數都很少,背后蘊藏的數學技巧卻很值得品味。