新年干貨分享--10分鍾學會概率數據結構


平時總用hashmap,tree, set,vector,queue/stack/heap, linklist, graph,是不是覺得數據結構就那點東西。新年到,卿哥給大家分享點段位比較高的大數據專用數據結構--概率數據結構,讓你不管是參與系統設計也好,平時和老板同事聊天也好,找工作面試也好都能讓人眼前一亮,即probabilistic data structure, 也有人稱之為 approximation algorithm 近似算法或者 online algorithm在線算法。今天教大家概率數據結構的5種招式俗稱打狗5式,分別是用於基數統計的HyperLogLog, 元素存在檢測的Bloom filter, 相似度檢測的MinHash, 頻率統計的count-min sketch 和 流統計的tdigest,把這打狗5式吃透就足夠你闖盪江湖華山論劍啦。

Set cardinality -- HyperLogLog

即基數統計,這是很常用的功能,比如我這個網站這段時間到底被多少獨立IP訪問過啊?諸如此類需要counting unique的問題。那么正常的思路是什么?建一個hashset往里裝,最后返回hashset的結果。可是如果數據量很大,hashset內存就不夠了怎么半呢?那就用多台機器memcached或者redis,內存不夠了再往硬盤里裝,總之就是硬吃。

這個時候其實可以換個思路,犧牲一點准確度來換取內存的節省。這就是HyperLogLog啦!下面的例子用1%的error bound來達到近似結果的目的,效率超級高,內存使用率超級低。下面用https://github.com/svpcom/hyperloglog提供的實現:

 

#!/usr/bin/python

import re

jabber_text = """
`Twas brillig, and the slithy toves 
      Did gyre and gimble in the wabe: 
All mimsy were the borogoves, 
      And the mome raths outgrabe. 

"Beware the Jabberwock, my son! 
      The jaws that bite, the claws that catch! 
Beware the Jubjub bird, and shun 
      The frumious Bandersnatch!" 

He took his vorpal sword in hand; 
      Long time the manxome foe he sought- 
So rested he by the Tumtum tree 
      And stood awhile in thought. 

And, as in uffish thought he stood, 
      The Jabberwock, with eyes of flame, 
Came whiffling through the tulgey wood, 
      And burbled as it came! 

One, two! One, two! And through and through 
      The vorpal blade went snicker-snack! 
He left it dead, and with its head 
      He went galumphing back. 

"And hast thou slain the Jabberwock? 
      Come to my arms, my beamish boy! 
O frabjous day! Callooh! Callay!" 
      He chortled in his joy. 

`Twas brillig, and the slithy toves 
      Did gyre and gimble in the wabe: 
All mimsy were the borogoves, 
      And the mome raths outgrabe.
"""

packer_text = """
My answers are inadequate
To those demanding day and date
And ever set a tiny shock
Through strangers asking what's o'clock;
Whose days are spent in whittling rhyme-
What's time to her, or she to Time? 
"""

def clean_words(text):
    return filter(lambda x: len(x) >0, re.sub("[^A-Za-z]", " ", text).split(" "))

jabber_words = clean_words(jabber_text.lower())
#print jabber_words

packer_words = clean_words(packer_text.lower())
#print packer_words

jabber_uniq = sorted(set(jabber_words))
#print jabber_uniq

import hyperloglog

hll = hyperloglog.HyperLogLog(0.01)

for word in jabber_words:
    hll.add(word)

print "prob count %d, true count %d" % (len(hll),len(jabber_uniq))
print "observed error rate %0.2f" % (abs(len(hll) - len(jabber_uniq))/float(len(jabber_uniq)))

打印結果:

prob count 90, true count 91
observed error rate 0.01

Set membership -- Bloom Filter

Bloom Filter是這幾種數據結構里你最應該掌握的,如果是讀過我“聊聊canssandra”的讀者一定耳熟能詳。在cassandra的讀操作里,如果memtable里沒有那么就看bloomfilter,如果bloomfilter說沒有就結束了,真沒有,如果說有繼續查key cache,etc。說沒有就沒有這個屬性太牛逼了。下半句是說有也不一定有但是誤差率可以控制為0.001,下面用https://github.com/jaybaird/python-bloomfilter給大家舉個例子(上面重復的code我就不寫了):

from pybloom import BloomFilter

bf = BloomFilter(capacity=1000, error_rate=0.001)

for word in packer_words:
    bf.add(word)

intersect = set([])

for word in jabber_words:
    if word in bf:
        intersect.add(word)

print intersect

打印結果:

set(['and', 'in', 'o', 'to', 'through', 'time', 'my', 'day'])

Set Similarity -- MinHash

就是說兩篇文章我來看看它們到底有多相似,聽起來可以預防論文抄襲啥的,下面我們通過https://github.com/ekzhu/datasketch的實現來看一下:

from datasketch import MinHash

def mh_digest(data):
    m = MinHash(num_perm=512) #number of permutation

    for d in data:
        m.update(d.encode('utf8'))

    return m

m1 = mh_digest(set(jabber_words))
m2 = mh_digest(set(packer_words))

print "Jaccard simularity %f" % m1.jaccard(m2), "estimated"

s1 = set(jabber_words)
s2 = set(packer_words)
actual_jaccard = float(len(s1.intersection(s2)))/float(len(s1.union(s2)))

print "Jaccard simularity %f" % actual_jaccard, "actual"

打印結果:

Jaccard simularity 0.060547 estimated
Jaccard simularity 0.069565 actual

Frequency Summaries -- count-min sketch

頻率統計算法一般用在排行榜中,既當前的🏆是誰,🥈是誰,🥉是誰,我們比較關注和比較發生次數非常多的事件,我們不太關心誰排第n和誰排第n+1,具體說你打槍打了1000環還是1001環這種誤差也不重要。還有一個應用實例是寫作語言檢測,把出現頻率最多的一些詞取出就能知道你這篇文章是用哪種語言寫的。下面通過https://github.com/IsaacHaze/countminsketch來具體看一看:

from collections import Counter
from yacms import CountMinSketch

counts = Counter()

# 200 is hash width in bits, 3 is number of hash functions
cms = CountMinSketch(200, 3)

for word in jabber_words:
    counts[word] += 1
    cms.update(word, 1)

for word in ["the", "he", "and", "that"]:
    print "word %s counts %d" % (word, cms.estimate(word))

for e in counts:
    if counts[e] != cms.estimate(e):
        print "missed %s counter: %d, sketch: %d" % (e, counts[e], cms.estimate(e))

打印結果:

word the counts 19
word he counts 7
word and counts 14
word that counts 2
missed two counter: 2, sketch: 3
missed chortled counter: 1, sketch: 2

Streaming Quantiles -- tdigest

流統計,這個厲害了,假設你有一個超大流數據源源不斷,比如是交易數據,現在讓你來檢測哪些可能是屬於信用卡盜刷,那么拿到平均交易價格,最大交易價格就很重要。實時獲取這個統計也增大了計算和處理難度,那么t-digest就要閃亮登場了,下面我們看看怎么實時拿到5%頭數據,5%尾數據和中間值。下面通過https://github.com/trademob/t-digest來看一下:

from tdigest import TDigest
import random

td = TDigest()

for x in xrange(0, 1000):
    td.add(random.random(), 1)

for q in [0.05, 0.5, 0.95]:
    print "%f @ %f" % (q, td.quantile(q))

打印結果:

0.050000 @ 0.052331
0.500000 @ 0.491775
0.950000 @ 0.955989

好了,今天關於概率數據結構的講解就到此為止,大家有興趣可以看看相關代碼的具體實現,代碼行數都很少,背后蘊藏的數學技巧卻很值得品味。


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM