simhash算法：海量千萬級的數據去重

simhash算法及原理參考：

簡單易懂講解simhash算法 hash 哈希：https://blog.csdn.net/le_le_name/article/details/51615931

simhash算法及原理簡介：https://blog.csdn.net/lengye7/article/details/79789206

使用SimHash進行海量文本去重：https://www.cnblogs.com/maybe2030/p/5203186.html#_label3

python實現：

python使用simhash實現文本相似性對比（全代碼展示）：https://blog.csdn.net/weixin_43750200/article/details/84789361

simhash的py實現：https://blog.csdn.net/gzt940726/article/details/80460419

python庫simhash使用

詳情請查看：https://leons.im/posts/a-python-implementation-of-simhash-algorithm/

（1）查看simhash值

>>> from simhash import Simhash
>>> print '%x' % Simhash(u'I am very happy'.split()).value
9f8fd7efdb1ded7f

Simhash()接收一個token序列，或者叫特征序列。

（2）計算兩個simhash值距離

>>> hash1 = Simhash(u'I am very happy'.split())
>>> hash2 = Simhash(u'I am very sad'.split())
>>> print hash1.distance(hash2)

（3）建立索引

simhash被用來去重。如果兩兩分別計算simhash值，數據量較大的情況下肯定hold不住。有專門的數據結構，參考：http://www.cnblogs.com/maybe2030/p/5203186.html#_label4

from simhash import Simhash, SimhashIndex
# 建立索引
data = {
u'1': u'How are you I Am fine . blar blar blar blar blar Thanks .'.lower().split(),
u'2': u'How are you i am fine .'.lower().split(),
u'3': u'This is simhash test .'.lower().split(),
}
objs = [(id, Simhash(sent)) for id, sent in data.items()]
index = SimhashIndex(objs, k=10) # k是容忍度；k越大，檢索出的相似文本就越多
# 檢索
s1 = Simhash(u'How are you . blar blar blar blar blar Thanks'.lower().split())
print index.get_near_dups(s1)
# 增加新索引
index.add(u'4', s1)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 海量數據去重之SimHash算法簡介和應用網頁去重之Simhash算法 [Algorithm] 使用SimHash進行海量文本去重高效網頁去重算法-SimHash 海量數據相似度計算之simhash短文本查找海量數據去重（上億數據去重）海量數據相似度計算之simhash和海明距離 simhash算法 simhash算法 bitmap海量數據的快速查找和去重————————————