簡介:
pyahocorasick是個python模塊,由兩種數據結構實現:trie和Aho-Corasick自動機。
Trie是一個字符串索引的詞典,檢索相關項時時間和字符串長度成正比。
AC自動機能夠在一次運行中找到給定集合所有字符串。AC自動機其實就是在Trie樹上實現KMP,可以完成多模式串的匹配。
(推薦學習資料:http://blog.csdn.net/niushuai666/article/details/7002823;http://www.cnblogs.com/kuangbin/p/3164106.html)
作者
Wojciech Muła, wojciech_mula@poczta.onet.pl
官方地址:
https://pypi.python.org/pypi/pyahocorasick/
pip過程中遇到了報錯不能下載,例如:
error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools"
因為並沒有直接可用的whl文件,那么只能安裝相應的工具包,直接默認安裝重啟就可以。
在這里給個百度網盤的鏈接:
鏈接:https://pan.baidu.com/s/13Fv3dPYQNq6u_ErO9nqCAw
提取碼:nhll
他的具體使用方法可以細看這兩個例子:
import ahocorasick A = ahocorasick.Automaton() # 向trie樹中添加單詞 for index,word in enumerate("he her hers she".split()): A.add_word(word, (index, word)) # 用法分析add_word(word,[value]) => bool # 根據Automaton構造函數的參數store設置,value這樣考慮: # 1. 如果store設置為STORE_LENGTH,不能傳遞value,默認保存len(word) # 2. 如果store設置為STORE_INTS,value可選,但必須是int類型,默認是len(automaton) # 3. 如果store設置為STORE_ANY,value必須寫,可以是任意類型 # 測試單詞是否在樹中 if "he" in A: print True else: print False A.get("he") # (0,'he') A.get("cat","<not exists>") # '<not exists>' A.get("dog") # KeyError # 將trie樹轉化為Aho-Corasick自動機 A.make_automaton() # 找到所有匹配字符串 for item in A.iter("_hershe_"): print item #(2,(0,'he')) #(3,(1,'her')) #(4, (2, 'hers')) #(6, (3, 'she')) #(6, (0, 'he'))
1 import ahocorasick 2 A = ahocorasick.Automaton() 3 4 # 添加單詞 5 for index,word in enumerate("cat catastropha rat rate bat".split()): 6 A.add_word(word, (index, word)) 7 8 # prefix 9 list(A.keys("cat")) 10 ## ["cat","catastropha"] 11 12 list(A.keys("?at","?",ahocprasick.MATCH_EXACT_LENGTH)) 13 ## ['bat','cat','rat'] 14 15 list(A.keys("?at?", "?", ahocorasick.MATCH_AT_MOST_PREFIX)) 16 ## ["bat", "cat", "rat", "rate"] 17 18 list(A.keys("?at?", "?", ahocorasick.MATCH_AT_LEAST_PREFIX)) 19 ## ['rate'] 20 ## keys用法分析 21 ## keys([prefix, [wildcard, [how]]]) => yield strings 22 ## If prefix (a string) is given, then only words sharing this prefix are yielded. 23 ## If wildcard (single character) is given, then prefix is treated as a simple pattern with selected wildcard. Optional parameter how controls which strings are matched: 24 ## MATCH_EXACT_LENGTH [default]:Only strings with the same length as a pattern’s length are yielded. In other words, literally match a pattern. 25 ## MATCH_AT_LEAST_PREFIX:Strings that have length greater or equal to a pattern’s length are yielded. 26 ## MATCH_AT_MOST_PREFIX:Strings that have length less or equal to a pattern’s length are yielded.