ahocorasick從安裝到使用


簡介:

pyahocorasick是個python模塊,由兩種數據結構實現:trie和Aho-Corasick自動機。

Trie是一個字符串索引的詞典,檢索相關項時時間和字符串長度成正比。

AC自動機能夠在一次運行中找到給定集合所有字符串。AC自動機其實就是在Trie樹上實現KMP,可以完成多模式串的匹配。
(推薦學習資料:http://blog.csdn.net/niushuai666/article/details/7002823http://www.cnblogs.com/kuangbin/p/3164106.html

作者

Wojciech Muła, wojciech_mula@poczta.onet.pl
官方地址:
https://pypi.python.org/pypi/pyahocorasick/

pip過程中遇到了報錯不能下載,例如:

error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools"

因為並沒有直接可用的whl文件,那么只能安裝相應的工具包,直接默認安裝重啟就可以。

在這里給個百度網盤的鏈接:

鏈接:https://pan.baidu.com/s/13Fv3dPYQNq6u_ErO9nqCAw
提取碼:nhll 

他的具體使用方法可以細看這兩個例子:

import ahocorasick
A = ahocorasick.Automaton()

# 向trie樹中添加單詞
for index,word in enumerate("he her hers she".split()):
    A.add_word(word, (index, word))
# 用法分析add_word(word,[value]) => bool
# 根據Automaton構造函數的參數store設置,value這樣考慮:
# 1. 如果store設置為STORE_LENGTH,不能傳遞value,默認保存len(word)
# 2. 如果store設置為STORE_INTS,value可選,但必須是int類型,默認是len(automaton)
# 3. 如果store設置為STORE_ANY,value必須寫,可以是任意類型

# 測試單詞是否在樹中
if "he" in A:
    print True
else:
    print False
A.get("he")
# (0,'he')
A.get("cat","<not exists>")
# '<not exists>'
A.get("dog")
# KeyError

# 將trie樹轉化為Aho-Corasick自動機
A.make_automaton()

# 找到所有匹配字符串
for item in A.iter("_hershe_"):
    print item
#(2,(0,'he'))
#(3,(1,'her'))
#(4, (2, 'hers'))
#(6, (3, 'she'))
#(6, (0, 'he'))
 1 import ahocorasick
 2 A = ahocorasick.Automaton()
 3 
 4 # 添加單詞
 5 for index,word in enumerate("cat catastropha rat rate bat".split()):
 6     A.add_word(word, (index, word))
 7 
 8 # prefix
 9 list(A.keys("cat"))
10 ## ["cat","catastropha"]
11 
12 list(A.keys("?at","?",ahocprasick.MATCH_EXACT_LENGTH))
13 ## ['bat','cat','rat']
14 
15 list(A.keys("?at?", "?", ahocorasick.MATCH_AT_MOST_PREFIX))
16 ## ["bat", "cat", "rat", "rate"]
17 
18 list(A.keys("?at?", "?", ahocorasick.MATCH_AT_LEAST_PREFIX))
19 ## ['rate']
20 ## keys用法分析
21 ## keys([prefix, [wildcard, [how]]]) => yield strings
22 ## If prefix (a string) is given, then only words sharing this prefix are yielded.
23 ## If wildcard (single character) is given, then prefix is treated as a simple pattern with selected wildcard. Optional parameter how controls which strings are matched:
24 ## MATCH_EXACT_LENGTH [default]:Only strings with the same length as a pattern’s length are yielded. In other words, literally match a pattern.
25 ## MATCH_AT_LEAST_PREFIX:Strings that have length greater or equal to a pattern’s length are yielded.
26 ## MATCH_AT_MOST_PREFIX:Strings that have length less or equal to a pattern’s length are yielded.

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM