簡單易用的字符串模糊匹配庫Fuzzywuzzy
閱讀目錄
FuzzyWuzzy 簡介
FuzzyWuzzy 是一個簡單易用的模糊字符串匹配工具包。它依據 Levenshtein Distance 算法 計算兩個序列之間的差異。
Levenshtein Distance
算法,又叫Edit Distance
算法,是指兩個字符串之間,由一個轉成另一個所需的最少編輯操作次數。許可的編輯操作包括將一個字符替換成另一個字符,插入一個字符,刪除一個字符。一般來說,編輯距離越小,兩個串的相似度越大。
項目地址:https://github.com/seatgeek/fuzzywuzzy
環境依賴
- Python 2.7 以上
- difflib
- python-Levenshtein(可選, 在字符串匹配時可提供4-10x 的加速, 但在某些特定情況下可能會導致不同的結果)
- pycodestyle
- hypothesis
- pytest
安裝
使用 PIP 通過 PyPI 安裝
pip install fuzzywuzzy
or the following to install python-Levenshtein
too
pip install fuzzywuzzy[speedup]
使用 PIP 通過 Github 安裝
pip install git+git://github.com/seatgeek/fuzzywuzzy.git@0.17.0#egg=fuzzywuzzy
或者添加你的 requirements.txt
文件 (然后運行 pip install -r requirements.txt
)
git+ssh://git@github.com/seatgeek/fuzzywuzzy.git@0.17.0#egg=fuzzywuzzy
使用 GIT 手工安裝
git clone git://github.com/seatgeek/fuzzywuzzy.git fuzzywuzzy cd fuzzywuzzy python setup.py install
用法
全匹配
from fuzzywuzzy import fuzz from fuzzywuzzy import process print(fuzz.ratio("this is a test", "this is a test!"))
C:\Pycham\anaconda\lib\site-packages\fuzzywuzzy\fuzz.py:11: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning') 97
1.報錯顯示我們需要安裝python-Levenshtein庫
非完全匹配(Partial Ratio)
from fuzzywuzzy import fuzz from fuzzywuzzy import process print(fuzz.partial_ratio("this is a test", "this is a test!"))
運行結果:
100
忽略順序匹配(Token Sort Ratio)
from fuzzywuzzy import fuzz from fuzzywuzzy import process print(fuzz.ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")) print(fuzz.token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear"))
運行結果:
91 100
fuzz._process_and_sort(s, force_ascii, full_process=True)
對字符串s排序。force_ascii:True 或者False。為True表示轉換為ascii碼。如果full_process為True,則會將字符串s轉換為小寫,去掉除字母和數字之外的字符(發現不能去掉-字符),剩下的字符串以空格分開,然后排序。如果為False,則直接對字符串s排序。
fuzz._token_sort(s1, s2, partial=True, force_ascii=True, full_process=True)
給出字符串 s1, s2的相似度。首先經過 fuzz._process_and_sort()函數處理。partial為True時,再經過fuzz.partial_ratio()函數。partial為False時,再經過fuzz.ratio()函數。
so:
fuzz._token_sort(s1, s2, partial=True, force_ascii=True, full_process=True)
partial為True時:
fuzz.partial_token_sort_ratio(s1, s2, force_ascii=True, full_process=True)
partial為False時:
fuzz.token_sort_ratio(s1, s2, force_ascii=True, full_process=True)
去重子集匹配(Token Set Ratio)
from fuzzywuzzy import fuzz from fuzzywuzzy import process print(fuzz.token_sort_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")) print(fuzz.token_set_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear"))
運行結果:
84 100
so:
fuzz._token_set(s1, s2, partial=True, force_ascii=True, full_process=True)
partial為False時,就是 fuzz.token_set_ratio()函數。
fuzz.token_set_ratio(s1, s2, force_ascii=True, full_process=True)
當partial為True時,就是 fuzz.partial_token_set_ratio()函數。
fuzz.partial_token_set_ratio(s1, s2, force_ascii=True, full_process=True)
Process
用來返回模糊匹配的字符串和相似度。
>>> choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"] >>> process.extract("new york jets", choices, limit=2) [('New York Jets', 100), ('New York Giants', 78)] >>> process.extractOne("cowboys", choices) ("Dallas Cowboys", 90)
你可以傳入附加參數到 extractOne
方法來設置使用特定的匹配模式。一個典型的用法是來匹配文件路徑:
已知移植
FuzzyWuzzy 已經被移植到其他語言環境,我們已知的有:
- Java: xpresso's fuzzywuzzy implementation
- Java: fuzzywuzzy (java port)
- Rust: fuzzyrusty (Rust port)
- JavaScript: fuzzball.js (JavaScript port)
- C++: Tmplt/fuzzywuzzy
- C#: fuzzysharp (.Net port)
- Go: go-fuzzywuzz (Go port)