簡單易用的字符串模糊匹配庫Fuzzywuzzy

本文轉載自查看原文 2019-06-24 20:06 1469 Python

簡單易用的字符串模糊匹配庫Fuzzywuzzy

閱讀目錄

FuzzyWuzzy 簡介

安裝

用法

已知移植

FuzzyWuzzy 簡介

FuzzyWuzzy 是一個簡單易用的模糊字符串匹配工具包。它依據 Levenshtein Distance 算法計算兩個序列之間的差異。

Levenshtein Distance 算法，又叫 Edit Distance 算法，是指兩個字符串之間，由一個轉成另一個所需的最少編輯操作次數。許可的編輯操作包括將一個字符替換成另一個字符，插入一個字符，刪除一個字符。一般來說，編輯距離越小，兩個串的相似度越大。

項目地址：https://github.com/seatgeek/fuzzywuzzy

環境依賴

Python 2.7 以上
difflib
python-Levenshtein（可選, 在字符串匹配時可提供4-10x 的加速, 但在某些特定情況下可能會導致不同的結果）

支持的測試工具

pycodestyle
hypothesis
pytest

安裝

使用 PIP 通過 PyPI 安裝

    pip install fuzzywuzzy

or the following to install python-Levenshtein too

    pip install fuzzywuzzy[speedup]

使用 PIP 通過 Github 安裝

    pip install git+git://github.com/seatgeek/fuzzywuzzy.git@0.17.0#egg=fuzzywuzzy

或者添加你的 requirements.txt 文件 (然后運行 pip install -r requirements.txt)

    git+ssh://git@github.com/seatgeek/fuzzywuzzy.git@0.17.0#egg=fuzzywuzzy

使用 GIT 手工安裝

    git clone git://github.com/seatgeek/fuzzywuzzy.git fuzzywuzzy cd fuzzywuzzy python setup.py install

用法

全匹配

fuzz.ratio()對位置敏感：

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

print(fuzz.ratio("this is a test", "this is a test!"))

運行結果：

C:\Pycham\anaconda\lib\site-packages\fuzzywuzzy\fuzz.py:11: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning
  warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')
97

1.報錯顯示我們需要安裝python-Levenshtein庫

2.當我安裝python-Levenshtein時又報錯：error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools"

3.提示讓我安裝Microsoft Visual C++ Build Tools，第一種方法安裝Microsoft Visual C++ Build Tools，我總不能為了安裝一個庫去安裝一個編譯器吧，第二種方法去https://www.lfd.uci.edu/~gohlke/pythonlibs/這個網站下找到對應版本的python-Levenshtein並下載。cp對應python版本號，amd后面對應計算機位數。

4.安裝

非完全匹配（Partial Ratio）

fuzz.partial_ratio()對位置敏感：

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

print(fuzz.partial_ratio("this is a test", "this is a test!"))

運行結果：

忽略順序匹配（Token Sort Ratio）

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

print(fuzz.ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear"))
print(fuzz.token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear"))

運行結果：

91
100

fuzz._process_and_sort(s, force_ascii, full_process=True)

對字符串s排序。force_ascii:True 或者False。為True表示轉換為ascii碼。如果full_process為True，則會將字符串s轉換為小寫，去掉除字母和數字之外的字符（發現不能去掉-字符），剩下的字符串以空格分開，然后排序。如果為False，則直接對字符串s排序。

fuzz._token_sort(s1, s2, partial=True, force_ascii=True, full_process=True)

給出字符串 s1, s2的相似度。首先經過 fuzz._process_and_sort（）函數處理。partial為True時，再經過fuzz.partial_ratio（）函數。partial為False時，再經過fuzz.ratio（）函數。

so:

fuzz._token_sort(s1, s2, partial=True, force_ascii=True, full_process=True)

partial為True時：

fuzz.partial_token_sort_ratio(s1, s2, force_ascii=True, full_process=True)

partial為False時：

fuzz.token_sort_ratio(s1, s2, force_ascii=True, full_process=True)

去重子集匹配（Token Set Ratio）

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

print(fuzz.token_sort_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear"))
print(fuzz.token_set_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear"))

運行結果：

84
100

so:

fuzz._token_set(s1, s2, partial=True, force_ascii=True, full_process=True)

partial為False時，就是 fuzz.token_set_ratio（）函數。

fuzz.token_set_ratio(s1, s2, force_ascii=True, full_process=True)

當partial為True時，就是 fuzz.partial_token_set_ratio（）函數。

fuzz.partial_token_set_ratio(s1, s2, force_ascii=True, full_process=True)

Process

用來返回模糊匹配的字符串和相似度。

    >>> choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]
    >>> process.extract("new york jets", choices, limit=2)
        [('New York Jets', 100), ('New York Giants', 78)]
    >>> process.extractOne("cowboys", choices)
        ("Dallas Cowboys", 90)

你可以傳入附加參數到 extractOne 方法來設置使用特定的匹配模式。一個典型的用法是來匹配文件路徑:

已知移植

FuzzyWuzzy 已經被移植到其他語言環境，我們已知的有：

Java: xpresso's fuzzywuzzy implementation
Java: fuzzywuzzy (java port)
Rust: fuzzyrusty (Rust port)
JavaScript: fuzzball.js (JavaScript port)
C++: Tmplt/fuzzywuzzy
C#: fuzzysharp (.Net port)
Go: go-fuzzywuzz (Go port)

Refer

https://www.jianshu.com/p/ed22a82b45d1

https://blog.csdn.net/sunyao_123/article/details/76942809

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python 字符串模糊匹配 Fuzzywuzzy python模糊匹配之fuzzywuzzy oracle數據庫模糊查詢匹配多個字符串 C++實現的字符串模糊匹配 Java中的字符串模糊匹配問題 C/C++ 字符串模糊匹配字符串模糊匹配數組中的元素使用vlookup的模糊匹配和字符串拼接 Mysql 模糊匹配(字符串str中是否包含子字符串substr) fuzzywuzzy：計算兩個字符串之間的相似度