R語言字符串相似度 stringdist包

本文轉載自查看原文 2022-01-19 23:47 863 數據挖掘和算法

計算字符串相似度可以使用utils包中的adist函數，或者MKmisc包中的stringdist函數，或者RecordLinkage包中也有如jarowinkler之類的距離函數。本文介紹stringdist包中的stringdist函數和stringdistmatrix函數。
stringdist包作者是 Mark der Loo
stringdist用於計算對象a，b中的字符串兩兩之間的相似度，對於一個對象中的元素少於另一個的情況，采用循環補齊機制。stringdistmatrix的出相似度矩陣，其中采用a中的行，b中的列。

stringdist(a, b, method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex"), useBytes = FALSE, weight = c(d = 1, i = 1, s = 1, t = 1), maxDist = Inf, q = 1, p = 0, nthread = getOption("sd_num_thread"))

stringdistmatrix(a, b, method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex"), useBytes = FALSE, weight = c(d = 1, i = 1, s = 1, t = 1), maxDist = Inf, q = 1, p = 0, useNames = c("none", "strings", "names"), ncores = 1, cluster = NULL, nthread = getOption("sd_num_thread"))
1
2
3
參數：
a，b：字符串類型的目標對象
method：距離計算方法，默認為“osa”，可以設置為jaccard，hamming，jarowinkler等方法。
useBytes：以字節為單位進行比較
weight：權值必須為正並且不超過1
maxDist：最大距離限制
q：在使用method=’qgram’, ‘jaccard’ 或 ‘cosine’的時候設置，必須為非負數
p：jarowinkler距離的懲罰因子，默認為0，在0-0.25之間取值
nThread：最大線程數
useNames：輸出的行、列名使用輸入變量的行、列名
ncores：核心數
cluster：自定義集群數

案例：

> stringdistmatrix(c("foo","bar","boo"),c("baz","buz"))
[,1] [,2]
[1,] 3 3
[2,] 1 2
[3,] 2 2

> # string distance matching is case sensitive:
> stringdist("ABC","abc")
[1] 3
>
> # so you may want to normalize a bit:
> stringdist(tolower("ABC"),"abc")
[1] 0
>
> # stringdist recycles the shortest argument:
> stringdist(c('a','b','c'),c('a','c'))
Warning message: longer object length is not a multiple of shorter object length
[1] 0 1 1
>
> # different edit operations may be weighted; e.g. weighted substitution:
> stringdist('ab','ba',weight=c(1,1,1,0.5))
[1] 0.5
>
> # Non-unit weights for insertion and deletion makes the distance metric asymetric
> stringdist('ca','abc')
[1] 3
> stringdist('abc','ca')
[1] 3
> stringdist('ca','abc',weight=c(0.5,1,1,1))
[1] 2
> stringdist('abc','ca',weight=c(0.5,1,1,1))
[1] 2.5

> # q-grams are based on the difference between occurrences of q consecutive characters
> # in string a and string b.
> # Since each character abc occurs in 'abc' and 'cba', the q=1 distance equals 0:
> stringdist('abc','cba',method='qgram',q=1)
[1] 0
>
> # since the first string consists of 'ab','bc' and the second
> # of 'cb' and 'ba', the q=2 distance equals 4 (they have no q=2 grams in common):
> stringdist('abc','cba',method='qgram',q=2)
[1] 4

> stringdist('MARTHA','MATHRA',method='jw')
[1] 0.08333333
> # Note that stringdist gives a _distance_ where wikipedia gives the corresponding
> # _similarity measure_. To get the wikipedia result:
> 1 - stringdist('MARTHA','MATHRA',method='jw')
[1] 0.9166667
>
> # The corresponding Jaro-Winkler distance can be computed by setting p=0.1
> stringdist('MARTHA','MATHRA',method='jw',p=0.1)
[1] 0.06666667
> # or, as a similarity measure
> 1 - stringdist('MARTHA','MATHRA',method='jw',p=0.1)
[1] 0.9333333
>
> # This gives distance 1 since Euler and Gauss translate to different soundex codes.
> stringdist('Euler','Gauss',method='soundex')
[1] 1
> # Euler and Ellery translate to the same code and have distance 0
> stringdist('Euler','Ellery',method='soundex')
[1] 0
>
————————————————

函數 Levenshtein編輯距離.可以將其轉換為相似度指標，例如1-(Levenshtein編輯距離/更長的字符串長度).

RecordLinkage 包中的levenshteinSim函數也可以直接執行此操作，並且可能比adist快.

library(RecordLinkage) > levenshteinSim("apple", "apple") [1] 1 > levenshteinSim("apple", "aaple") [1] 0.8 > levenshteinSim("apple", "appled") [1] 0.8333333 > levenshteinSim("appl", "apple") [1] 0.8

ETA:有趣的是，雖然RecordLinkage軟件包中的levenshteinDist似乎比adist略快，但levenshteinSim卻比任何一個都慢.使用 rbenchmark 包:

> benchmark(levenshteinDist("applesauce", "aaplesauce"), replications=100000) test replications elapsed relative 1 levenshteinDist("applesauce", "aaplesauce") 100000 4.012 1 user.self sys.self user.child sys.child 1 3.583 0.452 0 0 > benchmark(adist("applesauce", "aaplesauce"), replications=100000) test replications elapsed relative user.self 1 adist("applesauce", "aaplesauce") 100000 4.277 1 3.707 sys.self user.child sys.child 1 0.461 0 0 > benchmark(levenshteinSim("applesauce", "aaplesauce"), replications=100000) test replications elapsed relative 1 levenshteinSim("applesauce", "aaplesauce") 100000 7.206 1 user.self sys.self user.child sys.child 1 6.49 0.743 0 0

此開銷僅是由於levenshteinSim的代碼造成的，它只是levenshteinDist的包裝:

> levenshteinSim
function (str1, str2) 
{
    return(1 - (levenshteinDist(str1, str2)/pmax(nchar(str1), nchar(str2)))) }

僅供參考:如果您始終比較兩個字符串而不是向量，則可以創建一個使用max而不是pmax的新版本，並將運行時間節省約25％:

mylevsim = function (str1, str2) 
{
    return(1 - (levenshteinDist(str1, str2)/max(nchar(str1), nchar(str2)))) } > benchmark(mylevsim("applesauce", "aaplesauce"), replications=100000) test replications elapsed relative user.self 1 mylevsim("applesauce", "aaplesauce") 100000 5.608 1 4.987 sys.self user.child sys.child 1 0.627 0 0

長話短說，adist和levenshteinDist在性能上幾乎沒有區別，盡管如果您不想添加軟件包依賴項，則前者是更可取的.如何將其轉換為相似性指標確實會對性能產生一些影響.

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 字符串相似度字符串相似度字符串相似度匹配 Oracle字符串相似度查詢字符串相似度匹配算法一 C# 中如何判斷字符串的相似度字符串相似度算法——Levenshtein Distance算法字符串相似度算法——Levenshtein Distance算法 R語言--字符串操作 R語言-字符串處理函數