一個有意思的需求——中文匹配度

本文轉載自查看原文 2014-07-13 15:01 3519 算法/ 編程語言/ 中文匹配

引言

　　最近LZ帶頭在做一個互聯網項目，互聯網的東西總是那么新鮮，這也難怪大部分猿友都喜歡互聯網。這個互聯網項目不僅讓LZ開發了一個HBase大數據應用，近期的一次需求討論會上，又出來一個小需求，蠻有意思的。這些需求在之前枯燥的企業內部應用開發中，還是很難見到的，畢竟內部應用更多的是業務流程的體現。

　　具體的需求這里不方便透露，但簡單的描述一下需求，就是如何判斷兩個公司名是一個。這其實就是Java當中字符串的相等判斷，最簡單的當然是用equals來判斷。但是由於實際情況是，公司名是由客戶手動輸出的，難免有小小的偏差，因此equals自然就不適用了。比如“博客園科技發展有限公司”和“博客園科技發展（北京）有限公司”，這兩個顯然應該是一個公司，但是如果用equals，自然結果會是false。

方案提出

　　會議上，LZ簡單提出了一個解決方案，就是利用分詞來計算兩者的匹配度，如果匹配度達到一定數值，我們就認為兩個字符串是相等的。不過不論算法如何高明，准確率都不可能達到100%，這點也是LZ一再給業務同事強調的，否則到時候匹配錯了，他們找LZ的茬可咋辦。不過好在業務同事只是希望有一個提示功能，並不做嚴格的判斷標准，所以系統需要做的只是初步的判斷，因此准確率要求並不是特別的高，只能說越高越好。

　　由於這個項目是LZ以PM介入開發的，而且LZ一直都兼任SM，所以技術方案自然是LZ說了算了。沒有討論，沒有異議，方案就這么在會議上定了。

小研究

　　由於LZ最近負責的事情比較多，所以上班的時候自然是沒有時間研究這些東西，只能趁着周末小小研究一下。其實LZ對於分詞並不是特別了解，這些東西與算法的關聯度比較高，而LZ的算法基礎只能說“呵呵”。不過不管怎樣，作為一個項目的領頭人，總得向前沖。如果到時候算法的時間、空間成本太高，或者准確率太低，等着后續尋找大神優化也不晚。

　　這件事的思路比較清晰，LZ需要做的就是兩件事，第一件事就是“分詞”，第二件事就是“匹配”。

分詞

　　分詞是比較簡單的一步，有很多現成的類庫可以使用，只要選擇一個使用就可以了。Lucene是LZ第一個想到的，畢竟LZ也算是Apache的腦殘粉，因此話不多說，第一時間就下載Lucene，開始了探究之旅。

　　以下是LZ在網絡上找到的示例，稍微進行了一點修改。

package com.creditease.borrow.lucene; import java.io.IOException; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer; import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; import org.apache.lucene.util.Version; public class App { public static void main( String[] args ) throws IOException { String text = "博客園科技發展（北京）有限公司"; SmartChineseAnalyzer smartChineseAnalyzer = new SmartChineseAnalyzer(Version.LUCENE_47); TokenStream tokenStream = smartChineseAnalyzer.tokenStream("field", text); CharTermAttribute charTermAttribute = tokenStream.getAttribute(CharTermAttribute.class); tokenStream.reset(); while (tokenStream.incrementToken()) { System.out.print(charTermAttribute.toString() + "  "); } tokenStream.end(); tokenStream.close(); smartChineseAnalyzer.close(); } }

　　輸出結果為以下內容。

博  客  園  科技  發展  北京  有限公司

　　這種分詞結果勉強可以接受了，最理想的應該是將“博客”放在一起，不過看來Lucene的字典里沒有這個詞。沒關系，這對我們的計划並不影響，我們接下來開始着手匹配的事。

匹配

　　有了分詞，我們要做的，就是匹配兩個字符串數組，並得到一個匹配度。既然是匹配度，那么肯定就有分母和分子。我們可以挑其中一個數組的長度為分母，以另外一個數組中的元素找到匹配分詞的個數作為分子。為了減少程序的復雜度，我們采用Set集合的去重特點來進行計算。就像以下程序這樣。

package com.creditease.borrow.lucene; import java.io.IOException; import java.util.HashSet; import java.util.Set; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer; import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; import org.apache.lucene.util.Version; public class App { private static SmartChineseAnalyzer smartChineseAnalyzer = new SmartChineseAnalyzer(Version.LUCENE_47); public static void main( String[] args ) throws IOException { String text1 = "博客園科技發展（北京）有限公司"; String text2 = "博客園科技發展有限公司"; System.out.println(oneWayMatch(text1, text2)); } public static double oneWayMatch(String text1,String text2) { try { Set<String> set = new HashSet<String>(10); TokenStream tokenStream = smartChineseAnalyzer.tokenStream("field", text1); CharTermAttribute charTermAttribute = tokenStream.getAttribute(CharTermAttribute.class); tokenStream.reset(); while (tokenStream.incrementToken()) { set.add(charTermAttribute.toString()); } int denominator = set.size(); tokenStream.end(); tokenStream.close(); tokenStream = smartChineseAnalyzer.tokenStream("field", text2); charTermAttribute = tokenStream.getAttribute(CharTermAttribute.class); tokenStream.reset(); while (tokenStream.incrementToken()) { set.add(charTermAttribute.toString()); } int numerator = set.size() - denominator; double unmatchRate = ((double)numerator)/denominator; tokenStream.end(); tokenStream.close(); return unmatchRate; } catch (IOException e) { return 1D; } } }

　　輸出的結果為0，也就是說匹配度為100%。這顯然是不對的，兩個字符串很明顯不是一模一樣，得到匹配度為100%說明我們的算法還是有問題。

　　仔細分析一下，問題就出在我們是拿text2中不匹配的分詞作為分子，而text2中的分詞在text1中全部都包含，因此分子最終會是0，那么不匹配度自然就是0（unmatchRate）。為了彌補這一缺陷，LZ想來想去最終還是決定采取“雙向”（twoWay）的辦法解決這個問題，可以看到LZ給上面那個方法取的名字為“單向”匹配（oneWay）。

　　雙向匹配其實就是將兩者順序顛倒，再進行一次匹配而已，因此我們的程序可以簡單的更改為以下形式。

package com.creditease.borrow.lucene; import java.io.IOException; import java.util.HashSet; import java.util.Set; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer; import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; import org.apache.lucene.util.Version; public class App { private static SmartChineseAnalyzer smartChineseAnalyzer = new SmartChineseAnalyzer(Version.LUCENE_47); public static void main( String[] args ) throws IOException { String text1 = "博客園科技發展（北京）有限公司"; String text2 = "博客園科技發展有限公司"; System.out.println(twoWayMatch(text1, text2)); } public static double twoWayMatch(String text1,String text2) { return (oneWayMatch(text1, text2) + oneWayMatch(text2, text1)); } public static double oneWayMatch(String text1,String text2) { try { Set<String> set = new HashSet<String>(10); TokenStream tokenStream = smartChineseAnalyzer.tokenStream("field", text1); CharTermAttribute charTermAttribute = tokenStream.getAttribute(CharTermAttribute.class); tokenStream.reset(); while (tokenStream.incrementToken()) { set.add(charTermAttribute.toString()); } int denominator = set.size(); tokenStream.end(); tokenStream.close(); tokenStream = smartChineseAnalyzer.tokenStream("field", text2); charTermAttribute = tokenStream.getAttribute(CharTermAttribute.class); tokenStream.reset(); while (tokenStream.incrementToken()) { set.add(charTermAttribute.toString()); } int numerator = set.size() - denominator; double unmatchRate = ((double)numerator)/denominator; tokenStream.end(); tokenStream.close(); return unmatchRate; } catch (IOException e) { return 1D; } } }

　　該程序的輸出結果為0.1666666666...，也就是0.17，也就是說兩者的匹配度為83%。這個值顯然更加接近實際的情況。可以看到，雙向匹配以單向匹配為基礎，將順序顛倒的結果相加，就能得到不匹配度。

　　事情原本到此就可以結束了，但是還有一個更大的難點沒有處理。那就是匹配度到底設置為多少合適。這個問題到目前為止，LZ還沒有更好的辦法。按照上面的小例子來講，自然是設為80%就可以。

　　但是看下下面這個例子，就會發現這個數值很不正確，或者說匹配算法還是有問題。

public static void main( String[] args ) throws IOException { String text1 = "博客園科技發展（北京）有限公司"; String text2 = "博客園有限公司"; System.out.println(twoWayMatch(text1, text2)); }

　　運行的結果為0.75，也就是說匹配度大約只有25%。但是很明顯，上面兩個公司實際上匹配度是很高的，因為最重要的三個字是匹配的。

　　再仔細分析一下，問題就變成權重了。也就是說，每個詞的權重應該是不一樣的，這樣的話，匹配的准確度可能會更高一點。我們稍微改善一下程序。（以下統一使用后面加雙斜線的方式表示程序主要改變的地方）

package com.creditease.borrow.lucene; import java.io.IOException; import java.util.Arrays; import java.util.HashSet; import java.util.List; import java.util.Set; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer; import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; import org.apache.lucene.util.Version; public class App { private static SmartChineseAnalyzer smartChineseAnalyzer = new SmartChineseAnalyzer(Version.LUCENE_47); private static List<String> smallWeightWords = Arrays.asList("公司","有限公司","科技","發展","股份"); private static double smallWeight = 0.3D; public static void main( String[] args ) throws IOException { String text1 = "博客園科技發展（北京）有限公司"; String text2 = "博客園有限公司"; System.out.println(twoWayMatch(text1, text2)); } public static double twoWayMatch(String text1,String text2) { return (oneWayMatch(text1, text2) + oneWayMatch(text2, text1)); } public static double oneWayMatch(String text1,String text2) { try { Set<String> set = new HashSet<String>(10); TokenStream tokenStream = smartChineseAnalyzer.tokenStream("field", text1); CharTermAttribute charTermAttribute = tokenStream.getAttribute(CharTermAttribute.class); tokenStream.reset(); while (tokenStream.incrementToken()) { set.add(charTermAttribute.toString()); } int denominator = set.size(); tokenStream.end(); tokenStream.close(); tokenStream = smartChineseAnalyzer.tokenStream("field", text2); charTermAttribute = tokenStream.getAttribute(CharTermAttribute.class); tokenStream.reset(); int smallWeightWordsCount = 0;// while (tokenStream.incrementToken()) { String word = charTermAttribute.toString();// int tempSize = set.size();// set.add(word);// if (tempSize + 1 == set.size() && smallWeightWords.contains(word)) {// smallWeightWordsCount++;// }// } int numerator = set.size() - denominator; double unmatchRate = (smallWeightWordsCount * smallWeight + numerator - ((double)smallWeightWordsCount))/denominator;// tokenStream.end(); tokenStream.close(); return unmatchRate; } catch (IOException e) { return 1D; } } }

　　現在程序的輸出結果為0.4，匹配度為60%。從結果來看，依然有點不盡人意。仔細分析一下程序，會發現，我們計算不匹配度的時候，是交叉計算的。也就是說，我們使用一個數組中不匹配的數目去除以另外一個數組的大小，這可能會造成“極端”數值。

　　我們需要調整程序，讓數組自己與自己計算，這樣就不會出現那種情況。如下。

package com.creditease.borrow.lucene; import java.io.IOException; import java.util.Arrays; import java.util.HashSet; import java.util.List; import java.util.Set; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer; import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; import org.apache.lucene.util.Version; public class App { private static SmartChineseAnalyzer smartChineseAnalyzer = new SmartChineseAnalyzer(Version.LUCENE_47); private static List<String> smallWeightWords = Arrays.asList("公司","有限公司","科技","發展","股份"); private static double smallWeight = 0.3D; public static void main( String[] args ) throws IOException { String text1 = "博客園科技發展（北京）有限公司"; String text2 = "博客園有限公司"; System.out.println(twoWayMatch(text1, text2)); } public static double twoWayMatch(String text1,String text2) { return (oneWayMatch(text1, text2) + oneWayMatch(text2, text1)); } public static double oneWayMatch(String text1,String text2) { try { Set<String> set = new HashSet<String>(10); TokenStream tokenStream = smartChineseAnalyzer.tokenStream("field", text1); CharTermAttribute charTermAttribute = tokenStream.getAttribute(CharTermAttribute.class); tokenStream.reset(); while (tokenStream.incrementToken()) { set.add(charTermAttribute.toString()); } int originalCount = set.size();// tokenStream.end(); tokenStream.close(); tokenStream = smartChineseAnalyzer.tokenStream("field", text2); charTermAttribute = tokenStream.getAttribute(CharTermAttribute.class); tokenStream.reset(); int smallWeightWordsCount = 0; int denominator = 0;// while (tokenStream.incrementToken()) { denominator++;// String word = charTermAttribute.toString(); int tempSize = set.size(); set.add(word); if (tempSize + 1 == set.size() && smallWeightWords.contains(word)) { smallWeightWordsCount++; } } int numerator = set.size() - originalCount; double unmatchRate = (smallWeightWordsCount * smallWeight + numerator - ((double)smallWeightWordsCount))/denominator;// tokenStream.end(); tokenStream.close(); return unmatchRate; } catch (IOException e) { return 1D; } } }

　　程序的輸出結果為0.2285714285714286，也就是匹配度大約為77%，這個數值還是比較科學的。這次我們主要調整了分母，將分母調整為不匹配元素自己的數組大小。

　　現在我們需要做的就很簡單了，就是把有可能改變的地方都在程序當中做成可配置的，比如從數據庫讀取。需要做成可配置項的內容有以下幾個。

　　1、低權重的詞語，也就是smallWeightWords。

　　2、低權重的數值，也就是smallWeight。

　　3、匹配度的最小值，也就是說匹配度大於等於多少的時候，我們就認為是一個公司。

　　具體如何做成可配置項，這里LZ就不再贅述了，真實的web項目當中有無數種辦法可以達到這個目的，最常用的當然是存儲到數據庫。但第一項更適合放入數據庫，后面兩項更適合存放在配置文件當中。無論放在哪里，這些配置都要支持動態刷新，這樣應用在運行的時候就可以動態調整判斷規則了。

小結

　　LZ的算法不一定是最好的，或者說一定不是最好的。但是有時候慢慢解決一個問題，讓答案逐漸靠近自己的判斷也是一種樂趣不是嗎？

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 一個有意思的js實例，你會嗎？？[原創] 有意思的 CDN 一個非常有意思的蜜罐T-Pot 16.10 一個有意思的 hta 程序（html application） json.loads的一個很有意思的現象有意思的矛盾體有意思！強大的 SVG 濾鏡 Erlang庫 -- 有意思的庫匯總有意思的程序員注釋 Dubbo有意思的特性介紹