之前講過漢字注音問題,也發過關於拼音匹配問題,但是沒法處理多音字問題
例如:
漢字:不能說的秘密
拼音:bu|fou nai|neng shuo|shui|yue de|di bi|mi mi
當我們輸入:bunengshuodebimi,bunegnshuodemimi,bnsdmm,bnengyuedebimi,buneshdimmi等都可以匹配成功
這里很多字都有多音字,要判斷出每個字的確切的讀音比較困難。這里我們就對每一種讀音都進行匹配
思路:
這里對每個字的拼音首字母進行匹配
每個字的首字母為: b|f n s|y d b|m m
這里假設模式串為:nengshuodemimi (假設為target)
在首字母中找到 target[0],匹配后面的首字母
然后找到匹配的首字母,遍歷一遍拼音,看是否包含模式串,如果包含,則返回真
說的不是很清楚,看代碼把
先發一個獲取漢字拼音的類,拼音數據文件在后面
public static class PinyinHelper { private static Dictionary<int, string> pinyindictionary = new Dictionary<int, string>(); //問題:如何初始化靜態類的成員?? public static void Init() { using (StreamReader reader = new StreamReader("Data/pinyindata.txt", Encoding.UTF8)) { string content = reader.ReadToEnd(); StringReader stringreader = new StringReader(content); string readline = string.Empty; string[] lines = new string[2]; while ((readline = stringreader.ReadLine()) != null) { lines = readline.Split(' '); pinyindictionary.Add(Convert.ToInt32(lines[0], 16), lines[1]); } } } private static string GetPinyin(char ch) { return pinyindictionary[ch]; } public static string GetPinyin(string hanzis) { StringBuilder builder = new StringBuilder(); for (int i = 0; i < hanzis.Length - 1; i++) { builder.Append(GetPinyin(hanzis[i])); builder.Append(' '); } builder.Append(GetPinyin(hanzis[hanzis.Length - 1])); return builder.ToString(); } //是否是漢字 private static bool IsCharChinese(char c) { if (0x4e00 < c && c < 0x9fa5) { return true; } return false; } }
接下來是匹配算法
private bool IsFirstPinyinContains(string pinyin, char ch) { string[] pinyins = pinyin.Split(','); foreach (string py in pinyins) { if (py[0] == ch) { return true; } } return false; } private bool IsPinyinContainTarget(string[] pinyins, string target, int start, int end) { int k = 0; for (int i = start; i < end; i++) { foreach (char ch in pinyins[i]) { if (ch == target[k]) { k++; if (k == target.Length) { return true; } } } } return false; } private bool PinyinMatch(string[] pinyins, string target) { int start, end; for (int i = 0; i < pinyins.Length; i++) { if (pinyins[i][0] == target[0]) //找到第一個 { start = i; int j = i + 1; for (int k = 1; k < target.Length; k++) { if (j < pinyins.Length && IsFirstPinyinContains(pinyins[j], target[k])) { j++; } } end = j; //判斷從start到end的拼音是否包含target if (IsPinyinContainTarget(pinyins, target, start, end)) { return true; } } } return false; }
算法有些不足,大家有什么建議不
拼音數據文件:http://files.cnblogs.com/bomo/pinyindata.zip