【nlp】正向最大匹配算法、逆向最大匹配算法、雙向最大匹配算法代碼實現

本文轉載自查看原文 2019-12-09 20:24 586 FYK

from collections import defaultdict
'''
最大匹配算法 Maximum Match
{
    正向最大匹配,
    逆向最大匹配,
    雙向最大匹配;
    
    分詞算法設計中的幾個基本原則：
1、顆粒度越大越好：用於進行語義分析的文本分詞，要求分詞結果的顆粒度越大，
    即單詞的字數越多，所能表示的含義越確切，如：“公安局長”可以分為“公安 局長”、“公安局 長”、“公安局長”
    都算對，但是要用於語義分析，則“公安局長”的分詞結果最好（當然前提是所使用的詞典中有這個詞）

2、切分結果中非詞典詞越少越好，單字字典詞數越少越好，
    這里的“非詞典詞”就是不包含在詞典中的單字，而“單字字典詞”指的是可以獨立運用的單字，
    如“的”、“了”、“和”、“你”、“我”、“他”。
    例如：“技術和服務”，可以分為“技術 和服 務”以及“技術 和 服務”，
    因為“務”字無法獨立成詞（即詞典中沒有），而“和”字可以單獨成詞（詞典中要包含），
    因此“技術 和服 務”有1個非詞典詞，而“技術 和 服務”有0個非詞典詞，因此選用后者。

3、總體詞數越少越好，在相同字數的情況下，總詞數越少，
    說明語義單元越少，那么相對的單個語義單元的權重會越大，因此准確性會越高。
}
'''
# 加載詞典
def load_dict(path):
    word_dict = set()#創建一個去重集合
    with open(path, 'r',encoding='UTF-8') as f:
        for line in f:
            word_dict.add(line.strip())#Python strip() 方法用於移除字符串頭尾指定的字符（默認為空格或換行符）或字符序列。注意：該方法只能刪除開頭或是結尾的字符，不能刪除中間部分的字符。
    return word_dict
# 正向最大匹配算法
# 返回list[tuple(word, len)]的形式, tuple[0]為詞,tuple[1]為詞的長度，當長度為0時代表該詞為非詞典詞
def MM(sentence, max_len, word_dict):
    if not sentence or sentence is None:
        raise Exception("sentence can not be empty or None")#句子不能為空  #如果引發Exception異常，后面的代碼將不能執行
    result = []
    i = 0
    while i < len(sentence):
        end = i + max_len if i + max_len < len(sentence) else len(sentence)#一行表達式，真值放在if之前
        # if i + max_len < len(sentence):
        #     end = i + max_len
        # else:
        #     end = len(sentence)
        temp = sentence[i]
        index = i
        for j in range(i + 1, end + 1):
            if sentence[i:j] in word_dict:#詞典分割
                temp = sentence[i:j]
                index = j
        if index == i:
            result.append((temp, 0))
            i += 1
        else:
            result.append((temp, len(temp)))
            i = index
    return result

# 逆向匹配算法
# 返回list[tuple(word, len)]的形式, tuple[0]為詞,tuple[1]為詞的長度，當長度為0時代表該詞為非詞典詞
def RMM(sentence, max_len, word_dict):
    if not sentence or sentence is None:
        raise Exception("sentence can not be empty or None")
    result = []
    i = len(sentence) - 1
    while i >= 0:
        start = i - max_len + 1 if i - max_len + 1 > 0 else 0
        temp = sentence[i]
        index = i
        for j in range(start, i + 1):
            if sentence[j:i+1] in word_dict:
                temp = sentence[j:i+1]
                index = j - 1
                break
        if index == i:
            result.append((temp, 0))
            i -= 1
        else:
            result.append((temp, len(temp)))
            i = index
    result.reverse()
    return result

# 雙向最大匹配
# 根據大顆粒度詞越多越好，非詞典詞和單字詞越少越好原則
def BMM(sentence, max_len, word_dict):
    if not sentence or sentence is None:
        raise Exception("sentence can not be empty or None")
    foward_result = MM(sentence, max_len, word_dict)
    backward_result = RMM(sentence, max_len, word_dict)
    def count_result(result):
        counter = defaultdict(int)
        for r in result:
            if r[1] == 0:
                counter['OOV'] += 1#非詞典單詞
               # print(counter)
            elif r[1] == 1:
                counter['single'] += 1#單字詞
                #print(counter)
            else:
                counter['multi'] += r[1]#多字詞的字數
               # print(counter)
        return counter['multi'] - counter['OOV'] - counter['single']
        print(counter)
    foward_count = count_result(foward_result)
    backward_count = count_result(backward_result)
    if foward_count > backward_count:
        return foward_result
    else:
        return backward_result



path = './data/dict.txt'
max_len = 3#最長的詞為  中華人民共和國  共7個字
word_dict = load_dict(path)
result = MM('我們在野生動物園玩。', max_len, word_dict)
print(result)
result = RMM('我們在野生動物園玩。', max_len, word_dict)
print(result)
result = BMM('我們在野生動物園玩', max_len, word_dict)
print(result)

詞典隨便下載，路徑對了就行。（path='./data/dict.txt'）

一起學NLP，練着玩玩！

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 雙向最大匹配算法最大匹配算法中文分詞--最大正向與逆向匹配算法python實現圖的最大匹配算法最大匹配算法 (Maximum Matching) 原創：中文分詞的逆向最大匹配算法分詞-前向最大匹配算法最大匹配算法進行分詞前向后向 python實現匈牙利匹配和最大權值匹配算法雙向最大匹配算法——基於詞典規則的中文分詞(Java實現)