自然語言處理之中文分詞算法

本文轉載自查看原文 2018-09-13 13:21 2451 NLP

中文分詞算法一般分為三類：

1.基於詞表的分詞算法

正向最大匹配算法FMM
逆向最大匹配算法BMM
雙向最大匹配算法BM

2.基於統計模型的分詞算法：基於N-gram語言模型的分詞算法

3.基於序列標注的分詞算法

基於HMM
基於CRF
基於深度學習的端到端的分詞算法

下面介紹三類基於詞表的分詞算法

一、正向最大匹配算法

概念：對於一般文本，從左到右，以貪心的方式切分出當前位置上長度最大的詞。條件是必須基於字典，原理是單詞的顆粒度越大，所能表示的含義越確切

步驟：

從一個字符串的開始位置選擇一個最大長度的詞長片段，如果序列不足最大詞長，則選擇全部序列
首先看該片段是否在字典中，如果是，則算為一個分出來的詞，如果不是，則從右邊開始減少一個字符，然后看短一點的這個片段是否在詞典中，依次循環，直至剩下單字
此時序列變為第2步截取分詞后剩下的部分序列

#使用正向最大匹配算法實現中文分詞
words_dict = []#存放載入的詞典

def init():
    '''
    讀取詞典文件
    載入詞典
    :return:
    '''

    with open("dic/dic.txt","r",encoding="utf8") as dict_input:
        for word in dict_input:
            word_dict.append(word.strip())

#實現正向最大匹配算法中的切詞方法
def cut_words(raw_sentence,words_dict):
    #統計字典中最長的詞
    max_length = max(len(word) for word in words_dict)#找到句子中最長的詞
    sentence = raw_sentence
    #統計序列長度
    word_length = len(sentence)
    #存儲切分好的詞語
    cut_word_list = []
    while word_length > 0:
        max_cut_length = min(max_length,max_cut_length)#選取詞長和句子長中最小的一個
        subSentence = sentence[0:max_cut_length]
        while max_cut_length > 0:
            if subSentence in words_dict:#如果這個最長的詞在我們的詞典中，那么它就是最長的詞了
                cut_word_list.append(subSentence)
                break
            elif max_cut_length == 1:#如果是單字作為一個的時候
                cut_word_list.append(subSentence)
                break
            else:#如果這個詞不在字典中，並且也不是單字作為一個詞的，就要把匹配長度-1
                max_cut_length = max_cut_length -1
                subSentence = subSentence[0:max_cut_length]#這時要把右邊的詞去掉
        sentence = sentence[max_cut_length:]#把找的最大的詞去掉，剩下的繼續循環
        word_length = word_length - max_cut_length
    # words = "/".join(cut_word_list)
    return cut_word_list

def main():
    """
    用於用戶交互
    :return:
    """
    init()
    while True:
        print("請輸入要分詞的序列")
        input_str = input()
        if not input_str:
            break
        result = cut_words(input_str,word_dict)
        print("分詞結果",result)

if __name__ == '__main__':
    main()

二、逆向最大匹配算法

BMM與FMM類似，只是分詞順序變為從右至左

但是，BMM和FMM對於歧義詞的處理能力一般

#使用逆向最大匹配算法實現中文分詞
words_dict = []

def init():
    """
    讀取字典文件
    獲取字典
    :return:
    """
    with open("dict/dic.txt","r",encoding="utf8") as dic_input:
        for word in dic_input:
            words_dict.append(word.strip())

#實現逆向最大匹配算法中的切詞方法
def cut_words(raw_sentence,words_dict):
    #統計詞典中詞的最大長度
    max_length = max(len(word) for word in words_dict)
    sentence = raw_sentence.strip()
    #統計序列長度
    words_length = len(sentence)
    #存儲切分好的詞
    cut_word_list = []
    #判斷是否需要繼續切詞
    while words_length > 0:
        max_cut_length = min(max_length, max_cut_length)  # 選取詞長和句子長中最小的一個
        subSentence = sentence[-max_cut_length:]#從后往前取max_cut_length這么長
        while max_cut_length > 0:
            if subSentence in words_dict:
                cut_word_list.append(subSentence)
                break
            elif max_cut_length == 1:
                cut_word_list.append(subSentence)
                break
            else:
                max_cut_length = max_cut_length -1
                subSentence = subSentence[-max_cut_length:]
        sentence = sentence[0:-max_cut_length]
        words_length = words_length - max_cut_length
    cut_word_list.reverse()#切完之后的詞是亂序的  這里為其逆序一下
    # words = "/".join(cut_word_list)
    return  cut_word_list

def main():
    """
    用於用戶交互
    :return:
    """
    init()
    while True:
        print("請輸入要分詞的序列:")
        input_str = input()
        if not input_str:
            break
        result = cut_words(input_str,word_dict)
        print("分詞結果:",result)

if __name__ == '__main__':
    main()

三、雙向最大匹配算法

BI是將FMM和BMM得到的結果進行比較，得到正確的分詞方法

啟發式規則：

如果正、反向分詞結果詞數不同，則取分詞數量較少的那個
如果分詞詞數相同：

分詞的結果相同，則說明沒有歧義，可返回任意一個
分詞結果不同，則返回單字較少的那個

import BMM,FMM
#使用雙向最大匹配算法實現中文分詞
words_dict = []

def init():
    """
    讀取字典文件
    獲取字典
    :return:
    """
    with open("dict/dic.txt","r",encoding="utf8") as dic_input:
        for word in dic_input:
            words_dict.append(word.strip())

#實現雙向最大匹配算法中的切詞方法
def cut_words(raw_sentence,words_dict):
    bmm_word_list = BMM.cut_words(raw_sentence,words_dict)
    fmm_word_list = FMM.cut_words(raw_sentence,words_dict)
    bmm_word_list_size = len(bmm_word_list)
    fmm_word_list_size = len(fmm_word_list)
    if bmm_word_list_size != fmm_word_list_size:
        if bmm_word_list_size < fmm_word_list_size:
            return bmm_word_list
        else:
            return fmm_word_list
    else:
        FSingle = 0
        BSingle = 0
        isSame = True
        for i in range(len(fmm_word_list)):
            if fmm_word_list[i] not in bmm_word_list:#如果fmm和bmm的分詞結果是不相同的
                isSame = False
            if len(fmm_word_list[i]) == 1:
                FSingle = FSingle + 1#如果fmm列表里的詞長度為1，也就是說是單個詞，那么就把單個詞的數量+1
            if len(bmm_word_list[i]) == 1:
                BSingle = BSingle + 1
        if isSame:
            return fmm_word_list
        elif BSingle > FSingle:
            return fmm_word_list
        else:
            return bmm_word_list


def main():
    """
    用於用戶交互
    :return:
    """
    init()
    while True:
        print("請輸入要分詞的序列:")
        input_str = input()
        if not input_str:
            break
        result = cut_words(input_str,words_dict)
        print("分詞結果:",result)

if __name__ == '__main__':
    main()

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 ES-自然語言處理之中文分詞器自然語言處理之中文分詞器－jieba分詞器詳解及python實戰 [自然語言處理] 中文分詞技術 hanlp中文自然語言處理的幾種分詞方法 elasticsearch - 自然語言處理與中文分詞 Python 自然語言處理（1）中文分詞技術 NLP之中文自然語言處理工具庫：SnowNLP(情感分析/分詞/自動摘要) 自然語言處理之jieba分詞自然語言處理之jieba分詞中文自然語言處理(NLP)(五)應用HanLP分詞模塊進行分詞處理