中文文本分句

本文轉載自查看原文 2019-10-15 19:05 776 自然語言處理/ Python學習

關於文本分句這點，說簡單也簡單，說復雜也復雜。一般的自然語言處理任務中對這點要求並不嚴格，一般按照句末標點切分即可。也有一些專門從事文本相關項目的行業，可能就會有較高的要求，想100%分句正確是要考慮許多語言本身語法的，這里算是寫個中等水平的。以《背影》中的一段話為例：

我心里暗笑他的迂；他們只認得錢，托他們只是白托!而且我這樣大年紀的人，難道還不能料理自己么？唉，我現在想想，那時真是太聰明了!
我說道：“爸爸，你走吧。”他往車外看了看說：“我買幾個橘子去。你就在此地，不要走動。”我看那邊月台的柵欄外有幾個賣東西的等着顧客。走到那邊月台，須穿過鐵道，須跳下去又爬上去。

python實現：

import re

def __merge_symmetry(sentences, symmetry=('“','”')):
    '''合並對稱符號，如雙引號'''
    effective_ = []
    merged = True
    for index in range(len(sentences)):       
        if symmetry[0] in sentences[index] and symmetry[1] not in sentences[index]:
            merged = False
            effective_.append(sentences[index])
        elif symmetry[1] in sentences[index] and not merged:
            merged = True
            effective_[-1] += sentences[index]
        elif symmetry[0] not in sentences[index] and symmetry[1] not in sentences[index] and not merged :
            effective_[-1] += sentences[index]
        else:
            effective_.append(sentences[index])
        
    return [i.strip() for i in effective_ if len(i.strip()) > 0]

def to_sentences(paragraph):
    """由段落切分成句子"""
    sentences = re.split(r"(？|。|！|\…\…)", paragraph)
    sentences.append("")
    sentences = ["".join(i) for i in zip(sentences[0::2], sentences[1::2])]
    sentences = [i.strip() for i in sentences if len(i.strip()) > 0]
    
    for j in range(1, len(sentences)):
        if sentences[j][0] == '”':
            sentences[j-1] = sentences[j-1] + '”'
            sentences[j] = sentences[j][1:]
            
    return __merge_symmetry(sentences)

主要考慮分句之后要帶上句末標點，以及遇到人物有對話時保證話語完整性。分句結果：

我心里暗笑他的迂；他們只認得錢，托他們只是白托!而且我這樣大年紀的人，難道還不能料理自己么？
唉，我現在想想，那時真是太聰明了!
我說道：“爸爸，你走吧。”
他往車外看了看說：“我買幾個橘子去。你就在此地，不要走動。”
我看那邊月台的柵欄外有幾個賣東西的等着顧客。
走到那邊月台，須穿過鐵道，須跳下去又爬上去。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 中文文本分類中文文本分類 2.中文文本分類實戰 CNN在中文文本分類的應用 fastext 中文文本分類基於bert的中文文本分類中文文本分類之CharCNN 中文文本分類之TextRNN TextGrocery中文文本分類處理 Pytorch-中文文本分類