實現spaCy訓練詞性標注模型

本文轉載自查看原文 2021-04-27 17:59 297 spaCy/ python/ NLP/ 機器學習/ 自然語言處理

詞性標注是指為輸入文本中的單詞標注對應詞性的過程。詞性標注的主要作用在於預測接下來一個詞的詞性，並為句法分析、信息抽取等工作打下基礎。通常地，實現詞性標注的算法有HMM（隱馬爾科夫）和深度學習（RNN、LSTM等）。然而，在中文中，由於漢語是一種缺乏詞形態變化的語言，沒有直接判斷的依據，且常用詞兼類現象嚴重，研究者主觀原因造成的不同都給中文詞性標注帶來了很大的難點。
本文將介紹如何通過Python程序實現詞性標注，並運用spaCy訓練中文詞性標注模型：

1、對訓練集文本內容進行詞性標注

首先，對於給定的訓練集數據：

利用spaCy模塊進行nlp處理，初始化一個標簽列表和文本字符串，將文本分詞后用“/”號隔開，並儲存文本的詞性標簽到標簽列表中，代碼如下：

def train_data(train_path):
    nlp = spacy.load('zh_core_web_sm')
    train_list = []
    for line in open(train_path,"r",encoding="utf8"):
        train_list.append(line)
        #print(train_list)

    result = []
    train_dict = {}
    for i in train_list:
        doc = nlp(i)
        label = []
        text = ""
        #print(doc)
        for j in doc:
            text += j.text+"/"
            #result.append(str(j.text))
            #print(text)
            label.append(j.pos_[0])
            #print(result)
            train_dict[j.pos_[0]] = {"pos":j.pos_}
            #print(train_dict)
        result.append((text[:-1],{'tags':label}))
    return result,train_dict

大致會得到如下結果：
訓練數據集

2、利用spaCy訓練模型

然后，進行模型訓練：

@plac.annotations(
    lang=("ISO Code of language to use", "option", "l", str),
    output_dir=("Optional output directory", "option", "o", Path),
    n_iter=("Number of training iterations", "option", "n", int))

def main(lang='zh', output_dir=None, n_iter=25):
    nlp = spacy.blank(lang)    ##創建一個空的模型，en表示是英文的模型
    tagger = nlp.add_pipe('tagger')
    # Add the tags. This needs to be done before you start training.
    for tag, values in train_dict.items():
        print("tag:",tag)
        print("values:",values)
        #tagger.add_label(tag, values)
        tagger.add_label(tag)
        #tagger.add_label(values['pos'])
    #nlp.create_pipe(tagger)
    print("3:",tagger)
    #nlp.add_pipe(tagger)

    optimizer = nlp.begin_training() ##模型初始化
    for i in range(n_iter):
        random.shuffle(result)  ##打亂列表
        losses = {}
        for text, annotations in result:
            example = Example.from_dict(nlp.make_doc(text), annotations)
            #nlp.update([text], [annotations], sgd=optimizer, losses=losses)
            nlp.update([example], sgd=optimizer, losses=losses)
        print(losses)

運行結果如下：

3、測試集驗證模型

最后，同樣過程處理測試數據：
測試集數據
代碼如下：

    test_path = r"E:\1\Study\大三下\自然語言處理\第五章作業\test.txt"
    test_list = []
    for line in open(test_path,"r",encoding="utf8"):
        test_list.append(line)

    for z in test_list:
        txt = nlp(z)
        test_text = ""
        for word in txt:
            test_text += word.text+"/"
        print('test_data:', [(t.text, t.tag_, t.pos_) for t in txt])

    # save model to output directory
    if output_dir is not None:
        output_dir = Path(output_dir)
        if not output_dir.exists():
            output_dir.mkdir()
        nlp.to_disk(output_dir)
        print("Saved model to", output_dir)

        # test the save model
        print("Loading from", output_dir)
        nlp2 = spacy.load(output_dir)
        doc = nlp2(test_text)
        print('Tags', [(t.text, t.tag_, t.pos_) for t in doc])

驗證結果如下：
測試數據集

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 詞性標注 nltk詞性標注實現spaCy實體標注模型什么是詞性標注（POS tagging）詞性標注的簡單綜述 python之NLP詞性標注 pyhanlp 分詞與詞性標注 python中詞性標注 nlp詞性標注的作用 jieba分詞的詞性標注