hanlp學習一：詞性標注（語料庫建設）

本文轉載自查看原文 2020-01-21 10:24 1253 自然語言處理

前言：自然語言處理入門（何晗著）第7章詞性標注

一概念：

分詞語料庫，詞性標注語料庫，標注集

二流程：

工程上通常在大型分詞語料庫上訓練分詞器，然后與小型詞性標注語料庫上的詞性標注模型靈活組合為一個異源的流水線式詞法分析器

即先分別訓練分詞器以及詞性標注模型，將分詞結果運用到詞性標注模型上，進行詞性標注

用來訓練分詞器的材料和用來訓練詞性標注模型的材料不同

三代碼解析：

詞性標注流程

a.分詞

b.標注詞性

from  pyhanlp import *
import zipfile
import os

from pyhanlp.static import download, remove_file, HANLP_DATA_PATH


def test_data_path():
    """
    獲取測試數據路徑，位於$root/data/test，根目錄由配置文件指定。
    :return:
    """
    data_path = os.path.join(HANLP_DATA_PATH, 'test')
    if not os.path.isdir(data_path):
        os.mkdir(data_path)
    return data_path


def ensure_data(data_name, data_url):
    root_path = test_data_path()
    dest_path = os.path.join(root_path, data_name)
    if os.path.exists(dest_path):
        return dest_path
    if data_url.endswith('.zip'):
        dest_path += '.zip'
    download(data_url, dest_path)
    if data_url.endswith('.zip'):
        with zipfile.ZipFile(dest_path, "r") as archive:
            archive.extractall(root_path)
        remove_file(dest_path)
        dest_path = dest_path[:-len('.zip')]
    return dest_path


PKU98 = ensure_data("pku98", "http://file.hankcs.com/corpus/pku98.zip")
PKU199801 = os.path.join(PKU98, '199801.txt')
PKU199801_TRAIN = os.path.join(PKU98, '199801-train.txt')
PKU199801_TEST = os.path.join(PKU98, '199801-test.txt')
POS_MODEL = os.path.join('C:\\Users\\Administrator\\Desktop\\cx', 'pos.bin') # 獲取空模型
NER_MODEL = os.path.join('C:\\Users\\Administrator\\Desktop\\cx', 'ner.bin')
ZHUXIAN = ensure_data("zhuxian", "http://file.hankcs.com/corpus/zhuxian.zip") + "/train.txt"
POSTrainer = JClass('com.hankcs.hanlp.model.perceptron.POSTrainer')
PerceptronSegmenter = JClass('com.hankcs.hanlp.model.perceptron.PerceptronSegmenter')
AbstractLexicalAnalyzer = JClass('com.hankcs.hanlp.tokenizer.lexical.AbstractLexicalAnalyzer')
PerceptronPOSTagger = JClass('com.hankcs.hanlp.model.perceptron.PerceptronPOSTagger')

def train_perceptron_pos(corpus):
    trainer = POSTrainer()
    model = trainer.train(corpus, POS_MODEL).getModel()  # 標注訓練並保存文件
    model = os.path.join('C:\\Users\\Administrator\\Desktop\\cx\\pos.bin') # 指定模型文件路徑
    
    tagger = PerceptronPOSTagger(model)  # 加載模型文件
    print(', '.join(tagger.tag("他", "的", "希望", "是", "希望", "上學")))  # 預測
    analyzer = AbstractLexicalAnalyzer(PerceptronSegmenter(), tagger)  # 構造詞法分析器
    print(analyzer.analyze("李狗蛋的希望是希望上學"))  # 分詞+詞性標注
    return tagger


posTagger = train_perceptron_pos(ZHUXIAN)  # 訓練
analyzer = AbstractLexicalAnalyzer(PerceptronSegmenter(), posTagger)  # 包裝
print(analyzer.analyze("陸雪琪的天琊神劍不做絲毫退避，直沖而上，瞬間，這兩道奇光異寶撞到了一起。"))  # 分詞+標注

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 【Python & NLP】關於語料庫標注——詞性標注、分詞標注、類別標簽等-例如brat 現代漢語語料庫加工規范 ——詞語切分與詞性標注 Stanford Corenlp學習筆記——詞性標注詞性標注 nltk詞性標注 python調用Hanlp做命名實體識別以及詞性標注什么是詞性標注（POS tagging）詞性標注的簡單綜述 python中詞性標注 nlp詞性標注的作用