開源中文分詞工具探析（六）：Stanford CoreNLP

本文轉載自查看原文 2018-02-07 21:24 11430 自然語言處理

CoreNLP是由斯坦福大學開源的一套Java NLP工具，提供諸如：詞性標注（part-of-speech (POS) tagger）、命名實體識別（named entity recognizer (NER)）、情感分析（sentiment analysis）等功能。

【開源中文分詞工具探析】系列：

1. 前言

CoreNLP的中文分詞基於CRF模型：

\[P_w(y|x) = \frac{exp \left( \sum_i w_i f_i(x,y) \right)}{Z_w(x)} \]

其中，\(Z_w(x)\)為歸一化因子，\(w\)為模型的參數，\(f_i(x,y)\)為特征函數。

2. 分解

以下源碼分析基於3.7.0版本，分詞示例見SegDemo類。

模型

主要模型文件有兩份，一份為詞典文件dict-chris6.ser.gz：

// dict-chris6.ser.gz 對應於長度為7的Set數組詞典
// 共計詞數：0+7323+125336+142252+82139+26907+39243
ChineseDictionary::loadDictionary(String serializePath) {
    Set<String>[] dict = new HashSet[MAX_LEXICON_LENGTH + 1];
    for (int i = 0; i <= MAX_LEXICON_LENGTH; i++) {
        dict[i] = Generics.newHashSet();
    }
    dict = IOUtils.readObjectFromURLOrClasspathOrFileSystem(serializePath);
    return dict;
}

詞典的索引值為詞的長度，比如第0個詞典中沒有詞，第1個詞典為長度為1的詞，第6個詞典為長度為6的詞。其中，第6個詞典為半成詞，比如，有詞“《雙峰》（電”、“８０年國家領”、“１８２４年英”。

另一份為CRF訓練模型文件ctb.gz：

CRFClassifier::loadClassifier(ObjectInputStream ois, Properties props) {
    Object o = ois.readObject();
    if (o instanceof List) {
        labelIndices = (List<Index<CRFLabel>>) o; // label索引
    }
    classIndex = (Index<String>) ois.readObject(); // 序列標注label
    featureIndex = (Index<String>) ois.readObject(); // 特征
    flags = (SeqClassifierFlags) ois.readObject(); // 模型配置

    Object featureFactory = ois.readObject(); // 特征模板，用於生成特征
    else if (featureFactory instanceof FeatureFactory) {
        featureFactories = Generics.newArrayList();
        featureFactories.add((FeatureFactory<IN>) featureFactory);
    }

    windowSize = ois.readInt(); // 窗口大小為2
    weights = (double[][]) ois.readObject(); // 特征+label 對應的權重

    Set<String> lcWords = (Set<String>) ois.readObject(); // Set為空
	else {
        knownLCWords = new MaxSizeConcurrentHashSet<>(lcWords);
    }

    reinit();
}

不同於其他分詞器采用B、M、E、S四種label來做分詞，CoreNLP的中文分詞label只有兩種，“1”表示當前字符與前一字符連接成詞，“0”則表示當前字符為另一詞的開始——換言之前一字符為上一個詞的結尾。

class CRFClassifier {
    classIndex: class edu.stanford.nlp.util.HashIndex
      ["1","0"]
}

// 中文分詞label對應的類
public static class AnswerAnnotation implements CoreAnnotation<String>{}

特征

CoreNLP的特征如下（示例）：

class CRFClassifier {
	// 特征
    featureIndex: class edu.stanford.nlp.util.HashIndex
		size = 3408491
        0=的膀cc2|C
        1=身也pc|C
        44=LSSLp2spscsc2s|C
        45=科背p2p|C
        46=迪。cc2|C
      	...
      	=球-行pc2|CnC
		=音非cc2|CpC
    
    // 權重
    weights: double[3408491][2]
		[[2.2114868426005005E-5, -2.2114868091546352E-5]...]
}

特征后綴只有3類：C, CpC, CnC，分別代表了三大類特征；均由特征模板生成：

// 特征模板List
featureFactories: ArrayList<FeatureFactory>
    0 = Gale2007ChineseSegmenterFeatureFactory

// 具體特征模板
Gale2007ChineseSegmenterFeatureFactory::getCliqueFeatures() {
    if (clique == cliqueC) {
        addAllInterningAndSuffixing(features, featuresC(cInfo, loc), "C");
    } else if (clique == cliqueCpC) {
        addAllInterningAndSuffixing(features, featuresCpC(cInfo, loc), "CpC");
        addAllInterningAndSuffixing(features, featuresCnC(cInfo, loc - 1), "CnC");
    }
}

特征模板只用到了兩個特征簇cliqueC與cliqueCpC，其中，cliqueC由函數featuresC()實現，cliqueCpC由函數featuresCpC()與featuresCnC()


Gale2007ChineseSegmenterFeatureFactory::featuresC() {
    if (flags.useWord1) {
        // Unigram 特征
        features.add(charc +"::c"); // c[0]
        features.add(charc2+"::c2"); // c[1]
        features.add(charp +"::p"); // c[-1]
        features.add(charp2 +"::p2"); // c[-2]

        // Bigram 特征
        features.add(charc +charc2  +"::cn"); // c[0]c[1]
        features.add(charc +charc3  +"::cn2"); // c[0]c[2]
        features.add(charp +charc  +"::pc"); // c[-1]c[0]
        features.add(charp +charc2  +"::pn"); // c[-1]c[1]
        features.add(charp2 +charp  +"::p2p"); // c[-2]c[-1]
        features.add(charp2 +charc  +"::p2c"); // c[-2]c[0]
        features.add(charc2 +charc  +"::n2c"); // c[1]c[0]
    }

    // 三個字符c[-1]c[0]c[1]對應的LBeginAnnotation、LMiddleAnnotation、LEndAnnotation 三種label特征
    // 結果特征分別以6種形式結尾，"-lb", "-lm", "-le", "-plb", "-plm", "-ple", "-c2lb", "-c2lm", "-c2le"
    // null || ".../models/segmenter/chinese/dict-chris6.ser.gz"
    if (flags.dictionary != null || flags.serializedDictionary != null) {
        dictionaryFeaturesC(CoreAnnotations.LBeginAnnotation.class,
                CoreAnnotations.LMiddleAnnotation.class,
                CoreAnnotations.LEndAnnotation.class,
                "", features, p, c, c2);
    }

    // 特征 c[1]c[0], c[1]
    if (flags.useFeaturesC4gram || flags.useFeaturesC5gram || flags.useFeaturesC6gram) {
        features.add(charp2 + charp + "p2p");
        features.add(charp2 + "p2");
    }

    // Unicode特征
    if (flags.useUnicodeType || flags.useUnicodeType4gram || flags.useUnicodeType5gram) {
        features.add(uTypep + "-" + uTypec + "-" + uTypec2 + "-uType3");
    }

    // UnicodeType特征
    if (flags.useUnicodeType4gram || flags.useUnicodeType5gram) {
        features.add(uTypep2 + "-" + uTypep + "-" + uTypec + "-" + uTypec2 + "-uType4");
    }

    // UnicodeBlock特征
    if (flags.useUnicodeBlock) {
        features.add(p.getString(CoreAnnotations.UBlockAnnotation.class) + "-"
                + c.getString(CoreAnnotations.UBlockAnnotation.class) + "-"
                + c2.getString(CoreAnnotations.UBlockAnnotation.class)
                + "-uBlock");
    }

    // Shape特征
    if (flags.useShapeStrings) {
        if (flags.useShapeStrings1) {
            features.add(p.getString(CoreAnnotations.ShapeAnnotation.class) + "ps");
            features.add(c.getString(CoreAnnotations.ShapeAnnotation.class) + "cs");
            features.add(c2.getString(CoreAnnotations.ShapeAnnotation.class) + "c2s");
        }
        if (flags.useShapeStrings3) {
            features.add(p.getString(CoreAnnotations.ShapeAnnotation.class)
                    + c.getString(CoreAnnotations.ShapeAnnotation.class)
                    + c2.getString(CoreAnnotations.ShapeAnnotation.class)
                    + "pscsc2s");
        }
        if (flags.useShapeStrings4) {
            features.add(p2.getString(CoreAnnotations.ShapeAnnotation.class)
                    + p.getString(CoreAnnotations.ShapeAnnotation.class)
                    + c.getString(CoreAnnotations.ShapeAnnotation.class)
                    + c2.getString(CoreAnnotations.ShapeAnnotation.class)
                    + "p2spscsc2s");
        }
        if (flags.useShapeStrings5) {
            features.add(p2.getString(CoreAnnotations.ShapeAnnotation.class)
                    + p.getString(CoreAnnotations.ShapeAnnotation.class)
                    + c.getString(CoreAnnotations.ShapeAnnotation.class)
                    + c2.getString(CoreAnnotations.ShapeAnnotation.class)
                    + c3.getString(CoreAnnotations.ShapeAnnotation.class)
                    + "p2spscsc2sc3s");
        }
    }
}

Gale2007ChineseSegmenterFeatureFactory::featuresCpC() {}

Gale2007ChineseSegmenterFeatureFactory::featuresCnC() {}

三大類特征分別以“|C”為結尾（共計有32個）、以“|CpC”結尾（共計有37個）、以“|CnC”結尾（共計有9個）；總計78個特征。個人感覺CoreNLP定義的特征過於復雜，大部分特征並沒有什么用。CoreNLP后面處理流程跟其他分詞器別無二樣了，求每個label的權重加權之和，Viterbi解碼求解最大概率路徑，解析label序列得到分詞結果。

CoreNLP分詞速度巨慢，效果也一般，在PKU、MSR測試集上的表現如下：

測試集	分詞器	准確率	召回率	F1
PKU	thulac4j	0.948	0.936	0.942
	CoreNLP	0.901	0.894	0.897
MSR	thulac4j	0.866	0.896	0.881
	CoreNLP	0.822	0.859	0.840

3.參考資料

[1] Huihsin, Tseng, et al. "A conditional random field word segmenter." Fourth SIGHAN Workshop. 2005.
[2] Chang, Pi-Chuan, Michel Galley, and Christopher D. Manning. "Optimizing Chinese word segmentation for machine translation performance." Proceedings of the third workshop on statistical machine translation. Association for Computational Linguistics, 2008.

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 開源中文分詞工具探析（七）：LTP 中文分詞工具探析（一）：ICTCLAS (NLPIR) Stanford CoreNLP--功能列表 Stanford CoreNLP--Split Sentence 中文分詞工具——jieba Stanford Corenlp學習筆記——詞性標注幾款開源的中文分詞系統幾款開源分詞工具比較基於開源中文分詞工具pkuseg-python，我用張小龍的3萬字演講做了測試開源中文分詞框架分詞效果對比smartcn與IKanalyzer