依存分析 Dependency Parsing

句子成分依存分析主要分為兩種：句法級別的和語義級別的

依存句法分析 syntactic dependency parsing

語義依存分詞 semantic dependency parsing

依存分析有兩種類別的方法，基於轉移的(transition-based)和基於圖的(graph-based)。

Stanford NLP依存分析器訓練

斯坦福nlp工具（Stanford CoreNLP）提供了依存（句法）分析功能，同時允許訓練自己的依存分析模型數據。Standfornd DP（版本stanford nlp 3.9.2）使用的是Chen&Manning2014的一種神經網絡方法，其訓練需要使用兩個關鍵數據，一是word embedding文件，該文件非完全必要，但能提供是最好的；另一個是DP訓練數據。

訓練中文DP分析器命令格式為：

java -cp <stanfor-nlp相關jar包路徑> edu.stanford.nlp.parser.nndep.DependencyParser -tlp edu.stanford.nlp.trees.international.pennchinese.ChineseTreebankLanguagePack -trainFile <訓練數據文件> -embedFile <embedding文件> -model <輸出模型文件名>

官方文檔說明提供以CoNLL-X格式的文件，原文為：

Training your own parser

You can train a new dependency parser using your own data in the CoNLL-X data format. (Many dependency treebanks are provided in this format by default; even if not, conversion is often trivial.)

CoNLL-X格式文件只有7列，分別為ID,FORM,LEMMA,POS,FEAT,HEAD,DEPREL。CoNLL有12列，CoNLL-U有10列。Standfornd DP（截至2019-02）提供的CoNLL-X格式說明鏈接http://ilk.uvt.nl/conll/#dataformat已失效，可在 https://stackoverflow.com/questions/27416164/what-is-conll-data-format 找到CoNLL-X和CoNLL的格式說明，CoNLL-U格式說明可在 https://universaldependencies.org/docs/format.html 找到。

但以CoNLL-X格式提供時，訓練過程很快以拋空指針異常NPE終止。經查源碼（版本3.9.2），其大致過程是讀取訓練文件的每行，以符"\t"切割，若大於等於10列，則取第2列為詞，第3/4列為pos，第8列為依存類型；列數不到10列的行（即包含空行）被略過，且認為是句間分割標識。以CoNLL-X的7列格式提供時，訓練工具實際未解析到任何依存樹。訓練工具最終會將訓練數據中的已知標簽（List類型的變量 knownLabes）通過IO寫出，拋出的NPE是因在遍歷寫出 knownLables時其中有個 null元素，該null的來源是相關的 rootLabel字符串未得到正確賦值，因為未解析到任何依存樹也就未找到任何ROOT，變量 rootLabel保持初始值 null而被加入 knownLabels進而導致異常。

根據對源碼的分析可知，工具實際上要求CoNLL-U格式的訓練數據而非CoNLL-X格式（不論從列數還是每列對應意義）；並且盡管訓練過程與CoNLL-U格式的最后兩列無關，但列內容仍然不能設置為空串，否則會在切割時（使用java String.split("\t")）會被直接丟掉，相當於格式中沒有這兩列，訓練數據同樣不會被正確解析；此外，句子編號（ID）應從1開始，而非0，HEAD列中對ROOT引用時對應值為0而非-1。

總結，有3點需要注意：

要求CoNLL-U格式的訓練數據（而非CoNLL-X）
句子從1開始編號，依存ROOT的詞的HEAD列值為0
訓練數據文件最后兩列事實上無關訓練，但列內容不能是空串

附文件行切割及依存樹解析的部分源碼：

//file: edu.stanford.nlp.parser.nndep.DependencyParser.java(stanford-corenlp:3.9.2 )
public static void loadConllFile(String inFile, List<CoreMap> sents, List<DependencyTree> trees, boolean unlabeled, boolean cPOS)
  {
    CoreLabelTokenFactory tf = new CoreLabelTokenFactory(false);

    try (BufferedReader reader = IOUtils.readerFromString(inFile)) {

      List<CoreLabel> sentenceTokens = new ArrayList<>();
      DependencyTree tree = new DependencyTree();

      for (String line : IOUtils.getLineIterable(reader, false)) {
        String[] splits = line.split("\t");
        if (splits.length < 10) {
          if (sentenceTokens.size() > 0) {
            trees.add(tree);
            CoreMap sentence = new CoreLabel();
            sentence.set(CoreAnnotations.TokensAnnotation.class, sentenceTokens);
            sents.add(sentence);
            tree = new DependencyTree();
            sentenceTokens = new ArrayList<>();
          }
        } else {
          String word = splits[1],
                  pos = cPOS ? splits[3] : splits[4],
                  depType = splits[7];

          int head = -1;
          try {
            head = Integer.parseInt(splits[6]);
          } catch (NumberFormatException e) {
            continue;
          }

          CoreLabel token = tf.makeToken(word, 0, 0);
          token.setTag(pos);
          token.set(CoreAnnotations.CoNLLDepParentIndexAnnotation.class, head);
          token.set(CoreAnnotations.CoNLLDepTypeAnnotation.class, depType);
          sentenceTokens.add(token);

          if (!unlabeled)
            tree.add(head, depType);
          else
            tree.add(head, Config.UNKNOWN);
        }
      }
    } catch (IOException e) {
      throw new RuntimeIOException(e);
    }
  }

stanford nlp DP的訓練有時會非常慢。

哈工大nlp LTP依存分析器訓練

LTP基於cpp語言編寫，未提供直接用於訓練的二進制程序，需下載源碼編譯。
下載ltp源碼，根據指導文檔編譯程序，編譯生成的二進制程序tools/train/nndepparser learn用於訓練依存分析器。必需的參數--embedding <FILE>embedding文件，--reference <FILE>訓練數據文件，--model <FILE>模型輸出路徑。雖然需要提供embedding文件，但其內容可為空。命令：
tools/train/nndepparser learn --model ltp.nndep.model --embeding embed.txt --reference train-data.conllu

截至當前最新版Stanford NLP(3.9.2)的DP和LTP(3.4.0)的DP都利用Chen&Manning2014提出的方法。

HanLP依存分析

截至當前HanLP最新版1.7.2，其依存分析由LTP代碼移植而來。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 依存句法分析 Dependency Parsing 依存句法分析 Assignment3 依存分析 Hanlp 依存句法分析 spacy依存句法分析標簽 spaCy 第三篇：依存分析中文依存句法分析及標簽簡介 HanLP筆記 - 依存句法分析 ZH奶酪：中文依存句法分析概述及應用依存句法分析器的簡單實現