stanford corenlp的TokensRegex

本文轉載自查看原文 2016-08-05 15:59 3075 工具框架

最近做一些音樂類、讀物類的自然語言理解，就調研使用了下Stanford corenlp，記錄下來。

功能

Stanford Corenlp是一套自然語言分析工具集包括：

POS(part of speech tagger)-標注詞性
NER(named entity recognizer)-實體名識別
Parser樹-分析句子的語法結構，如識別出短語詞組、主謂賓等
Coreference Resolution-指代消解，找出句子中代表同一個實體的詞。下文的I/my，Nader/he表示的是同一個人

Sentiment Analysis-情感分析
Bootstrapped pattern learning-自展的模式學習（也不知道翻譯對不對，大概就是可以無監督的提取一些模式，如提取實體名）
Open IE(Information Extraction)-從純文本中提取有結構關系組，如"Barack Obama was born in Hawaii" =》 (Barack Obama; was born in; Hawaii)

需求

語音交互類的應用（如語音助手、智能音箱echo）收到的通常是口語化的自然語言，如：我想聽一個段子，給我來個牛郎織女的故事，要想精確的返回結果，就需要提出有用的主題詞，段子/牛郎織女/故事。看了一圈就想使用下corenlp的TokensRegex，基於tokens序列的正則表達式。因為它提供的可用的工具有：正則表達式、分詞、詞性、實體類別，另外還可以自己指定實體類別，如指定牛郎織女是READ類別的實體。

Pattern語法

規則格式

{
  // ruleType is "text", "tokens", "composite", or "filter"
  ruleType: "tokens",//tokens是基於切詞用於tokens正則，text是文本串用於文本正則，composite/filter還沒搞明白
  
  // pattern to be matched  
  pattern: ( ( [ { ner:PERSON } ]) /was/ /born/ /on/ ([ { ner:DATE } ]) ),

  // value associated with the expression for which the pattern was matched
  // matched expressions are returned with "DATE_OF_BIRTH" as the value
  // (as part of the MatchedExpression class)
  result: "DATE_OF_BIRTH"
}

除了上面的字段外還有action/name/stage/active/priority等，可以參考文后的文獻。

ruleTypes是tokens，pattern中的基本元素是token，整體用()，1個token用[<expression>]，1個expression用{tag:xx;ner:xx}來表述

ruleTypes是text，pattern就是常規的正則表達式，基本元素就是字符了，整體用//包圍

實例

corenlp提供了單條/多條正則表達式的提取，本文就介紹從文件中加載規則來攔截我們需要的文本，並從中提取主題詞。

依賴包

<dependency>
     <groupId>edu.stanford.nlp</groupId>
     <artifactId>stanford-corenlp</artifactId>
     <version>3.4.1</version>
</dependency>
<dependency>
      <groupId>edu.stanford.nlp</groupId>
      <artifactId>stanford-corenlp</artifactId>
      <version>3.4.1</version>
      <classifier>models</classifier>
</dependency>
<!--中文支持-->
<dependency>
      <groupId>edu.stanford.nlp</groupId>
      <artifactId>stanford-corenlp</artifactId>
      <version>3.6.0</version>
      <classifier>models-chinese</classifier>
</dependency>

屬性配置CoreNLP-chinese.properties（可以參考stanford-corenlp-models-chinese中的配置）

annotators = segment, ssplit, pos, ner, regexner, parse
regexner.mapping = regexner.txt//自定義的實體正則表達式文件

customAnnotatorClass.segment = edu.stanford.nlp.pipeline.ChineseSegmenterAnnotator

segment.model = edu/stanford/nlp/models/segmenter/chinese/pku.gz
segment.sighanCorporaDict = edu/stanford/nlp/models/segmenter/chinese
segment.serDictionary = edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz
segment.sighanPostProcessing = true

ssplit.boundaryTokenRegex = [.]|[!?]+|[。]|[！？]+ //句子切分符

pos.model = edu/stanford/nlp/models/pos-tagger/chinese-distsim/chinese-distsim.tagger

ner.model = edu/stanford/nlp/models/ner/chinese.misc.distsim.crf.ser.gz
ner.applyNumericClassifiers = false
ner.useSUTime = false

parse.model = edu/stanford/nlp/models/lexparser/chinesePCFG.ser.gz

corenlp中對文本的一次處理稱為一個pipeline，annotators代表一個處理節點，如segment切詞、ssplit句子切割（將一段話分為多個句子）、pos詞性、ner實體命名、regexner是用自定義正則表達式來標注實體類型、parse是句子結構解析。后面就是各annotator的屬性。

自定義的規則文件

regexner.txt（將'牛郎織女'的實體類別識別為READ）

牛郎織女	READ

rule.txt（tokensregex規則）

$TYPE="/笑話|故事|段子|口技|謎語|寓言|評書|相聲|小品|唐詩|古詩|宋詞|繞口令|故事|小說/ | /腦筋/ /急轉彎/"
//單類型
{
	ruleType: "tokens",
	pattern: ((?$type $TYPE)),
	result: Format("%s;%s;%s", "", $$type.text.replace(" ",""), "")
}

(?type xx)代表一個命名group，提取該group將結果組裝成xx;xx;xx形式返回

代碼

//加載tokens正則表達
CoreMapExpressionExtractor extractor = CoreMapExpressionExtractor.createExtractorFromFile(TokenSequencePattern.getNewEnv(), "rule.txt");
//創建pipeline
StanfordCoreNLP coreNLP = new StanfordCoreNLP("CoreNLP-chinese.properties");
//處理文本
Annotation annotation = coreNLP.process("聽個故事");
List<CoreMap> sentences = annotation.get(CoreAnnotations.SentencesAnnotation.class);
CoreMap sentence = sentences.get(0); //獲得第一個句子分析結果
//過一遍tokens正則
List<MatchedExpression> matchedExpressions = extractor.extractExpressions(sentence);
for (MatchedExpression match : matchedExpressions) {
    System.out.println("Matched expression: " + match.getText() + " with value " + match.getValue());
}

想看下分析結果，如切詞、詞性、實體名，可以使用下面的函數

    private void debug(CoreMap sentence) {
        // 從CoreMap中取出CoreLabel List，逐一打印出來
        List<CoreLabel> tokens = sentence.get(CoreAnnotations.TokensAnnotation.class);
        System.out.println("字/詞" + "\t " + "詞性" + "\t " + "實體標記");
        System.out.println("-----------------------------");
        for (CoreLabel token : tokens) {
            String word = token.getString(CoreAnnotations.TextAnnotation.class);
            String pos = token.getString(CoreAnnotations.PartOfSpeechAnnotation.class);
            String ner = token.getString(CoreAnnotations.NamedEntityTagAnnotation.class);
            System.out.println(word + "\t " + pos + "\t " + ner);
        }
    }

功能還是很強大的，畢竟可以用的東西多了，遇到問題時方法就多了。

參考文獻

TokensRegex: http://nlp.stanford.edu/software/tokensregex.shtml

SequenceMatchRules: http://nlp.stanford.edu/nlp/javadoc/javanlp-3.5.0/edu/stanford/nlp/ling/tokensregex/SequenceMatchRules.html

Regexner: http://nlp.stanford.edu/software/regexner.html

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Eclipse下使用Stanford CoreNLP的方法 stanford corenlp自定義切詞類 Stanford CoreNLP 3.6.0 中文指代消解模塊調用失敗的解決方案用 Python 和 Stanford CoreNLP 進行中文自然語言處理 stanfordcorenlp安裝教程&問題匯總（importerror-no-module-named-psutil、OSError: stanford-chinese-corenlp-yyyy-MM-dd-models.jar not exists.）&簡單使用教程 Stanford NLP語義分析使用Standford coreNLP進行中文命名實體識別 Stanford Universal Dependencies依存句法標簽解釋使用Stanford Parser進行句法分析底層碼農的Stanford夢 --- 從SCPD開始 [轉]