最近工作需要,研究學習 NLP ,但是 苦於官方文檔太過紛繁,容易找不到重點,於是打算自己寫一份學習線路
NLP 路線圖
好博客韓小陽
斯坦福NLP公開課
統計學習方法
好博客
鏈接地址:https://pan.baidu.com/s/1myVT-yMzqzJIcl50mGs2JA
提取密碼:tw6r
參考文檔:
依照 印度小哥的 視頻 跑了一個小 demo
step 1 用 IDEA 構建一個 maven 項目,引入 相關依賴包,當前依賴包最新版本為 3.9.2
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-corenlp</artifactId>
<version>3.9.2</version>
</dependency>
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-corenlp</artifactId>
<version>3.9.2</version>
<classifier>models</classifier>
</dependency>
<!--添加中文支持-->
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-corenlp</artifactId>
<version>3.9.2</version>
<classifier>models-chinese</classifier>
</dependency>
step 2 使用 nlp 包
package com.ghc.corhort.query.utils;
import edu.stanford.nlp.coref.CorefCoreAnnotations;
import edu.stanford.nlp.coref.data.CorefChain;
import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.semgraph.SemanticGraph;
import edu.stanford.nlp.semgraph.SemanticGraphCoreAnnotations;
import edu.stanford.nlp.trees.Tree;
import edu.stanford.nlp.trees.TreeCoreAnnotations;
import edu.stanford.nlp.util.CoreMap;
import java.util.*;
/**
* @author :Frank Li
* @date :Created in 2019/8/7 13:39
* @description:${description}
* @modified By:
* @version: $version$
*/
public class Demo {
public static void main(String[] args) {
// creates a StanfordCoreNLP object, with POS tagging, lemmatization, NER, parsing, and coreference resolution
Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// read some text in the text variable
String text = "I like eat apple!";
// create an empty Annotation just with the given text
Annotation document = new Annotation(text);
// run all Annotators on this text
pipeline.annotate(document);
// these are all the sentences in this document
// a CoreMap is essentially a Map that uses class objects as keys and has values with custom types
List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);
for(CoreMap sentence: sentences) {
// traversing the words in the current sentence
// a CoreLabel is a CoreMap with additional token-specific methods
for (CoreLabel token: sentence.get(CoreAnnotations.TokensAnnotation.class)) {
// this is the text of the token
String word = token.get(CoreAnnotations.TextAnnotation.class);
// this is the POS tag of the token
String pos = token.get(CoreAnnotations.PartOfSpeechAnnotation.class);
// this is the NER label of the token
String ne = token.get(CoreAnnotations.NamedEntityTagAnnotation.class);
System.out.println("word:"+word+"-->pos:"+pos+"-->ne:"+ne);
}
// this is the parse tree of the current sentence
Tree tree = sentence.get(TreeCoreAnnotations.TreeAnnotation.class);
System.out.println(String.format("tree:\n%s",tree.toString()));
// this is the Stanford dependency graph of the current sentence
SemanticGraph dependencies = sentence.get(SemanticGraphCoreAnnotations.CollapsedCCProcessedDependenciesAnnotation.class);
}
// This is the coreference link graph
// Each chain stores a set of mentions that link to each other,
// along with a method for getting the most representative mention
// Both sentence and token offsets start at 1!
Map<Integer, CorefChain> graph =
document.get(CorefCoreAnnotations.CorefChainAnnotation.class);
}
}
輸出結果

淺度原理
stanford corenlp的TokensRegex
最近做一些音樂類、讀物類的自然語言理解,就調研使用了下Stanford corenlp,記錄下來。
功能
Stanford Corenlp是一套自然語言分析工具集包括:
POS(part of speech tagger)-標注詞性
NER(named entity recognizer)-實體名識別
Parser樹-分析句子的語法結構,如識別出短語詞組、主謂賓等
Coreference Resolution-指代消解,找出句子中代表同一個實體的詞。下文的I/my,Nader/he表示的是同一個人
Sentiment Analysis-情感分析
Bootstrapped pattern learning-自展的模式學習(也不知道翻譯對不對,大概就是可以無監督的提取一些模式,如提取實體名)
Open IE(Information Extraction)-從純文本中提取有結構關系組,如"Barack Obama was born in Hawaii" =》 (Barack Obama; was born in; Hawaii)
需求
語音交互類的應用(如語音助手、智能音箱echo)收到的通常是口語化的自然語言,如:我想聽一個段子,給我來個牛郎織女的故事,要想精確的返回結果,就需要提出有用的主題詞,段子/牛郎織女/故事。看了一圈就想使用下corenlp的TokensRegex,基於tokens序列的正則表達式。因為它提供的可用的工具有:正則表達式、分詞、詞性、實體類別,另外還可以自己指定實體類別,如指定牛郎織女是READ類別的實體。


接下來要做 nlp2sql 的事情了

