源碼下載地址:CoreNLP官網。
目前release的CoreNLP version 3.5.0版本僅支持java-1.8及以上版本,因此有時需要為Eclipse添加jdk-1.8配置,配置方法如下:
- 首先,去oracle官網下載java-1.8,下載網址為:java下載,安裝完成后。
- 打開Eclipse,選擇Window -> Preferences -> Java –> Installed JREs 進行配置:
點擊窗體右邊的“add”,然后添加一個“Standard VM”(應該是標准虛擬機的意思),然后點擊“next”;
在”JRE HOME”那一行點擊右邊的“Directory…”找到你java 的安裝路徑,比如“C:Program Files/Java/jdk1.8”
這樣你的Eclipse就已經支持jdk-1.8了。
1. 新建java工程,注意編譯環境版本選擇1.8
2. 將官網下載的源碼解壓到工程下,並導入所需jar包
如導入stanford-corenlp-3.5.0.jar、stanford-corenlp-3.5.0-javadoc.jar、stanford-corenlp-3.5.0-models.jar、stanford-corenlp-3.5.0-sources.jar、xom.jar等
導入jar包過程為:項目右擊->Properties->Java Build Path->Libraries,點擊“Add JARs”,在路徑中選取相應的jar包即可。
3. 新建TestCoreNLP類,代碼如下
1 package Test; 2 3 import java.util.List; 4 import java.util.Map; 5 import java.util.Properties; 6 7 import edu.stanford.nlp.dcoref.CorefChain; 8 import edu.stanford.nlp.dcoref.CorefCoreAnnotations.CorefChainAnnotation; 9 import edu.stanford.nlp.ling.CoreAnnotations.LemmaAnnotation; 10 import edu.stanford.nlp.ling.CoreAnnotations.NamedEntityTagAnnotation; 11 import edu.stanford.nlp.ling.CoreAnnotations.PartOfSpeechAnnotation; 12 import edu.stanford.nlp.ling.CoreAnnotations.SentencesAnnotation; 13 import edu.stanford.nlp.ling.CoreAnnotations.TextAnnotation; 14 import edu.stanford.nlp.ling.CoreAnnotations.TokensAnnotation; 15 import edu.stanford.nlp.ling.CoreLabel; 16 import edu.stanford.nlp.pipeline.Annotation; 17 import edu.stanford.nlp.pipeline.StanfordCoreNLP; 18 import edu.stanford.nlp.semgraph.SemanticGraph; 19 import edu.stanford.nlp.semgraph.SemanticGraphCoreAnnotations.CollapsedCCProcessedDependenciesAnnotation; 20 import edu.stanford.nlp.sentiment.SentimentCoreAnnotations; 21 import edu.stanford.nlp.trees.Tree; 22 import edu.stanford.nlp.trees.TreeCoreAnnotations.TreeAnnotation; 23 import edu.stanford.nlp.util.CoreMap; 24 25 public class TestCoreNLP { 26 public static void main(String[] args) { 27 // creates a StanfordCoreNLP object, with POS tagging, lemmatization, NER, parsing, and coreference resolution 28 Properties props = new Properties(); 29 props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref"); 30 StanfordCoreNLP pipeline = new StanfordCoreNLP(props); 31 32 // read some text in the text variable 33 String text = "Add your text here:Beijing sings Lenovo"; 34 35 // create an empty Annotation just with the given text 36 Annotation document = new Annotation(text); 37 38 // run all Annotators on this text 39 pipeline.annotate(document); 40 41 // these are all the sentences in this document 42 // a CoreMap is essentially a Map that uses class objects as keys and has values with custom types 43 List<CoreMap> sentences = document.get(SentencesAnnotation.class); 44 45 System.out.println("word\tpos\tlemma\tner"); 46 for(CoreMap sentence: sentences) { 47 // traversing the words in the current sentence 48 // a CoreLabel is a CoreMap with additional token-specific methods 49 for (CoreLabel token: sentence.get(TokensAnnotation.class)) { 50 // this is the text of the token 51 String word = token.get(TextAnnotation.class); 52 // this is the POS tag of the token 53 String pos = token.get(PartOfSpeechAnnotation.class); 54 // this is the NER label of the token 55 String ne = token.get(NamedEntityTagAnnotation.class); 56 String lemma = token.get(LemmaAnnotation.class); 57 58 System.out.println(word+"\t"+pos+"\t"+lemma+"\t"+ne); 59 } 60 // this is the parse tree of the current sentence 61 Tree tree = sentence.get(TreeAnnotation.class); 62 63 // this is the Stanford dependency graph of the current sentence 64 SemanticGraph dependencies = sentence.get(CollapsedCCProcessedDependenciesAnnotation.class); 65 } 66 // This is the coreference link graph 67 // Each chain stores a set of mentions that link to each other, 68 // along with a method for getting the most representative mention 69 // Both sentence and token offsets start at 1! 70 Map<Integer, CorefChain> graph = document.get(CorefChainAnnotation.class); 71 } 72 }
PS:該代碼的思想是將text字符串交給Stanford CoreNLP處理,StanfordCoreNLP的各個組件(annotator)按“tokenize(分詞), ssplit(斷句), pos(詞性標注), lemma(詞元化), ner(命名實體識別), parse(語法分析), dcoref(同義詞分辨)”順序進行處理。
處理完后List<CoreMap> sentences = document.get(SentencesAnnotation.class);中包含了所有分析結果,遍歷即可獲知結果。
這里簡單的將單詞、詞性、詞元、是否實體打印出來。其余的用法參見官網(如sentiment、parse、relation等)。
4. 執行結果: