Standford CoreNLP


Stanford CoreNLP

Stanford CoreNLP提供一組自然語言處理的工具。這些工具可以把原始英語文本作為輸入,輸出詞的基本形式,詞的詞性標記,判斷詞是否是公司名、人名等,規格化日期、時間、數字量,剖析句子的句法分析樹和詞依存,指示那些名詞短語指代相同的實體。Stanford CoreNLP是一個綜合的框架,這可以很簡單的使用工具集的一個分支分析一小塊文本。從簡單的文本開始,你可以僅僅使用兩行代碼對它運行所有的工具。

Stanford CoreNLP集合了詞性標注器,命名實體識別器,共指消解系統,情緒分析工具,提供模型文件來分析英語。這個項目的目標是使人們能夠快速不費力的獲得完整的自然語言標注。它被設計為靈活的和可擴展的。使用一個選項你可以選擇啟用那個工具禁用哪個工具。

Stanford CoreNLP的代碼使用Java語言編寫,遵循GNU通用公共許可證。需要java 1.6+。

下載鏈接為:http://nlp.stanford.edu/software/stanford-corenlp-full-2014-01-04.zip

使用說明

剖析一個文件,保存為xml文件

在使用Stanford CoreNLP之前,通常要建立一個配置文件(Java properties file)。最小的,文件應該包含“annotators”屬性,這個屬性包含用逗號分隔開的標注器的列表。

例如:annotators = tokenize, ssplit, pos, lemma, ner, parse, dcoref

激活了tokenization, sentence splitting (required by most Annotators), POS tagging, lemmatization, NER, syntactic parsing, and coreference resolution.

然而,如果你只想要指定一個或兩個屬性,你可以在命令行中指定。

要使用Stanford CoreNLP處理一個文件,使用下面的排序命令行

java -cp stanford-corenlp-YYYY-MM-DD.jar:stanford-corenlp-YYYY-MM-DD-models.jar:xom.jar:joda-time.jar:jollyday.jar:ejml-VV.jar -Xmx3g edu.stanford.nlp.pipeline.StanfordCoreNLP [ -props <YOUR CONFIGURATION FILE> ] -file <YOUR INPUT FILE>

例如,為了處理樣例文件input.txt你可以在發布文件夾下使用命令行:

java -cp stanford-corenlp-3.3.1.jar:stanford-corenlp-3.3.1-models.jar:xom.jar:joda-time.jar:jollyday.jar:ejml-0.23.jar
-Xmx3g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref -file input.txt

注意:

  • Stanford CoreNLP需要Java版本1.6或更高
  • -Xmx3g 指定RAM總量,在一個64位電腦中,Stanford CoreNLP一般需要3GB內存來運行。
  • 上面的命令可以在OSX 和Linux中運行,在Windows中,冒號(:)要改為分號(;)。如果不在發布文件夾中,需要加上文件夾路徑
  • 參數 -annotators是可選的。如果你省略了,代碼將使用屬性文件中的屬性。
  • 像這樣處理小文件是不效率的,因為這需要花幾分鍾的時間來加載數據。應該成批處理。

如果你想要處理一個列表的文件,使用如下命令行:

java -cp stanford-corenlp-YYYY-MM-DD.jar:stanford-corenlp-models-YYYY-MM-DD.jar:xom.jar:joda-time.jar:jollyday.jar:ejml-VV.jar -Xmx3g edu.stanford.nlp.pipeline.StanfordCoreNLP [ -props <YOUR CONFIGURATION FILE> ] -filelist <YOUR LIST OF FILES>

其中,-filelist參數列出指向一個文件,文件的內容是要處理的文件列表。

-props參數是可選的,默認的,它將查找StanfordCoreNLP.propertiys在你的類路徑中,並且使用在發布中的默認參數。

默認的情況下,你的輸出文件被寫在當前文件夾下,你可以執行一個可更改的路徑,使用參數-outputDirectoty。輸出文件名和輸入文件名相同,使用-outputExtension加上名稱擴展(默認是.xml)文件。默認的將覆蓋輸出文件。可以使用參數-noClobber避免這種情況。可以使用參數-replaceExtension來代替名稱擴展。

Stanford CoreNLP具有移除標簽的功能。例如,如果使用如下標注器參數運行

annotators = tokenize, cleanxml, ssplit, pos, lemma, ner, parse, dcoref

使用如下文本,

<xml>Stanford University is located in California. It is a great university.</xml>

將會生成移除標簽后的結果。

使用Stanford CoreNLP API

CoreNLP的主干分為兩個類:Annotation和Annotator。Annotations是數據結構保存標注結果。是基本的map,從鍵值到標注。Annocators類似於函數,作用與Annotations.他們做類似剖析、實體識別等任務。Annotations和Annotators被AnnotationPipelines集成,AnnotationPipelines創建Annotators的序列。

下表是目前支持的Annocators和他們產生的Annocations。

Property name Annotator class name Generated Annotation Description
tokenize PTBTokenizerAnnotator TokensAnnotation (list of tokens), and CharacterOffsetBeginAnnotation, CharacterOffsetEndAnnotation, TextAnnotation (for each token) Tokenizes the text. This component started as a PTB-style tokenizer, but was extended since then to handle noisy and web text. The tokenizer saves the character offsets of each token in the input text, as CharacterOffsetBeginAnnotation and CharacterOffsetEndAnnotation.
cleanxml CleanXmlAnnotator XmlContextAnnotation Remove xml tokens from the document
ssplit WordToSentenceAnnotator SentencesAnnotation Splits a sequence of tokens into sentences.
pos POSTaggerAnnotator PartOfSpeechAnnotation Labels tokens with their POS tag. For more details see this page.
lemma MorphaAnnotator LemmaAnnotation Generates the word lemmas for all tokens in the corpus.
ner NERClassifierCombiner NamedEntityTagAnnotation and NormalizedNamedEntityTagAnnotation Recognizes named (PERSON, LOCATION, ORGANIZATION, MISC) and numerical entities (DATE, TIME, MONEY, NUMBER). Named entities are recognized using a combination of three CRF sequence taggers trained on various corpora, such as ACE and MUC. Numerical entities are recognized using a rule-based system. Numerical entities that require normalization, e.g., dates, are normalized to NormalizedNamedEntityTagAnnotation. For more details on the CRF tagger see this page.
regexner RegexNERAnnotator NamedEntityTagAnnotation Implements a simple, rule-based NER over token sequences using Java regular expressions. The goal of this Annotator is to provide a simple framework to incorporate NE labels that are not annotated in traditional NL corpora. For example, the default list of regular expressions that we distribute in the models file recognizes ideologies (IDEOLOGY), nationalities (NATIONALITY), religions (RELIGION), and titles (TITLE). Here is a simple example of how to use RegexNER. For more complex applications, you might consider TokensRegex.
sentiment SentimentAnnotator SentimentCoreAnnotations.AnnotatedTree Implements Socher et al's sentiment model. Attaches a binarized tree of the sentence to the sentence level CoreMap. The nodes of the tree then contain the annotations from RNNCoreAnnotations indicating the predicted class and scores for that subtree. See the sentiment page for more information about this project.
truecase TrueCaseAnnotator TrueCaseAnnotation and TrueCaseTextAnnotation Recognizes the true case of tokens in text where this information was lost, e.g., all upper case text. This is implemented with a discriminative model implemented using a CRF sequence tagger. The true case label, e.g., INIT_UPPER is saved in TrueCaseAnnotation. The token text adjusted to match its true case is saved as TrueCaseTextAnnotation.
parse ParserAnnotator TreeAnnotation, BasicDependenciesAnnotation, CollapsedDependenciesAnnotation, CollapsedCCProcessedDependenciesAnnotation Provides full syntactic analysis, using both the constituent and the dependency representations. The constituent-based output is saved in TreeAnnotation. We generate three dependency-based outputs, as follows: basic, uncollapsed dependencies, saved in BasicDependenciesAnnotation; collapsed dependencies saved in CollapsedDependenciesAnnotation; and collapsed dependencies with processed coordinations, in CollapsedCCProcessedDependenciesAnnotation. Most users of our parser will prefer the latter representation. For more details on the parser, please see this page. For more details about the dependencies, please refer to this page.
dcoref DeterministicCorefAnnotator CorefChainAnnotation Implements both pronominal and nominal coreference resolution. The entire coreference graph (with head words of mentions as nodes) is saved in CorefChainAnnotation. For more details on the underlying coreference resolution algorithm, see this page.


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM