Python自然語言處理工具小結

斯坦福大學的分詞器，該系統需要JDK 1.8+，從上面鏈接中下載stanford-segmenter-2014-10-26，解壓之后，如下圖所示

進入data目錄，其中有兩個gz壓縮文件，分別是ctb.gz和pku.gz，其中 CTB：賓州大學的中國樹庫訓練資料， PKU：中國北京大學提供的訓練資料。當然了，你也可以自己訓練，一個訓練的例子可以在這里面看到 http://nlp.stanford.edu/software/trainSegmenter-20080521.tar.gz

2、NER介紹

斯坦福NER是采用Java實現，可以識別出（PERSON，ORGANIZATION，LOCATION），使用本軟件發表的研究成果需引用下述論文：

下載地址在： http://nlp.sta nford.edu/~manning/papers/gibbscrf3.pdf

在NER頁面可以下載到兩個壓縮文件，分別是stanford-ner-2014-10-26和stanford-ner-2012-11-11-chinese

將兩個文件解壓可看到

默認NER可以用來處理英文，如果需要處理中文要另外處理。

3、分詞和NER使用

在Eclipse中新建一個Java Project，將data目錄拷貝到項目根路徑下，再把stanford-ner-2012-11-11-chinese解壓的內容全部拷貝到classifiers文件夾下，將stanford-segmenter-3.5.0加入到classpath之中，將classifiers文件夾拷貝到項目根目錄，將stanford-ner-3.5.0.jar和stanford-ner.jar加入到classpath中。最后，去 http://nlp.stanford.edu/software/corenlp.shtml下載stanford-corenlp-full-2014-10-31，將解壓之后的stanford-corenlp-3.5.0也加入到classpath之中。最后的Eclipse中結構如下：

Chinese NER：這段說明，很清晰，需要將中文分詞的結果作為NER的輸入，然后才能識別出NER來。

同時便於測試，本Demo使用junit-4.10.jar，下面開始上代碼

import edu.stanford.nlp.ie.AbstractSequenceClassifier; 
import edu.stanford.nlp.ie.crf.CRFClassifier; 
import edu.stanford.nlp.ling.CoreLabel; 

/** 
* 
* <p> 
* ClassName ExtractDemo 
* </p> 
* <p> 
* Description 加載NER模塊 
* 
*/ 
public class ExtractDemo { 
private static AbstractSequenceClassifier<CoreLabel> ner; 
public ExtractDemo() { 
InitNer(); 
} 
public void InitNer() { 
String serializedClassifier = "classifiers/chinese.misc.distsim.crf.ser.gz"; // chinese.misc.distsim.crf.ser.gz 
if (ner == null) { 
ner = CRFClassifier.getClassifierNoExceptions(serializedClassifier); 
} 
} 

public String doNer(String sent) { 
return ner.classifyWithInlineXML(sent); 
} 

public static void main(String args[]) { 
String str = "我 去 吃飯 ， 告訴 李強 一聲 。"; 
ExtractDemo extractDemo = new ExtractDemo(); 
System.out.println(extractDemo.doNer(str)); 
System.out.println("Complete!"); 
} 

}

import java.io.File; 
import java.io.IOException; 
import java.util.Properties; 

import org.apache.commons.io.FileUtils; 

import edu.stanford.nlp.ie.crf.CRFClassifier; 
import edu.stanford.nlp.ling.CoreLabel; 

/** 
* 
* <p> 
* Description 使用Stanford CoreNLP進行中文分詞 
* </p> 
* 
*/ 
public class ZH_SegDemo { 
public static CRFClassifier<CoreLabel> segmenter; 
static { 
// 設置一些初始化參數 
Properties props = new Properties(); 
props.setProperty("sighanCorporaDict", "data"); 
props.setProperty("serDictionary", "data/dict-chris6.ser.gz"); 
props.setProperty("inputEncoding", "UTF-8"); 
props.setProperty("sighanPostProcessing", "true"); 
segmenter = new CRFClassifier<CoreLabel>(props); 
segmenter.loadClassifierNoExceptions("data/ctb.gz", props); 
segmenter.flags.setProperties(props); 
} 

public static String doSegment(String sent) { 
String[] strs = (String[]) segmenter.segmentString(sent).toArray(); 
StringBuffer buf = new StringBuffer(); 
for (String s : strs) { 
buf.append(s + " "); 
} 
System.out.println("segmented res: " + buf.toString()); 
return buf.toString(); 
} 

public static void main(String[] args) { 
try { 
String readFileToString = FileUtils.readFileToString(new File("澳門141人食物中毒與進食“問題生蚝”有關.txt")); 
String doSegment = doSegment(readFileToString); 
System.out.println(doSegment); 

ExtractDemo extractDemo = new ExtractDemo(); 
System.out.println(extractDemo.doNer(doSegment)); 

System.out.println("Complete!"); 
} catch (IOException e) { 
e.printStackTrace(); 
} 

} 
}

注意一定是JDK 1.8+的環境，最后輸出結果如下：