斯坦福分詞工具的試用


下載鏈接 戳這里

下載后的文件夾是這樣的:

 

然后打開eclipse,新建項目,把源文件segDemo.java拷貝進去,把jar包全丟進去(右鍵項目, properties,Java Build Path,Add External Jars)

導入data數據包,並且修改源碼中的路徑,如圖所示:

然后修改segDemo.java並且測試

 1 package test;
 2 import java.io.*;
 3 import java.util.List;
 4 import java.util.Properties;
 5 
 6 import edu.stanford.nlp.ie.crf.CRFClassifier;
 7 import edu.stanford.nlp.ling.CoreLabel;
 8 
 9 
10 /** This is a very simple demo of calling the Chinese Word Segmenter
11  *  programmatically.  It assumes an input file in UTF8.
12  *  <p/>
13  *  <code>
14  *  Usage: java -mx1g -cp seg.jar SegDemo fileName
15  *  </code>
16  *  This will run correctly in the distribution home directory.  To
17  *  run in general, the properties for where to find dictionaries or
18  *  normalizations have to be set.
19  *
20  *  @author Christopher Manning
21  */
22 
23 public class SegDemo {
24 
25   private static final String basedir = System.getProperty("SegDemo", "data");
26 
27   public static void main(String[] args) throws Exception {
28     System.setOut(new PrintStream(System.out, true, "utf-8"));
29 
30     Properties props = new Properties();
31     props.setProperty("sighanCorporaDict", basedir);
32     // props.setProperty("NormalizationTable", "data/norm.simp.utf8");
33     // props.setProperty("normTableEncoding", "UTF-8");
34     // below is needed because CTBSegDocumentIteratorFactory accesses it
35     props.setProperty("serDictionary", basedir + "/dict-chris6.ser.gz");
36     if (args.length > 0) {
37       props.setProperty("testFile", args[0]);
38     }
39     props.setProperty("inputEncoding", "UTF-8");
40     props.setProperty("sighanPostProcessing", "true");
41 
42     CRFClassifier<CoreLabel> segmenter = new CRFClassifier<>(props);
43     segmenter.loadClassifierNoExceptions(basedir + "/ctb.gz", props);
44     for (String filename : args) {
45       segmenter.classifyAndWriteAnswers(filename);
46     }
47 
48     String sample = "我住在美國。";
49     List<String> segmented = segmenter.segmentString(sample);
50     System.out.println(segmented);
51   }
52 
53 }

輸出:[我, 住在, 美國, 。]

之后請隨意發揮吧~


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM