借助之前的分詞器,這里我們可以利用句法分析器來對中文語句進行更深的研究,下面我來簡單接受一下其調試
首先可以在已有的工程中導入,建立JAVA-Project
再對應的按照代碼的種變化進行調整,主要調整兩個地方:
1:new LexicalizedParser("grammar/chinesePCFG.ser.gz");
2: String[] sent = { "我", "是", "一名", "好", "學生", "。" };
因為是中文所以,選擇的包是中文的,還有就語句要是已經分好詞的句子,詳細代碼如下:
import java.util.ArrayList; import java.util.Collection; import java.util.List; import java.util.*; import java.io.StringReader; import edu.stanford.nlp.objectbank.TokenizerFactory; import edu.stanford.nlp.process.CoreLabelTokenFactory; import edu.stanford.nlp.process.DocumentPreprocessor; import edu.stanford.nlp.process.PTBTokenizer; import edu.stanford.nlp.ling.CoreLabel; import edu.stanford.nlp.ling.HasWord; import edu.stanford.nlp.trees.*; import edu.stanford.nlp.parser.lexparser.LexicalizedParser; class ParserDemo { public static void main(String[] args) { LexicalizedParser lp = new LexicalizedParser("grammar/chinesePCFG.ser.gz"); if (args.length > 0) { demoDP(lp, args[0]); } else { demoAPI(lp); } } public static void demoDP(LexicalizedParser lp, String filename) { // This option shows loading and sentence-segment and tokenizing // a file using DocumentPreprocessor TreebankLanguagePack tlp = new PennTreebankLanguagePack(); GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory(); // You could also create a tokenier here (as below) and pass it // to DocumentPreprocessor for (List<HasWord> sentence : new DocumentPreprocessor(filename)) { Tree parse = lp.apply(sentence); parse.pennPrint(); System.out.println(); GrammaticalStructure gs = gsf.newGrammaticalStructure(parse); Collection tdl = gs.typedDependenciesCCprocessed(true); System.out.println(tdl); System.out.println(); } } public static void demoAPI(LexicalizedParser lp) { // This option shows parsing a list of correctly tokenized words String[] sent = { "我", "是", "一名", "好", "學生", "。" }; List<CoreLabel> rawWords = new ArrayList<CoreLabel>(); for (String word : sent) { CoreLabel l = new CoreLabel(); l.setWord(word); rawWords.add(l); } Tree parse = lp.apply(rawWords); parse.pennPrint(); System.out.println(); // This option shows loading and using an explicit tokenizer //String sent2 = "今天是個晴朗的天氣。"; TokenizerFactory<CoreLabel> tokenizerFactory = PTBTokenizer.factory(new CoreLabelTokenFactory(), ""); //List<CoreLabel> rawWords2 = //tokenizerFactory.getTokenizer(new StringReader(sent2)).tokenize(); //parse = lp.apply(rawWords2); TreebankLanguagePack tlp = new PennTreebankLanguagePack(); GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory(); GrammaticalStructure gs = gsf.newGrammaticalStructure(parse); List<TypedDependency> tdl = gs.typedDependenciesCCprocessed(); System.out.println(tdl); System.out.println(); TreePrint tp = new TreePrint("penn,typedDependenciesCollapsed"); tp.printTree(parse); } private ParserDemo() {} // static methods only }
最后能編譯通過后,顯示結果如下,這里用的是LDC(賓夕法尼亞州的中文語料庫及其詞性標注和短語標注)