借助之前的分词器,这里我们可以利用句法分析器来对中文语句进行更深的研究,下面我来简单接受一下其调试
首先可以在已有的工程中导入,建立JAVA-Project
再对应的按照代码的种变化进行调整,主要调整两个地方:
1:new LexicalizedParser("grammar/chinesePCFG.ser.gz");
2: String[] sent = { "我", "是", "一名", "好", "学生", "。" };
因为是中文所以,选择的包是中文的,还有就语句要是已经分好词的句子,详细代码如下:
import java.util.ArrayList; import java.util.Collection; import java.util.List; import java.util.*; import java.io.StringReader; import edu.stanford.nlp.objectbank.TokenizerFactory; import edu.stanford.nlp.process.CoreLabelTokenFactory; import edu.stanford.nlp.process.DocumentPreprocessor; import edu.stanford.nlp.process.PTBTokenizer; import edu.stanford.nlp.ling.CoreLabel; import edu.stanford.nlp.ling.HasWord; import edu.stanford.nlp.trees.*; import edu.stanford.nlp.parser.lexparser.LexicalizedParser; class ParserDemo { public static void main(String[] args) { LexicalizedParser lp = new LexicalizedParser("grammar/chinesePCFG.ser.gz"); if (args.length > 0) { demoDP(lp, args[0]); } else { demoAPI(lp); } } public static void demoDP(LexicalizedParser lp, String filename) { // This option shows loading and sentence-segment and tokenizing // a file using DocumentPreprocessor TreebankLanguagePack tlp = new PennTreebankLanguagePack(); GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory(); // You could also create a tokenier here (as below) and pass it // to DocumentPreprocessor for (List<HasWord> sentence : new DocumentPreprocessor(filename)) { Tree parse = lp.apply(sentence); parse.pennPrint(); System.out.println(); GrammaticalStructure gs = gsf.newGrammaticalStructure(parse); Collection tdl = gs.typedDependenciesCCprocessed(true); System.out.println(tdl); System.out.println(); } } public static void demoAPI(LexicalizedParser lp) { // This option shows parsing a list of correctly tokenized words String[] sent = { "我", "是", "一名", "好", "学生", "。" }; List<CoreLabel> rawWords = new ArrayList<CoreLabel>(); for (String word : sent) { CoreLabel l = new CoreLabel(); l.setWord(word); rawWords.add(l); } Tree parse = lp.apply(rawWords); parse.pennPrint(); System.out.println(); // This option shows loading and using an explicit tokenizer //String sent2 = "今天是个晴朗的天气。"; TokenizerFactory<CoreLabel> tokenizerFactory = PTBTokenizer.factory(new CoreLabelTokenFactory(), ""); //List<CoreLabel> rawWords2 = //tokenizerFactory.getTokenizer(new StringReader(sent2)).tokenize(); //parse = lp.apply(rawWords2); TreebankLanguagePack tlp = new PennTreebankLanguagePack(); GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory(); GrammaticalStructure gs = gsf.newGrammaticalStructure(parse); List<TypedDependency> tdl = gs.typedDependenciesCCprocessed(); System.out.println(tdl); System.out.println(); TreePrint tp = new TreePrint("penn,typedDependenciesCollapsed"); tp.printTree(parse); } private ParserDemo() {} // static methods only }
最后能编译通过后,显示结果如下,这里用的是LDC(宾夕法尼亚州的中文语料库及其词性标注和短语标注)