LUCENE的创建索引有好多种分词方式,这里我们用的StandardAnalyzer分词
package cn.lucene; import java.io.IOException; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; import org.apache.lucene.analysis.util.CharArraySet; public class test1 { public static final String[] china_stop = {"着", "的", "之", "式"}; public static void main(String[] args) throws IOException { //把数组赋值到CharArraySet里 CharArraySet cnstop=new CharArraySet(china_stop.length, true); for(String value : china_stop) { cnstop.add(value); } //并把StandardAnalyzer默认的停用词加进来 cnstop.addAll(StandardAnalyzer.STOP_WORDS_SET); System.out.println(cnstop); Analyzer analyzer = new StandardAnalyzer(cnstop); TokenStream stream= analyzer.tokenStream("", "中秋be之夜,享受着月华的孤独,享受着爆炸式的思维跃迁"); //获取每个单词信息,获取词元文本属性 CharTermAttribute cta = stream.addAttribute(CharTermAttribute.class); stream.reset(); while(stream.incrementToken()){ System.out.print("[" + cta + "]"); } System.out.println(); analyzer.close(); } }
输出结果如下:
输入所有的停止词,可以看到新的停止词已经加进去了
[着, but, be, 的, with, such, then, for, 之, 式, no, will, not, are, and, their, if, this, on, into, a, or, there, in, that, they, was, is, it, an, the, as, at, of, by, to, these]
分词结果,"着", "的", "之", "式"四个词已经被停止分词了
[中][秋][夜][享][受][月][华][孤][独][享][受][爆][炸][思][维][跃][迁]
通过上面的分词效果应该就知道StandardAnalyzer是怎么分词了吧!