LUCENE的創建索引有好多種分詞方式,這里我們用的StandardAnalyzer分詞
package cn.lucene;
import java.io.IOException;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.util.CharArraySet;
public class test1 {
public static final String[] china_stop = {"着", "的", "之", "式"};
public static void main(String[] args) throws IOException {
//把數組賦值到CharArraySet里
CharArraySet cnstop=new CharArraySet(china_stop.length, true);
for(String value : china_stop) {
cnstop.add(value);
}
//並把StandardAnalyzer默認的停用詞加進來
cnstop.addAll(StandardAnalyzer.STOP_WORDS_SET);
System.out.println(cnstop);
Analyzer analyzer = new StandardAnalyzer(cnstop);
TokenStream stream= analyzer.tokenStream("", "中秋be之夜,享受着月華的孤獨,享受着爆炸式的思維躍遷");
//獲取每個單詞信息,獲取詞元文本屬性
CharTermAttribute cta = stream.addAttribute(CharTermAttribute.class);
stream.reset();
while(stream.incrementToken()){
System.out.print("[" + cta + "]");
}
System.out.println();
analyzer.close();
}
}
輸出結果如下:
輸入所有的停止詞,可以看到新的停止詞已經加進去了
[着, but, be, 的, with, such, then, for, 之, 式, no, will, not, are, and, their, if, this, on, into, a, or, there, in, that, they, was, is, it, an, the, as, at, of, by, to, these]
分詞結果,"着", "的", "之", "式"四個詞已經被停止分詞了
[中][秋][夜][享][受][月][華][孤][獨][享][受][爆][炸][思][維][躍][遷]
通過上面的分詞效果應該就知道StandardAnalyzer是怎么分詞了吧!
