第四步:查看StandardAnalyzer的分詞效果並添加停用詞


LUCENE的創建索引有好多種分詞方式,這里我們用的StandardAnalyzer分詞

package cn.lucene;

import java.io.IOException;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.util.CharArraySet;

public class test1 {
	public static final String[] china_stop = {"着", "的", "之", "式"};
	public static void main(String[] args) throws IOException {
		//把數組賦值到CharArraySet里
		CharArraySet cnstop=new CharArraySet(china_stop.length, true);
	    for(String value : china_stop) {
	    	cnstop.add(value);
	    }
	    //並把StandardAnalyzer默認的停用詞加進來
	    cnstop.addAll(StandardAnalyzer.STOP_WORDS_SET);
	    System.out.println(cnstop);		
		
		Analyzer analyzer = new StandardAnalyzer(cnstop);
		TokenStream stream=  analyzer.tokenStream("", "中秋be之夜,享受着月華的孤獨,享受着爆炸式的思維躍遷");
		//獲取每個單詞信息,獲取詞元文本屬性
		CharTermAttribute cta = stream.addAttribute(CharTermAttribute.class);
		stream.reset();
        while(stream.incrementToken()){
            System.out.print("[" + cta + "]");
        }
        System.out.println();
		analyzer.close();
	}
}

輸出結果如下:

輸入所有的停止詞,可以看到新的停止詞已經加進去了

[着, but, be, 的, with, such, then, for, 之, 式, no, will, not, are, and, their, if, this, on, into, a, or, there, in, that, they, was, is, it, an, the, as, at, of, by, to, these]

分詞結果,"着", "的", "之", "式"四個詞已經被停止分詞了
[中][秋][夜][享][受][月][華][孤][獨][享][受][爆][炸][思][維][躍][遷]

通過上面的分詞效果應該就知道StandardAnalyzer是怎么分詞了吧!


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM