第四步：查看StandardAnalyzer的分詞效果並添加停用詞

本文轉載自查看原文 2016-09-09 14:57 1778 Java-Lucene

LUCENE的創建索引有好多種分詞方式，這里我們用的StandardAnalyzer分詞

package cn.lucene;

import java.io.IOException;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.util.CharArraySet;

public class test1 {
	public static final String[] china_stop = {"着", "的", "之", "式"};
	public static void main(String[] args) throws IOException {
		//把數組賦值到CharArraySet里
		CharArraySet cnstop=new CharArraySet(china_stop.length, true);
	    for(String value : china_stop) {
	    	cnstop.add(value);
	    }
	    //並把StandardAnalyzer默認的停用詞加進來
	    cnstop.addAll(StandardAnalyzer.STOP_WORDS_SET);
	    System.out.println(cnstop);		
		
		Analyzer analyzer = new StandardAnalyzer(cnstop);
		TokenStream stream=  analyzer.tokenStream("", "中秋be之夜，享受着月華的孤獨，享受着爆炸式的思維躍遷");
		//獲取每個單詞信息,獲取詞元文本屬性
		CharTermAttribute cta = stream.addAttribute(CharTermAttribute.class);
		stream.reset();
        while(stream.incrementToken()){
            System.out.print("[" + cta + "]");
        }
        System.out.println();
		analyzer.close();
	}
}

輸出結果如下：

輸入所有的停止詞，可以看到新的停止詞已經加進去了

[着, but, be, 的, with, such, then, for, 之, 式, no, will, not, are, and, their, if, this, on, into, a, or, there, in, that, they, was, is, it, an, the, as, at, of, by, to, these]

分詞結果，"着", "的", "之", "式"四個詞已經被停止分詞了
[中][秋][夜][享][受][月][華][孤][獨][享][受][爆][炸][思][維][躍][遷]

通過上面的分詞效果應該就知道StandardAnalyzer是怎么分詞了吧！

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 ElasticSearch第四步-查詢詳解 Fluent Ribbon 第四步快速啟動欄 H.264---CABAC---第四步--算術編碼一步一步學lucene——（第四步：搜索篇） jieba文本分詞，去除停用詞，添加用戶詞 python jieba分詞（添加停用詞，用戶字典取詞頻學習 WebService 第四步：利用WSDL（URL）生成WebService客戶端<初級> 第四步使用 adt-eclipse 打包 Cordova (3.0及其以上版本) + sencha touch 項目中文分詞與停用詞的作用 IKAnalyzer進行中文分詞和去停用詞