第四步：查看StandardAnalyzer的分词效果并添加停用词

本文转载自查看原文 2016-09-09 14:57 1778 Java-Lucene

LUCENE的创建索引有好多种分词方式，这里我们用的StandardAnalyzer分词

package cn.lucene;

import java.io.IOException;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.util.CharArraySet;

public class test1 {
	public static final String[] china_stop = {"着", "的", "之", "式"};
	public static void main(String[] args) throws IOException {
		//把数组赋值到CharArraySet里
		CharArraySet cnstop=new CharArraySet(china_stop.length, true);
	    for(String value : china_stop) {
	    	cnstop.add(value);
	    }
	    //并把StandardAnalyzer默认的停用词加进来
	    cnstop.addAll(StandardAnalyzer.STOP_WORDS_SET);
	    System.out.println(cnstop);		
		
		Analyzer analyzer = new StandardAnalyzer(cnstop);
		TokenStream stream=  analyzer.tokenStream("", "中秋be之夜，享受着月华的孤独，享受着爆炸式的思维跃迁");
		//获取每个单词信息,获取词元文本属性
		CharTermAttribute cta = stream.addAttribute(CharTermAttribute.class);
		stream.reset();
        while(stream.incrementToken()){
            System.out.print("[" + cta + "]");
        }
        System.out.println();
		analyzer.close();
	}
}

输出结果如下：

输入所有的停止词，可以看到新的停止词已经加进去了

[着, but, be, 的, with, such, then, for, 之, 式, no, will, not, are, and, their, if, this, on, into, a, or, there, in, that, they, was, is, it, an, the, as, at, of, by, to, these]

分词结果，"着", "的", "之", "式"四个词已经被停止分词了
[中][秋][夜][享][受][月][华][孤][独][享][受][爆][炸][思][维][跃][迁]

通过上面的分词效果应该就知道StandardAnalyzer是怎么分词了吧！

免责声明！

本站转载的文章为个人学习借鉴使用，本站对版权不负任何法律责任。如果侵犯了您的隐私权益，请联系本站邮箱yoyou2525@163.com删除。

猜您在找 ElasticSearch第四步-查询详解 Fluent Ribbon 第四步快速启动栏 H.264---CABAC---第四步--算术编码一步一步学lucene——（第四步：搜索篇） jieba文本分词，去除停用词，添加用户词 python jieba分词（添加停用词，用户字典取词频学习 WebService 第四步：利用WSDL（URL）生成WebService客户端<初级> 第四步使用 adt-eclipse 打包 Cordova (3.0及其以上版本) + sencha touch 项目中文分词与停用词的作用 IKAnalyzer进行中文分词和去停用词