7、ElasticSearch 索引及分词

本文转载自查看原文 2019-06-15 14:20 1915

正排索引
倒排索引
索引模块组成部分
索引过程
停用词
中文分词器
常见的中文分词器
集成IK中文分词插件
自定义IK词库
热更新IK词库

正排索引

由文档去找关键词

倒排索引

由关键词去找文档

倒排索引：保留关键词及对应文档的对应关系

索引模块组成部分

索引分析模块 （通过注册分词器来配置：从文档中提取若干关键词）

分解器
词源过滤器
索引建立模块Indexer

在建立索引过程中，分析处理过的文档将被加入到索引列表。事实上，Lucene为此仅通过了一个非常简单的API，而后自行内生地完成。



文档集合---> 预处理（字符过滤器  charater filter）--->文档集合---> Tokenizer
--->Token集合--->Token Filter ---> Token集合 ---> Index

索引过程

停用词

有些词在文本中出现的频率非常高，但是对文本所携带的信息基本去产生影响，这些词就称之为停用词。

 
http://www.ranks.nl/stopwords/chinese-stopwords
http://www.ranks.nl/stopwords

用法：
　　文本经过分词之后，停用词通常被过滤掉，不会进行索引
　　在检索的时候，用户的查询中如果含有停用词，检索系统也会将其过滤掉
优点：
　　排除停用词可以加快建立索引的所读，减小索引库文件的大小

中文分词器

单字分词
二分法分词
词库分词
　　按某种算法构造词，然后去匹配已建好的词库集合，如果匹配到就切分出来为词语。通常词库分词被认为是最理想的中文分词算法。

常见的中文分词器

StandardAnzlyzer   单字分词
CJKAnalyzer　　二分法
IKAnalyzer 　　词库分词

测试默认的分词器对中文的支持(单字分词）
curl -H "Content-Type: application/json" -XGET 'http://192.168.81.131:9200/_analyze?pretty=true' -d'{"text":"我们是中国人"}'

{
  "tokens" : [
    {
      "token" : "我",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<IDEOGRAPHIC>",
      "position" : 0
    },
    {
      "token" : "们",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "<IDEOGRAPHIC>",
      "position" : 1
    },
    {
      "token" : "是",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "<IDEOGRAPHIC>",
      "position" : 2
    },
    {
      "token" : "中",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "<IDEOGRAPHIC>",
      "position" : 3
    },
    {
      "token" : "国",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "<IDEOGRAPHIC>",
      "position" : 4
    },
    {
      "token" : "人",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "<IDEOGRAPHIC>",
      "position" : 5
    }
  ]
}

集成IK中文分词插件

http://mirror.bit.edu.cn/apache/maven/maven-3/3.6.1/binaries/apache-maven-3.6.1-bin.tar.gz

yum install maven
https://github.com/medcl/elasticsearch-analysis-ik.git

<mirrors> <mirror> <id>alimaven</id> <name>aliyun maven</name> <url>http://maven.aliyun.com/nexus/content/groups/public/</url> <mirrorOf>central</mirrorOf> </mirror> </mirrors>

1、下载es的ik插件
　　https://github.com/medcl/elasticsearch-analysis-ik
　　使用maven对下载的插件进行源码编译  （提前安装maven）
　　mvn clean package -DskipTests
2、拷贝和解压release下的文件： target/releases/elasticsearch-analysis-ik-*.zip到你的elasticsearch插件目录。
　　ES_HOME/plugins/ik
　　
3、重启es服务
4、测试分词效果

curl -H "Content-Type: application/json" -XGET 'http://192.168.81.131:9200/_analyze?pretty=true' -d'{"analyzer":"ik_max_word","text":"我们是中国人"}'
curl -H "Content-Type: application/json" -XGET 'http://192.168.81.131:9200/_analyze?pretty=true' -d'{"analyzer":"ik_smart","text":"我们是中国人"}'

 [INFO ][o.e.p.PluginsService     ] [pz_l3tP] loaded plugin [analysis-ik]

错误信息
Plugin [analysis-ik] was built for Elasticsearch version 6.5.0 but version 6.7.1 is running
编辑插件内的  plugin-descriptor.properties 文件中的版本号

[root@localhost plugins]# curl -H "Content-Type: application/json" -XGET 'http://192.168.81.131:9200/_analyze?pretty=true' -d'{"analyzer":"ik_max_word","text":"我们是中国人"}'
{
  "tokens" : [
    {
      "token" : "我们",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "是",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "中国人",
      "start_offset" : 3,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "中国",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "国人",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 4
    }
  ]
}

自定义IK词库

首先在ik插件的config/custom（自己创建）目录下创建一个test.dic文件，在文件中添加词语即可，每个词语一行

修改ik配置文件/usr/local/elasticsearch-6.6.1/plugins/ik/config/IKAnalyzer.cfg.xml
　　<entry key="ext_dict">custom/test.dic</entry>

重启es服务
测试效果

[root@localhost config]# cat /usr/local/elasticsearch-6.6.1/plugins/ik/config/IKAnalyzer.cfg.xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
    <comment>IK Analyzer 扩展配置</comment>
    <!--用户可以在这里配置自己的扩展字典 -->
    <entry key="ext_dict"></entry>
     <!--用户可以在这里配置自己的扩展停止词字典-->
    <entry key="ext_stopwords"></entry>
    <!--用户可以在这里配置远程扩展字典 -->
    <!-- <entry key="remote_ext_dict">words_location</entry> -->
    <!--用户可以在这里配置远程扩展停止词字典-->
    <!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

热更新IK词库

不用每次更新词库，都要重启

一、部署http服务，安装Tomact
　　1、切换到apache-tomact-7.0.67/webapps/ROOT
　　2、新建热词文件
　　　　vim hot.dic
　　　　****
　　3、需正常访问
　　bin/startup.sh   # 开启Tomact，   使用nginx也一样
　　http://192.168.81.131:8080/hot.dic
二、修改ik插件的配置文件
　　1、vim config/IKAnalyzer.cfg.xml
　　　　添加如下内容
　　　　<entry key="remote_ext_dict">http://192.168.81.131:8080/hot.dic</entry>
　　　　分发修改后的配置到其他es节点
　　2、重启es，可以看到加载热词库
　　3、测试动态添加热词
　　　　对比动态添加热词
　　　　curl -H "Content-Type: application/json" -XGET 'http://192.168.81.131:9200/_analyze?pretty=true' -d'{"analyzer":"ik_smart","text":"老司机"}'

免责声明！

本站转载的文章为个人学习借鉴使用，本站对版权不负任何法律责任。如果侵犯了您的隐私权益，请联系本站邮箱yoyou2525@163.com删除。

猜您在找 Elasticsearch (1) - 索引库文档分词 ElasticStack学习（五）：了解ElasticSearch索引与分词 elasticsearch学习笔记-倒排索引以及中文分词 ElasticSearch（六）：IK分词器的安装与使用IK分词器创建索引 Elasticsearch由浅入深（八）搜索引擎：mapping、精确匹配与全文搜索、分词器、mapping总结 Elasticsearch简介、倒排索引、文档基本操作、分词器 elasticsearch系列三：索引详解（分词器、文档管理、路由详解（集群）） elasticsearch ik分词 Elasticsearch实践（四）：IK分词 elasticsearch 及分词使用