elasticsearch分詞器

本文轉載自查看原文 2018-08-30 14:27 1153 elasticsearch

1、什么是分詞器

切分詞語，normalization（提升recall召回率）

給你一段句子，然后將這段句子拆分成一個一個的單個的單詞，同時對每個單詞進行normalization（時態轉換，單復數轉換）

recall，召回率：搜索的時候，增加能夠搜索到的結果的數量

character filter：在一段文本進行分詞之前，先進行預處理，比如說最常見的就是，過濾html標簽（<span>hello<span> --> hello），& --> and（I&you --> I and you）

tokenizer：分詞，hello you and me --> hello, you, and, me

token filter：lowercase，stop word，synonymom，dogs --> dog，liked --> like，Tom --> tom，a/the/an --> 干掉，mother --> mom，small --> little

一個分詞器，很重要，將一段文本進行各種處理，最后處理好的結果才會拿去建立倒排索引

2、內置分詞器的介紹

Set the shape to semi-transparent by calling set_trans(5)

standard analyzer：set, the, shape, to, semi, transparent, by, calling, set_trans, 5（默認的是standard）

simple analyzer：set, the, shape, to, semi, transparent, by, calling, set, trans

whitespace analyzer：Set, the, shape, to, semi-transparent, by, calling, set_trans(5)

language analyzer（特定的語言的分詞器，比如說，english，英語分詞器）：set, shape, semi, transpar, call, set_tran, 5

3、測試分詞器

GET /_analyze

{

"analyzer": "standard",

"text": "Text to analyze"

}

GET /_analyze

{

"analyzer": "english",

"text": "Text to analyze"

}

1、默認的分詞器

standard

standard tokenizer：以單詞邊界進行切分

standard token filter：什么都不做

lowercase token filter：將所有字母轉換為小寫

stop token filer（默認被禁用）：移除停用詞，比如a the it等等

2、修改分詞器的設置

啟用english停用詞token filter

PUT /my_index

{

"settings": {

"analysis": {

"analyzer": {

"es_std": { //自己起的名字

"type": "standard",

"stopwords": "_english_" //啟用英語移除停用詞

}

測試

GET /my_index/_analyze

{

"analyzer": "standard",

"text": "a dog is in the house"

}

GET /my_index/_analyze

{

"analyzer": "es_std",

"text":"a dog is in the house"

}

3、定制化自己的分詞器

PUT /my_index

{

"settings": {

"analysis": {

"char_filter": { //定義字符轉換名稱

"&_to_and": {

"type": "mapping",//映射

"mappings": ["&=> and"]

}

"filter": {

"my_stopwords": { //定義移除停用詞名稱

"type": "stop",//停用詞

"stopwords": ["the", "a"]

}

"analyzer": {

"my_analyzer": { //自定義需要使用的分詞器名稱

"type": "custom", //自定義

"char_filter": ["html_strip", "&_to_and"], //html_strip(應該是內置)表示移除html，&_to_and表示&轉換成and

"tokenizer": "standard", //基礎默認的分詞器

"filter": ["lowercase", "my_stopwords"] //lowercase表示內置的大小寫轉換，my_stopwords這個是自定義的移除停用詞

}

//自定義的分詞器分析

GET /my_index/_analyze

{

"text": "tom&jerry are a friend in the house, <a>, HAHA!!",

"analyzer": "my_analyzer"

}

//使用自定義的分詞器

PUT /my_index/_mapping/my_type

{

"properties": {

"content": {

"type": "text",

"analyzer": "my_analyzer"

}

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 ElasticSearch 分詞器 ElasticSearch 分詞器 ElasticSearch 分詞器 Elasticsearch 分詞器 elasticsearch之ik分詞器 elasticsearch - ik分詞器 ElasticSearch（四）查詢、分詞器 elasticsearch 安裝分詞器 Elasticsearch IK分詞器 Elasticsearch：Pinyin 分詞器