Elastic Search 分詞器的介紹和使用

本文轉載自查看原文 2020-04-05 21:49 678 Elastic Search

分詞器的介紹和使用

什么是分詞器?

將用戶輸入的一段文本，按照一定邏輯，分析成多個詞語的一種工具

常用的內置分詞器

standard analyzer、simple analyzer、whitespace analyzer、stop analyzer、language analyzer、pattern analyzer

standard analyzer

標准分析器是默認分詞器，如果未指定，則使用該分詞器。

POST localhost:9200/_analyze

參數：
{
    "analyzer":"standard",
    "text":"The best 3-points shooter is Curry!"
}

返回值：
{
    "tokens": [
        {
            "token": "the",
            "start_offset": 0,
            "end_offset": 3,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "best",
            "start_offset": 4,
            "end_offset": 8,
            "type": "<ALPHANUM>",
            "position": 1
        },
        {
            "token": "3",
            "start_offset": 9,
            "end_offset": 10,
            "type": "<NUM>",
            "position": 2
        },
        {
            "token": "points",
            "start_offset": 11,
            "end_offset": 17,
            "type": "<ALPHANUM>",
            "position": 3
        },
        {
            "token": "shooter",
            "start_offset": 18,
            "end_offset": 25,
            "type": "<ALPHANUM>",
            "position": 4
        },
        {
            "token": "is",
            "start_offset": 26,
            "end_offset": 28,
            "type": "<ALPHANUM>",
            "position": 5
        },
        {
            "token": "curry",
            "start_offset": 29,
            "end_offset": 34,
            "type": "<ALPHANUM>",
            "position": 6
        }
    ]
}

simple analyzer

simple 分析器當它遇到只要不是字母的字符，就將文本解析成 term，而且所有的 term 都是小寫的。

POST localhost:9200/_analyze

參數：
{
    "analyzer":"simple",
    "text":"The best 3-points shooter is Curry!"
}

返回值：
{
    "tokens": [
        {
            "token": "the",
            "start_offset": 0,
            "end_offset": 3,
            "type": "word",
            "position": 0
        },
        {
            "token": "best",
            "start_offset": 4,
            "end_offset": 8,
            "type": "word",
            "position": 1
        },
        {
            "token": "points",
            "start_offset": 11,
            "end_offset": 17,
            "type": "word",
            "position": 2
        },
        {
            "token": "shooter",
            "start_offset": 18,
            "end_offset": 25,
            "type": "word",
            "position": 3
        },
        {
            "token": "is",
            "start_offset": 26,
            "end_offset": 28,
            "type": "word",
            "position": 4
        },
        {
            "token": "curry",
            "start_offset": 29,
            "end_offset": 34,
            "type": "word",
            "position": 5
        }
    ]
}

whitespace analyzer

whitespace 分析器，當它遇到空白字符時，就將文本解析成terms

POST localhost:9200/_analyze

參數：
{
    "analyzer":"whitespace",
    "text":"The best 3-points shooter is Curry!"
}

返回值：
{
    "tokens": [
        {
            "token": "The",
            "start_offset": 0,
            "end_offset": 3,
            "type": "word",
            "position": 0
        },
        {
            "token": "best",
            "start_offset": 4,
            "end_offset": 8,
            "type": "word",
            "position": 1
        },
        {
            "token": "3-points",
            "start_offset": 9,
            "end_offset": 17,
            "type": "word",
            "position": 2
        },
        {
            "token": "shooter",
            "start_offset": 18,
            "end_offset": 25,
            "type": "word",
            "position": 3
        },
        {
            "token": "is",
            "start_offset": 26,
            "end_offset": 28,
            "type": "word",
            "position": 4
        },
        {
            "token": "Curry!",
            "start_offset": 29,
            "end_offset": 35,
            "type": "word",
            "position": 5
        }
    ]
}

stop analyzer

stop 分析器和 simple 分析器很像，唯一不同的是，stop 分析器增加了對刪除停止詞的支持，默認使用了 english 停止詞

stopwords 預定義的停止詞列表，比如 (the,a,an,this,of,at)等等

POST localhost:9200/_analyze

參數：
{
    "analyzer":"stop",
    "text":"The best 3-points shooter is Curry!"
}

返回值：
{
    "tokens": [
        {
            "token": "best",
            "start_offset": 4,
            "end_offset": 8,
            "type": "word",
            "position": 1
        },
        {
            "token": "points",
            "start_offset": 11,
            "end_offset": 17,
            "type": "word",
            "position": 2
        },
        {
            "token": "shooter",
            "start_offset": 18,
            "end_offset": 25,
            "type": "word",
            "position": 3
        },
        {
            "token": "curry",
            "start_offset": 29,
            "end_offset": 34,
            "type": "word",
            "position": 5
        }
    ]
}

language analyzer

（特定的語言的分詞器，比如說，english，英語分詞器)，內置語言：arabic，armenian，basque，bengali，brazilian，bulgarian，catalan，cjk，czech，danish，dutch，english，finnish，french，galician，german，greek，hindi，hungarian，indonesian，irish，italian，latvian，lithuanian，norwegian，persian，portuguese，romanian，russian，sorani，spanish，swedish，turkish，thai

POST localhost:9200/_analyze

參數：
{
    "analyzer":"english",
    "text":"The best 3-points shooter is Curry!"
}

返回值：
{
    "tokens": [
        {
            "token": "best",
            "start_offset": 4,
            "end_offset": 8,
            "type": "<ALPHANUM>",
            "position": 1
        },
        {
            "token": "3",
            "start_offset": 9,
            "end_offset": 10,
            "type": "<NUM>",
            "position": 2
        },
        {
            "token": "point",
            "start_offset": 11,
            "end_offset": 17,
            "type": "<ALPHANUM>",
            "position": 3
        },
        {
            "token": "shooter",
            "start_offset": 18,
            "end_offset": 25,
            "type": "<ALPHANUM>",
            "position": 4
        },
        {
            "token": "curri",
            "start_offset": 29,
            "end_offset": 34,
            "type": "<ALPHANUM>",
            "position": 6
        }
    ]
}

pattern analyzer

用正則表達式來將文本分割成terms，默認的正則表達式是\W+（非單詞字符）

POST localhost:9200/_analyze

參數：
{
    "analyzer":"pattern",
    "text":"The best 3-points shooter is Curry!"
}

返回值：
{
    "tokens": [
        {
            "token": "the",
            "start_offset": 0,
            "end_offset": 3,
            "type": "word",
            "position": 0
        },
        {
            "token": "best",
            "start_offset": 4,
            "end_offset": 8,
            "type": "word",
            "position": 1
        },
        {
            "token": "3",
            "start_offset": 9,
            "end_offset": 10,
            "type": "word",
            "position": 2
        },
        {
            "token": "points",
            "start_offset": 11,
            "end_offset": 17,
            "type": "word",
            "position": 3
        },
        {
            "token": "shooter",
            "start_offset": 18,
            "end_offset": 25,
            "type": "word",
            "position": 4
        },
        {
            "token": "is",
            "start_offset": 26,
            "end_offset": 28,
            "type": "word",
            "position": 5
        },
        {
            "token": "curry",
            "start_offset": 29,
            "end_offset": 34,
            "type": "word",
            "position": 6
        }
    ]
}

使用示例

PUT localhost:9200/my_index

參數：
{
    "settings":{
        "analysis":{
            "analyzer":{
                "my_analyzer":{
                    "type":"whitespace"
                }
            }
        }
    },
    "mappings":{
        "properties":{
            "name":{
                "type":"text"
            },
            "team_name":{
                "type":"text"
            },
            "position":{
                "type":"text"
            },
            "play_year":{
                "type":"long"
            },
            "jerse_no":{
                "type":"keyword"
            },
            "title":{
                "type":"text",
                "analyzer":"my_analyzer"
            }
        }
    }
}

返回值：
{
    "acknowledged": true,
    "shards_acknowledged": true,
    "index": "my_index"
}

PUT localhost:9200/my_index/_doc/1

參數：
{
    "name":"庫⾥里里",
    "team_name":"勇⼠士",
    "position":"控球后衛",
    "play_year":10,
    "jerse_no":"30",
    "title":"The best 3-points shooter is Curry!"
}

返回值：
{
    "_index": "my_index",
    "_type": "_doc",
    "_id": "1",
    "_version": 1,
    "result": "created",
    "_shards": {
        "total": 2,
        "successful": 1,
        "failed": 0
    },
    "_seq_no": 0,
    "_primary_term": 1
}

POST localhost:9200/my_index/_search

參數：
{
    "query":{
        "match":{
            "title":"Curry!"
        }
    }
}

返回值：
{
    "took": 2,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 0.2876821,
        "hits": [
            {
                "_index": "my_index",
                "_type": "_doc",
                "_id": "1",
                "_score": 0.2876821,
                "_source": {
                    "name": "庫⾥里里",
                    "team_name": "勇⼠士",
                    "position": "控球后衛",
                    "play_year": 10,
                    "jerse_no": "30",
                    "title": "The best 3-points shooter is Curry!"
                }
            }
        ]
    }
}

常見中文分詞器的使用

如果用默認的分詞器 standard 進行中文分詞，中文會被單獨分成一個個漢字

POST localhost:9200/_analyze

參數：
{
    "analyzer": "standard",
    "text": "火箭明年總冠軍"
}

返回值：
{
    "tokens": [
        {
            "token": "火",
            "start_offset": 0,
            "end_offset": 1,
            "type": "<IDEOGRAPHIC>",
            "position": 0
        },
        {
            "token": "箭",
            "start_offset": 1,
            "end_offset": 2,
            "type": "<IDEOGRAPHIC>",
            "position": 1
        },
        {
            "token": "明",
            "start_offset": 2,
            "end_offset": 3,
            "type": "<IDEOGRAPHIC>",
            "position": 2
        },
        {
            "token": "年",
            "start_offset": 3,
            "end_offset": 4,
            "type": "<IDEOGRAPHIC>",
            "position": 3
        },
        {
            "token": "總",
            "start_offset": 4,
            "end_offset": 5,
            "type": "<IDEOGRAPHIC>",
            "position": 4
        },
        {
            "token": "冠",
            "start_offset": 5,
            "end_offset": 6,
            "type": "<IDEOGRAPHIC>",
            "position": 5
        },
        {
            "token": "軍",
            "start_offset": 6,
            "end_offset": 7,
            "type": "<IDEOGRAPHIC>",
            "position": 6
        }
    ]
}

常見分詞器

smartCN分詞器：一個簡單的中文或中英文混合文本的分詞器

IK分詞器：更智能更友好的中文分詞器

smartCn

安裝，進入到bin目錄

Linux 使用命令：sh elasticsearch-plugin install analysis-smartcn
Windows 使用命令：.\elasticsearch-plugin install analysis-smartcn

安裝后需要重新啟動

POST localhost:9200/_analyze

參數：
{
    "analyzer": "smartcn",
    "text": "火箭明年總冠軍"
}

返回值：
{
    "tokens": [
        {
            "token": "火箭",
            "start_offset": 0,
            "end_offset": 2,
            "type": "word",
            "position": 0
        },
        {
            "token": "明年",
            "start_offset": 2,
            "end_offset": 4,
            "type": "word",
            "position": 1
        },
        {
            "token": "總",
            "start_offset": 4,
            "end_offset": 5,
            "type": "word",
            "position": 2
        },
        {
            "token": "冠軍",
            "start_offset": 5,
            "end_offset": 7,
            "type": "word",
            "position": 3
        }
    ]
}

IK分詞器

下載：https://github.com/medcl/elasticsearch-analysis-ik/releases

安裝：解壓安裝到 plugins 目錄

安裝后重新啟動

POST localhost:9200/_analyze

參數：
{
    "analyzer": "ik_max_word",
    "text": "火箭明年總冠軍"
}

返回值：
{
    "tokens": [
        {
            "token": "火箭",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "明年",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "總冠軍",
            "start_offset": 4,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 2
        },
        {
            "token": "冠軍",
            "start_offset": 5,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 3
        }
    ]
}

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Elastic search 字段指定自帶分詞器 es學習(三)：分詞器介紹以及中文分詞器ik的安裝與使用 ELASTIC-PHP + IK分詞器 + THINKPHP6 初次使用（關鍵詞查詢） ElasticSearch-ik分詞器介紹及使用 IK分詞器的使用 ElasticSearch中文分詞器-IK分詞器的使用 IK分詞器的使用 ElasticSearch中文分詞器-IK分詞器的使用 Elasticsearch：ICU分詞器介紹 Elastic Search 介紹及入門