es中的分詞

本文轉載自查看原文 2020-04-06 23:30 2282 ElasticSearch

　　這個分詞，明天晚上進行補充好。

一：概述

1.分詞器

　　將⽤戶輸⼊的⼀段⽂本，按照⼀定邏輯，分析成多個詞語的⼀種⼯具

2.內置的分詞器

　　standard analyzer

　　simple analyzer

　　whitespace analyzer

　　stop analyzer

　　language analyzer

　　pattern analyzer

二：分詞器測試

1.standard analyzer

　　標准分析器是默認分詞器，如果未指定，則使⽤該分詞器。

POST /_analyze
{
  "analyzer": "standard",
  "text": "The best 3-points shooter is Curry!"
}

　　效果：

　　可以看看token，start_offset，end_offset，type，position。其中，‘-’不在了。

{
  "tokens" : [
    {
      "token" : "the",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "best",
      "start_offset" : 4,
      "end_offset" : 8,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "3",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "<NUM>",
      "position" : 2
    },
    {
      "token" : "points",
      "start_offset" : 11,
      "end_offset" : 17,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "shooter",
      "start_offset" : 18,
      "end_offset" : 25,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "is",
      "start_offset" : 26,
      "end_offset" : 28,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "curry",
      "start_offset" : 29,
      "end_offset" : 34,
      "type" : "<ALPHANUM>",
      "position" : 6
    }
  ]
}

2，simple analyze

　　simple 分析器當它遇到只要不是字⺟的字符，就將⽂本解析成term，⽽且所有的term都是⼩寫的。

POST /_analyze
{
  "analyzer": "simple",
  "text": "The best 3-points shooter is Curry!"
}

　　效果：

　　其中，3與‘-’不在了

{
  "tokens" : [
    {
      "token" : "the",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "best",
      "start_offset" : 4,
      "end_offset" : 8,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "points",
      "start_offset" : 11,
      "end_offset" : 17,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "shooter",
      "start_offset" : 18,
      "end_offset" : 25,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "is",
      "start_offset" : 26,
      "end_offset" : 28,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "curry",
      "start_offset" : 29,
      "end_offset" : 34,
      "type" : "word",
      "position" : 5
    }
  ]
}

　　說明：只要不是字母的。都會被去掉

3.whitespace analyzer

　　whitespace 分析器，當它遇到空⽩字符時，就將⽂本解析成terms

POST /_analyze
{
  "analyzer": "whitespace",
  "text": "The best 3-points shooter is Curry!"
}

　　效果：

{
  "tokens" : [
    {
      "token" : "The",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "best",
      "start_offset" : 4,
      "end_offset" : 8,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "3-points",
      "start_offset" : 9,
      "end_offset" : 17,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "shooter",
      "start_offset" : 18,
      "end_offset" : 25,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "is",
      "start_offset" : 26,
      "end_offset" : 28,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "Curry!",
      "start_offset" : 29,
      "end_offset" : 35,
      "type" : "word",
      "position" : 5
    }
  ]
}

　　說明：只是使用空格進行分開，不會小寫

4.stop analyzer

　　stop 分析器和 simple 分析器很像，唯⼀不同的是，stop 分析器增加了對刪除停⽌詞的⽀持，默認使⽤了english停⽌詞

　　stopwords 預定義的停⽌詞列表，⽐如 (the,a,an,this,of,at)等等

POST /_analyze
{
  "analyzer": "stop",
  "text": "The best 3-points shooter is Curry!"
}

　　效果：

{
  "tokens" : [
    {
      "token" : "best",
      "start_offset" : 4,
      "end_offset" : 8,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "points",
      "start_offset" : 11,
      "end_offset" : 17,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "shooter",
      "start_offset" : 18,
      "end_offset" : 25,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "curry",
      "start_offset" : 29,
      "end_offset" : 34,
      "type" : "word",
      "position" : 5
    }
  ]
}

5.language analyzer

　　特定的語⾔的分詞器，⽐如說，english，英語分詞器),

　　內置語⾔：arabic, armenian, basque, bengali, brazilian, bulgarian, catalan, cjk, czech, danish, dutch, english, finnish, french, galician, german, greek, hindi, hungarian, indonesian, irish, italian, latvian, lithuanian, norwegian, persian, portuguese, romanian, russian, sorani, spanish, swedish, turkish, thai

POST /_analyze
{
  "analyzer": "english",
  "text": "The best 3-points shooter is Curry!"
}

　　效果：

{
  "tokens" : [
    {
      "token" : "best",
      "start_offset" : 4,
      "end_offset" : 8,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "3",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "<NUM>",
      "position" : 2
    },
    {
      "token" : "point",
      "start_offset" : 11,
      "end_offset" : 17,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "shooter",
      "start_offset" : 18,
      "end_offset" : 25,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "curri",
      "start_offset" : 29,
      "end_offset" : 34,
      "type" : "<ALPHANUM>",
      "position" : 6
    }
  ]
}

6.pattern analyzer

　　⽤正則表達式來將⽂本分割成terms，默認的正則表達式是\W+（⾮單詞字符）

POST /_analyze
{
  "analyzer": "pattern",
  "text": "The best 3-points shooter is Curry!"
}

　　效果：

{
  "tokens" : [
    {
      "token" : "the",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "best",
      "start_offset" : 4,
      "end_offset" : 8,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "3",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "points",
      "start_offset" : 11,
      "end_offset" : 17,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "shooter",
      "start_offset" : 18,
      "end_offset" : 25,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "is",
      "start_offset" : 26,
      "end_offset" : 28,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "curry",
      "start_offset" : 29,
      "end_offset" : 34,
      "type" : "word",
      "position" : 6
    }
  ]
}

三：實際使用分詞器

1.新建索引

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "type": "whitespace"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "name": {
        "type": "text"
      },
      "team_name": {
        "type": "text"
      },
      "position": {
        "type": "text"
      },
      "play_year": {
        "type": "long"
      },
      "jerse_no": {
        "type": "keyword"
      },
      "title": {
        "type": "text",
        "analyzer": "my_analyzer"
      }
    }
  }
}

2.進行測試

GET /my_index/_search
{
  "query": {
    "match": {
      "title": "Curry!"
    }
  }
}

　　效果：

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "my_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.2876821,
        "_source" : {
          "name" : "庫⾥",
          "team_name" : "勇⼠",
          "position" : "控球后衛",
          "play_year" : 10,
          "jerse_no" : "30",
          "title" : "The best 3-points shooter is Curry!"
        }
      }
    ]
  }
}

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 es中中文分詞器的使用配置ES中IK分詞器遠程詞庫 [ES]elasticsearch章5　ES的分詞（一）基於hanlp的es分詞插件 es ElasticSearch 中文分詞 es ik分詞插件安裝 es string 分詞完整示例 ES分詞器詳解 es的分詞器analyzer ES中增加大小寫不敏感的分詞器配置-轉