Elasticsearch 自定義多個分析器

本文轉載自查看原文 2019-03-01 11:44 586

分析器(Analyzer)
Elasticsearch 無論是內置分析器還是自定義分析器，都由三部分組成：字符過濾器(Character Filters)、分詞器(Tokenizer)、詞元過濾器(Token Filters)。

分析器Analyzer工作流程：

Input Text => Character Filters(如果有多個，按順序應用) => Tokenizer => Token Filters(如果有多個，按順序應用) => Output Token

字符過濾器(Character Filters)
字符過濾器：對原始文本預處理，如去除HTML標簽，”&”轉成”and”等。

注意：一個分析器同時有多個字符過濾器時，按順序應用。

分詞器(Tokenizer)
分詞器：將字符串分解成一系列的詞元Token。如根據空格將英文單詞分開。

詞元過濾器(Token Filters)
詞元過濾器：對分詞器分出來的詞元Token做進一步處理，如轉換大小寫、移除停用詞、單復數轉換、同義詞轉換等。

注意：一個分析器同時有多個詞元過濾器時，按順序應用。

分析器analyze API的使用
分析器analyze API可驗證分析器的分析效果並解釋分析過程。

# text: 待分析文本
# explain:解釋分析過程
# char_filter:字符過濾器
# tokenizer:分詞器
# filter:詞元過濾器

GET _analyze 
{
  "char_filter" : ["html_strip"],
  "tokenizer": "standard",
  "filter":  [ "lowercase"],
  "text": "<p><em>No <b>dreams</b>, why bother <b>Beijing</b> !</em></p>",
  "explain" : true
}

自定義多個分析器
創建索引並自定義多個分析器
這里對一個索引同時定義了多個分析器。

PUT my_index
{
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 1, 
    "analysis": { 
      "char_filter": { //自定義多個字符過濾器
        "my_charfilter1": {
          "type": "mapping",
          "mappings": ["& => and"]
        },
        "my_charfilter2": {
          "type": "pattern_replace",
          "pattern": "(\\d+)-(?=\\d)",
          "replacement": "$1_"
        }
      },
      "tokenizer":{  //自定義多個分詞器
          "my_tokenizer1": {
              "pattern":"\\s+",
              "type":"pattern"
            },
          "my_tokenizer2":{
                "pattern":"_",
                "type":"pattern"
            }
      },
      "filter": {  //自定義多個詞元過濾器
        "my_tokenfilter1": {
          "type": "stop",
          "stopwords": ["the", "a","an"]
        },
        "my_tokenfilter2": {
          "type": "stop",
          "stopwords": ["info", "debug"]
        }
      },
      "analyzer": { //自定義多個分析器
         "my_analyzer1":{  //分析器my_analyzer1 
           "char_filter": ["html_strip", "my_charfilter1","my_charfilter2"],
           "tokenizer":"my_tokenizer1",
           "filter": ["lowercase", "my_tokenfilter1"]
         },
         "my_analyzer2":{  //分析器my_analyzer2
           "char_filter": ["html_strip"],
           "tokenizer":"my_tokenizer2",
           "filter": ["my_tokenfilter2"]
         }
      }
    }
  }
}

驗證索引my_index的多個分析器
驗證分析器my_analyzer1分析效果
GET /my_index/_analyze
{
  "text": "<b>Tom </b> & <b>jerry</b> in the room number 1-1-1",
  "analyzer": "my_analyzer1"//,
  //"explain": true
}

#返回結果
{
  "tokens": [
    {
      "token": "tom",
      "start_offset": 3,
      "end_offset": 6,
      "type": "word",
      "position": 0
    },
    {
      "token": "and",
      "start_offset": 12,
      "end_offset": 13,
      "type": "word",
      "position": 1
    },
    {
      "token": "jerry",
      "start_offset": 17,
      "end_offset": 26,
      "type": "word",
      "position": 2
    },
    {
      "token": "in",
      "start_offset": 27,
      "end_offset": 29,
      "type": "word",
      "position": 3
    },
    {
      "token": "room",
      "start_offset": 34,
      "end_offset": 38,
      "type": "word",
      "position": 5
    },
    {
      "token": "number",
      "start_offset": 39,
      "end_offset": 45,
      "type": "word",
      "position": 6
    },
    {
      "token": "1_1_1",
      "start_offset": 46,
      "end_offset": 51,
      "type": "word",
      "position": 7
    }
  ]
}

驗證分析器my_analyzer2分析效果
GET /my_index/_analyze
{
  "text": "<b>debug_192.168.113.1_971213863506812928</b>",
  "analyzer": "my_analyzer2"//,
  //"explain": true
}


#返回結果
{
  "tokens": [
    {
      "token": "192.168.113.1",
      "start_offset": 9,
      "end_offset": 22,
      "type": "word",
      "position": 1
    },
    {
      "token": "971213863506812928",
      "start_offset": 23,
      "end_offset": 45,
      "type": "word",
      "position": 2
    }
  ]
}

添加Mapping並為不同字段設置不同分析器
PUT my_index/_mapping/my_type
{
      "properties": {
      "my_field1": {
        "type": "text",
        "analyzer": "my_analyzer1",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      },
      "my_field2": {
        "type": "text",
        "analyzer": "my_analyzer2",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      }
    }
}

創建文檔
PUT my_index/my_type/1
{
  "my_field1":"<b>Tom </b> & <b>jerry</b> in the room number 1-1-1",
  "my_field2":"<b>debug_192.168.113.1_971213863506812928</b>"
}

Query-Mathch全文檢索
查詢時，ES會根據字段使用的分析器進行分析，然后檢索。

#查詢my_field2字段包含IP:192.168.113.1的文檔
GET my_index/_search
{
  "query": {
    "match": {
      "my_field2": "192.168.113.1"
    }
  }
}

#返回結果
{
  "took": 22,
  "timed_out": false,
  "_shards": {
    "total": 3,
    "successful": 3,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.2876821,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "1",
        "_score": 0.2876821,
        "_source": {
          "my_field1": "<b>Tom </b> & <b>jerry</b> in the room number 1-1-1",
          "my_field2": "<b>debug_192.168.113.1_971213863506812928</b>"
        }
      }
    ]
  }
}

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Elasticsearch自定義分析器 ElasticSearch自定義分析器-集成結巴分詞插件 ElasticSearch：分析器 Beetle在TCP通訊中使用協議分析器和自定義協議對象 Elasticsearch之Analysis（分析器） Elasticsearch集成IKAnalyzer分析器 Elasticsearch分析器結構組成 elasticsearch之內置分析器 elasticsearch ik分詞器自定義詞庫 elasticsearch Mapping使用自定義分詞器