Elasticsearch: analyzer

本文轉載自查看原文 2019-12-23 16:47 2636 ELK Stack

在今天的文章中，我們來進一步了解analyzer。 analyzer執行將輸入字符流分解為token的過程，它一般發生在兩個場合：

在indexing的時候，也即在建立索引的時候
在searching的時候，也即在搜索時，分析需要搜索的詞語

什么是analysis?

分析是Elasticsearch在文檔發送之前對文檔正文執行的過程，以添加到反向索引中（inverted index）。在將文檔添加到索引之前，Elasticsearch會為每個分析的字段執行許多步驟：

Character filtering (字符過濾器): 使用字符過濾器轉換字符
Breaking text into tokens (把文字轉化為標記): 將文本分成一組一個或多個標記
Token filtering：使用標記過濾器轉換每個標記
Token indexing：把這些標記存於index中

接下來我們將更詳細地討論每個步驟，但首先讓我們看一下圖表中總結的整個過程。圖5.1顯示了“share your experience with NoSql & big data technologies"為分析的標記：share, your, experience, with, nosql, big, data，tools,及 technologies。

上面所展示的是一個由character過濾器，標准的tokenizer及Token filter組成的定制analyzer。上面的這個圖非常好，它很簡潔地描述一個analyzer的基本組成部分，以及每個部分所需要表述的東西。

每當一個文檔被ingest節點納入，它需要經歷如下的步驟，才能最終把文檔寫入到Elasticsearch的數據庫中：

上面中間的那部分就叫做analyzer，即分析器。它有三個部分組成：Char Filters, Tokenizer及 Token Filter。它們的作用分別如下：

Char Filter: 字符過濾器的工作是執行清除任務，例如剝離HTML標記，還有上面的把“&”轉換為“and”字符串
Tokenizer: 下一步是將文本拆分為稱為標記的術語。這是由tokenizer完成的。可以基於任何規則（例如空格）來完成拆分。有關tokennizer的更多詳細信息，請訪問以下URL：https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html。
Token filter: 一旦創建了token，它們就會被傳遞給token filter，這些過濾器會對token進行規范化。 Token filter可以更改token，刪除術語或向token添加術語。

Elasticsearch已經提供了比較豐富的開箱即用analyzer。我們可以自己創建自己的token analyzer，甚至可以利用已經有的char filter，tokenizer及token filter來重新組合成一個新的analyzer，並可以對文檔中的每一個字段分別定義自己的analyzer。如果大家對analyzer比較感興趣的話，請參閱我們的網址https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html。

在默認的情況下，standard analyzer是Elasticsearch的缺省分析器(https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-standard-analyzer.html)：

沒有 Char Filter
使用standard tokonizer
把字符串變為小寫，同時有選擇地刪除一些stop words等。默認的情況下stop words為_none_，也即不過濾任何stop words。

總體說來一個analyzer可以分為如下的幾個部分：

0個或1個以上的character filter
1個tokenizer
0個或1個以上的token filter

Analyze API

GET /_analyze
POST /_analyze
GET /<index>/_analyze
POST /<index>/_analyze

使用_analyze API來測試analyzer如何解析我們的字符串的，比如：

    GET /_analyze
    {
      "analyzer": "standard",
      "text": "Quick Brown Foxes!"
    }

返回結果：

      "tokens" : [
        {
          "token" : "quick",
          "start_offset" : 0,
          "end_offset" : 5,
          "type" : "<ALPHANUM>",
          "position" : 0
        },
        {
          "token" : "brown",
          "start_offset" : 6,
          "end_offset" : 11,
          "type" : "<ALPHANUM>",
          "position" : 1
        },
        {
          "token" : "foxes",
          "start_offset" : 12,
          "end_offset" : 17,
          "type" : "<ALPHANUM>",
          "position" : 2
        }
      ]
    }

在這里我們使用了standard分析器，它把我們的字符串分解為三個token，並顯示它們分別的位置信息。

Multi-field字符字段

我們可以針對這個使用多個不同的anaylzer來提高我們的搜索：使用不同的分析器來分析同樣的一個字符串，用不同的方式。我們可以使用現有的分析器倆設置一個定制的分析器。比如我們定義如下的一個mapping:

    PUT multifield
    {
      "mappings": {
        "properties": {
          "content": {
            "type": "text",
            "analyzer": "standard", 
            "fields": {
              "english": {
                "type": "text",
                "analyzer": "english"
              }
            }
          }
        }
      }
    }

在這里我們定義了一個叫做multifield的index，我們可以對這個index進行分析。我們對整個field定義了一個standard分析器，同時為叫做english的字段定義了一個english的分析器，這樣有利於我們刪除一些stop words及運用一些同根詞。我們首先來為multifield來建立一個文檔：

    PUT multifield/_doc/1
    {
      "content": "We are excited to introduce the world to X-Pack"
    }

那么我們可以通過如下的方法來進行搜索：

    GET /multifield/_search
    {
      "query": {
        "match": {
          "content": "the"
        }
      }
    }

我們可以看到搜索的結果：

    {
      "took" : 0,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 1,
          "relation" : "eq"
        },
        "max_score" : 0.2876821,
        "hits" : [
          {
            "_index" : "multifield",
            "_type" : "_doc",
            "_id" : "1",
            "_score" : 0.2876821,
            "_source" : {
              "content" : "We are excited to introduce the world to X-Pack"
            }
          }
        ]
      }
    }

我們可以看到搜尋的結果，但是如果我們使用如下的方法：

    GET /multifield/_search
    {
      "query": {
        "match": {
          "content.english": "the"
        }
      }
    }

我們啥也看不到，這是因為“the”在english analyzer里“the”被認為是stop word，而被忽略。

如何定義一個定制的分析器

在這里我們主要運用現有的plugin來完成定制的分析器。對於需要開發自己的plugin的需求，不在這篇文章的范圍。

假如我們有一下的一個句子：

    GET _analyze
    {
      "text": "I am so excited to go to the x-school",
      "analyzer": "standard"
    }

我們可以看到這樣的結果：

        {
          "token" : "x",
          "start_offset" : 29,
          "end_offset" : 30,
          "type" : "<ALPHANUM>",
          "position" : 8
        },
        {
          "token" : "school",
          "start_offset" : 31,
          "end_offset" : 37,
          "type" : "<ALPHANUM>",
          "position" : 9
        }

x-school在這里被分為兩個token：x 及 school。如果我們想把x-school當做一個該怎么辦呢？我們可以通過設置特有的mapping來實現，比如我們有一個叫做blog的index：

    PUT blogs
    {
      "settings": {
        "analysis": {
          "char_filter": {
            "xschool_filter": {
              "type": "mapping",
              "mappings": [
                "X-School => XSchool"
              ]
            }
          },
          "analyzer": {
            "my_content_analyzer": {
              "type": "custom",
              "char_filter": [
                "xschool_filter"
              ],
              "tokenizer": "standard",
              "filter": [
                "lowercase"
              ]
            }
          }
        }
      },
      "mappings": {
        "properties": {
          "content": {
            "type": "text",
            "analyzer": "my_content_analyzer"
          }
        }
      }    
    }

大家請注意在settings里的“analysis”部分，我們定義了一個稱之為xschool_filter的char_filter，它可以幫我們把“x-school”轉化為“XSchool”。緊接着，我們利用xschool_filter定義了一個叫做“my_content_analyzer”。它是一個定制的類型。我們定義它的char_filter， tokenizer及filter。現在我們可以利用我們剛才定義my_content_analyzer來分析我們的字符串。我們在mappings里可以看到：

      "mappings": {
        "properties": {
          "content": {
            "type": "text",
            "analyzer": "my_content_analyzer"
          }
        }
      }

在這里，我們使用了我們剛才在analysis里定義的my_content_analyzer分析器。我們可以通過如下的方法來測試它是否工作：

    POST blogs/_analyze
    {
      "text": "I am so excited to go to the X-School",
      "analyzer": "my_content_analyzer"
    }

我們可以看到如下的結果：

    {
      "tokens" : [
        {
          "token" : "i",
          "start_offset" : 0,
          "end_offset" : 1,
          "type" : "<ALPHANUM>",
          "position" : 0
        },
        {
          "token" : "am",
          "start_offset" : 2,
          "end_offset" : 4,
          "type" : "<ALPHANUM>",
          "position" : 1
        },
        {
          "token" : "so",
          "start_offset" : 5,
          "end_offset" : 7,
          "type" : "<ALPHANUM>",
          "position" : 2
        },
        {
          "token" : "excited",
          "start_offset" : 8,
          "end_offset" : 15,
          "type" : "<ALPHANUM>",
          "position" : 3
        },
        {
          "token" : "to",
          "start_offset" : 16,
          "end_offset" : 18,
          "type" : "<ALPHANUM>",
          "position" : 4
        },
        {
          "token" : "go",
          "start_offset" : 19,
          "end_offset" : 21,
          "type" : "<ALPHANUM>",
          "position" : 5
        },
        {
          "token" : "to",
          "start_offset" : 22,
          "end_offset" : 24,
          "type" : "<ALPHANUM>",
          "position" : 6
        },
        {
          "token" : "the",
          "start_offset" : 25,
          "end_offset" : 28,
          "type" : "<ALPHANUM>",
          "position" : 7
        },
        {
          "token" : "xschool",
          "start_offset" : 29,
          "end_offset" : 37,
          "type" : "<ALPHANUM>",
          "position" : 8
        }
      ]
    }

在這里，我們可以看到“xschool”這個token。

從上面的返回的結果來看，我們還是可以看到“the”，“to”這樣的token。如果我們想去掉這些token的話，我們可以做做如下的設置：

    DELETE blogs
     
    PUT blogs
    {
      "settings": {
        "analysis": {
          "char_filter": {
            "xschool_filter": {
              "type": "mapping",
              "mappings": [
                "X-School => XSchool"
              ]
            }
          },
          "analyzer": {
            "my_content_analyzer": {
              "type": "custom",
              "char_filter": [
                "xschool_filter"
              ],
              "tokenizer": "standard",
              "filter": [
                "lowercase",
                "my_stop"
              ]
            }
          },
          "filter": {
            "my_stop": {
              "type": "stop",
              "stopwords": ["so", "to", "the"]
            }
          }
        }
      },
      "mappings": {
        "properties": {
          "content": {
            "type": "text",
            "analyzer": "my_content_analyzer"
          }
        }
      }
    }

在這里，我們重新加入了一個叫做my_stop的過濾器：

          "filter": {
            "my_stop": {
              "type": "stop",
              "stopwords": ["so", "to", "the"]
            }
          }

我們在我們自己定制的分析器中也加入了my_stop。重新運行我們的分析：

    POST blogs/_analyze
    {
      "text": "I am so excited to go to the X-School",
      "analyzer": "my_content_analyzer"
    }

在上面我們把so, to及the作為stop words去掉了。重新運行我們的分析：

    POST blogs/_analyze
    {
      "text": "I am so excited to go to the X-School",
      "analyzer": "my_content_analyzer"
    }

顯示的結果為：

    {
      "tokens" : [
        {
          "token" : "i",
          "start_offset" : 0,
          "end_offset" : 1,
          "type" : "<ALPHANUM>",
          "position" : 0
        },
        {
          "token" : "am",
          "start_offset" : 2,
          "end_offset" : 4,
          "type" : "<ALPHANUM>",
          "position" : 1
        },
        {
          "token" : "excited",
          "start_offset" : 8,
          "end_offset" : 15,
          "type" : "<ALPHANUM>",
          "position" : 3
        },
        {
          "token" : "go",
          "start_offset" : 19,
          "end_offset" : 21,
          "type" : "<ALPHANUM>",
          "position" : 5
        },
        {
          "token" : "xschool",
          "start_offset" : 29,
          "end_offset" : 37,
          "type" : "<ALPHANUM>",
          "position" : 8
        }
      ]
    }

我們可以看到so, the及to都被過濾掉了。

Filter的順序也很重要

我們來試一下下面的一個例子：

    GET _analyze
    {
      "tokenizer": "whitespace",
      "filter": [
        "lowercase",
        "stop"
      ],
      "text": "To Be Or Not To Be"
    }

在這里我們先進行lowercase的過濾器，先變成小寫字母，再進行stop過濾器，那么返回的結果是[]，也即沒有。

相反，如果我們使用如下的順序：

    GET _analyze
    {
      "tokenizer": "whitespace",
      "filter": [
        "stop",
        "lowercase"
      ],
      "text": "To Be Or Not To Be"
    }

這里先進行stop過濾器，因為這里的詞有些是大寫字母，所以不被認為是stop詞，那么沒有被過濾掉。之后進行lowercase，顯示的結果是to, be, or, not, to, be這些token。

search_analyzer

也許大家已經看出來了，每當一個文檔在被錄入到Elasticsearch中時，需要一個叫做index的過程。在Index的過程中，它會為該字符串進行分詞，並最終形成一個一個的token，並存於數據庫。但是，每當我們搜索一個字符串時，在搜索時，我們同樣也要對該字符串進行分詞，也會建立token。當然這些token不會被存放於數據庫中。

比如：

    GET /chinese/_search
    {
      "query": {
        "match": {
          "content": "Happy a birthday"
        }
      }
    }

對於這個搜索來說，我們在默認的情況下，會把"Happy a birthday"使用同樣的analyzer進行分詞。如果我們的analyzer里含有stop過濾器，它極有可能把字母“a”過濾掉，那么直剩下“happy”及“birthday”這兩個詞，而“a”將不進入搜索之中。

在實際的使用中，我們也可以通過如下的方法對搜索進行制定具體的search_analyzer。

    PUT blogs
    {
      "settings": {
        "analysis": {
          "char_filter": {
            "xschool_filter": {
              "type": "mapping",
              "mappings": [
                "X-School => XSchool"
              ]
            }
          },
          "analyzer": {
            "my_content_analyzer": {
              "type": "custom",
              "char_filter": [
                "xschool_filter"
              ],
              "tokenizer": "standard",
              "filter": [
                "lowercase",
                "my_stop"
              ]
            }
          },
          "filter": {
            "my_stop": {
              "type": "stop",
              "stopwords": ["so", "to", "the"]
            }
          }
        }
      },
      "mappings": {
        "properties": {
          "content": {
            "type": "text",
            "analyzer": "my_content_analyzer",
            "search_analyzer": "standard"
          }
        }
      }
    }

在上面，我們可以看到，我們分別定義了不用的analyzer：在錄入文檔時，我們使用了my_content_analyzer分析器，而在搜索時，我們使用了standard分析器。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 【Elasticsearch 7 探索之路】（四）Analyzer 分析 ElasticSearch-分詞器analyzer elasticsearch中文分詞器ik-analyzer安裝 ElasticSearch（三）：通分詞器（Analyzer）進行分詞（Analysis）【elasticsearch】（3）centos7 安裝中文分詞插件elasticsearch-analyzer-ik ElasticSearch7.3學習(十五)----中文分詞器(IK Analyzer)及自定義詞庫 Docker Caused by: java.nio.file.AccessDeniedException: /usr/share/elasticsearch/plugins/ik-analyzer/plugin-descriptor.properties Elasticsearch的索引模塊（正排索引、倒排索引、索引分析模塊Analyzer、索引和搜索、停用詞、中文分詞器） 2016021904 - 如何使用Memory Analyzer Memory Analyzer Tool 使用