elasticsearch - 分析過程

本文轉載自查看原文 2019-03-25 12:28 621 Elasticsearch

前言
分析過程
分析器
字符過濾器
分詞器
分詞過濾器

前言

現在，我們已經了解了如何建立索引和搜索數據了。
那么，是時候來探索背后的故事了！當數據傳遞到elasticsearch后，到底發生了什么？

分析過程

當數據被發送到elasticsearch后並加入到倒排索引之前，elasticsearch會對該文檔的進行一系列的處理步驟：

字符過濾：使用字符過濾器轉變字符。
文本切分為分詞：將文本（檔）分為單個或多個分詞。
分詞過濾：使用分詞過濾器轉變每個分詞。
分詞索引：最終將分詞存儲在Lucene倒排索引中。

整體流程如下圖所示：

接下來，我們簡要的介紹elasticsearch中的分析器、分詞器和分詞過濾器。它們配置簡單，靈活好用，我們可以通過不同的組合來獲取我們想要的分詞！

是的，無論多么復雜的分析過程，都是為了獲取更加人性化的分詞！
接下來，我們來看看其中，在整個分析過程的各個組件吧。

分析器

在elasticsearch中，一個分析器可以包括：

可選的字符過濾器
一個分詞器
0個或多個分詞過濾器

接下來簡要的介紹各內置分詞的大致情況。在介紹之前，為了方便演示。如果你已經按照之前的教程安裝了ik analysis，現在請暫時將該插件移出plugins目錄。

標准分析器：standard analyzer

標准分析器（standard analyzer）：是elasticsearch的默認分析器，該分析器綜合了大多數歐洲語言來說合理的默認模塊，包括標准分詞器、標准分詞過濾器、小寫轉換分詞過濾器和停用詞分詞過濾器。

POST _analyze
{
  "analyzer": "standard",
  "text":"To be or not to be,  That is a question ———— 莎士比亞"
}

分詞結果如下：

{
  "tokens" : [
    {
      "token" : "to",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "be",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "or",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "not",
      "start_offset" : 9,
      "end_offset" : 12,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "to",
      "start_offset" : 13,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "be",
      "start_offset" : 16,
      "end_offset" : 18,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "that",
      "start_offset" : 21,
      "end_offset" : 25,
      "type" : "<ALPHANUM>",
      "position" : 6
    },
    {
      "token" : "is",
      "start_offset" : 26,
      "end_offset" : 28,
      "type" : "<ALPHANUM>",
      "position" : 7
    },
    {
      "token" : "a",
      "start_offset" : 29,
      "end_offset" : 30,
      "type" : "<ALPHANUM>",
      "position" : 8
    },
    {
      "token" : "question",
      "start_offset" : 31,
      "end_offset" : 39,
      "type" : "<ALPHANUM>",
      "position" : 9
    },
    {
      "token" : "莎",
      "start_offset" : 45,
      "end_offset" : 46,
      "type" : "<IDEOGRAPHIC>",
      "position" : 10
    },
    {
      "token" : "士",
      "start_offset" : 46,
      "end_offset" : 47,
      "type" : "<IDEOGRAPHIC>",
      "position" : 11
    },
    {
      "token" : "比",
      "start_offset" : 47,
      "end_offset" : 48,
      "type" : "<IDEOGRAPHIC>",
      "position" : 12
    },
    {
      "token" : "亞",
      "start_offset" : 48,
      "end_offset" : 49,
      "type" : "<IDEOGRAPHIC>",
      "position" : 13
    }
  ]
}

簡單分析器：simple analyzer

簡單分析器（simple analyzer）：簡單分析器僅使用了小寫轉換分詞，這意味着在非字母處進行分詞，並將分詞自動轉換為小寫。這個分詞器對於亞種語言來說效果不佳，因為亞洲語言不是根據空白來分詞的，所以一般用於歐洲言中。

POST _analyze
{
  "analyzer": "simple",
  "text":"To be or not to be,  That is a question ———— 莎士比亞"
}

分詞結果如下：

{
  "tokens" : [
    {
      "token" : "to",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "be",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "or",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "not",
      "start_offset" : 9,
      "end_offset" : 12,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "to",
      "start_offset" : 13,
      "end_offset" : 15,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "be",
      "start_offset" : 16,
      "end_offset" : 18,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "that",
      "start_offset" : 21,
      "end_offset" : 25,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "is",
      "start_offset" : 26,
      "end_offset" : 28,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "a",
      "start_offset" : 29,
      "end_offset" : 30,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "question",
      "start_offset" : 31,
      "end_offset" : 39,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "莎士比亞",
      "start_offset" : 45,
      "end_offset" : 49,
      "type" : "word",
      "position" : 10
    }
  ]
}

空白分析器：whitespace analyzer

空白（格）分析器（whitespace analyzer）：這玩意兒只是根據空白將文本切分為若干分詞，真是有夠偷懶！

POST _analyze
{
  "analyzer": "whitespace",
  "text":"To be or not to be,  That is a question ———— 莎士比亞"
}

分詞結果如下：

{
  "tokens" : [
    {
      "token" : "To",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "be",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "or",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "not",
      "start_offset" : 9,
      "end_offset" : 12,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "to",
      "start_offset" : 13,
      "end_offset" : 15,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "be,",
      "start_offset" : 16,
      "end_offset" : 19,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "That",
      "start_offset" : 21,
      "end_offset" : 25,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "is",
      "start_offset" : 26,
      "end_offset" : 28,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "a",
      "start_offset" : 29,
      "end_offset" : 30,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "question",
      "start_offset" : 31,
      "end_offset" : 39,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "————",
      "start_offset" : 40,
      "end_offset" : 44,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "莎士比亞",
      "start_offset" : 45,
      "end_offset" : 49,
      "type" : "word",
      "position" : 11
    }
  ]
}

停用詞分析器：stop analyzer

停用詞分析（stop analyzer）和簡單分析器的行為很像，只是在分詞流中額外的過濾了停用詞。

POST _analyze
{
  "analyzer": "stop",
  "text":"To be or not to be,  That is a question ———— 莎士比亞"
}

結果也很簡單：

{
  "tokens" : [
    {
      "token" : "question",
      "start_offset" : 31,
      "end_offset" : 39,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "莎士比亞",
      "start_offset" : 45,
      "end_offset" : 49,
      "type" : "word",
      "position" : 10
    }
  ]
}

關鍵詞分析器：keyword analyzer

關鍵詞分析器（keyword analyzer）將整個字段當做單獨的分詞，如無必要，我們不在映射中使用關鍵詞分析器。

POST _analyze
{
  "analyzer": "keyword",
  "text":"To be or not to be,  That is a question ———— 莎士比亞"
}

結果如下：

{
  "tokens" : [
    {
      "token" : "To be or not to be,  That is a question ———— 莎士比亞",
      "start_offset" : 0,
      "end_offset" : 49,
      "type" : "word",
      "position" : 0
    }
  ]
}

說的一點沒錯，分析結果是將整段當做單獨的分詞。

模式分析器：pattern analyzer

模式分析器（pattern analyzer）允許我們指定一個分詞切分模式。但是通常更佳的方案是使用定制的分析器，組合現有的模式分詞器和所需要的分詞過濾器更加合適。

POST _analyze
{
  "analyzer": "pattern",
  "explain": false, 
  "text":"To be or not to be,  That is a question ———— 莎士比亞"
}

結果如下：

{
  "tokens" : [
    {
      "token" : "to",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "be",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "or",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "not",
      "start_offset" : 9,
      "end_offset" : 12,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "to",
      "start_offset" : 13,
      "end_offset" : 15,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "be",
      "start_offset" : 16,
      "end_offset" : 18,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "that",
      "start_offset" : 21,
      "end_offset" : 25,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "is",
      "start_offset" : 26,
      "end_offset" : 28,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "a",
      "start_offset" : 29,
      "end_offset" : 30,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "question",
      "start_offset" : 31,
      "end_offset" : 39,
      "type" : "word",
      "position" : 9
    }
  ]
}

我們來自定制一個模式分析器，比如我們寫匹配郵箱的正則。

PUT pattern_test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_email_analyzer":{
          "type":"pattern",
          "pattern":"\\W|_",
          "lowercase":true
        }
      }
    }
  }
}

上例中，我們在創建一條索引的時候，配置分析器為自定義的分析器。

需要注意的是，在json字符串中，正則的斜杠需要轉義。

我們使用自定義的分析器來查詢。

POST pattern_test/_analyze
{
  "analyzer": "my_email_analyzer",
  "text": "John_Smith@foo-bar.com"
}

結果如下：

{
  "tokens" : [
    {
      "token" : "john",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "smith",
      "start_offset" : 5,
      "end_offset" : 10,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "foo",
      "start_offset" : 11,
      "end_offset" : 14,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "bar",
      "start_offset" : 15,
      "end_offset" : 18,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "com",
      "start_offset" : 19,
      "end_offset" : 22,
      "type" : "word",
      "position" : 4
    }
  ]
}

語言和多語言分析器：chinese

elasticsearch為很多世界流行語言提供良好的、簡單的、開箱即用的語言分析器集合：阿拉伯語、亞美尼亞語、巴斯克語、巴西語、保加利亞語、加泰羅尼亞語、中文、捷克語、丹麥、荷蘭語、英語、芬蘭語、法語、加里西亞語、德語、希臘語、北印度語、匈牙利語、印度尼西亞、愛爾蘭語、意大利語、日語、韓國語、庫爾德語、挪威語、波斯語、葡萄牙語、羅馬尼亞語、俄語、西班牙語、瑞典語、土耳其語和泰語。

我們可以指定其中之一的語言來指定特定的語言分析器，但必須是小寫的名字！如果你要分析的語言不在上述集合中，可能還需要搭配相應的插件支持。

POST _analyze
{
  "analyzer": "chinese",
  "text":"To be or not to be,  That is a question ———— 莎士比亞"
}

結果如下：

{
  "tokens" : [
    {
      "token" : "question",
      "start_offset" : 31,
      "end_offset" : 39,
      "type" : "<ALPHANUM>",
      "position" : 9
    },
    {
      "token" : "莎",
      "start_offset" : 45,
      "end_offset" : 46,
      "type" : "<IDEOGRAPHIC>",
      "position" : 10
    },
    {
      "token" : "士",
      "start_offset" : 46,
      "end_offset" : 47,
      "type" : "<IDEOGRAPHIC>",
      "position" : 11
    },
    {
      "token" : "比",
      "start_offset" : 47,
      "end_offset" : 48,
      "type" : "<IDEOGRAPHIC>",
      "position" : 12
    },
    {
      "token" : "亞",
      "start_offset" : 48,
      "end_offset" : 49,
      "type" : "<IDEOGRAPHIC>",
      "position" : 13
    }
  ]
}

也可以是別語言：

POST _analyze
{
  "analyzer": "french",
  "text":"Je suis ton père"
}
POST _analyze
{
  "analyzer": "german",
  "text":"Ich bin dein vater"
}

雪球分析器：snowball analyzer

雪球分析器（snowball analyzer）除了使用標准的分詞和分詞過濾器（和標准分析器一樣）也是用了小寫分詞過濾器和停用詞過濾器，除此之外，它還是用了雪球詞干器對文本進行詞干提取。

POST _analyze
{
  "analyzer": "snowball",
  "text":"To be or not to be,  That is a question ———— 莎士比亞"
}

結果如下：

{
  "tokens" : [
    {
      "token" : "question",
      "start_offset" : 31,
      "end_offset" : 39,
      "type" : "<ALPHANUM>",
      "position" : 9
    },
    {
      "token" : "莎",
      "start_offset" : 45,
      "end_offset" : 46,
      "type" : "<IDEOGRAPHIC>",
      "position" : 10
    },
    {
      "token" : "士",
      "start_offset" : 46,
      "end_offset" : 47,
      "type" : "<IDEOGRAPHIC>",
      "position" : 11
    },
    {
      "token" : "比",
      "start_offset" : 47,
      "end_offset" : 48,
      "type" : "<IDEOGRAPHIC>",
      "position" : 12
    },
    {
      "token" : "亞",
      "start_offset" : 48,
      "end_offset" : 49,
      "type" : "<IDEOGRAPHIC>",
      "position" : 13
    }
  ]
}

字符過濾器

字符過濾器在<charFilter>屬性中定義，它是對字符流進行處理。字符過濾器種類不多。elasticearch只提供了三種字符過濾器：

HTML字符過濾器（HTML Strip Char Filter）
映射字符過濾器（Mapping Char Filter）
模式替換過濾器（Pattern Replace Char Filter）

我們來分別看看都是怎么玩的吧！

HTML字符過濾器

HTML字符過濾器（HTML Strip Char Filter）從文本中去除HTML元素。

POST _analyze
{
  "tokenizer": "keyword",
  "char_filter": ["html_strip"],
  "text":"<p>I&apos;m so <b>happy</b>!</p>"
}

結果如下：

{
  "tokens" : [
    {
      "token" : """

I'm so happy!

""",
      "start_offset" : 0,
      "end_offset" : 32,
      "type" : "word",
      "position" : 0
    }
  ]
}

映射字符過濾器

映射字符過濾器（Mapping Char Filter）接收鍵值的映射，每當遇到與鍵相同的字符串時，它就用該鍵關聯的值替換它們。

PUT pattern_test4
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer":{
          "tokenizer":"keyword",
          "char_filter":["my_char_filter"]
        }
      },
      "char_filter":{
          "my_char_filter":{
            "type":"mapping",
            "mappings":["蒼井空 => 666","武藤蘭 => 888"]
          }
        }
    }
  }
}

上例中，我們自定義了一個分析器，其內的分詞器使用關鍵字分詞器，字符過濾器則是自定制的，將字符中的蒼井空替換為666，武藤蘭替換為888。

POST pattern_test4/_analyze
{
  "analyzer": "my_analyzer",
  "text": "蒼井空熱愛武藤蘭，可惜后來蒼井空結婚了"
}

結果如下：

{
  "tokens" : [
    {
      "token" : "666熱愛888，可惜后來666結婚了",
      "start_offset" : 0,
      "end_offset" : 19,
      "type" : "word",
      "position" : 0
    }
  ]
}

模式替換過濾器

模式替換過濾器（Pattern Replace Char Filter）使用正則表達式匹配並替換字符串中的字符。但要小心你寫的摳腳的正則表達式。因為這可能導致性能變慢！

PUT pattern_test5
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "char_filter": [
            "my_char_filter"
          ]
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "pattern_replace",
          "pattern": "(\\d+)-(?=\\d)",
          "replacement": "$1_"
        }
      }
    }
  }
}

上例中，我們自定義了一個正則規則。

POST pattern_test5/_analyze
{
  "analyzer": "my_analyzer",
  "text": "My credit card is 123-456-789"
}

結果如下：

{
  "tokens" : [
    {
      "token" : "My",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "credit",
      "start_offset" : 3,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "card",
      "start_offset" : 10,
      "end_offset" : 14,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "is",
      "start_offset" : 15,
      "end_offset" : 17,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "123_456_789",
      "start_offset" : 18,
      "end_offset" : 29,
      "type" : "<NUM>",
      "position" : 4
    }
  ]
}

我們大致的了解elasticsearch分析處理數據的流程。但可以看到的是，我們極少地在例子中演示中文處理。因為elasticsearch內置的分析器處理起來中文不是很好。所以，后續會介紹一個重量級的插件就是elasticsearch analysis ik（一般習慣稱呼為ik分詞器）。

分詞器

由於elasticsearch內置了分析器，它同樣也包含了分詞器。分詞器，顧名思義，主要的操作是將文本字符串分解為小塊，而這些小塊這被稱為分詞token。

標准分詞器：standard tokenizer

標准分詞器（standard tokenizer）是一個基於語法的分詞器，對於大多數歐洲語言來說還是不錯的，它同時還處理了Unicode文本的分詞，但分詞默認的最大長度是255字節，它也移除了逗號和句號這樣的標點符號。

POST _analyze
{
  "tokenizer": "standard",
  "text":"To be or not to be,  That is a question ———— 莎士比亞"
}

結果如下：

{
  "tokens" : [
    {
      "token" : "To",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "be",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "or",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "not",
      "start_offset" : 9,
      "end_offset" : 12,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "to",
      "start_offset" : 13,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "be",
      "start_offset" : 16,
      "end_offset" : 18,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "That",
      "start_offset" : 21,
      "end_offset" : 25,
      "type" : "<ALPHANUM>",
      "position" : 6
    },
    {
      "token" : "is",
      "start_offset" : 26,
      "end_offset" : 28,
      "type" : "<ALPHANUM>",
      "position" : 7
    },
    {
      "token" : "a",
      "start_offset" : 29,
      "end_offset" : 30,
      "type" : "<ALPHANUM>",
      "position" : 8
    },
    {
      "token" : "question",
      "start_offset" : 31,
      "end_offset" : 39,
      "type" : "<ALPHANUM>",
      "position" : 9
    },
    {
      "token" : "莎",
      "start_offset" : 45,
      "end_offset" : 46,
      "type" : "<IDEOGRAPHIC>",
      "position" : 10
    },
    {
      "token" : "士",
      "start_offset" : 46,
      "end_offset" : 47,
      "type" : "<IDEOGRAPHIC>",
      "position" : 11
    },
    {
      "token" : "比",
      "start_offset" : 47,
      "end_offset" : 48,
      "type" : "<IDEOGRAPHIC>",
      "position" : 12
    },
    {
      "token" : "亞",
      "start_offset" : 48,
      "end_offset" : 49,
      "type" : "<IDEOGRAPHIC>",
      "position" : 13
    }
  ]
}

關鍵詞分詞器：keyword tokenizer

關鍵詞分詞器（keyword tokenizer）是一種簡單的分詞器，將整個文本作為單個的分詞，提供給分詞過濾器，當你只想用分詞過濾器，而不做分詞操作時，它是不錯的選擇。

POST _analyze
{
  "tokenizer": "keyword",
  "text":"To be or not to be,  That is a question ———— 莎士比亞"
}

結果如下：

{
  "tokens" : [
    {
      "token" : "To be or not to be,  That is a question ———— 莎士比亞",
      "start_offset" : 0,
      "end_offset" : 49,
      "type" : "word",
      "position" : 0
    }
  ]
}

字母分詞器：letter tokenizer

字母分詞器（letter tokenizer）根據非字母的符號，將文本切分成分詞。

POST _analyze
{
  "tokenizer": "letter",
  "text":"To be or not to be,  That is a question ———— 莎士比亞"
}

結果如下：

{
  "tokens" : [
    {
      "token" : "To",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "be",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "or",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "not",
      "start_offset" : 9,
      "end_offset" : 12,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "to",
      "start_offset" : 13,
      "end_offset" : 15,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "be",
      "start_offset" : 16,
      "end_offset" : 18,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "That",
      "start_offset" : 21,
      "end_offset" : 25,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "is",
      "start_offset" : 26,
      "end_offset" : 28,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "a",
      "start_offset" : 29,
      "end_offset" : 30,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "question",
      "start_offset" : 31,
      "end_offset" : 39,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "莎士比亞",
      "start_offset" : 45,
      "end_offset" : 49,
      "type" : "word",
      "position" : 10
    }
  ]
}

小寫分詞器：lowercase tokenizer

小寫分詞器（lowercase tokenizer）結合了常規的字母分詞器和小寫分詞過濾器（跟你想的一樣，就是將所有的分詞轉化為小寫）的行為。通過一個單獨的分詞器來實現的主要原因是，一次進行兩項操作會獲得更好的性能。

POST _analyze
{
  "tokenizer": "lowercase",
  "text":"To be or not to be,  That is a question ———— 莎士比亞"
}

結果如下：

{
  "tokens" : [
    {
      "token" : "to",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "be",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "or",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "not",
      "start_offset" : 9,
      "end_offset" : 12,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "to",
      "start_offset" : 13,
      "end_offset" : 15,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "be",
      "start_offset" : 16,
      "end_offset" : 18,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "that",
      "start_offset" : 21,
      "end_offset" : 25,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "is",
      "start_offset" : 26,
      "end_offset" : 28,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "a",
      "start_offset" : 29,
      "end_offset" : 30,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "question",
      "start_offset" : 31,
      "end_offset" : 39,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "莎士比亞",
      "start_offset" : 45,
      "end_offset" : 49,
      "type" : "word",
      "position" : 10
    }
  ]
}

空白分詞器：whitespace tokenizer

空白分詞器（whitespace tokenizer）通過空白來分隔不同的分詞，空白包括空格、制表符、換行等。但是，我們需要注意的是，空白分詞器不會刪除任何標點符號。

POST _analyze
{
  "tokenizer": "whitespace",
  "text":"To be or not to be,  That is a question ———— 莎士比亞"
}

結果如下：

{
  "tokens" : [
    {
      "token" : "To",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "be",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "or",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "not",
      "start_offset" : 9,
      "end_offset" : 12,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "to",
      "start_offset" : 13,
      "end_offset" : 15,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "be,",
      "start_offset" : 16,
      "end_offset" : 19,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "That",
      "start_offset" : 21,
      "end_offset" : 25,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "is",
      "start_offset" : 26,
      "end_offset" : 28,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "a",
      "start_offset" : 29,
      "end_offset" : 30,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "question",
      "start_offset" : 31,
      "end_offset" : 39,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "————",
      "start_offset" : 40,
      "end_offset" : 44,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "莎士比亞",
      "start_offset" : 45,
      "end_offset" : 49,
      "type" : "word",
      "position" : 11
    }
  ]
}

模式分詞器：pattern tokenizer

模式分詞器（pattern tokenizer）允許指定一個任意的模式，將文本切分為分詞。

POST _analyze
{
  "tokenizer": "pattern",
  "text":"To be or not to be,  That is a question ———— 莎士比亞"
}

現在讓我們手動定制一個以逗號分隔的分詞器。

PUT pattern_test2
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer":{
          "tokenizer":"my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer":{
          "type":"pattern",
          "pattern":","
        }
      }
    }
  }
}

上例中，在settings下的自定義分析器my_analyzer中，自定義的模式分詞器名叫my_tokenizer；在與自定義分析器同級，為新建的自定義模式分詞器設置一些屬性，比如以逗號分隔。

POST pattern_test2/_analyze
{
  "tokenizer": "my_tokenizer",
  "text":"To be or not to be,  That is a question ———— 莎士比亞"
}

結果如下：

{
  "tokens" : [
    {
      "token" : "To be or not to be",
      "start_offset" : 0,
      "end_offset" : 18,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "  That is a question ———— 莎士比亞",
      "start_offset" : 19,
      "end_offset" : 49,
      "type" : "word",
      "position" : 1
    }
  ]
}

根據結果可以看到，文檔被逗號分割為兩部分。

UAX URL電子郵件分詞器：UAX RUL email tokenizer

在處理單個的英文單詞的情況下，標准分詞器是個非常好的選擇，但是現在很多的網站以網址或電子郵件作為結尾，比如我們現在有這樣的一個文本：

作者：張開
來源：未知 
原文：https://www.cnblogs.com/Neeo/articles/10402742.html
郵箱：xxxxxxx@xx.com
版權聲明：本文為博主原創文章，轉載請附上博文鏈接！

現在讓我們使用標准分詞器查看一下：

POST _analyze
{
  "tokenizer": "standard",
  "text":"作者：張開來源：未知原文：https://www.cnblogs.com/Neeo/articles/10402742.html郵箱：xxxxxxx@xx.com版權聲明：本文為博主原創文章，轉載請附上博文鏈接！"
}

結果很長：

{
  "tokens" : [
    {
      "token" : "作",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<IDEOGRAPHIC>",
      "position" : 0
    },
    {
      "token" : "者",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "<IDEOGRAPHIC>",
      "position" : 1
    },
    {
      "token" : "張",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "<IDEOGRAPHIC>",
      "position" : 2
    },
    {
      "token" : "開",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "<IDEOGRAPHIC>",
      "position" : 3
    },
    {
      "token" : "來",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "<IDEOGRAPHIC>",
      "position" : 4
    },
    {
      "token" : "源",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "<IDEOGRAPHIC>",
      "position" : 5
    },
    {
      "token" : "未",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "<IDEOGRAPHIC>",
      "position" : 6
    },
    {
      "token" : "知",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "<IDEOGRAPHIC>",
      "position" : 7
    },
    {
      "token" : "原",
      "start_offset" : 10,
      "end_offset" : 11,
      "type" : "<IDEOGRAPHIC>",
      "position" : 8
    },
    {
      "token" : "文",
      "start_offset" : 11,
      "end_offset" : 12,
      "type" : "<IDEOGRAPHIC>",
      "position" : 9
    },
    {
      "token" : "https",
      "start_offset" : 13,
      "end_offset" : 18,
      "type" : "<ALPHANUM>",
      "position" : 10
    },
    {
      "token" : "www.cnblogs.com",
      "start_offset" : 21,
      "end_offset" : 36,
      "type" : "<ALPHANUM>",
      "position" : 11
    },
    {
      "token" : "Neeo",
      "start_offset" : 37,
      "end_offset" : 41,
      "type" : "<ALPHANUM>",
      "position" : 12
    },
    {
      "token" : "articles",
      "start_offset" : 42,
      "end_offset" : 50,
      "type" : "<ALPHANUM>",
      "position" : 13
    },
    {
      "token" : "10402742",
      "start_offset" : 51,
      "end_offset" : 59,
      "type" : "<NUM>",
      "position" : 14
    },
    {
      "token" : "html",
      "start_offset" : 60,
      "end_offset" : 64,
      "type" : "<ALPHANUM>",
      "position" : 15
    },
    {
      "token" : "郵",
      "start_offset" : 64,
      "end_offset" : 65,
      "type" : "<IDEOGRAPHIC>",
      "position" : 16
    },
    {
      "token" : "箱",
      "start_offset" : 65,
      "end_offset" : 66,
      "type" : "<IDEOGRAPHIC>",
      "position" : 17
    },
    {
      "token" : "xxxxxxx",
      "start_offset" : 67,
      "end_offset" : 74,
      "type" : "<ALPHANUM>",
      "position" : 18
    },
    {
      "token" : "xx.com",
      "start_offset" : 75,
      "end_offset" : 81,
      "type" : "<ALPHANUM>",
      "position" : 19
    },
    {
      "token" : "版",
      "start_offset" : 81,
      "end_offset" : 82,
      "type" : "<IDEOGRAPHIC>",
      "position" : 20
    },
    {
      "token" : "權",
      "start_offset" : 82,
      "end_offset" : 83,
      "type" : "<IDEOGRAPHIC>",
      "position" : 21
    },
    {
      "token" : "聲",
      "start_offset" : 83,
      "end_offset" : 84,
      "type" : "<IDEOGRAPHIC>",
      "position" : 22
    },
    {
      "token" : "明",
      "start_offset" : 84,
      "end_offset" : 85,
      "type" : "<IDEOGRAPHIC>",
      "position" : 23
    },
    {
      "token" : "本",
      "start_offset" : 86,
      "end_offset" : 87,
      "type" : "<IDEOGRAPHIC>",
      "position" : 24
    },
    {
      "token" : "文",
      "start_offset" : 87,
      "end_offset" : 88,
      "type" : "<IDEOGRAPHIC>",
      "position" : 25
    },
    {
      "token" : "為",
      "start_offset" : 88,
      "end_offset" : 89,
      "type" : "<IDEOGRAPHIC>",
      "position" : 26
    },
    {
      "token" : "博",
      "start_offset" : 89,
      "end_offset" : 90,
      "type" : "<IDEOGRAPHIC>",
      "position" : 27
    },
    {
      "token" : "主",
      "start_offset" : 90,
      "end_offset" : 91,
      "type" : "<IDEOGRAPHIC>",
      "position" : 28
    },
    {
      "token" : "原",
      "start_offset" : 91,
      "end_offset" : 92,
      "type" : "<IDEOGRAPHIC>",
      "position" : 29
    },
    {
      "token" : "創",
      "start_offset" : 92,
      "end_offset" : 93,
      "type" : "<IDEOGRAPHIC>",
      "position" : 30
    },
    {
      "token" : "文",
      "start_offset" : 93,
      "end_offset" : 94,
      "type" : "<IDEOGRAPHIC>",
      "position" : 31
    },
    {
      "token" : "章",
      "start_offset" : 94,
      "end_offset" : 95,
      "type" : "<IDEOGRAPHIC>",
      "position" : 32
    },
    {
      "token" : "轉",
      "start_offset" : 96,
      "end_offset" : 97,
      "type" : "<IDEOGRAPHIC>",
      "position" : 33
    },
    {
      "token" : "載",
      "start_offset" : 97,
      "end_offset" : 98,
      "type" : "<IDEOGRAPHIC>",
      "position" : 34
    },
    {
      "token" : "請",
      "start_offset" : 98,
      "end_offset" : 99,
      "type" : "<IDEOGRAPHIC>",
      "position" : 35
    },
    {
      "token" : "附",
      "start_offset" : 99,
      "end_offset" : 100,
      "type" : "<IDEOGRAPHIC>",
      "position" : 36
    },
    {
      "token" : "上",
      "start_offset" : 100,
      "end_offset" : 101,
      "type" : "<IDEOGRAPHIC>",
      "position" : 37
    },
    {
      "token" : "博",
      "start_offset" : 101,
      "end_offset" : 102,
      "type" : "<IDEOGRAPHIC>",
      "position" : 38
    },
    {
      "token" : "文",
      "start_offset" : 102,
      "end_offset" : 103,
      "type" : "<IDEOGRAPHIC>",
      "position" : 39
    },
    {
      "token" : "鏈",
      "start_offset" : 103,
      "end_offset" : 104,
      "type" : "<IDEOGRAPHIC>",
      "position" : 40
    },
    {
      "token" : "接",
      "start_offset" : 104,
      "end_offset" : 105,
      "type" : "<IDEOGRAPHIC>",
      "position" : 41
    }
  ]
}

無論如何，這個結果不符合我們的預期，因為把我們的郵箱和網址分的亂七八糟！那么針對這種情況，我們應該使用UAX URL電子郵件分詞器（UAX RUL email tokenizer），該分詞器將電子郵件和URL都作為單獨的分詞進行保留。

POST _analyze
{
  "tokenizer": "uax_url_email",
  "text":"作者：張開來源：未知原文：https://www.cnblogs.com/Neeo/articles/10402742.html郵箱：xxxxxxx@xx.com版權聲明：本文為博主原創文章，轉載請附上博文鏈接！"
}

結果如下：

{
  "tokens" : [
    {
      "token" : "作",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<IDEOGRAPHIC>",
      "position" : 0
    },
    {
      "token" : "者",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "<IDEOGRAPHIC>",
      "position" : 1
    },
    {
      "token" : "張",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "<IDEOGRAPHIC>",
      "position" : 2
    },
    {
      "token" : "開",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "<IDEOGRAPHIC>",
      "position" : 3
    },
    {
      "token" : "來",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "<IDEOGRAPHIC>",
      "position" : 4
    },
    {
      "token" : "源",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "<IDEOGRAPHIC>",
      "position" : 5
    },
    {
      "token" : "未",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "<IDEOGRAPHIC>",
      "position" : 6
    },
    {
      "token" : "知",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "<IDEOGRAPHIC>",
      "position" : 7
    },
    {
      "token" : "原",
      "start_offset" : 10,
      "end_offset" : 11,
      "type" : "<IDEOGRAPHIC>",
      "position" : 8
    },
    {
      "token" : "文",
      "start_offset" : 11,
      "end_offset" : 12,
      "type" : "<IDEOGRAPHIC>",
      "position" : 9
    },
    {
      "token" : "https://www.cnblogs.com/Neeo/articles/10402742.html",
      "start_offset" : 13,
      "end_offset" : 64,
      "type" : "<URL>",
      "position" : 10
    },
    {
      "token" : "郵",
      "start_offset" : 64,
      "end_offset" : 65,
      "type" : "<IDEOGRAPHIC>",
      "position" : 11
    },
    {
      "token" : "箱",
      "start_offset" : 65,
      "end_offset" : 66,
      "type" : "<IDEOGRAPHIC>",
      "position" : 12
    },
    {
      "token" : "xxxxxxx@xx.com",
      "start_offset" : 67,
      "end_offset" : 81,
      "type" : "<EMAIL>",
      "position" : 13
    },
    {
      "token" : "版",
      "start_offset" : 81,
      "end_offset" : 82,
      "type" : "<IDEOGRAPHIC>",
      "position" : 14
    },
    {
      "token" : "權",
      "start_offset" : 82,
      "end_offset" : 83,
      "type" : "<IDEOGRAPHIC>",
      "position" : 15
    },
    {
      "token" : "聲",
      "start_offset" : 83,
      "end_offset" : 84,
      "type" : "<IDEOGRAPHIC>",
      "position" : 16
    },
    {
      "token" : "明",
      "start_offset" : 84,
      "end_offset" : 85,
      "type" : "<IDEOGRAPHIC>",
      "position" : 17
    },
    {
      "token" : "本",
      "start_offset" : 86,
      "end_offset" : 87,
      "type" : "<IDEOGRAPHIC>",
      "position" : 18
    },
    {
      "token" : "文",
      "start_offset" : 87,
      "end_offset" : 88,
      "type" : "<IDEOGRAPHIC>",
      "position" : 19
    },
    {
      "token" : "為",
      "start_offset" : 88,
      "end_offset" : 89,
      "type" : "<IDEOGRAPHIC>",
      "position" : 20
    },
    {
      "token" : "博",
      "start_offset" : 89,
      "end_offset" : 90,
      "type" : "<IDEOGRAPHIC>",
      "position" : 21
    },
    {
      "token" : "主",
      "start_offset" : 90,
      "end_offset" : 91,
      "type" : "<IDEOGRAPHIC>",
      "position" : 22
    },
    {
      "token" : "原",
      "start_offset" : 91,
      "end_offset" : 92,
      "type" : "<IDEOGRAPHIC>",
      "position" : 23
    },
    {
      "token" : "創",
      "start_offset" : 92,
      "end_offset" : 93,
      "type" : "<IDEOGRAPHIC>",
      "position" : 24
    },
    {
      "token" : "文",
      "start_offset" : 93,
      "end_offset" : 94,
      "type" : "<IDEOGRAPHIC>",
      "position" : 25
    },
    {
      "token" : "章",
      "start_offset" : 94,
      "end_offset" : 95,
      "type" : "<IDEOGRAPHIC>",
      "position" : 26
    },
    {
      "token" : "轉",
      "start_offset" : 96,
      "end_offset" : 97,
      "type" : "<IDEOGRAPHIC>",
      "position" : 27
    },
    {
      "token" : "載",
      "start_offset" : 97,
      "end_offset" : 98,
      "type" : "<IDEOGRAPHIC>",
      "position" : 28
    },
    {
      "token" : "請",
      "start_offset" : 98,
      "end_offset" : 99,
      "type" : "<IDEOGRAPHIC>",
      "position" : 29
    },
    {
      "token" : "附",
      "start_offset" : 99,
      "end_offset" : 100,
      "type" : "<IDEOGRAPHIC>",
      "position" : 30
    },
    {
      "token" : "上",
      "start_offset" : 100,
      "end_offset" : 101,
      "type" : "<IDEOGRAPHIC>",
      "position" : 31
    },
    {
      "token" : "博",
      "start_offset" : 101,
      "end_offset" : 102,
      "type" : "<IDEOGRAPHIC>",
      "position" : 32
    },
    {
      "token" : "文",
      "start_offset" : 102,
      "end_offset" : 103,
      "type" : "<IDEOGRAPHIC>",
      "position" : 33
    },
    {
      "token" : "鏈",
      "start_offset" : 103,
      "end_offset" : 104,
      "type" : "<IDEOGRAPHIC>",
      "position" : 34
    },
    {
      "token" : "接",
      "start_offset" : 104,
      "end_offset" : 105,
      "type" : "<IDEOGRAPHIC>",
      "position" : 35
    }
  ]
}

路徑層次分詞器：path hierarchy tokenizer

路徑層次分詞器（path hierarchy tokenizer）允許以特定的方式索引文件系統的路徑，這樣在搜索時，共享同樣路徑的文件將被作為結果返回。

POST _analyze
{
  "tokenizer": "path_hierarchy",
  "text":"/usr/local/python/python2.7"
}

返回結果如下：

{
  "tokens" : [
    {
      "token" : "/usr",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/usr/local",
      "start_offset" : 0,
      "end_offset" : 10,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/usr/local/python",
      "start_offset" : 0,
      "end_offset" : 17,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/usr/local/python/python2.7",
      "start_offset" : 0,
      "end_offset" : 27,
      "type" : "word",
      "position" : 0
    }
  ]
}

分詞過濾器

asticsearch內置很多（真是變態多啊！但一般用不到，美滋滋！！！）的分詞過濾器。其中包含分詞過濾器和字符過濾器。
常見分詞過濾器
這里僅列舉幾個常見的分詞過濾器（token filter）包括：

標准分詞過濾器（Standard Token Filter）在6.5.0版本棄用。此篩選器已被棄用，將在下一個主要版本中刪除。在之前的版本中其實也沒干啥，甚至在更老版本的Lucene中，它用於去除單詞結尾的s字符，還有不必要的句點字符，但是現在，連這些小功能都被其他的分詞器和分詞過濾器順手干了，真可憐！
ASCII折疊分詞過濾器（ASCII Folding Token Filter）將前127個ASCII字符(基本拉丁語的Unicode塊)中不包含的字母、數字和符號Unicode字符轉換為對應的ASCII字符(如果存在的話）。
扁平圖形分詞過濾器（Flatten Graph Token Filter）接受任意圖形標記流。例如由同義詞圖形標記過濾器生成的標記流，並將其展平為適合索引的單個線性標記鏈。這是一個有損的過程，因為單獨的側路徑被壓扁在彼此之上，但是如果在索引期間使用圖形令牌流是必要的，因為Lucene索引當前不能表示圖形。出於這個原因，最好只在搜索時應用圖形分析器，因為這樣可以保留完整的圖形結構，並為鄰近查詢提供正確的匹配。該功能在Lucene中為實驗性功能。
長度標記過濾器（Length Token Filter）會移除分詞流中太長或者太短的標記，它是可配置的，我們可以在settings中設置。
小寫分詞過濾器（Lowercase Token Filter）將分詞規范化為小寫，它通過language參數支持希臘語、愛爾蘭語和土耳其語小寫標記過濾器。
大寫分詞過濾器（Uppercase Token Filter）將分詞規范為大寫。

其余分詞過濾器不一一列舉。詳情參見官網。

自定義分詞過濾器

接下來我們簡單的來學習自定義兩個分詞過濾器。首先是長度分詞過濾器。

PUT pattern_test3
{
  "settings": {
    "analysis": {
      "filter": {
        "my_test_length":{
          "type":"length",
          "max":8,
          "min":2
        }
      }
    }
  }
}

上例中，我們自定義了一個長度過濾器，過濾掉長度大於8和小於2的分詞。
需要補充的是，max參數表示最大分詞長度。默認為Integer.MAX_VALUE，就是2147483647（\(2^31-1\)），而min則表示最小長度，默認為0。

POST pattern_test3/_analyze
{
  "tokenizer": "standard",
  "filter": ["my_test_length"],
  "text":"a Small word and a longerword"
}

結果如下：

{
  "tokens" : [
    {
      "token" : "Small",
      "start_offset" : 2,
      "end_offset" : 7,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "word",
      "start_offset" : 8,
      "end_offset" : 12,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "and",
      "start_offset" : 13,
      "end_offset" : 16,
      "type" : "<ALPHANUM>",
      "position" : 3
    }
  ]
}

自定義小寫分詞過濾器

自定義一個小寫分詞過濾器，過濾希臘文：

PUT lowercase_example
{
  "settings": {
    "analysis": {
      "analyzer": {
        "standard_lowercase_example": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase"]
        },
        "greek_lowercase_example": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["greek_lowercase"]
        }
      },
      "filter": {
        "greek_lowercase": {
          "type": "lowercase",
          "language": "greek"
        }
      }
    }
  }
}

過濾內容是：

POST lowercase_example/_analyze
{
  "tokenizer": "standard",
  "filter": ["greek_lowercase"],
  "text":"Ένα φίλτρο διακριτικού τύπου πεζά s ομαλοποιεί το κείμενο διακριτικού σε χαμηλότερη θήκη"
}

結果如下：

{
  "tokens" : [
    {
      "token" : "ενα",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "φιλτρο",
      "start_offset" : 4,
      "end_offset" : 10,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "διακριτικου",
      "start_offset" : 11,
      "end_offset" : 22,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "τυπου",
      "start_offset" : 23,
      "end_offset" : 28,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "πεζα",
      "start_offset" : 29,
      "end_offset" : 33,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "s",
      "start_offset" : 34,
      "end_offset" : 35,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "ομαλοποιει",
      "start_offset" : 36,
      "end_offset" : 46,
      "type" : "<ALPHANUM>",
      "position" : 6
    },
    {
      "token" : "το",
      "start_offset" : 47,
      "end_offset" : 49,
      "type" : "<ALPHANUM>",
      "position" : 7
    },
    {
      "token" : "κειμενο",
      "start_offset" : 50,
      "end_offset" : 57,
      "type" : "<ALPHANUM>",
      "position" : 8
    },
    {
      "token" : "διακριτικου",
      "start_offset" : 58,
      "end_offset" : 69,
      "type" : "<ALPHANUM>",
      "position" : 9
    },
    {
      "token" : "σε",
      "start_offset" : 70,
      "end_offset" : 72,
      "type" : "<ALPHANUM>",
      "position" : 10
    },
    {
      "token" : "χαμηλοτερη",
      "start_offset" : 73,
      "end_offset" : 83,
      "type" : "<ALPHANUM>",
      "position" : 11
    },
    {
      "token" : "θηκη",
      "start_offset" : 84,
      "end_offset" : 88,
      "type" : "<ALPHANUM>",
      "position" : 12
    }
  ]
}

多個分詞過濾器

除此之外，我們可以使用多個分詞過濾器。例如我們在使用長度過濾器時，可以同時使用小寫分詞過濾器或者更多。

POST _analyze
{
  "tokenizer": "standard",
  "filter": ["length","lowercase"],
  "text":"a Small word and a longerword"
}

上例中，我們用列表來管理多個分詞過濾器。
結果如下：

{
  "tokens" : [
    {
      "token" : "a",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "small",
      "start_offset" : 2,
      "end_offset" : 7,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "word",
      "start_offset" : 8,
      "end_offset" : 12,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "and",
      "start_offset" : 13,
      "end_offset" : 16,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "a",
      "start_offset" : 17,
      "end_offset" : 18,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "longerword",
      "start_offset" : 19,
      "end_offset" : 29,
      "type" : "<ALPHANUM>",
      "position" : 5
    }
  ]
}

歡迎斧正，that's all see also：[elasticsearch analyzer](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html) | [elasticsearch tokenizers](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html) | [Standard Token Filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-standard-tokenfilter.html) | [ASCII Folding Token Filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-asciifolding-tokenfilter.html) | [Flatten Graph Token Filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-flatten-graph-tokenfilter.html) | [Length Token Filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-length-tokenfilter.html) | [Lowercase Token Filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lowercase-tokenfilter.html) | [Uppercase Token Filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-uppercase-tokenfilter.html) | [NGram Token Filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-ngram-tokenfilter.html) | [Edge NGram Token Filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-edgengram-tokenfilter.html#analysis-edgengram-tokenfilter) | [Porter Stem Token Filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-porterstem-tokenfilter.html) | [Shingle Token Filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-shingle-tokenfilter.html) | [Stop Token Filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-stop-tokenfilter.html) | [Word Delimiter Token Filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-stop-tokenfilter.html) | [Word Delimiter Graph Token Filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-word-delimiter-graph-tokenfilter.html) | [Multiplexer Token Filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-multiplexer-tokenfilter.html) | [Conditional Token Filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-condition-tokenfilter.html) | [Predicate Token Filter Script](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-predicatefilter-tokenfilter.html) | [Stemmer Token Filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-stemmer-tokenfilter.html) | [Stemmer Override Token Filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-stemmer-override-tokenfilter.html) | [Keyword Marker Token Filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-keyword-marker-tokenfilter.html) | [Keyword Repeat Token Filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-keyword-repeat-tokenfilter.html) | [KStem Token Filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-kstem-tokenfilter.html) | [Snowball Token Filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-snowball-tokenfilter.html) | [Phonetic Token Filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-phonetic-tokenfilter.html) | [Synonym Token Filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-tokenfilter.html) | [Parsing synonym files](https://www.elastic.co/guide/en/elasticsearch/reference/current/_parsing_synonym_files.html) | [Synonym Graph Token Filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-graph-tokenfilter.html) | [Parsing synonym files](https://www.elastic.co/guide/en/elasticsearch/reference/current/_parsing_synonym_files_2.html) | [Compound Word Token Filters](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-compound-word-tokenfilter.html) | [Reverse Token Filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-reverse-tokenfilter.html) | [Elision Token Filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-elision-tokenfilter.html) | [Truncate Token Filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-truncate-tokenfilter.html) | [Unique Token Filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-unique-tokenfilter.html) | [Pattern Capture Token Filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-capture-tokenfilter.html) | [Pattern Replace Token Filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern_replace-tokenfilter.html) | [Trim Token Filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-trim-tokenfilter.html) | [Limit Token Count Token Filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-limit-token-count-tokenfilter.html) | [Hunspell Token Filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-hunspell-tokenfilter.html) | [Common Grams Token Filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-common-grams-tokenfilter.html) | [Normalization Token Filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-normalization-tokenfilter.html) | [CJK Width Token Filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-cjk-width-tokenfilter.html) | [CJK Bigram Token Filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-cjk-bigram-tokenfilter.html) | [Delimited Payload Token Filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-delimited-payload-tokenfilter.html) | [Keep Words Token Filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-keep-words-tokenfilter.html) | [Keep Types Token Filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-keep-types-tokenfilter.html) | [Exclude mode settings example](https://www.elastic.co/guide/en/elasticsearch/reference/current/_exclude_mode_settings_example.html) | [Classic Token Filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-classic-tokenfilter.html) | [Apostrophe Token Filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-apostrophe-tokenfilter.html) | [Decimal Digit Token Filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-decimal-digit-tokenfilter.html) | [Fingerprint Token Filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-fingerprint-tokenfilter.html) | [Minhash Token Filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-minhash-tokenfilter.html) | [Remove Duplicates Token Filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-remove-duplicates-tokenfilter.html) | [Character Filters](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-charfilters.html) | [HTML Strip Char Filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-htmlstrip-charfilter.html) | [Mapping Char Filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-mapping-charfilter.html) | [Pattern Replace Char Filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-replace-charfilter.html)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 elasticsearch document的索引過程分析 Elasticsearch High Level Rest Client 發起請求的過程分析【elasticsearch】搜索過程詳解 ElasticSearch 搜索數據過程 elasticsearch 之編譯過程（48）ElasticSearch之查詢結果分析 ElasticSearch 聚合分析 Elasticsearch 之聚合分析入門 Elasticsearch底層原理分析 ElasticSearch聚合分析