前言
由於elasticsearch內置了分析器,它同樣也包含了分詞器。分詞器,顧名思義,主要的操作是將文本字符串分解為小塊,而這些小塊這被稱為分詞token
。
標准分詞器:standard tokenizer
標准分詞器(standard tokenizer)是一個基於語法的分詞器,對於大多數歐洲語言來說還是不錯的,它同時還處理了Unicode文本的分詞,但分詞默認的最大長度是255字節,它也移除了逗號和句號這樣的標點符號。
POST _analyze
{
"tokenizer": "standard",
"text":"To be or not to be, That is a question ———— 莎士比亞"
}
結果如下:
{
"tokens" : [
{
"token" : "To",
"start_offset" : 0,
"end_offset" : 2,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "be",
"start_offset" : 3,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "or",
"start_offset" : 6,
"end_offset" : 8,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "not",
"start_offset" : 9,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "to",
"start_offset" : 13,
"end_offset" : 15,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "be",
"start_offset" : 16,
"end_offset" : 18,
"type" : "<ALPHANUM>",
"position" : 5
},
{
"token" : "That",
"start_offset" : 21,
"end_offset" : 25,
"type" : "<ALPHANUM>",
"position" : 6
},
{
"token" : "is",
"start_offset" : 26,
"end_offset" : 28,
"type" : "<ALPHANUM>",
"position" : 7
},
{
"token" : "a",
"start_offset" : 29,
"end_offset" : 30,
"type" : "<ALPHANUM>",
"position" : 8
},
{
"token" : "question",
"start_offset" : 31,
"end_offset" : 39,
"type" : "<ALPHANUM>",
"position" : 9
},
{
"token" : "莎",
"start_offset" : 45,
"end_offset" : 46,
"type" : "<IDEOGRAPHIC>",
"position" : 10
},
{
"token" : "士",
"start_offset" : 46,
"end_offset" : 47,
"type" : "<IDEOGRAPHIC>",
"position" : 11
},
{
"token" : "比",
"start_offset" : 47,
"end_offset" : 48,
"type" : "<IDEOGRAPHIC>",
"position" : 12
},
{
"token" : "亞",
"start_offset" : 48,
"end_offset" : 49,
"type" : "<IDEOGRAPHIC>",
"position" : 13
}
]
}
關鍵詞分詞器:keyword tokenizer
關鍵詞分詞器(keyword tokenizer)是一種簡單的分詞器,將整個文本作為單個的分詞,提供給分詞過濾器,當你只想用分詞過濾器,而不做分詞操作時,它是不錯的選擇。
POST _analyze
{
"tokenizer": "keyword",
"text":"To be or not to be, That is a question ———— 莎士比亞"
}
結果如下:
{
"tokens" : [
{
"token" : "To be or not to be, That is a question ———— 莎士比亞",
"start_offset" : 0,
"end_offset" : 49,
"type" : "word",
"position" : 0
}
]
}
字母分詞器:letter tokenizer
字母分詞器(letter tokenizer)根據非字母的符號,將文本切分成分詞。
POST _analyze
{
"tokenizer": "letter",
"text":"To be or not to be, That is a question ———— 莎士比亞"
}
結果如下:
{
"tokens" : [
{
"token" : "To",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 0
},
{
"token" : "be",
"start_offset" : 3,
"end_offset" : 5,
"type" : "word",
"position" : 1
},
{
"token" : "or",
"start_offset" : 6,
"end_offset" : 8,
"type" : "word",
"position" : 2
},
{
"token" : "not",
"start_offset" : 9,
"end_offset" : 12,
"type" : "word",
"position" : 3
},
{
"token" : "to",
"start_offset" : 13,
"end_offset" : 15,
"type" : "word",
"position" : 4
},
{
"token" : "be",
"start_offset" : 16,
"end_offset" : 18,
"type" : "word",
"position" : 5
},
{
"token" : "That",
"start_offset" : 21,
"end_offset" : 25,
"type" : "word",
"position" : 6
},
{
"token" : "is",
"start_offset" : 26,
"end_offset" : 28,
"type" : "word",
"position" : 7
},
{
"token" : "a",
"start_offset" : 29,
"end_offset" : 30,
"type" : "word",
"position" : 8
},
{
"token" : "question",
"start_offset" : 31,
"end_offset" : 39,
"type" : "word",
"position" : 9
},
{
"token" : "莎士比亞",
"start_offset" : 45,
"end_offset" : 49,
"type" : "word",
"position" : 10
}
]
}
小寫分詞器:lowercase tokenizer
小寫分詞器(lowercase tokenizer)結合了常規的字母分詞器和小寫分詞過濾器(跟你想的一樣,就是將所有的分詞轉化為小寫)的行為。通過一個單獨的分詞器來實現的主要原因是,一次進行兩項操作會獲得更好的性能。
POST _analyze
{
"tokenizer": "lowercase",
"text":"To be or not to be, That is a question ———— 莎士比亞"
}
結果如下:
{
"tokens" : [
{
"token" : "to",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 0
},
{
"token" : "be",
"start_offset" : 3,
"end_offset" : 5,
"type" : "word",
"position" : 1
},
{
"token" : "or",
"start_offset" : 6,
"end_offset" : 8,
"type" : "word",
"position" : 2
},
{
"token" : "not",
"start_offset" : 9,
"end_offset" : 12,
"type" : "word",
"position" : 3
},
{
"token" : "to",
"start_offset" : 13,
"end_offset" : 15,
"type" : "word",
"position" : 4
},
{
"token" : "be",
"start_offset" : 16,
"end_offset" : 18,
"type" : "word",
"position" : 5
},
{
"token" : "that",
"start_offset" : 21,
"end_offset" : 25,
"type" : "word",
"position" : 6
},
{
"token" : "is",
"start_offset" : 26,
"end_offset" : 28,
"type" : "word",
"position" : 7
},
{
"token" : "a",
"start_offset" : 29,
"end_offset" : 30,
"type" : "word",
"position" : 8
},
{
"token" : "question",
"start_offset" : 31,
"end_offset" : 39,
"type" : "word",
"position" : 9
},
{
"token" : "莎士比亞",
"start_offset" : 45,
"end_offset" : 49,
"type" : "word",
"position" : 10
}
]
}
空白分詞器:whitespace tokenizer
空白分詞器(whitespace tokenizer)通過空白來分隔不同的分詞,空白包括空格、制表符、換行等。但是,我們需要注意的是,空白分詞器不會刪除任何標點符號。
POST _analyze
{
"tokenizer": "whitespace",
"text":"To be or not to be, That is a question ———— 莎士比亞"
}
結果如下:
{
"tokens" : [
{
"token" : "To",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 0
},
{
"token" : "be",
"start_offset" : 3,
"end_offset" : 5,
"type" : "word",
"position" : 1
},
{
"token" : "or",
"start_offset" : 6,
"end_offset" : 8,
"type" : "word",
"position" : 2
},
{
"token" : "not",
"start_offset" : 9,
"end_offset" : 12,
"type" : "word",
"position" : 3
},
{
"token" : "to",
"start_offset" : 13,
"end_offset" : 15,
"type" : "word",
"position" : 4
},
{
"token" : "be,",
"start_offset" : 16,
"end_offset" : 19,
"type" : "word",
"position" : 5
},
{
"token" : "That",
"start_offset" : 21,
"end_offset" : 25,
"type" : "word",
"position" : 6
},
{
"token" : "is",
"start_offset" : 26,
"end_offset" : 28,
"type" : "word",
"position" : 7
},
{
"token" : "a",
"start_offset" : 29,
"end_offset" : 30,
"type" : "word",
"position" : 8
},
{
"token" : "question",
"start_offset" : 31,
"end_offset" : 39,
"type" : "word",
"position" : 9
},
{
"token" : "————",
"start_offset" : 40,
"end_offset" : 44,
"type" : "word",
"position" : 10
},
{
"token" : "莎士比亞",
"start_offset" : 45,
"end_offset" : 49,
"type" : "word",
"position" : 11
}
]
}
模式分詞器:pattern tokenizer
模式分詞器(pattern tokenizer)允許指定一個任意的模式,將文本切分為分詞。
POST _analyze
{
"tokenizer": "pattern",
"text":"To be or not to be, That is a question ———— 莎士比亞"
}
現在讓我們手動定制一個以逗號分隔的分詞器。
PUT pattern_test2
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer":{
"tokenizer":"my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer":{
"type":"pattern",
"pattern":","
}
}
}
}
}
上例中,在settings下的自定義分析器my_analyzer中,自定義的模式分詞器名叫my_tokenizer;在與自定義分析器同級,為新建的自定義模式分詞器設置一些屬性,比如以逗號分隔。
POST pattern_test2/_analyze
{
"tokenizer": "my_tokenizer",
"text":"To be or not to be, That is a question ———— 莎士比亞"
}
結果如下:
{
"tokens" : [
{
"token" : "To be or not to be",
"start_offset" : 0,
"end_offset" : 18,
"type" : "word",
"position" : 0
},
{
"token" : " That is a question ———— 莎士比亞",
"start_offset" : 19,
"end_offset" : 49,
"type" : "word",
"position" : 1
}
]
}
根據結果可以看到,文檔被逗號分割為兩部分。
UAX URL電子郵件分詞器:UAX RUL email tokenizer
在處理單個的英文單詞的情況下,標准分詞器是個非常好的選擇,但是現在很多的網站以網址或電子郵件作為結尾,比如我們現在有這樣的一個文本:
作者:張開
來源:未知
原文:https://www.cnblogs.com/Neeo/articles/10402742.html
郵箱:xxxxxxx@xx.com
版權聲明:本文為博主原創文章,轉載請附上博文鏈接!
現在讓我們使用標准分詞器查看一下:
POST _analyze
{
"tokenizer": "standard",
"text":"作者:張開來源:未知原文:https://www.cnblogs.com/Neeo/articles/10402742.html郵箱:xxxxxxx@xx.com版權聲明:本文為博主原創文章,轉載請附上博文鏈接!"
}
結果很長:
{
"tokens" : [
{
"token" : "作",
"start_offset" : 0,
"end_offset" : 1,
"type" : "<IDEOGRAPHIC>",
"position" : 0
},
{
"token" : "者",
"start_offset" : 1,
"end_offset" : 2,
"type" : "<IDEOGRAPHIC>",
"position" : 1
},
{
"token" : "張",
"start_offset" : 3,
"end_offset" : 4,
"type" : "<IDEOGRAPHIC>",
"position" : 2
},
{
"token" : "開",
"start_offset" : 4,
"end_offset" : 5,
"type" : "<IDEOGRAPHIC>",
"position" : 3
},
{
"token" : "來",
"start_offset" : 5,
"end_offset" : 6,
"type" : "<IDEOGRAPHIC>",
"position" : 4
},
{
"token" : "源",
"start_offset" : 6,
"end_offset" : 7,
"type" : "<IDEOGRAPHIC>",
"position" : 5
},
{
"token" : "未",
"start_offset" : 8,
"end_offset" : 9,
"type" : "<IDEOGRAPHIC>",
"position" : 6
},
{
"token" : "知",
"start_offset" : 9,
"end_offset" : 10,
"type" : "<IDEOGRAPHIC>",
"position" : 7
},
{
"token" : "原",
"start_offset" : 10,
"end_offset" : 11,
"type" : "<IDEOGRAPHIC>",
"position" : 8
},
{
"token" : "文",
"start_offset" : 11,
"end_offset" : 12,
"type" : "<IDEOGRAPHIC>",
"position" : 9
},
{
"token" : "https",
"start_offset" : 13,
"end_offset" : 18,
"type" : "<ALPHANUM>",
"position" : 10
},
{
"token" : "www.cnblogs.com",
"start_offset" : 21,
"end_offset" : 36,
"type" : "<ALPHANUM>",
"position" : 11
},
{
"token" : "Neeo",
"start_offset" : 37,
"end_offset" : 41,
"type" : "<ALPHANUM>",
"position" : 12
},
{
"token" : "articles",
"start_offset" : 42,
"end_offset" : 50,
"type" : "<ALPHANUM>",
"position" : 13
},
{
"token" : "10402742",
"start_offset" : 51,
"end_offset" : 59,
"type" : "<NUM>",
"position" : 14
},
{
"token" : "html",
"start_offset" : 60,
"end_offset" : 64,
"type" : "<ALPHANUM>",
"position" : 15
},
{
"token" : "郵",
"start_offset" : 64,
"end_offset" : 65,
"type" : "<IDEOGRAPHIC>",
"position" : 16
},
{
"token" : "箱",
"start_offset" : 65,
"end_offset" : 66,
"type" : "<IDEOGRAPHIC>",
"position" : 17
},
{
"token" : "xxxxxxx",
"start_offset" : 67,
"end_offset" : 74,
"type" : "<ALPHANUM>",
"position" : 18
},
{
"token" : "xx.com",
"start_offset" : 75,
"end_offset" : 81,
"type" : "<ALPHANUM>",
"position" : 19
},
{
"token" : "版",
"start_offset" : 81,
"end_offset" : 82,
"type" : "<IDEOGRAPHIC>",
"position" : 20
},
{
"token" : "權",
"start_offset" : 82,
"end_offset" : 83,
"type" : "<IDEOGRAPHIC>",
"position" : 21
},
{
"token" : "聲",
"start_offset" : 83,
"end_offset" : 84,
"type" : "<IDEOGRAPHIC>",
"position" : 22
},
{
"token" : "明",
"start_offset" : 84,
"end_offset" : 85,
"type" : "<IDEOGRAPHIC>",
"position" : 23
},
{
"token" : "本",
"start_offset" : 86,
"end_offset" : 87,
"type" : "<IDEOGRAPHIC>",
"position" : 24
},
{
"token" : "文",
"start_offset" : 87,
"end_offset" : 88,
"type" : "<IDEOGRAPHIC>",
"position" : 25
},
{
"token" : "為",
"start_offset" : 88,
"end_offset" : 89,
"type" : "<IDEOGRAPHIC>",
"position" : 26
},
{
"token" : "博",
"start_offset" : 89,
"end_offset" : 90,
"type" : "<IDEOGRAPHIC>",
"position" : 27
},
{
"token" : "主",
"start_offset" : 90,
"end_offset" : 91,
"type" : "<IDEOGRAPHIC>",
"position" : 28
},
{
"token" : "原",
"start_offset" : 91,
"end_offset" : 92,
"type" : "<IDEOGRAPHIC>",
"position" : 29
},
{
"token" : "創",
"start_offset" : 92,
"end_offset" : 93,
"type" : "<IDEOGRAPHIC>",
"position" : 30
},
{
"token" : "文",
"start_offset" : 93,
"end_offset" : 94,
"type" : "<IDEOGRAPHIC>",
"position" : 31
},
{
"token" : "章",
"start_offset" : 94,
"end_offset" : 95,
"type" : "<IDEOGRAPHIC>",
"position" : 32
},
{
"token" : "轉",
"start_offset" : 96,
"end_offset" : 97,
"type" : "<IDEOGRAPHIC>",
"position" : 33
},
{
"token" : "載",
"start_offset" : 97,
"end_offset" : 98,
"type" : "<IDEOGRAPHIC>",
"position" : 34
},
{
"token" : "請",
"start_offset" : 98,
"end_offset" : 99,
"type" : "<IDEOGRAPHIC>",
"position" : 35
},
{
"token" : "附",
"start_offset" : 99,
"end_offset" : 100,
"type" : "<IDEOGRAPHIC>",
"position" : 36
},
{
"token" : "上",
"start_offset" : 100,
"end_offset" : 101,
"type" : "<IDEOGRAPHIC>",
"position" : 37
},
{
"token" : "博",
"start_offset" : 101,
"end_offset" : 102,
"type" : "<IDEOGRAPHIC>",
"position" : 38
},
{
"token" : "文",
"start_offset" : 102,
"end_offset" : 103,
"type" : "<IDEOGRAPHIC>",
"position" : 39
},
{
"token" : "鏈",
"start_offset" : 103,
"end_offset" : 104,
"type" : "<IDEOGRAPHIC>",
"position" : 40
},
{
"token" : "接",
"start_offset" : 104,
"end_offset" : 105,
"type" : "<IDEOGRAPHIC>",
"position" : 41
}
]
}
無論如何,這個結果不符合我們的預期,因為把我們的郵箱和網址分的亂七八糟!那么針對這種情況,我們應該使用UAX URL電子郵件分詞器(UAX RUL email tokenizer),該分詞器將電子郵件和URL都作為單獨的分詞進行保留。
POST _analyze
{
"tokenizer": "uax_url_email",
"text":"作者:張開來源:未知原文:https://www.cnblogs.com/Neeo/articles/10402742.html郵箱:xxxxxxx@xx.com版權聲明:本文為博主原創文章,轉載請附上博文鏈接!"
}
結果如下:
{
"tokens" : [
{
"token" : "作",
"start_offset" : 0,
"end_offset" : 1,
"type" : "<IDEOGRAPHIC>",
"position" : 0
},
{
"token" : "者",
"start_offset" : 1,
"end_offset" : 2,
"type" : "<IDEOGRAPHIC>",
"position" : 1
},
{
"token" : "張",
"start_offset" : 3,
"end_offset" : 4,
"type" : "<IDEOGRAPHIC>",
"position" : 2
},
{
"token" : "開",
"start_offset" : 4,
"end_offset" : 5,
"type" : "<IDEOGRAPHIC>",
"position" : 3
},
{
"token" : "來",
"start_offset" : 5,
"end_offset" : 6,
"type" : "<IDEOGRAPHIC>",
"position" : 4
},
{
"token" : "源",
"start_offset" : 6,
"end_offset" : 7,
"type" : "<IDEOGRAPHIC>",
"position" : 5
},
{
"token" : "未",
"start_offset" : 8,
"end_offset" : 9,
"type" : "<IDEOGRAPHIC>",
"position" : 6
},
{
"token" : "知",
"start_offset" : 9,
"end_offset" : 10,
"type" : "<IDEOGRAPHIC>",
"position" : 7
},
{
"token" : "原",
"start_offset" : 10,
"end_offset" : 11,
"type" : "<IDEOGRAPHIC>",
"position" : 8
},
{
"token" : "文",
"start_offset" : 11,
"end_offset" : 12,
"type" : "<IDEOGRAPHIC>",
"position" : 9
},
{
"token" : "https://www.cnblogs.com/Neeo/articles/10402742.html",
"start_offset" : 13,
"end_offset" : 64,
"type" : "<URL>",
"position" : 10
},
{
"token" : "郵",
"start_offset" : 64,
"end_offset" : 65,
"type" : "<IDEOGRAPHIC>",
"position" : 11
},
{
"token" : "箱",
"start_offset" : 65,
"end_offset" : 66,
"type" : "<IDEOGRAPHIC>",
"position" : 12
},
{
"token" : "xxxxxxx@xx.com",
"start_offset" : 67,
"end_offset" : 81,
"type" : "<EMAIL>",
"position" : 13
},
{
"token" : "版",
"start_offset" : 81,
"end_offset" : 82,
"type" : "<IDEOGRAPHIC>",
"position" : 14
},
{
"token" : "權",
"start_offset" : 82,
"end_offset" : 83,
"type" : "<IDEOGRAPHIC>",
"position" : 15
},
{
"token" : "聲",
"start_offset" : 83,
"end_offset" : 84,
"type" : "<IDEOGRAPHIC>",
"position" : 16
},
{
"token" : "明",
"start_offset" : 84,
"end_offset" : 85,
"type" : "<IDEOGRAPHIC>",
"position" : 17
},
{
"token" : "本",
"start_offset" : 86,
"end_offset" : 87,
"type" : "<IDEOGRAPHIC>",
"position" : 18
},
{
"token" : "文",
"start_offset" : 87,
"end_offset" : 88,
"type" : "<IDEOGRAPHIC>",
"position" : 19
},
{
"token" : "為",
"start_offset" : 88,
"end_offset" : 89,
"type" : "<IDEOGRAPHIC>",
"position" : 20
},
{
"token" : "博",
"start_offset" : 89,
"end_offset" : 90,
"type" : "<IDEOGRAPHIC>",
"position" : 21
},
{
"token" : "主",
"start_offset" : 90,
"end_offset" : 91,
"type" : "<IDEOGRAPHIC>",
"position" : 22
},
{
"token" : "原",
"start_offset" : 91,
"end_offset" : 92,
"type" : "<IDEOGRAPHIC>",
"position" : 23
},
{
"token" : "創",
"start_offset" : 92,
"end_offset" : 93,
"type" : "<IDEOGRAPHIC>",
"position" : 24
},
{
"token" : "文",
"start_offset" : 93,
"end_offset" : 94,
"type" : "<IDEOGRAPHIC>",
"position" : 25
},
{
"token" : "章",
"start_offset" : 94,
"end_offset" : 95,
"type" : "<IDEOGRAPHIC>",
"position" : 26
},
{
"token" : "轉",
"start_offset" : 96,
"end_offset" : 97,
"type" : "<IDEOGRAPHIC>",
"position" : 27
},
{
"token" : "載",
"start_offset" : 97,
"end_offset" : 98,
"type" : "<IDEOGRAPHIC>",
"position" : 28
},
{
"token" : "請",
"start_offset" : 98,
"end_offset" : 99,
"type" : "<IDEOGRAPHIC>",
"position" : 29
},
{
"token" : "附",
"start_offset" : 99,
"end_offset" : 100,
"type" : "<IDEOGRAPHIC>",
"position" : 30
},
{
"token" : "上",
"start_offset" : 100,
"end_offset" : 101,
"type" : "<IDEOGRAPHIC>",
"position" : 31
},
{
"token" : "博",
"start_offset" : 101,
"end_offset" : 102,
"type" : "<IDEOGRAPHIC>",
"position" : 32
},
{
"token" : "文",
"start_offset" : 102,
"end_offset" : 103,
"type" : "<IDEOGRAPHIC>",
"position" : 33
},
{
"token" : "鏈",
"start_offset" : 103,
"end_offset" : 104,
"type" : "<IDEOGRAPHIC>",
"position" : 34
},
{
"token" : "接",
"start_offset" : 104,
"end_offset" : 105,
"type" : "<IDEOGRAPHIC>",
"position" : 35
}
]
}
路徑層次分詞器:path hierarchy tokenizer
路徑層次分詞器(path hierarchy tokenizer)允許以特定的方式索引文件系統的路徑,這樣在搜索時,共享同樣路徑的文件將被作為結果返回。
POST _analyze
{
"tokenizer": "path_hierarchy",
"text":"/usr/local/python/python2.7"
}
返回結果如下:
{
"tokens" : [
{
"token" : "/usr",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 0
},
{
"token" : "/usr/local",
"start_offset" : 0,
"end_offset" : 10,
"type" : "word",
"position" : 0
},
{
"token" : "/usr/local/python",
"start_offset" : 0,
"end_offset" : 17,
"type" : "word",
"position" : 0
},
{
"token" : "/usr/local/python/python2.7",
"start_offset" : 0,
"end_offset" : 27,
"type" : "word",
"position" : 0
}
]
}
see also:[elasticsearch tokenizers](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html) 歡迎斧正,that's all