分詞器的介紹和使用
什么是分詞器?
將用戶輸入的一段文本,按照一定邏輯,分析成多個詞語的一種工具
常用的內置分詞器
standard analyzer、simple analyzer、whitespace analyzer、stop analyzer、language analyzer、pattern analyzer
standard analyzer
標准分析器是默認分詞器,如果未指定,則使用該分詞器。
POST localhost:9200/_analyze
參數:
{
"analyzer":"standard",
"text":"The best 3-points shooter is Curry!"
}
返回值:
{
"tokens": [
{
"token": "the",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "best",
"start_offset": 4,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "3",
"start_offset": 9,
"end_offset": 10,
"type": "<NUM>",
"position": 2
},
{
"token": "points",
"start_offset": 11,
"end_offset": 17,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "shooter",
"start_offset": 18,
"end_offset": 25,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "is",
"start_offset": 26,
"end_offset": 28,
"type": "<ALPHANUM>",
"position": 5
},
{
"token": "curry",
"start_offset": 29,
"end_offset": 34,
"type": "<ALPHANUM>",
"position": 6
}
]
}
simple analyzer
simple 分析器當它遇到只要不是字母的字符,就將文本解析成 term,而且所有的 term 都是小寫的。
POST localhost:9200/_analyze
參數:
{
"analyzer":"simple",
"text":"The best 3-points shooter is Curry!"
}
返回值:
{
"tokens": [
{
"token": "the",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
},
{
"token": "best",
"start_offset": 4,
"end_offset": 8,
"type": "word",
"position": 1
},
{
"token": "points",
"start_offset": 11,
"end_offset": 17,
"type": "word",
"position": 2
},
{
"token": "shooter",
"start_offset": 18,
"end_offset": 25,
"type": "word",
"position": 3
},
{
"token": "is",
"start_offset": 26,
"end_offset": 28,
"type": "word",
"position": 4
},
{
"token": "curry",
"start_offset": 29,
"end_offset": 34,
"type": "word",
"position": 5
}
]
}
whitespace analyzer
whitespace 分析器,當它遇到空白字符時,就將文本解析成terms
POST localhost:9200/_analyze
參數:
{
"analyzer":"whitespace",
"text":"The best 3-points shooter is Curry!"
}
返回值:
{
"tokens": [
{
"token": "The",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
},
{
"token": "best",
"start_offset": 4,
"end_offset": 8,
"type": "word",
"position": 1
},
{
"token": "3-points",
"start_offset": 9,
"end_offset": 17,
"type": "word",
"position": 2
},
{
"token": "shooter",
"start_offset": 18,
"end_offset": 25,
"type": "word",
"position": 3
},
{
"token": "is",
"start_offset": 26,
"end_offset": 28,
"type": "word",
"position": 4
},
{
"token": "Curry!",
"start_offset": 29,
"end_offset": 35,
"type": "word",
"position": 5
}
]
}
stop analyzer
stop 分析器 和 simple 分析器很像,唯一不同的是,stop 分析器增加了對刪除停止詞的支持,默認使用了 english 停止詞
stopwords 預定義的停止詞列表,比如 (the,a,an,this,of,at)等等
POST localhost:9200/_analyze
參數:
{
"analyzer":"stop",
"text":"The best 3-points shooter is Curry!"
}
返回值:
{
"tokens": [
{
"token": "best",
"start_offset": 4,
"end_offset": 8,
"type": "word",
"position": 1
},
{
"token": "points",
"start_offset": 11,
"end_offset": 17,
"type": "word",
"position": 2
},
{
"token": "shooter",
"start_offset": 18,
"end_offset": 25,
"type": "word",
"position": 3
},
{
"token": "curry",
"start_offset": 29,
"end_offset": 34,
"type": "word",
"position": 5
}
]
}
language analyzer
(特定的語言的分詞器,比如說,english,英語分詞器),內置語言:arabic,armenian,basque,bengali,brazilian,bulgarian,catalan,cjk,czech,danish,dutch,english,finnish,french,galician,german,greek,hindi,hungarian,indonesian,irish,italian,latvian,lithuanian,norwegian,persian,portuguese,romanian,russian,sorani,spanish,swedish,turkish,thai
POST localhost:9200/_analyze
參數:
{
"analyzer":"english",
"text":"The best 3-points shooter is Curry!"
}
返回值:
{
"tokens": [
{
"token": "best",
"start_offset": 4,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "3",
"start_offset": 9,
"end_offset": 10,
"type": "<NUM>",
"position": 2
},
{
"token": "point",
"start_offset": 11,
"end_offset": 17,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "shooter",
"start_offset": 18,
"end_offset": 25,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "curri",
"start_offset": 29,
"end_offset": 34,
"type": "<ALPHANUM>",
"position": 6
}
]
}
pattern analyzer
用正則表達式來將文本分割成terms,默認的正則表達式是\W+(非單詞字符)
POST localhost:9200/_analyze
參數:
{
"analyzer":"pattern",
"text":"The best 3-points shooter is Curry!"
}
返回值:
{
"tokens": [
{
"token": "the",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
},
{
"token": "best",
"start_offset": 4,
"end_offset": 8,
"type": "word",
"position": 1
},
{
"token": "3",
"start_offset": 9,
"end_offset": 10,
"type": "word",
"position": 2
},
{
"token": "points",
"start_offset": 11,
"end_offset": 17,
"type": "word",
"position": 3
},
{
"token": "shooter",
"start_offset": 18,
"end_offset": 25,
"type": "word",
"position": 4
},
{
"token": "is",
"start_offset": 26,
"end_offset": 28,
"type": "word",
"position": 5
},
{
"token": "curry",
"start_offset": 29,
"end_offset": 34,
"type": "word",
"position": 6
}
]
}
使用示例
PUT localhost:9200/my_index
參數:
{
"settings":{
"analysis":{
"analyzer":{
"my_analyzer":{
"type":"whitespace"
}
}
}
},
"mappings":{
"properties":{
"name":{
"type":"text"
},
"team_name":{
"type":"text"
},
"position":{
"type":"text"
},
"play_year":{
"type":"long"
},
"jerse_no":{
"type":"keyword"
},
"title":{
"type":"text",
"analyzer":"my_analyzer"
}
}
}
}
返回值:
{
"acknowledged": true,
"shards_acknowledged": true,
"index": "my_index"
}
PUT localhost:9200/my_index/_doc/1
參數:
{
"name":"庫⾥里里",
"team_name":"勇⼠士",
"position":"控球后衛",
"play_year":10,
"jerse_no":"30",
"title":"The best 3-points shooter is Curry!"
}
返回值:
{
"_index": "my_index",
"_type": "_doc",
"_id": "1",
"_version": 1,
"result": "created",
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
},
"_seq_no": 0,
"_primary_term": 1
}
POST localhost:9200/my_index/_search
參數:
{
"query":{
"match":{
"title":"Curry!"
}
}
}
返回值:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 0.2876821,
"hits": [
{
"_index": "my_index",
"_type": "_doc",
"_id": "1",
"_score": 0.2876821,
"_source": {
"name": "庫⾥里里",
"team_name": "勇⼠士",
"position": "控球后衛",
"play_year": 10,
"jerse_no": "30",
"title": "The best 3-points shooter is Curry!"
}
}
]
}
}
常見中文分詞器的使用
如果用默認的分詞器 standard 進行中文分詞,中文會被單獨分成一個個漢字
POST localhost:9200/_analyze 參數: { "analyzer": "standard", "text": "火箭明年總冠軍" } 返回值: { "tokens": [ { "token": "火", "start_offset": 0, "end_offset": 1, "type": "<IDEOGRAPHIC>", "position": 0 }, { "token": "箭", "start_offset": 1, "end_offset": 2, "type": "<IDEOGRAPHIC>", "position": 1 }, { "token": "明", "start_offset": 2, "end_offset": 3, "type": "<IDEOGRAPHIC>", "position": 2 }, { "token": "年", "start_offset": 3, "end_offset": 4, "type": "<IDEOGRAPHIC>", "position": 3 }, { "token": "總", "start_offset": 4, "end_offset": 5, "type": "<IDEOGRAPHIC>", "position": 4 }, { "token": "冠", "start_offset": 5, "end_offset": 6, "type": "<IDEOGRAPHIC>", "position": 5 }, { "token": "軍", "start_offset": 6, "end_offset": 7, "type": "<IDEOGRAPHIC>", "position": 6 } ] }
常見分詞器
smartCN分詞器:一個簡單的中文或中英文混合文本的分詞器
IK分詞器:更智能更友好的中文分詞器
smartCn
安裝,進入到bin目錄
- Linux 使用命令:sh elasticsearch-plugin install analysis-smartcn
- Windows 使用命令:.\elasticsearch-plugin install analysis-smartcn

安裝后需要重新啟動
POST localhost:9200/_analyze 參數: { "analyzer": "smartcn", "text": "火箭明年總冠軍" } 返回值: { "tokens": [ { "token": "火箭", "start_offset": 0, "end_offset": 2, "type": "word", "position": 0 }, { "token": "明年", "start_offset": 2, "end_offset": 4, "type": "word", "position": 1 }, { "token": "總", "start_offset": 4, "end_offset": 5, "type": "word", "position": 2 }, { "token": "冠軍", "start_offset": 5, "end_offset": 7, "type": "word", "position": 3 } ] }
IK分詞器
下載:https://github.com/medcl/elasticsearch-analysis-ik/releases
安裝:解壓安裝到 plugins 目錄
安裝后重新啟動
POST localhost:9200/_analyze 參數: { "analyzer": "ik_max_word", "text": "火箭明年總冠軍" } 返回值: { "tokens": [ { "token": "火箭", "start_offset": 0, "end_offset": 2, "type": "CN_WORD", "position": 0 }, { "token": "明年", "start_offset": 2, "end_offset": 4, "type": "CN_WORD", "position": 1 }, { "token": "總冠軍", "start_offset": 4, "end_offset": 7, "type": "CN_WORD", "position": 2 }, { "token": "冠軍", "start_offset": 5, "end_offset": 7, "type": "CN_WORD", "position": 3 } ] }
