這個分詞,明天晚上進行補充好。
一:概述
1.分詞器
將⽤戶輸⼊的⼀段⽂本,按照⼀定邏輯,分析成多個詞語的⼀種⼯具
2.內置的分詞器
standard analyzer
simple analyzer
whitespace analyzer
stop analyzer
language analyzer
pattern analyzer
二:分詞器測試
1.standard analyzer
標准分析器是默認分詞器,如果未指定,則使⽤該分詞器。
POST /_analyze
{
"analyzer": "standard",
"text": "The best 3-points shooter is Curry!"
}
效果:
可以看看token,start_offset,end_offset,type,position。其中,‘-’不在了。
{
"tokens" : [
{
"token" : "the",
"start_offset" : 0,
"end_offset" : 3,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "best",
"start_offset" : 4,
"end_offset" : 8,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "3",
"start_offset" : 9,
"end_offset" : 10,
"type" : "<NUM>",
"position" : 2
},
{
"token" : "points",
"start_offset" : 11,
"end_offset" : 17,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "shooter",
"start_offset" : 18,
"end_offset" : 25,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "is",
"start_offset" : 26,
"end_offset" : 28,
"type" : "<ALPHANUM>",
"position" : 5
},
{
"token" : "curry",
"start_offset" : 29,
"end_offset" : 34,
"type" : "<ALPHANUM>",
"position" : 6
}
]
}
2,simple analyze
simple 分析器當它遇到只要不是字⺟的字符,就將⽂本解析成term,⽽且所有的term都是 ⼩寫的。
POST /_analyze
{
"analyzer": "simple",
"text": "The best 3-points shooter is Curry!"
}
效果:
其中,3與‘-’不在了
{
"tokens" : [
{
"token" : "the",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 0
},
{
"token" : "best",
"start_offset" : 4,
"end_offset" : 8,
"type" : "word",
"position" : 1
},
{
"token" : "points",
"start_offset" : 11,
"end_offset" : 17,
"type" : "word",
"position" : 2
},
{
"token" : "shooter",
"start_offset" : 18,
"end_offset" : 25,
"type" : "word",
"position" : 3
},
{
"token" : "is",
"start_offset" : 26,
"end_offset" : 28,
"type" : "word",
"position" : 4
},
{
"token" : "curry",
"start_offset" : 29,
"end_offset" : 34,
"type" : "word",
"position" : 5
}
]
}
說明:只要不是字母的。都會被去掉
3.whitespace analyzer
whitespace 分析器,當它遇到空⽩字符時,就將⽂本解析成terms
POST /_analyze
{
"analyzer": "whitespace",
"text": "The best 3-points shooter is Curry!"
}
效果:
{
"tokens" : [
{
"token" : "The",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 0
},
{
"token" : "best",
"start_offset" : 4,
"end_offset" : 8,
"type" : "word",
"position" : 1
},
{
"token" : "3-points",
"start_offset" : 9,
"end_offset" : 17,
"type" : "word",
"position" : 2
},
{
"token" : "shooter",
"start_offset" : 18,
"end_offset" : 25,
"type" : "word",
"position" : 3
},
{
"token" : "is",
"start_offset" : 26,
"end_offset" : 28,
"type" : "word",
"position" : 4
},
{
"token" : "Curry!",
"start_offset" : 29,
"end_offset" : 35,
"type" : "word",
"position" : 5
}
]
}
說明:只是使用空格進行分開,不會小寫
4.stop analyzer
stop 分析器 和 simple 分析器很像,唯⼀不同的是,stop 分析器增加了對刪除停⽌詞的⽀ 持,默認使⽤了english停⽌詞
stopwords 預定義的停⽌詞列表,⽐如 (the,a,an,this,of,at)等等
POST /_analyze
{
"analyzer": "stop",
"text": "The best 3-points shooter is Curry!"
}
效果:
{
"tokens" : [
{
"token" : "best",
"start_offset" : 4,
"end_offset" : 8,
"type" : "word",
"position" : 1
},
{
"token" : "points",
"start_offset" : 11,
"end_offset" : 17,
"type" : "word",
"position" : 2
},
{
"token" : "shooter",
"start_offset" : 18,
"end_offset" : 25,
"type" : "word",
"position" : 3
},
{
"token" : "curry",
"start_offset" : 29,
"end_offset" : 34,
"type" : "word",
"position" : 5
}
]
}
5.language analyzer
特定的語⾔的分詞器,⽐如說,english,英語分詞器),
內置語⾔:arabic, armenian, basque, bengali, brazilian, bulgarian, catalan, cjk, czech, danish, dutch, english, finnish, french, galician, german, greek, hindi, hungarian, indonesian, irish, italian, latvian, lithuanian, norwegian, persian, portuguese, romanian, russian, sorani, spanish, swedish, turkish, thai
POST /_analyze
{
"analyzer": "english",
"text": "The best 3-points shooter is Curry!"
}
效果:
{
"tokens" : [
{
"token" : "best",
"start_offset" : 4,
"end_offset" : 8,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "3",
"start_offset" : 9,
"end_offset" : 10,
"type" : "<NUM>",
"position" : 2
},
{
"token" : "point",
"start_offset" : 11,
"end_offset" : 17,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "shooter",
"start_offset" : 18,
"end_offset" : 25,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "curri",
"start_offset" : 29,
"end_offset" : 34,
"type" : "<ALPHANUM>",
"position" : 6
}
]
}
6.pattern analyzer
⽤正則表達式來將⽂本分割成terms,默認的正則表達式是\W+(⾮單詞字符)
POST /_analyze { "analyzer": "pattern", "text": "The best 3-points shooter is Curry!" }
效果:
{
"tokens" : [
{
"token" : "the",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 0
},
{
"token" : "best",
"start_offset" : 4,
"end_offset" : 8,
"type" : "word",
"position" : 1
},
{
"token" : "3",
"start_offset" : 9,
"end_offset" : 10,
"type" : "word",
"position" : 2
},
{
"token" : "points",
"start_offset" : 11,
"end_offset" : 17,
"type" : "word",
"position" : 3
},
{
"token" : "shooter",
"start_offset" : 18,
"end_offset" : 25,
"type" : "word",
"position" : 4
},
{
"token" : "is",
"start_offset" : 26,
"end_offset" : 28,
"type" : "word",
"position" : 5
},
{
"token" : "curry",
"start_offset" : 29,
"end_offset" : 34,
"type" : "word",
"position" : 6
}
]
}
三:實際使用分詞器
1.新建索引
PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "whitespace"
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text"
},
"team_name": {
"type": "text"
},
"position": {
"type": "text"
},
"play_year": {
"type": "long"
},
"jerse_no": {
"type": "keyword"
},
"title": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
2.進行測試
GET /my_index/_search
{
"query": {
"match": {
"title": "Curry!"
}
}
}
效果:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.2876821,
"hits" : [
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.2876821,
"_source" : {
"name" : "庫⾥",
"team_name" : "勇⼠",
"position" : "控球后衛",
"play_year" : 10,
"jerse_no" : "30",
"title" : "The best 3-points shooter is Curry!"
}
}
]
}
}
