分詞器的介紹和使用
什么是分詞器?
將用戶輸入的一段文本,按照一定邏輯,分析成多個詞語的一種工具
常用的內置分詞器
standard analyzer、simple analyzer、whitespace analyzer、stop analyzer、language analyzer、pattern analyzer
standard analyzer
標准分析器是默認分詞器,如果未指定,則使用該分詞器。
POST localhost:9200/_analyze 參數: { "analyzer":"standard", "text":"The best 3-points shooter is Curry!" } 返回值: { "tokens": [ { "token": "the", "start_offset": 0, "end_offset": 3, "type": "<ALPHANUM>", "position": 0 }, { "token": "best", "start_offset": 4, "end_offset": 8, "type": "<ALPHANUM>", "position": 1 }, { "token": "3", "start_offset": 9, "end_offset": 10, "type": "<NUM>", "position": 2 }, { "token": "points", "start_offset": 11, "end_offset": 17, "type": "<ALPHANUM>", "position": 3 }, { "token": "shooter", "start_offset": 18, "end_offset": 25, "type": "<ALPHANUM>", "position": 4 }, { "token": "is", "start_offset": 26, "end_offset": 28, "type": "<ALPHANUM>", "position": 5 }, { "token": "curry", "start_offset": 29, "end_offset": 34, "type": "<ALPHANUM>", "position": 6 } ] }
simple analyzer
simple 分析器當它遇到只要不是字母的字符,就將文本解析成 term,而且所有的 term 都是小寫的。
POST localhost:9200/_analyze 參數: { "analyzer":"simple", "text":"The best 3-points shooter is Curry!" } 返回值: { "tokens": [ { "token": "the", "start_offset": 0, "end_offset": 3, "type": "word", "position": 0 }, { "token": "best", "start_offset": 4, "end_offset": 8, "type": "word", "position": 1 }, { "token": "points", "start_offset": 11, "end_offset": 17, "type": "word", "position": 2 }, { "token": "shooter", "start_offset": 18, "end_offset": 25, "type": "word", "position": 3 }, { "token": "is", "start_offset": 26, "end_offset": 28, "type": "word", "position": 4 }, { "token": "curry", "start_offset": 29, "end_offset": 34, "type": "word", "position": 5 } ] }
whitespace analyzer
whitespace 分析器,當它遇到空白字符時,就將文本解析成terms
POST localhost:9200/_analyze 參數: { "analyzer":"whitespace", "text":"The best 3-points shooter is Curry!" } 返回值: { "tokens": [ { "token": "The", "start_offset": 0, "end_offset": 3, "type": "word", "position": 0 }, { "token": "best", "start_offset": 4, "end_offset": 8, "type": "word", "position": 1 }, { "token": "3-points", "start_offset": 9, "end_offset": 17, "type": "word", "position": 2 }, { "token": "shooter", "start_offset": 18, "end_offset": 25, "type": "word", "position": 3 }, { "token": "is", "start_offset": 26, "end_offset": 28, "type": "word", "position": 4 }, { "token": "Curry!", "start_offset": 29, "end_offset": 35, "type": "word", "position": 5 } ] }
stop analyzer
stop 分析器 和 simple 分析器很像,唯一不同的是,stop 分析器增加了對刪除停止詞的支持,默認使用了 english 停止詞
stopwords 預定義的停止詞列表,比如 (the,a,an,this,of,at)等等
POST localhost:9200/_analyze 參數: { "analyzer":"stop", "text":"The best 3-points shooter is Curry!" } 返回值: { "tokens": [ { "token": "best", "start_offset": 4, "end_offset": 8, "type": "word", "position": 1 }, { "token": "points", "start_offset": 11, "end_offset": 17, "type": "word", "position": 2 }, { "token": "shooter", "start_offset": 18, "end_offset": 25, "type": "word", "position": 3 }, { "token": "curry", "start_offset": 29, "end_offset": 34, "type": "word", "position": 5 } ] }
language analyzer
(特定的語言的分詞器,比如說,english,英語分詞器),內置語言:arabic,armenian,basque,bengali,brazilian,bulgarian,catalan,cjk,czech,danish,dutch,english,finnish,french,galician,german,greek,hindi,hungarian,indonesian,irish,italian,latvian,lithuanian,norwegian,persian,portuguese,romanian,russian,sorani,spanish,swedish,turkish,thai
POST localhost:9200/_analyze 參數: { "analyzer":"english", "text":"The best 3-points shooter is Curry!" } 返回值: { "tokens": [ { "token": "best", "start_offset": 4, "end_offset": 8, "type": "<ALPHANUM>", "position": 1 }, { "token": "3", "start_offset": 9, "end_offset": 10, "type": "<NUM>", "position": 2 }, { "token": "point", "start_offset": 11, "end_offset": 17, "type": "<ALPHANUM>", "position": 3 }, { "token": "shooter", "start_offset": 18, "end_offset": 25, "type": "<ALPHANUM>", "position": 4 }, { "token": "curri", "start_offset": 29, "end_offset": 34, "type": "<ALPHANUM>", "position": 6 } ] }
pattern analyzer
用正則表達式來將文本分割成terms,默認的正則表達式是\W+(非單詞字符)
POST localhost:9200/_analyze 參數: { "analyzer":"pattern", "text":"The best 3-points shooter is Curry!" } 返回值: { "tokens": [ { "token": "the", "start_offset": 0, "end_offset": 3, "type": "word", "position": 0 }, { "token": "best", "start_offset": 4, "end_offset": 8, "type": "word", "position": 1 }, { "token": "3", "start_offset": 9, "end_offset": 10, "type": "word", "position": 2 }, { "token": "points", "start_offset": 11, "end_offset": 17, "type": "word", "position": 3 }, { "token": "shooter", "start_offset": 18, "end_offset": 25, "type": "word", "position": 4 }, { "token": "is", "start_offset": 26, "end_offset": 28, "type": "word", "position": 5 }, { "token": "curry", "start_offset": 29, "end_offset": 34, "type": "word", "position": 6 } ] }
使用示例
PUT localhost:9200/my_index 參數: { "settings":{ "analysis":{ "analyzer":{ "my_analyzer":{ "type":"whitespace" } } } }, "mappings":{ "properties":{ "name":{ "type":"text" }, "team_name":{ "type":"text" }, "position":{ "type":"text" }, "play_year":{ "type":"long" }, "jerse_no":{ "type":"keyword" }, "title":{ "type":"text", "analyzer":"my_analyzer" } } } } 返回值: { "acknowledged": true, "shards_acknowledged": true, "index": "my_index" } PUT localhost:9200/my_index/_doc/1 參數: { "name":"庫⾥里里", "team_name":"勇⼠士", "position":"控球后衛", "play_year":10, "jerse_no":"30", "title":"The best 3-points shooter is Curry!" } 返回值: { "_index": "my_index", "_type": "_doc", "_id": "1", "_version": 1, "result": "created", "_shards": { "total": 2, "successful": 1, "failed": 0 }, "_seq_no": 0, "_primary_term": 1 } POST localhost:9200/my_index/_search 參數: { "query":{ "match":{ "title":"Curry!" } } } 返回值: { "took": 2, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 1, "relation": "eq" }, "max_score": 0.2876821, "hits": [ { "_index": "my_index", "_type": "_doc", "_id": "1", "_score": 0.2876821, "_source": { "name": "庫⾥里里", "team_name": "勇⼠士", "position": "控球后衛", "play_year": 10, "jerse_no": "30", "title": "The best 3-points shooter is Curry!" } } ] } }
常見中文分詞器的使用
如果用默認的分詞器 standard 進行中文分詞,中文會被單獨分成一個個漢字
POST localhost:9200/_analyze 參數: { "analyzer": "standard", "text": "火箭明年總冠軍" } 返回值: { "tokens": [ { "token": "火", "start_offset": 0, "end_offset": 1, "type": "<IDEOGRAPHIC>", "position": 0 }, { "token": "箭", "start_offset": 1, "end_offset": 2, "type": "<IDEOGRAPHIC>", "position": 1 }, { "token": "明", "start_offset": 2, "end_offset": 3, "type": "<IDEOGRAPHIC>", "position": 2 }, { "token": "年", "start_offset": 3, "end_offset": 4, "type": "<IDEOGRAPHIC>", "position": 3 }, { "token": "總", "start_offset": 4, "end_offset": 5, "type": "<IDEOGRAPHIC>", "position": 4 }, { "token": "冠", "start_offset": 5, "end_offset": 6, "type": "<IDEOGRAPHIC>", "position": 5 }, { "token": "軍", "start_offset": 6, "end_offset": 7, "type": "<IDEOGRAPHIC>", "position": 6 } ] }
常見分詞器
smartCN分詞器:一個簡單的中文或中英文混合文本的分詞器
IK分詞器:更智能更友好的中文分詞器
smartCn
安裝,進入到bin目錄
- Linux 使用命令:sh elasticsearch-plugin install analysis-smartcn
- Windows 使用命令:.\elasticsearch-plugin install analysis-smartcn
安裝后需要重新啟動
POST localhost:9200/_analyze 參數: { "analyzer": "smartcn", "text": "火箭明年總冠軍" } 返回值: { "tokens": [ { "token": "火箭", "start_offset": 0, "end_offset": 2, "type": "word", "position": 0 }, { "token": "明年", "start_offset": 2, "end_offset": 4, "type": "word", "position": 1 }, { "token": "總", "start_offset": 4, "end_offset": 5, "type": "word", "position": 2 }, { "token": "冠軍", "start_offset": 5, "end_offset": 7, "type": "word", "position": 3 } ] }
IK分詞器
下載:https://github.com/medcl/elasticsearch-analysis-ik/releases
安裝:解壓安裝到 plugins 目錄
安裝后重新啟動
POST localhost:9200/_analyze 參數: { "analyzer": "ik_max_word", "text": "火箭明年總冠軍" } 返回值: { "tokens": [ { "token": "火箭", "start_offset": 0, "end_offset": 2, "type": "CN_WORD", "position": 0 }, { "token": "明年", "start_offset": 2, "end_offset": 4, "type": "CN_WORD", "position": 1 }, { "token": "總冠軍", "start_offset": 4, "end_offset": 7, "type": "CN_WORD", "position": 2 }, { "token": "冠軍", "start_offset": 5, "end_offset": 7, "type": "CN_WORD", "position": 3 } ] }