一、分詞器
1、作用:①切詞
②normalizaton(提升recall召回率:能搜索到的結果的比率)
2、分析器
①character filter:分詞之前預處理(過濾無用字符、標簽等,轉換一些&=>and 《Elasticsearch》=> Elasticsearch
A、HTML Strip Character Filter:html_strip
escaped_tags 需要保留的html標簽
PUT my_index { "settings": { "analysis": { "char_filter": { "my_char_filter":{ "type":"html_strip",
"escaped_tags":["a"] } }, "analyzer": { "my_analyzer":{ "tokenizer":"keyword", "char_filter":"my_char_filter" } } } } }
測試分詞
GET my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "liuyucheng <a><b>edu</b></a>"
}
B、Mapping Character Filter:type mapping
PUT my_index { "settings": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "keyword", "char_filter": [ "my_char_filter" ] } }, "char_filter": { "my_char_filter": { "type": "mapping", "mappings": [ "٠ => 0", "١ => 1", "٢ => 2", "٣ => 3", "٤ => 4", "٥ => 5", "٦ => 6", "٧ => 7", "٨ => 8", "٩ => 9" ] } } } } }
測試分詞 POST my_index/_analyze { "analyzer": "my_analyzer", "text": "My license plate is ٢٥٠١٥" }
C、Pattern Replace Character Filter:正則替換type pattern_replace
PUT my_index { "settings": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "standard", "char_filter": ["my_char_filter"] } }, "char_filter": { "my_char_filter": { "type": "pattern_replace", "pattern": "(\\d+)-(?=\\d)", "replacement": "$1_" } } } } }
測試分詞 POST my_index/_analyze { "analyzer": "my_analyzer", "text": "My credit card is 123-456-789" }
②tokenizer:分詞器
③token filter:時態轉換、大小寫轉換、同義詞轉換、語氣詞處理等
比如:has=>have him=>he apples=>apple the/oh/a=>干掉
A、大小寫 lowercase token filter
GET _analyze { "tokenizer" : "standard", "filter" : ["lowercase"], "text" : "THE Quick FoX JUMPs" } GET /_analyze { "tokenizer": "standard", "filter": [ { "type": "condition", "filter": [ "lowercase" ], "script": { "source": "token.getTerm().length() < 5" } } ], "text": "THE QUICK BROWN FOX" }
B、停用詞 stopwords token filter
PUT /my_index { "settings": { "analysis": { "analyzer": { "my_analyzer":{ "type":"standard", "stopwords":"_english_" } } } } } GET my_index/_analyze { "analyzer": "my_analyzer", "text": "Teacher Ma is in the restroom" }
C、分詞器 tokenizer :standard
GET /my_index/_analyze { "text": "江山如此多嬌,小姐姐哪里可以撩", "analyzer": "standard" }
D、自定義 analysis,設置type為custom告訴Elasticsearch我們正在定義一個定制分析器。將此與配置內置分析器的方式進行比較: type將設置為內置分析器的名稱,如 standard或simple
PUT /test_analysis { "settings": { "analysis": { "char_filter": { "test_char_filter": { "type": "mapping", "mappings": [ "& => and", "| => or" ] } }, "filter": { "test_stopwords": { "type": "stop", "stopwords": ["is","in","at","the","a","for"] } }, "tokenizer": { "punctuation": { "type": "pattern", "pattern": "[ .,!?]" } }, "analyzer": { "my_analyzer": { "type": "custom", "char_filter": [ "html_strip", "test_char_filter" ], "tokenizer": "standard", "filter": ["lowercase","test_stopwords"] } } } } } GET /test_analysis/_analyze { "text": "Teacher ma & zhang also thinks [mother's friends] is good | nice!!!", "analyzer": "my_analyzer" }
E、創建mapping時候指定分詞器
PUT /test_analysis/_mapping/my_type { "properties": { "content": { "type": "text", "analyzer": "test_analysis" } } }
二、中文分詞器
(1) 中文分詞器:
① IK分詞:ES的安裝目錄 不要有中文 空格
1) 下載:https://github.com/medcl/elasticsearch-analysis-ik
2) 創建插件文件夾 cd your-es-root/plugins/ && mkdir ik
3) 將插件解壓縮到文件夾 your-es-root/plugins/ik
4) 重新啟動es
② 兩種analyzer
1) ik_max_word:細粒度
2) ik_smart:粗粒度
③ IK文件描述
1) IKAnalyzer.cfg.xml:IK分詞配置文件
2) 主詞庫:main.dic
3) 英文停用詞:stopword.dic,不會建立在倒排索引中
4) 特殊詞庫:
- quantifier.dic:特殊詞庫:計量單位等
- suffix.dic:特殊詞庫:后綴名
- surname.dic:特殊詞庫:百家姓
- preposition:特殊詞庫:語氣詞
5) 自定義詞庫:比如當下流行詞:857、emmm...、渣女、舔屏、996
6) 熱更新:
- 修改ik分詞器源碼
- 基於ik分詞器原生支持的熱更新方案,部署一個web服務器,提供一個http接口,通過modified和tag兩個http響應頭,來提供詞語的熱更新