現在很多公司都開始使用es來做搜索,我們公司目前也有好幾個業務部門在用,我主要做商戶搜索,為業務部門提供基礎支持。上周把呼叫中心的搜索重新整理了下,在新增幾個字段后,全量同步發現通過拼音首字母搜索無法搜索出來了,最后發現是詞庫地址變更,導致分詞出現了問題。
我整理了下es的搜索分詞插件和流程,如下:
1. 下載安裝分詞插件 https://github.com/medcl/elasticsearch-analysis-ik
修改 IKAnalyzer.cfg.xml 配置加載自己的遠程擴展詞庫,我的詞庫由於一次機房遷移導致地址失效了,但是一直都沒有發現是因為大部分商戶數據並沒有更新,分詞索引必須要在數據更新時才會被重建!
2. 下載安裝拼音插件 https://github.com/medcl/elasticsearch-analysis-pinyin
創建索引
curl -XPUT http://127.0.0.1:9200/demo/ -d'{ "settings" : { "index" : { "analysis": { "analyzer": { "ik_smart_pinyin": { "tokenizer": "ik_smart", "filter": [ "my_pinyin", "lowercase", "word_delimiter" ] }, "ik_max_word_pinyin": { "tokenizer": "ik_max_word", "filter": [ "my_pinyin", "lowercase", "word_delimiter" ] } }, "tokenizer": { "ik_smart": { "type": "ik_smart", "use_smart": "true" }, "ik_max_word": { "type": "ik_max_word", "use_smart": "false" } }, "filter": { "my_pinyin": { "type": "pinyin", "first_letter": "all" } } } } }}'
curl -XPUT http://127.0.0.1:9200/_analyze?analyzer=ik_smart_pinyin&text=望湘園
{ "tokens": [ { "token": "wang", "start_offset": 0, "end_offset": 3, "type": "CN_WORD", "position": 0 }, { "token": "xiang", "start_offset": 0, "end_offset": 3, "type": "CN_WORD", "position": 1 }, { "token": "yuan", "start_offset": 0, "end_offset": 3, "type": "CN_WORD", "position": 2 }, { "token": "wxy", "start_offset": 0, "end_offset": 3, "type": "CN_WORD", "position": 3 } ] }
"token": "wxy" 就是首字母