1、概述
elasticsearch用於搜索引擎,需要設置一些分詞器來優化索引。常用的有ik_max_word: 會將文本做最細粒度的拆分、ik_smart: 會做最粗粒度的拆分、ansj等。
ik下載地址: https://github.com/medcl/elasticsearch-analysis-ik/releases
ansj下載地址:https://github.com/NLPchina/elasticsearch-analysis-ansj
安裝的時候一定要安裝相對應的版本,並且解壓完成后一定要報安裝包移到其他目錄或直接刪除,plugins目錄下只能包含分詞器的目錄。否則啟動會報錯,尤其是docker環境,如果沒能映射plugins目錄的話,就只能重新創建容器了。
本文以5.2版本為例,講解安裝ansj分詞,並將其設置為默認分詞器。注意5.x版本以后不再支持在elasticsearch.yml里面設置默認分詞器,只能通過API的方式進行設置。
Since elasticsearch 5.x index level settings can NOT be set on the nodes configuration like the elasticsearch.yaml, in system properties or command line arguments.In order to upgrade all indices the settings must be updated via the /${index}/_settings API. Unless all settings are dynamic all indices must be closed in order to apply the upgradeIndices created in the future should use index templates to set default values.
2、安裝
#./bin/elasticsearch-plugin install https://github.com/NLPchina/elasticsearch-analysis-ansj/releases/download/v5.2.2/elasticsearch-analysis-ansj-5.2.2.0-release.zip
然后進行解壓:
#unzip elasticsearch-analysis-ansj-5.2.2.0-release.zip && rm -rf elasticsearch-analysis-ansj-5.2.2.0-release.zip
#mv elasticsearch elasticsearch-analysis-ansj
重啟服務,加載分詞器設置。
設置為默認得分詞器:
# curl -XPUT 'http://localhost:9200/_all/_settings?preserve_existing=true' -d '{
"index.analysis.analyzer.default.type" : "index_ansj",
"index.analysis.analyzer.default_search.type" : "query_ansj"
}'
出現如下報錯:
{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"Can't update non dynamic settings [[index.analysis.analyzer.default_search.type]] for open indices
不支持動態設置,indecis處於開啟狀態,需要先關閉,在進行設置,設置完成后在打開。這種通過API設置的方式不需要重啟elsatisearch。線上的集群最好不要重啟,加載索引的時間會很久並且會引發一些錯誤。
# curl -XPOST 'localhost:9200/_all/_close'
# curl -XPUT 'http://localhost:9200/_all/_settings?preserve_existing=true' -d '{
"index.analysis.analyzer.default.type" : "index_ansj",
"index.analysis.analyzer.default_search.type" : "query_ansj"
}'
# curl -XPOST 'localhost:9200/_all/_open'
6.x版本后執行put命令:
6.x版本以后修改或寫入數據到es,都要使用-H'Content-Type: application/json'。參考地址:
https://www.elastic.co/blog/strict-content-type-checking-for-elasticsearch-rest-requests
#curl -XPUT -H'Content-Type: application/json' 'http://localhost:9200/_all/_settings?preserve_existing=true' -d '{
"index.analysis.analyzer.default.type" : "index_ansj",
"index.analysis.analyzer.default_search.type" : "query_ansj"
}'
##設置停用詞,stopwords:
stopwords用來在搜索時被過濾掉。如設置stopwords為“老”,則在搜索時“老師”,只能搜索出“師”。
本文據一個去除空格的例子:
修改
elasticsearch-analysis-ansj的配置文件:
# cat ansj.cfg.yml
# 全局變量配置方式一
ansj:
#默認參數配置
isNameRecognition: true #開啟姓名識別
isNumRecognition: true #開啟數字識別
isQuantifierRecognition: true #是否數字和量詞合並
isRealName: false #是否保留真實詞語,建議保留false
#用戶自定詞典配置
dic: file://usr/share/elasticsearch/plugins/elasticsearch-analysis-ansj/default.dic #也可以寫成 file://default.dic , 如果未配置dic,則此詞典默認加載
# http方式加載
#dic_d1: http://xxx/xx.dic
# jar中文件加載
#dic_d2: jar://org.ansj.dic.DicReader|/dic2.dic
# 從數據庫中加載
#dic_d3: jdbc:mysql://xxxx:3306/ttt?useUnicode=true&characterEncoding=utf-8&zeroDateTimeBehavior=convertToNull|username|password|select name as name,nature,freq from dic where type=1
# 從自定義類中加載,YourClas extends PathToStream
#dic_d3: class://xxx.xxx.YourClas|ohterparam
#過濾詞典配置
#stop: http,file,jar,class,jdbc 都支持
#stop_key1: ...
stop: file://usr/share/elasticsearch/plugins/elasticsearch-analysis-ansj/stop.dic
#歧義詞典配置
#ambiguity: http,file,jar,class,jdbc 都支持
#ambiguity_key1: ...
#同義詞詞典配置
#synonyms: http,file,jar,class,jdbc 都支持
#synonyms_key1: ...
# 全局變量配置方式二 通過配置文件的方式配置,優先級高於es本身的配置
#ansj_config: ansj_library.properties #http,file,jar,class,jdbc 都支持,格式參見ansj_library.properties
# 配置自定義分詞器
index:
analysis:
tokenizer :
my_dic :
type : dic_ansj
dic: dic
stop: stop
ambiguity: ambiguity
synonyms: synonyms
isNameRecognition: true
isNumRecognition: true
isQuantifierRecognition: true
isRealName: false
analyzer:
my_dic:
type: custom
tokenizer: my_dic
添加stop: file://usr/share/elasticsearch/plugins/elasticsearch-analysis-ansj/stop.dic這樣一行內容,然后在相應的位置創建stop.dic文件,字符編碼為utf-8。
想要過濾空格需要使用正則表達式,編輯器將制表符翻譯成空格,所以過濾空格的語法為:\s+[tab]regex,其中[tab]代表按一下tab鍵。即在stop.dic文件里面\s+和regex之間需要按一個tab鍵代表過濾空格。
# cat stop.dic
\s+ regex
參考地址:
https://github.com/NLPchina/elasticsearch-analysis-ansj
http://pathbox.github.io/work/2017/09/13/elasticsearch-5.5.2-install-and-config.html
https://stackoverflow.com/questions/19758335/error-when-trying-to-update-the-settings/24414375
https://www.elastic.co/blog/strict-content-type-checking-for-elasticsearch-rest-requests