一、先擺需求:
1、中文搜索、英文搜索、中英混搜 如:“南京東路”,“cafe 南京東路店”
2、全拼搜索、首字母搜索、中文+全拼、中文+首字母混搜 如:“nanjingdonglu”,“njdl”,“南京donglu”,“南京dl”,“nang南東路”,“njd路”等等組合
3、簡繁搜索、特殊符號過濾搜索 如:“龍馬”可通過“龍馬”搜索,再比如 L.G.F可以通過lgf搜索,café可能通過cafe搜索
4、排序優先級為: 以關鍵字開頭>包含關鍵字
二、生產效果圖:






三、實現
1、索引設計
使用multi_field為搜索字段建立不同類型的索引,有全拼索引、首字母簡寫索引、Ngram索引以及IK索引,從各個角度分別擊破,然后通過char-filter進行特殊符號與簡繁轉換。
curl -XPUT localhost:9200/search_words_index -d '{
"settings" : {
"refresh_interval" : "5s",
"number_of_shards" : 1,
"number_of_replicas" : 1,
"analysis" : {
"filter": {
"edge_ngram_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 50
},
"pinyin_simple_filter":{
"type" : "pinyin",
"keep_first_letter":true,
"keep_separate_first_letter" : false,
"keep_full_pinyin" : false,
"keep_original" : false,
"limit_first_letter_length" : 50,
"lowercase" : true
},
"pinyin_full_filter":{
"type" : "pinyin",
"keep_first_letter":false,
"keep_separate_first_letter" : false,
"keep_full_pinyin" : true,
"none_chinese_pinyin_tokenize":true,
"keep_original" : false,
"limit_first_letter_length" : 50,
"lowercase" : true
},
"t2s_convert":{
"type": "stconvert",
"delimiter": ",",
"convert_type": "t2s"
}
},
"char_filter" : {
"charconvert" : {
"type" : "mapping",
"mappings_path":"char_filter_text.txt"
}
},
"tokenizer":{
"ik_smart":{
"type":"ik",
"use_smart":true
}
},
"analyzer": {
"ngramIndexAnalyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": ["edge_ngram_filter","lowercase"],
"char_filter" : ["charconvert"]
},
"ngramSearchAnalyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter":["lowercase"],
"char_filter" : ["charconvert"]
},
"ikIndexAnalyzer": {
"type": "custom",
"tokenizer": "ik",
"char_filter" : ["charconvert"]
},
"ikSearchAnalyzer": {
"type": "custom",
"tokenizer": "ik",
"char_filter" : ["charconvert"]
},
"pinyiSimpleIndexAnalyzer":{
"tokenizer" : "keyword",
"filter": ["pinyin_simple_filter","edge_ngram_filter","lowercase"]
},
"pinyiSimpleSearchAnalyzer":{
"tokenizer" : "keyword",
"filter": ["pinyin_simple_filter","lowercase"]
},
"pinyiFullIndexAnalyzer":{
"tokenizer" : "keyword",
"filter": ["pinyin_full_filter","lowercase"]
},
"pinyiFullSearchAnalyzer":{
"tokenizer" : "keyword",
"filter": ["pinyin_full_filter","lowercase"]
}
}
}
},
"mappings": {
"search_words_type": {
"properties": {
"words": {
"type": "multi_field",
"fields":{
"words": {
"type": "string",
"index": "analyzed",
"indexAnalyzer" : "ngramIndexAnalyzer"
},
"SPY": {
"type": "string",
"index": "analyzed",
"indexAnalyzer" : "pinyiSimpleIndexAnalyzer"
},
"FPY": {
"type": "string",
"index": "analyzed",
"indexAnalyzer" : "pinyiFullIndexAnalyzer"
},
"IKS": {
"type": "string",
"index": "analyzed",
"indexAnalyzer" : "ikIndexAnalyzer"
}
}
}
}
}
}
}'
拼音插件的使用請參考:https://github.com/medcl/elasticsearch-analysis-pinyin
2、搜索構建
以下是搜索實現代碼(非完整代碼,只摘錄核心部分,主要是思路):
/** * 純中文搜索 * @return */ public List<Map> chineseSearch(String key,Integer cityId) throws Exception{ DisMaxQueryBuilder disMaxQueryBuilder=QueryBuilders.disMaxQuery(); //以關鍵字開頭(優先級最高) MatchQueryBuilder q1=QueryBuilders.matchQuery("words",key).analyzer("ngramSearchAnalyzer").boost(5); //完整包含經過分析過的關鍵字 // boolean whitespace=key.contains(" "); // int slop=whitespace?50:5; QueryBuilder q2=QueryBuilders.matchQuery("words.IKS", key).analyzer("ikSearchAnalyzer").minimumShouldMatch("100%"); disMaxQueryBuilder.add(q1); disMaxQueryBuilder.add(q2); SearchQuery searchQuery=builderQuery(cityId,disMaxQueryBuilder); return elasticsearchTemplate.queryForList(searchQuery,Map.class); } /** * 混合搜索 * @return */ public List<Map> chineseWithEnglishOrPinyinSearch(String key,Integer cityId) throws Exception{ DisMaxQueryBuilder disMaxQueryBuilder=QueryBuilders.disMaxQuery(); //是否有中文開頭,有則返回中文前綴 String startChineseString=commonSearchService.getStartChineseString(key); /** * 源值搜索,不做拼音轉換 * 權重* 1.5 */ QueryBuilder normSearchBuilder=QueryBuilders.matchQuery("words",key).analyzer("ngramSearchAnalyzer").boost(5f); /** * 拼音簡寫搜索 * 1、分析key,轉換為簡寫 case: 南京東路==>njdl,南京dl==>njdl,njdl==>njdl * 2、搜索匹配,必須完整匹配簡寫詞干 * 3、如果有中文前綴,則排序優先 * 權重*1 */ String analysisKey=commonSearchService.anaysisKeyAndGetMaxWords(SearchIndex.INDEX_NAME_SEARCHWORDSSTATISTICS,key,"pinyiSimpleSearchAnalyzer"); QueryBuilder pingYinSampleQueryBuilder=QueryBuilders.termQuery("words.SPY", analysisKey); /** * 拼音簡寫包含匹配,如 njdl可以查出 "城市公牛 南京東路店",雖然非南京東路開頭 * 權重*0.8 */ QueryBuilder pingYinSampleContainQueryBuilder=null; if(analysisKey.length()>1){ pingYinSampleContainQueryBuilder=QueryBuilders.wildcardQuery("words.SPY", "*"+analysisKey+"*").boost(0.8f); } /** * 拼音全拼搜索 * 1、分析key,獲取拼音詞干 case : 南京東路==>[nan,jing,dong,lu],南京donglu==>[nan,jing,dong,lu] * 2、搜索查詢,必須匹配所有拼音詞,如南京東路,則nan,jing,dong,lu四個詞干必須完全匹配 * 3、如果有中文前綴,則排序優先 * 權重*1 */ QueryBuilder pingYinFullQueryBuilder=null; if(key.length()>1){ pingYinFullQueryBuilder=QueryBuilders.matchPhraseQuery("words.FPY", key).analyzer("pinyiFullSearchAnalyzer"); } /** * 完整包含關鍵字查詢(優先級最低,只有以上四種方式查詢無結果時才考慮) * 權重*0.8 */ QueryBuilder containSearchBuilder=QueryBuilders.matchQuery("words.IKS", key).analyzer("ikSearchAnalyzer").minimumShouldMatch("100%"); disMaxQueryBuilder .add(normSearchBuilder) .add(pingYinSampleQueryBuilder) .add(containSearchBuilder); //以下兩個對性能有一定的影響,故作此判定,單個字符不執行此類搜索 if(pingYinFullQueryBuilder!=null){ disMaxQueryBuilder.add(pingYinFullQueryBuilder); } if(pingYinSampleContainQueryBuilder!=null){ disMaxQueryBuilder.add(pingYinSampleContainQueryBuilder); } QueryBuilder queryBuilder=disMaxQueryBuilder; //關鍵如果有中文,則必須包含在內容中 if(StringUtils.isNotBlank(startChineseString)){ queryBuilder= QueryBuilders.filteredQuery(disMaxQueryBuilder, FilterBuilders.queryFilter(QueryBuilders.queryStringQuery("*"+startChineseString+"*").field("words").analyzer("ngramSearchAnalyzer"))); queryBuilder=QueryBuilders.functionScoreQuery(queryBuilder) .add(FilterBuilders.queryFilter(QueryBuilders.matchQuery("words",startChineseString).analyzer("ngramSearchAnalyzer")), ScoreFunctionBuilders.weightFactorFunction(1.5f)); } SearchQuery searchQuery=builderQuery(cityId,queryBuilder); return elasticsearchTemplate.queryForList(searchQuery,Map.class); }
注:以上JAVA示例代碼皆以spring-data-elasticsearch框架為基礎。
拼音插件安裝:
1、下載拼音插件,官網地址:https://github.com/medcl/elasticsearch-analysis-pinyin 我下載的版本是:elasticsearch-analysis-pinyin-1.3.3。
把下載的 elasticsearch-analysis-pinyin-1.3.3.jar與nlp-lang-1.7.jar放於plugins目錄下。
2、修改elasticsearch配置文件,在最后一行之下加入(里面包括IK配置,如果未安裝IK可省略IK的配置):
index:
analysis:
analyzer:
ik:
alias: [news_analyzer_ik,ik_analyzer]
type: org.elasticsearch.index.analysis.IkAnalyzerProvider
ik_max_word:
type: ik
use_smart: false
ik_smart:
type: ik
use_smart: true
pinyin:
tokenizer: pinyin_tokenizer
filter: [standard,nGram]
tokenizer:
pinyin_tokenizer:
type: pinyin
first_letter: "prefix"
padding_char: ""
3、定制特殊符號及簡繁轉換文本:char_filter_text.txt,由於文件有點長,以下是部分內容,參考格式即可。
à=>a
á=>a
â=>a
ä=>a
À=>a
Â=>a
Ä=>a
è=>e
é=>e
ê=>e
ë=>e
È=>e
É=>e
Ê=>e
Ë=>e
î=>i
ï=>i
Î=>i
Ï=>i
ô=>o
ö=>o
Ô=>o
Ö=>o
ù=>u
û=>u
ü=>u
Ù=>u
Û=>u
Ü=>u
ç=>c
œ=>c
&=>
^=>
.=>
·=>
-=>
'=>
’=>
‘=>
/=>
醯壺=>醯壺
屢顧爾僕=>屢顧爾仆
見=>見
往裡=>往里
置言成範=>置言成范
捲動=>卷動
規=>規
齣電視=>出電視
覎=>覎
後堂=>后堂
4、重啟elasticsearch,重建索引,看是否生效。
