一次 ElasticSearch 搜索優化
1. 環境
ES6.3.2,索引名稱 user_v1,5個主分片,每個分片一個副本。分片基本都在11GB左右,GET _cat/shards/user
一共有3.4億文檔,主分片總共57GB。
Segment信息:curl -X GET "221.228.105.140:9200/_cat/segments/user_v1?v" >> user_v1_segment
user_v1索引一共有404個段:
cat user_v1_segment | wc -l
404
處理一下數據,用Python畫個直方圖看看效果:
sed -i '1d' file # 刪除文件第一行
awk -F ' ' '{print $7}' user_v1_segment >> docs_count # 選取感興趣的一列(docs.count 列)
with open('doc_count.txt') as f:
data=f.read()
docList = data.splitlines()
docNums = list(map(int,docList))
import matplotlib.pyplot as plt
plt.hist(docNums,bins=40,normed=0,facecolor='blue',edgecolor='black')
大概看一下每個Segment中包含的文檔的個數。橫坐標是:文檔數量,縱坐標是:segment個數。可見:大部分的Segment中只包含了少量的文檔(\(0.5*10^7\))
修改refresh_interval為30s,原來默認為1s,這樣能在一定程度上減少Segment的數量。然后先force merge將404個Segment減少到200個:
POST /user_v1/_forcemerge?only_expunge_deletes=false&max_num_segments=200&flush=true
但是一看,還是有312個Segment。這個可能與merge的配置有關了。有興趣的可以了解一下 force merge 過程中這2個參數的意義:
- merge.policy.max_merge_at_once_explicit
- merge.scheduler.max_merge_count
執行profile分析:
1,Collector 時間過長,有些分片耗時長達7.9s。關於Profile 分析,可參考:profile-api
2,采用HanLP 分詞插件,Analyzer后得到Term,居然有"空格Term",而這個Term的匹配長達800ms!
來看看原因:
POST /_analyze
{
"analyzer": "hanlp_standard",
"text":"人生 如夢"
}
分詞結果是包含了空格的:
{
"tokens": [
{
"token": "人生",
"start_offset": 0,
"end_offset": 2,
"type": "n",
"position": 0
},
{
"token": " ",
"start_offset": 0,
"end_offset": 1,
"type": "w",
"position": 1
},
{
"token": "如",
"start_offset": 0,
"end_offset": 1,
"type": "v",
"position": 2
},
{
"token": "夢",
"start_offset": 0,
"end_offset": 1,
"type": "n",
"position": 3
}
]
}
那實際文檔被Analyzer了之后是否存儲了空格呢?
於是先定義一個索引,開啟term_vector。參考store term-vector
PUT user
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
},
"mappings": {
"profile": {
"properties": {
"nick": {
"type": "text",
"analyzer": "hanlp_standard",
"term_vector": "yes",
"fields": {
"raw": {
"type": "keyword"
}
}
}
}
}
}
}
然后PUT一篇文檔進去:
PUT user/profile/1
{
"nick":"人生 如夢"
}
查看Term Vector:docs-termvectors
GET /user/profile/1/_termvectors
{
"fields" : ["nick"],
"offsets" : true,
"payloads" : true,
"positions" : true,
"term_statistics" : true,
"field_statistics" : true
}
發現存儲的Terms里面有空格。
{
"_index": "user",
"_type": "profile",
"_id": "1",
"_version": 1,
"found": true,
"took": 2,
"term_vectors": {
"nick": {
"field_statistics": {
"sum_doc_freq": 4,
"doc_count": 1,
"sum_ttf": 4
},
"terms": {
" ": {
"doc_freq": 1,
"ttf": 1,
"term_freq": 1
},
"人生": {
"doc_freq": 1,
"ttf": 1,
"term_freq": 1
},
"如": {
"doc_freq": 1,
"ttf": 1,
"term_freq": 1
},
"夢": {
"doc_freq": 1,
"ttf": 1,
"term_freq": 1
}
}
}
}
}
然后再執行profile 查詢分析:
GET user/profile/_search?human=true
{
"profile":true,
"query": {
"match": {
"nick": "人生 如夢"
}
}
}
發現Profile里面居然有針對 空格Term 的查詢!!!(注意 nick 后面有個空格)
"type": "TermQuery",
"description": "nick: ",
"time": "58.2micros",
"time_in_nanos": 58244,
profile結果如下:
"profile": {
"shards": [
{
"id": "[7MyDkEDrRj2RPHCPoaWveQ][user][0]",
"searches": [
{
"query": [
{
"type": "BooleanQuery",
"description": "nick:人生 nick: nick:如 nick:夢",
"time": "642.9micros",
"time_in_nanos": 642931,
"breakdown": {
"score": 13370,
"build_scorer_count": 2,
"match_count": 0,
"create_weight": 390646,
"next_doc": 18462,
"match": 0,
"create_weight_count": 1,
"next_doc_count": 2,
"score_count": 1,
"build_scorer": 220447,
"advance": 0,
"advance_count": 0
},
"children": [
{
"type": "TermQuery",
"description": "nick:人生",
"time": "206.6micros",
"time_in_nanos": 206624,
"breakdown": {
"score": 942,
"build_scorer_count": 3,
"match_count": 0,
"create_weight": 167545,
"next_doc": 1493,
"match": 0,
"create_weight_count": 1,
"next_doc_count": 2,
"score_count": 1,
"build_scorer": 36637,
"advance": 0,
"advance_count": 0
}
},
{
"type": "TermQuery",
"description": "nick: ",
"time": "58.2micros",
"time_in_nanos": 58244,
"breakdown": {
"score": 918,
"build_scorer_count": 3,
"match_count": 0,
"create_weight": 46130,
"next_doc": 964,
"match": 0,
"create_weight_count": 1,
"next_doc_count": 2,
"score_count": 1,
"build_scorer": 10225,
"advance": 0,
"advance_count": 0
}
},
{
"type": "TermQuery",
"description": "nick:如",
"time": "51.3micros",
"time_in_nanos": 51334,
"breakdown": {
"score": 888,
"build_scorer_count": 3,
"match_count": 0,
"create_weight": 43779,
"next_doc": 1103,
"match": 0,
"create_weight_count": 1,
"next_doc_count": 2,
"score_count": 1,
"build_scorer": 5557,
"advance": 0,
"advance_count": 0
}
},
{
"type": "TermQuery",
"description": "nick:夢",
"time": "59.1micros",
"time_in_nanos": 59108,
"breakdown": {
"score": 3473,
"build_scorer_count": 3,
"match_count": 0,
"create_weight": 49739,
"next_doc": 900,
"match": 0,
"create_weight_count": 1,
"next_doc_count": 2,
"score_count": 1,
"build_scorer": 4989,
"advance": 0,
"advance_count": 0
}
}
]
}
],
"rewrite_time": 182090,
"collector": [
{
"name": "CancellableCollector",
"reason": "search_cancelled",
"time": "25.9micros",
"time_in_nanos": 25906,
"children": [
{
"name": "SimpleTopScoreDocCollector",
"reason": "search_top_hits",
"time": "19micros",
"time_in_nanos": 19075
}
]
}
]
}
],
"aggregations": []
}
]
}
而在實際的生產環境中,空格Term的查詢耗時480ms,而一個正常詞語("微信")的查詢,只有18ms。如下在分片[user_v1][3]
上的profile分析結果:
"profile": {
"shards": [
{
"id": "[8eN-6lsLTJ6as39QJhK5MQ][user_v1][3]",
"searches": [
{
"query": [
{
"type": "BooleanQuery",
"description": "nick:微信 nick: nick:黃色",
"time": "888.6ms",
"time_in_nanos": 888636963,
"breakdown": {
"score": 513864260,
"build_scorer_count": 50,
"match_count": 0,
"create_weight": 93345,
"next_doc": 364649642,
"match": 0,
"create_weight_count": 1,
"next_doc_count": 5063173,
"score_count": 4670398,
"build_scorer": 296094,
"advance": 0,
"advance_count": 0
},
"children": [
{
"type": "TermQuery",
"description": "nick:微信",
"time": "18.4ms",
"time_in_nanos": 18480019,
"breakdown": {
"score": 656810,
"build_scorer_count": 62,
"match_count": 0,
"create_weight": 23633,
"next_doc": 17712339,
"match": 0,
"create_weight_count": 1,
"next_doc_count": 7085,
"score_count": 5705,
"build_scorer": 74384,
"advance": 0,
"advance_count": 0
}
},
{
"type": "TermQuery",
"description": "nick: ",
"time": "480.5ms",
"time_in_nanos": 480508016,
"breakdown": {
"score": 278358058,
"build_scorer_count": 72,
"match_count": 0,
"create_weight": 6041,
"next_doc": 192388910,
"match": 0,
"create_weight_count": 1,
"next_doc_count": 5056541,
"score_count": 4665006,
"build_scorer": 33387,
"advance": 0,
"advance_count": 0
}
},
{
"type": "TermQuery",
"description": "nick:黃色",
"time": "3.8ms",
"time_in_nanos": 3872679,
"breakdown": {
"score": 136812,
"build_scorer_count": 50,
"match_count": 0,
"create_weight": 5423,
"next_doc": 3700537,
"match": 0,
"create_weight_count": 1,
"next_doc_count": 923,
"score_count": 755,
"build_scorer": 28178,
"advance": 0,
"advance_count": 0
}
}
]
}
],
"rewrite_time": 583986593,
"collector": [
{
"name": "CancellableCollector",
"reason": "search_cancelled",
"time": "730.3ms",
"time_in_nanos": 730399762,
"children": [
{
"name": "SimpleTopScoreDocCollector",
"reason": "search_top_hits",
"time": "533.2ms",
"time_in_nanos": 533238387
}
]
}
]
}
],
"aggregations": []
},
由於我采用的是HanLP分詞,用的這個分詞插件elasticsearch-analysis-hanlp,而采用ik_max_word分詞卻沒有相應的問題,這應該是分詞插件的bug,於是去github上提了一個issue,有興趣的可以關注。看來我得去研究一下ElasticSearch Analyze整個流程的源碼以及加載插件的源碼了 :😦
以上是一個空格Term造成的查詢性能問題。在Profile分析時,還發現,使用SSD的Collector time比機械硬盤快10倍左右。
分片 [user_v1][0]
的 Collector time長達7.6秒,而這個分片所在機器的磁盤是機械硬盤。而上面那個分片[user_v1][3]
所在的磁盤是SSD,Collector time只有730.3ms。可見SSD與機械硬盤的在Collector time上相差10倍 。下面是分片[user_v1][0]
的profile查詢分析:
{
"id": "[wx0dqdubRkiqJJ-juAqH4A][user_v1][0]",
"searches": [
{
"query": [
{
"type": "BooleanQuery",
"description": "nick:微信 nick: nick:黃色",
"time": "726.1ms",
"time_in_nanos": 726190295,
"breakdown": {
"score": 339421458,
"build_scorer_count": 48,
"match_count": 0,
"create_weight": 65012,
"next_doc": 376526603,
"match": 0,
"create_weight_count": 1,
"next_doc_count": 4935754,
"score_count": 4665766,
"build_scorer": 575653,
"advance": 0,
"advance_count": 0
},
"children": [
{
"type": "TermQuery",
"description": "nick:微信",
"time": "63.2ms",
"time_in_nanos": 63220487,
"breakdown": {
"score": 649184,
"build_scorer_count": 61,
"match_count": 0,
"create_weight": 32572,
"next_doc": 62398621,
"match": 0,
"create_weight_count": 1,
"next_doc_count": 6759,
"score_count": 5857,
"build_scorer": 127432,
"advance": 0,
"advance_count": 0
}
},
{
"type": "TermQuery",
"description": "nick: ",
"time": "1m",
"time_in_nanos": 60373841264,
"breakdown": {
"score": 60184752245,
"build_scorer_count": 69,
"match_count": 0,
"create_weight": 5888,
"next_doc": 179443959,
"match": 0,
"create_weight_count": 1,
"next_doc_count": 4929373,
"score_count": 4660228,
"build_scorer": 49501,
"advance": 0,
"advance_count": 0
}
},
{
"type": "TermQuery",
"description": "nick:黃色",
"time": "528.1ms",
"time_in_nanos": 528107489,
"breakdown": {
"score": 141744,
"build_scorer_count": 43,
"match_count": 0,
"create_weight": 4717,
"next_doc": 527942227,
"match": 0,
"create_weight_count": 1,
"next_doc_count": 967,
"score_count": 780,
"build_scorer": 17010,
"advance": 0,
"advance_count": 0
}
}
]
}
],
"rewrite_time": 993826311,
"collector": [
{
"name": "CancellableCollector",
"reason": "search_cancelled",
"time": "7.8s",
"time_in_nanos": 7811511525,
"children": [
{
"name": "SimpleTopScoreDocCollector",
"reason": "search_top_hits",
"time": "7.6s",
"time_in_nanos": 7616467158
}
]
}
]
}
],
"aggregations": []
},
結論
查詢不僅僅與Segment數量、Collector time等有關,還與索引的mapping定義,查詢方式(match、filter、term……)有關,可用Profile API分析查詢性能問題。另外也有一些壓測工具,比如:esrally
對於中文而言,還要注意 query string 被Analyze成各個token之后,到底是針對了哪些Token查詢,這個可以通過term vector進行測試,但生產環境一般不會開啟term vector。因此,中文分詞算法對搜索命中會有影響。
而至於搜索排序,可先用explain API 分析各個Term的得分,然后也可考慮ES的Function Score功能,針對某些特定的field做調節(field_value_factor),甚至可以用機器學習模型優化搜索排序(learning to rank)
關於ElasticSearch查詢效率的提升一些思考:
- FileSystem cache 要足夠(堆外內存 vs 堆外內存),數據分布要合理(冷熱分離)
- 索引設計要合理(多字段、Analyzer、Index shard數量)、Segment數量(refresh interval 配置)
- 查詢語法要合適(term、match、filter),可通過搜索參數調優(terminate_after提前返回、timeout查詢響應超時)
- profile分析