一次 ElasticSearch 搜索優化

本文轉載自查看原文 2019-03-20 22:22 1960 elasticsearch

一次 ElasticSearch 搜索優化

1. 環境

ES6.3.2，索引名稱 user_v1，5個主分片，每個分片一個副本。分片基本都在11GB左右，GET _cat/shards/user

一共有3.4億文檔，主分片總共57GB。

Segment信息：curl -X GET "221.228.105.140:9200/_cat/segments/user_v1?v" >> user_v1_segment

user_v1索引一共有404個段：

cat user_v1_segment | wc -l

404

處理一下數據，用Python畫個直方圖看看效果：

sed -i '1d' file # 刪除文件第一行

awk -F ' ' '{print $7}' user_v1_segment >> docs_count # 選取感興趣的一列(docs.count 列)

with open('doc_count.txt') as f:
    data=f.read()
docList = data.splitlines()
docNums = list(map(int,docList))
import matplotlib.pyplot as plt
plt.hist(docNums,bins=40,normed=0,facecolor='blue',edgecolor='black')

大概看一下每個Segment中包含的文檔的個數。橫坐標是：文檔數量，縱坐標是：segment個數。可見：大部分的Segment中只包含了少量的文檔($0.5*10^7$)

修改refresh_interval為30s，原來默認為1s，這樣能在一定程度上減少Segment的數量。然后先force merge將404個Segment減少到200個：

POST /user_v1/_forcemerge?only_expunge_deletes=false&max_num_segments=200&flush=true

但是一看，還是有312個Segment。這個可能與merge的配置有關了。有興趣的可以了解一下 force merge 過程中這2個參數的意義：

merge.policy.max_merge_at_once_explicit
merge.scheduler.max_merge_count

執行profile分析：

1，Collector 時間過長，有些分片耗時長達7.9s。關於Profile 分析，可參考：profile-api

2，采用HanLP 分詞插件，Analyzer后得到Term，居然有"空格Term"，而這個Term的匹配長達800ms！

來看看原因：

POST /_analyze
{
"analyzer": "hanlp_standard",
"text":"人生如夢"
}

分詞結果是包含了空格的：

{
  "tokens": [
    {
      "token": "人生",
      "start_offset": 0,
      "end_offset": 2,
      "type": "n",
      "position": 0
    },
    {
      "token": " ",
      "start_offset": 0,
      "end_offset": 1,
      "type": "w",
      "position": 1
    },
    {
      "token": "如",
      "start_offset": 0,
      "end_offset": 1,
      "type": "v",
      "position": 2
    },
    {
      "token": "夢",
      "start_offset": 0,
      "end_offset": 1,
      "type": "n",
      "position": 3
    }
  ]
}

那實際文檔被Analyzer了之后是否存儲了空格呢？

於是先定義一個索引，開啟term_vector。參考store term-vector

PUT user
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "profile": {
      "properties": {
        "nick": {
          "type": "text",
          "analyzer": "hanlp_standard",
          "term_vector": "yes", 
          "fields": {
            "raw": {
              "type": "keyword"
            }
          }
        }
      }
    }
  }
}

然后PUT一篇文檔進去：

PUT user/profile/1
{
  "nick":"人生 如夢"
}

查看Term Vector：docs-termvectors

GET /user/profile/1/_termvectors
{
"fields" : ["nick"],
"offsets" : true,
"payloads" : true,
"positions" : true,
"term_statistics" : true,
"field_statistics" : true
}

發現存儲的Terms里面有空格。

{
  "_index": "user",
  "_type": "profile",
  "_id": "1",
  "_version": 1,
  "found": true,
  "took": 2,
  "term_vectors": {
    "nick": {
      "field_statistics": {
        "sum_doc_freq": 4,
        "doc_count": 1,
        "sum_ttf": 4
      },
      "terms": {
        " ": {
          "doc_freq": 1,
          "ttf": 1,
          "term_freq": 1
        },
        "人生": {
          "doc_freq": 1,
          "ttf": 1,
          "term_freq": 1
        },
        "如": {
          "doc_freq": 1,
          "ttf": 1,
          "term_freq": 1
        },
        "夢": {
          "doc_freq": 1,
          "ttf": 1,
          "term_freq": 1
        }
      }
    }
  }
}

然后再執行profile 查詢分析：

GET user/profile/_search?human=true
{
  "profile":true,
  "query": {
    "match": {
      "nick": "人生 如夢"
    }
  }
}

發現Profile里面居然有針對空格Term 的查詢！！！（注意 nick 后面有個空格）

            "type": "TermQuery",
            "description": "nick: ",
            "time": "58.2micros",
            "time_in_nanos": 58244,

profile結果如下：

 "profile": {
    "shards": [
      {
        "id": "[7MyDkEDrRj2RPHCPoaWveQ][user][0]",
        "searches": [
          {
            "query": [
              {
                "type": "BooleanQuery",
                "description": "nick:人生 nick:  nick:如 nick:夢",
                "time": "642.9micros",
                "time_in_nanos": 642931,
                "breakdown": {
                  "score": 13370,
                  "build_scorer_count": 2,
                  "match_count": 0,
                  "create_weight": 390646,
                  "next_doc": 18462,
                  "match": 0,
                  "create_weight_count": 1,
                  "next_doc_count": 2,
                  "score_count": 1,
                  "build_scorer": 220447,
                  "advance": 0,
                  "advance_count": 0
                },
                "children": [
                  {
                    "type": "TermQuery",
                    "description": "nick:人生",
                    "time": "206.6micros",
                    "time_in_nanos": 206624,
                    "breakdown": {
                      "score": 942,
                      "build_scorer_count": 3,
                      "match_count": 0,
                      "create_weight": 167545,
                      "next_doc": 1493,
                      "match": 0,
                      "create_weight_count": 1,
                      "next_doc_count": 2,
                      "score_count": 1,
                      "build_scorer": 36637,
                      "advance": 0,
                      "advance_count": 0
                    }
                  },
                  {
                    "type": "TermQuery",
                    "description": "nick: ",
                    "time": "58.2micros",
                    "time_in_nanos": 58244,
                    "breakdown": {
                      "score": 918,
                      "build_scorer_count": 3,
                      "match_count": 0,
                      "create_weight": 46130,
                      "next_doc": 964,
                      "match": 0,
                      "create_weight_count": 1,
                      "next_doc_count": 2,
                      "score_count": 1,
                      "build_scorer": 10225,
                      "advance": 0,
                      "advance_count": 0
                    }
                  },
                  {
                    "type": "TermQuery",
                    "description": "nick:如",
                    "time": "51.3micros",
                    "time_in_nanos": 51334,
                    "breakdown": {
                      "score": 888,
                      "build_scorer_count": 3,
                      "match_count": 0,
                      "create_weight": 43779,
                      "next_doc": 1103,
                      "match": 0,
                      "create_weight_count": 1,
                      "next_doc_count": 2,
                      "score_count": 1,
                      "build_scorer": 5557,
                      "advance": 0,
                      "advance_count": 0
                    }
                  },
                  {
                    "type": "TermQuery",
                    "description": "nick:夢",
                    "time": "59.1micros",
                    "time_in_nanos": 59108,
                    "breakdown": {
                      "score": 3473,
                      "build_scorer_count": 3,
                      "match_count": 0,
                      "create_weight": 49739,
                      "next_doc": 900,
                      "match": 0,
                      "create_weight_count": 1,
                      "next_doc_count": 2,
                      "score_count": 1,
                      "build_scorer": 4989,
                      "advance": 0,
                      "advance_count": 0
                    }
                  }
                ]
              }
            ],
            "rewrite_time": 182090,
            "collector": [
              {
                "name": "CancellableCollector",
                "reason": "search_cancelled",
                "time": "25.9micros",
                "time_in_nanos": 25906,
                "children": [
                  {
                    "name": "SimpleTopScoreDocCollector",
                    "reason": "search_top_hits",
                    "time": "19micros",
                    "time_in_nanos": 19075
                  }
                ]
              }
            ]
          }
        ],
        "aggregations": []
      }
    ]
  }

而在實際的生產環境中，空格Term的查詢耗時480ms，而一個正常詞語（"微信"）的查詢，只有18ms。如下在分片[user_v1][3]上的profile分析結果：

 "profile": {
    "shards": [
      {
        "id": "[8eN-6lsLTJ6as39QJhK5MQ][user_v1][3]",
        "searches": [
          {
            "query": [
              {
                "type": "BooleanQuery",
                "description": "nick:微信 nick:  nick:黃色",
                "time": "888.6ms",
                "time_in_nanos": 888636963,
                "breakdown": {
                  "score": 513864260,
                  "build_scorer_count": 50,
                  "match_count": 0,
                  "create_weight": 93345,
                  "next_doc": 364649642,
                  "match": 0,
                  "create_weight_count": 1,
                  "next_doc_count": 5063173,
                  "score_count": 4670398,
                  "build_scorer": 296094,
                  "advance": 0,
                  "advance_count": 0
                },
                "children": [
                  {
                    "type": "TermQuery",
                    "description": "nick:微信",
                    "time": "18.4ms",
                    "time_in_nanos": 18480019,
                    "breakdown": {
                      "score": 656810,
                      "build_scorer_count": 62,
                      "match_count": 0,
                      "create_weight": 23633,
                      "next_doc": 17712339,
                      "match": 0,
                      "create_weight_count": 1,
                      "next_doc_count": 7085,
                      "score_count": 5705,
                      "build_scorer": 74384,
                      "advance": 0,
                      "advance_count": 0
                    }
                  },
                  {
                    "type": "TermQuery",
                    "description": "nick: ",
                    "time": "480.5ms",
                    "time_in_nanos": 480508016,
                    "breakdown": {
                      "score": 278358058,
                      "build_scorer_count": 72,
                      "match_count": 0,
                      "create_weight": 6041,
                      "next_doc": 192388910,
                      "match": 0,
                      "create_weight_count": 1,
                      "next_doc_count": 5056541,
                      "score_count": 4665006,
                      "build_scorer": 33387,
                      "advance": 0,
                      "advance_count": 0
                    }
                  },
                  {
                    "type": "TermQuery",
                    "description": "nick:黃色",
                    "time": "3.8ms",
                    "time_in_nanos": 3872679,
                    "breakdown": {
                      "score": 136812,
                      "build_scorer_count": 50,
                      "match_count": 0,
                      "create_weight": 5423,
                      "next_doc": 3700537,
                      "match": 0,
                      "create_weight_count": 1,
                      "next_doc_count": 923,
                      "score_count": 755,
                      "build_scorer": 28178,
                      "advance": 0,
                      "advance_count": 0
                    }
                  }
                ]
              }
            ],
            "rewrite_time": 583986593,
            "collector": [
              {
                "name": "CancellableCollector",
                "reason": "search_cancelled",
                "time": "730.3ms",
                "time_in_nanos": 730399762,
                "children": [
                  {
                    "name": "SimpleTopScoreDocCollector",
                    "reason": "search_top_hits",
                    "time": "533.2ms",
                    "time_in_nanos": 533238387
                  }
                ]
              }
            ]
          }
        ],
        "aggregations": []
      },

由於我采用的是HanLP分詞，用的這個分詞插件elasticsearch-analysis-hanlp，而采用ik_max_word分詞卻沒有相應的問題，這應該是分詞插件的bug，於是去github上提了一個issue，有興趣的可以關注。看來我得去研究一下ElasticSearch Analyze整個流程的源碼以及加載插件的源碼了 :😦

以上是一個空格Term造成的查詢性能問題。在Profile分析時，還發現，使用SSD的Collector time比機械硬盤快10倍左右。

分片 [user_v1][0] 的 Collector time長達7.6秒，而這個分片所在機器的磁盤是機械硬盤。而上面那個分片[user_v1][3]所在的磁盤是SSD，Collector time只有730.3ms。可見SSD與機械硬盤的在Collector time上相差10倍。下面是分片[user_v1][0]的profile查詢分析：

{
        "id": "[wx0dqdubRkiqJJ-juAqH4A][user_v1][0]",
        "searches": [
          {
            "query": [
              {
                "type": "BooleanQuery",
                "description": "nick:微信 nick:  nick:黃色",
                "time": "726.1ms",
                "time_in_nanos": 726190295,
                "breakdown": {
                  "score": 339421458,
                  "build_scorer_count": 48,
                  "match_count": 0,
                  "create_weight": 65012,
                  "next_doc": 376526603,
                  "match": 0,
                  "create_weight_count": 1,
                  "next_doc_count": 4935754,
                  "score_count": 4665766,
                  "build_scorer": 575653,
                  "advance": 0,
                  "advance_count": 0
                },
                "children": [
                  {
                    "type": "TermQuery",
                    "description": "nick:微信",
                    "time": "63.2ms",
                    "time_in_nanos": 63220487,
                    "breakdown": {
                      "score": 649184,
                      "build_scorer_count": 61,
                      "match_count": 0,
                      "create_weight": 32572,
                      "next_doc": 62398621,
                      "match": 0,
                      "create_weight_count": 1,
                      "next_doc_count": 6759,
                      "score_count": 5857,
                      "build_scorer": 127432,
                      "advance": 0,
                      "advance_count": 0
                    }
                  },
                  {
                    "type": "TermQuery",
                    "description": "nick: ",
                    "time": "1m",
                    "time_in_nanos": 60373841264,
                    "breakdown": {
                      "score": 60184752245,
                      "build_scorer_count": 69,
                      "match_count": 0,
                      "create_weight": 5888,
                      "next_doc": 179443959,
                      "match": 0,
                      "create_weight_count": 1,
                      "next_doc_count": 4929373,
                      "score_count": 4660228,
                      "build_scorer": 49501,
                      "advance": 0,
                      "advance_count": 0
                    }
                  },
                  {
                    "type": "TermQuery",
                    "description": "nick:黃色",
                    "time": "528.1ms",
                    "time_in_nanos": 528107489,
                    "breakdown": {
                      "score": 141744,
                      "build_scorer_count": 43,
                      "match_count": 0,
                      "create_weight": 4717,
                      "next_doc": 527942227,
                      "match": 0,
                      "create_weight_count": 1,
                      "next_doc_count": 967,
                      "score_count": 780,
                      "build_scorer": 17010,
                      "advance": 0,
                      "advance_count": 0
                    }
                  }
                ]
              }
            ],
            "rewrite_time": 993826311,
            "collector": [
              {
                "name": "CancellableCollector",
                "reason": "search_cancelled",
                "time": "7.8s",
                "time_in_nanos": 7811511525,
                "children": [
                  {
                    "name": "SimpleTopScoreDocCollector",
                    "reason": "search_top_hits",
                    "time": "7.6s",
                    "time_in_nanos": 7616467158
                  }
                ]
              }
            ]
          }
        ],
        "aggregations": []
      },

結論

查詢不僅僅與Segment數量、Collector time等有關，還與索引的mapping定義，查詢方式(match、filter、term……)有關，可用Profile API分析查詢性能問題。另外也有一些壓測工具，比如：esrally

對於中文而言，還要注意 query string 被Analyze成各個token之后，到底是針對了哪些Token查詢，這個可以通過term vector進行測試，但生產環境一般不會開啟term vector。因此，中文分詞算法對搜索命中會有影響。

而至於搜索排序，可先用explain API 分析各個Term的得分，然后也可考慮ES的Function Score功能，針對某些特定的field做調節(field_value_factor)，甚至可以用機器學習模型優化搜索排序(learning to rank)

關於ElasticSearch查詢效率的提升一些思考：

FileSystem cache 要足夠（堆外內存 vs 堆外內存），數據分布要合理(冷熱分離)
索引設計要合理（多字段、Analyzer、Index shard數量）、Segment數量(refresh interval 配置)
查詢語法要合適（term、match、filter），可通過搜索參數調優(terminate_after提前返回、timeout查詢響應超時)
profile分析

參考資料

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 ElasticSearch如何一次查詢出全部數據——基於Scroll 哎呀，我老大寫Bug啦——記一次MessageQueue的優化記一次springcloud並發優化記錄一次sql查詢union的優化記一次線上優化實戰記一次Sql優化過程記一次mysql關於limit和orderby的優化一次性能優化實戰經歷一次生產事故的優化經歷設置elasticsearch一次最大數量查詢