記一次ElasticSearch重啟之后shard未分配問題的解決

本文轉載自查看原文 2018-09-30 22:43 1523 elasticsearch

記一次ElasticSearch重啟之后shard未分配問題的解決

環境

ElasticSearch6.3.2，三節點集群
Ubuntu16.04
一個名為user的索引，索引配置為：3 primary shard，每個primary shard 2個replica

正常情況下，各個分片的分布如下：

可見，user 索引的三個分片平均分布在各台機器上，可以完全容忍一台機器宕機，而不丟失任何數據。

由於一次故障（修改了一個分詞插件，但是這個插件未能正確加載），導致 node-151 節點宕機了。修復問題后，執行./bin/elasticsearch -d正常啟動，但是發現集群中存在三個未分配的shards。本以為這些未分配的shards在node-151正常啟動后能夠自動分配，但是卻發現它一直沒有自動分配。

解決方法

首先：GET user/_recovery?active_only=true 發現集群並沒有進行副本恢復。

執行GET _cluster/allocation/explain?pretty發現：

"explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2018-09-29T08:02:03.794Z], failed_attempts[5], delayed=false, details[failed shard on node [mKkj4112T7aLeC2oNouOrg]: failed to update mapping for index, failure MapperParsingException[Failed to parse mapping [profile]: analyzer [hanlp_standard] not found for field [details]]; nested: MapperParsingException[analyzer [hanlp_standard] not found for field [details]]; ]

原來是分詞插件錯誤導致。再仔細看日志，有一行：

allocation_status: "no_attempt"

原因是：shard 自動分配已經達到最大重試次數5次，仍然失敗了，所以導致"shard的分配狀態已經是：no_attempt"。這時在Kibana Dev Tools，執行命令：POST /_cluster/reroute?retry_failed=true即可。由index.allocation.max_retries參數來控制最大重試次數。

The cluster will attempt to allocate a shard a maximum of index.allocation.max_retries times in a row (defaults to 5), before giving up and leaving the shard unallocated.

當執行reroute命令對分片重新路由后，ElasticSearch會自動進行負載均衡，負載均衡參數cluster.routing.rebalance.enable默認為true。

It is important to note that after processing any reroute commands Elasticsearch will perform rebalancing as normal (respecting the values of settings such as cluster.routing.rebalance.enable) in order to remain in a balanced state.

過一段時間后：執行 GET /_cat/shards?index=user 可查看 user 索引中所有的分片分配情況已經正常了。

user 1 p STARTED 13610428 2.6gb node-248
user 1 r STARTED 13610428 2.5gb node-151
user 1 r STARTED 13610428 2.8gb node-140
user 2 p STARTED 13606674 2.8gb node-248
user 2 r STARTED 13606674 2.7gb node-151
user 2 r STARTED 13606684 3.8gb node-140
user 0 p STARTED 13603429 2.6gb node-248
user 0 r STARTED 13603429 2.6gb node-151
user 0 r STARTED 13603429 2.7gb node-140

第一列：索引名稱；第二列標識 shard 是primary(p) 還是 replica(r)；第三列 shard的狀態；第四列：該shard上的文檔數量；最后一列節點名稱。

總結

一般來說，ElasticSearch會自動分配那些 unassigned shards，當發現某些shards長期未分配時，首先看下是否是因為：為索引指定了過多的primary shard 和 replica 數量，然后集群中機器數量又不夠。另一個原因就是本文中提到的：由於故障，shard自動分配達到了最大重試次數了，這時執行 reroute 就可以了。

參考資料

/_cat/shards 命令：https://www.elastic.co/guide/en/elasticsearch/reference/current/cat-shards.html

2018.9.30
原文：https://www.cnblogs.com/hapjin/p/9726469.html

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 記一次因jenkins重啟導致的項目全部丟失的問題 Elasticsearch主分片變成未分配記錄一次mongodb因網絡問題導致shard節點異常 Elasticsearch強制重置未分配的分片(unassigned) 記一次mysql 重啟失敗 Elasticsearch 學習之分片未分配原因記錄一次因為意外斷電造成gitlab(docker容器)重啟之后無法訪問的問題記一次Pr字幕模糊問題及解決方法記一次解決netty半包問題的經歷記一次解決 vue 兼容ie11 的問題