ES重啟后部分索引無法分配的問題

本文轉載自查看原文 2021-11-26 13:42 1474 搜索引擎

這次遇到的問題參考了文章《es實戰-分片分配失敗解決方案》，在這篇文章對於分片失敗的問題給出了很全面的解決方法。

開發環境ES重啟后有部分索引的主副分片都分配失敗，導致集群狀態是紅色。通過查看日志發現第一類錯誤：是調用分詞器異常導致重試5次都失敗，所以分片加載失敗

[2021-11-05T17:20:33,188][DEBUG][o.e.a.a.c.a.TransportClusterAllocationExplainAction] [node-2] explaining the allocation for 
[ClusterAllocationExplainRequest[useAnyUnassignedShard=true,includeYesDecisions?=false], found shard [[dev_srms_service_process][0], node[null], [P], 
recovery_source[existing store recovery; bootstrap_history_uuid=false], s[UNASSIGNED], unassigned_info[[reason=ALLOCATION_FAILED], 
at[2021-11-04T10:43:44.702Z], failed_attempts[5], failed_nodes[[teRXfA-SRPSZ0Gt7bogtYA, AvVMttYeQ4eaVorPLofABw]], delayed=false, 
details[failed shard on node [teRXfA-SRPSZ0Gt7bogtYA]: failed recovery, failure RecoveryFailedException[[dev_srms_service_process][0]: 
Recovery failed on {node-2}{teRXfA-SRPSZ0Gt7bogtYA}{FaumuozZQL-4x5dfkgD5UA}{172.16.2.68}{172.16.2.68:9301}{dilm}{ml.machine_memory=33564663808, 
xpack.installed=true, ml.max_open_jobs=20}]; nested: IndexShardRecoveryException[failed to recover from gateway]; 
nested: EngineException[failed to recover from translog]; nested: RuntimeException[調用分詞器異常]; ], allocation_status[deciders_no]]]

通過執行分配狀態查詢得到第二類錯誤：GET /_cluster/allocation/explain，這個錯誤應該是啟動時node狀態不穩定，分片加載失敗


{
  "index" : "zhugeio_person_search_user_v2",
  "shard" : 2,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "CLUSTER_RECOVERED",
    "at" : "2021-11-05T02:23:36.603Z",
    "last_allocation_status" : "no_attempt"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
  "node_allocation_decisions" : [
    {
      "node_id" : "4cesg3C0RIeODfnXvSRlNw",
      "node_name" : "node-3",
      "transport_address" : "172.16.2.69:9301",
      "node_attributes" : {
        "ml.machine_memory" : "33564680192",
        "ml.max_open_jobs" : "20",
        "xpack.installed" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "replica_after_primary_active",
          "decision" : "NO",
          "explanation" : "primary shard for this replica is not yet active"
        },
        {
          "decider" : "throttling",
          "decision" : "NO",
          "explanation" : "primary shard for this replica is not yet active"
        }
      ]
    },
    {
      "node_id" : "AvVMttYeQ4eaVorPLofABw",
      "node_name" : "node-1",
      "transport_address" : "172.16.2.69:9300",
      "node_attributes" : {
        "ml.machine_memory" : "33564680192",
        "xpack.installed" : "true",
        "ml.max_open_jobs" : "20"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "replica_after_primary_active",
          "decision" : "NO",
          "explanation" : "primary shard for this replica is not yet active"
        },
        {
          "decider" : "throttling",
          "decision" : "NO",
          "explanation" : "primary shard for this replica is not yet active"
        }
      ]
    },
    {
      "node_id" : "teRXfA-SRPSZ0Gt7bogtYA",
      "node_name" : "node-2",
      "transport_address" : "172.16.2.69:9302",
      "node_attributes" : {
        "ml.machine_memory" : "33564680192",
        "ml.max_open_jobs" : "20",
        "xpack.installed" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "replica_after_primary_active",
          "decision" : "NO",
          "explanation" : "primary shard for this replica is not yet active"
        },
        {
          "decider" : "throttling",
          "decision" : "NO",
          "explanation" : "primary shard for this replica is not yet active"
        }
      ]
    }
  ]
}

最后，嘗試重新分配失敗的分片，問題解決：POST /_cluster/reroute?retry_failed=true
如果副本分片一直正在初始化（INITIALIZING）導致索引無法恢復為綠色，可嘗試調整索引的副本為0，等待恢復為綠色后再把副本數設置回來，這樣應該很快就由黃變綠了
參考文章：恢復狀態為INITIALIZING的分片、ES使用Lucene修復錯誤的分片

總結：

默認索引分配的嘗試次數為5
如果節點狀態剛啟動時狀態不穩定，很容易重試失敗超過5，最終導致某個分片分配失敗
當集群各節點啟動並穩定后，如果不是分片數據損壞，再嘗試重新分配失敗的分片，問題可以解決

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 電腦雙屏顯示變單屏后部分程序無法在當前屏幕顯示的問題 Qt 觸發resizeEvent后部分數據滯后問題 ES磁盤分配不均問題 vue-解決Vue打包上線之后部分CSS不生效的問題使用同一種字體和字號卻出現字的大小不一，且打印后部分字不顯示的問題 ES不能創建索引問題修護【jenkins系列之一】部分插件由於缺少以來無法加載。要恢復這些插件提供的功能，需要修復這些問題並重啟jenkins。解決Nginx無法重啟問題解決 jenkins漢化后部分中文 Vmware下NAT無法分配網址問題