Elasticsearch學習之集群常見狀況處理（干貨）

本文轉載自查看原文 2019-07-07 22:58 1140 Elasticsearch/ Elasticsearch學習

1. 集群健康狀況處理

當集群處於yellow或者red狀態的時候，整體處理步驟如下：

（1）首先查看集群狀態

localhost:9200/_cluster/health?pretty

{
　　"cluster_name": "elasticsearch",
　　"status": "yellow",
　　"timed_out": false,
　　"number_of_nodes": 1,
　　"number_of_data_nodes": 1,
　　"active_primary_shards": 278,
　　"active_shards": 278,
　　"relocating_shards": 0,
　　"initializing_shards": 0,
　　"unassigned_shards": 278,
　　"delayed_unassigned_shards": 0,
　　"number_of_pending_tasks": 0,
　　"number_of_in_flight_fetch": 0,
　　"task_max_waiting_in_queue_millis": 0,
　　"active_shards_percent_as_number": 50
}

主要關注其中的unassigned_shards指標，其代表已經在集群狀態中存在的分片，但是實際在集群里又找不着。通常未分配分片的來源是未分配的副本。比如，一個有 5 分片和 1 副本的索引，在單節點集群上，就會有 5 個未分配副本分片。如果你的集群是 red 狀態，也會長期保有未分配分片（因為缺少主分片）。其他指標解釋：

(1) initializing_shards 是剛剛創建的分片的個數。比如，當你剛創建第一個索引，分片都會短暫的處於 initializing 狀態。這通常會是一個臨時事件，分片不應該長期停留在 initializing 狀態。你還可能在節點剛重啟的時候看到 initializing 分片：當分片從磁盤上加載后，它們會從 initializing 狀態開始。

(2) number_of_nodes 和 number_of_data_nodes 這個命名完全是自描述的。

(3) active_primary_shards 指出你集群中的主分片數量。這是涵蓋了所有索引的匯總值。

(4) active_shards 是涵蓋了所有索引的_所有_分片的匯總值，即包括副本分片。

(5) relocating_shards 顯示當前正在從一個節點遷往其他節點的分片的數量。通常來說應該是 0，不過在 Elasticsearch 發現集群不太均衡時，該值會上漲。比如說：添加了一個新節點，或者下線了一個節點。

（2）查找問題索引

curl -XGET 'localhost:9200/_cluster/health?level=indices'

{
    "cluster_name": "elasticsearch",
    "status": "yellow",
    "timed_out": false,
    "number_of_nodes": 1,
    "number_of_data_nodes": 1,
    "active_primary_shards": 278,
    "active_shards": 278,
    "relocating_shards": 0,
    "initializing_shards": 0,
    "unassigned_shards": 278,
    "delayed_unassigned_shards": 0,
    "number_of_pending_tasks": 0,
    "number_of_in_flight_fetch": 0,
    "task_max_waiting_in_queue_millis": 0,
    "active_shards_percent_as_number": 50,
    "indices": {
        "gaczrk": {
            "status": "yellow",
            "number_of_shards": 5,
            "number_of_replicas": 1,
            "active_primary_shards": 5,
            "active_shards": 5,
            "relocating_shards": 0,
            "initializing_shards": 0,
            "unassigned_shards": 5
        },
        "special-sms-extractor_zhuanche_20200204": {
            "status": "yellow",
            "number_of_shards": 5,
            "number_of_replicas": 1,
            "active_primary_shards": 5,
            "active_shards": 5,
            "relocating_shards": 0,
            "initializing_shards": 0,
            "unassigned_shards": 5
        },
        "specialhtl201905": {
            "status": "yellow",
            "number_of_shards": 1,
            "number_of_replicas": 1,
            "active_primary_shards": 1,
            "active_shards": 1,
            "relocating_shards": 0,
            "initializing_shards": 0,
            "unassigned_shards": 1
        },

        "v2": {
         "status": "red", 
         "number_of_shards": 10,
         "number_of_replicas": 1,
         "active_primary_shards": 0,
         "active_shards": 0,
         "relocating_shards": 0,
 "initializing_shards": 0,
 "unassigned_shards": 20 
        },

       "sms20181009": { 
"status": "yellow", 
"number_of_shards": 5, 
"number_of_replicas": 1, 
"active_primary_shards": 5, 
"active_shards": 5, 
"relocating_shards": 0, 
"initializing_shards": 0, 
"unassigned_shards": 5 
}, 
......

這個參數會讓 cluster-health API 在我們的集群信息里添加一個索引清單，以及有關每個索引的細節（狀態、分片數、未分配分片數等等），一旦我們詢問要索引的輸出，哪個索引有問題立馬就很清楚了：v2 索引。我們還可以看到這個索引曾經有 10 個主分片和一個副本，而現在這 20 個分片全不見了。可以推測，這 20 個索引就是位於從我們集群里不見了的那兩個節點上。一般來講，Elasticsearch是有自我分配節點功能的，首先查看這個功能是否開啟：

curl -XGET 'localhost:9200/_cluster/settings?pretty' -d  
'{
    "persistent": {},
    "transient": {
        "cluster": {
            "routing": {
                "allocation": {
                    "enable": "all"
                }
            }
        }
    }
}'

level 參數還可以接受其他更多選項：

localhost:9200/_cluster/health?level=shards

{
    "cluster_name": "elasticsearch",
    "status": "yellow",
    "timed_out": false,
    "number_of_nodes": 1,
    "number_of_data_nodes": 1,
    "active_primary_shards": 278,
    "active_shards": 278,
    "relocating_shards": 0,
    "initializing_shards": 0,
    "unassigned_shards": 278,
    "delayed_unassigned_shards": 0,
    "number_of_pending_tasks": 0,
    "number_of_in_flight_fetch": 0,
    "task_max_waiting_in_queue_millis": 0,
    "active_shards_percent_as_number": 50,
    "indices": {
        "gaczrk": {
            "status": "yellow",
            "number_of_shards": 5,
            "number_of_replicas": 1,
            "active_primary_shards": 5,
            "active_shards": 5,
            "relocating_shards": 0,
            "initializing_shards": 0,
            "unassigned_shards": 5,
            "shards": {
                "0": {
                    "status": "yellow",
                    "primary_active": true,
                    "active_shards": 1,
                    "relocating_shards": 0,
                    "initializing_shards": 0,
                    "unassigned_shards": 1
                },
                "1": {
                    "status": "yellow",
                    "primary_active": true,
                    "active_shards": 1,
                    "relocating_shards": 0,
                    "initializing_shards": 0,
                    "unassigned_shards": 1
                },
                "2": {
                    "status": "yellow",
                    "primary_active": true,
                    "active_shards": 1,
                    "relocating_shards": 0,
                    "initializing_shards": 0,
                    "unassigned_shards": 1
                },
                "3": {
                    "status": "yellow",
                    "primary_active": true,
                    "active_shards": 1,
                    "relocating_shards": 0,
                    "initializing_shards": 0,
                    "unassigned_shards": 1
                },
                "4": {
                    "status": "yellow",
                    "primary_active": true,
                    "active_shards": 1,
                    "relocating_shards": 0,
                    "initializing_shards": 0,
                    "unassigned_shards": 1
                }
            }
        },
......

shards 選項會提供一個詳細得多的輸出，列出每個索引里每個分片的狀態和位置。這個輸出有時候很有用，但是由於太過詳細會比較難用。

(3) 手動分配未分配分片

查詢未分配分片的節點以及未分配原因

localhost:9200/_cat/shards?v&h=index,shard,prirep,state,unassigned.reason

index                                   shard prirep state      unassigned.reason 
gaczrk                                  4     p      STARTED                      
gaczrk                                  4     r      UNASSIGNED CLUSTER_RECOVERED 
gaczrk                                  2     p      STARTED                      
gaczrk                                  2     r      UNASSIGNED CLUSTER_RECOVERED 
gaczrk                                  1     p      STARTED

未分配原因說明：

INDEX_CREATED:  由於創建索引的API導致未分配。
CLUSTER_RECOVERED:  由於完全集群恢復導致未分配。
INDEX_REOPENED:  由於打開open或關閉close一個索引導致未分配。
DANGLING_INDEX_IMPORTED:  由於導入dangling索引的結果導致未分配。
NEW_INDEX_RESTORED:  由於恢復到新索引導致未分配。
EXISTING_INDEX_RESTORED:  由於恢復到已關閉的索引導致未分配。
REPLICA_ADDED:  由於顯式添加副本分片導致未分配。
ALLOCATION_FAILED:  由於分片分配失敗導致未分配。
NODE_LEFT:  由於承載該分片的節點離開集群導致未分配。
REINITIALIZED:  由於當分片從開始移動到初始化時導致未分配（例如，使用影子shadow副本分片）。
REROUTE_CANCELLED:  作為顯式取消重新路由命令的結果取消分配。
REALLOCATED_REPLICA:  確定更好的副本位置被標定使用，導致現有的副本分配被取消，出現未分配。

然后執行命令手動分配:

curl -XPOST 'localhost:9200/_cluster/reroute' -d '{
    "commands": [{
        "allocate": {
            "index": "gaczrk(索引名稱)",
            "shard": 4分片編號),
            "node": "其他node的id",
            "allow_primary": true
        }
    }]
}'

如果未分片較多的話，可以用如下腳本進行自動分派：

#!/bin/bash
array=( node1 node2 node3 )
node_counter=0
length=${#array[@]}
IFS=$'\n'
for line in $(curl -s 'http://127.0.0.1:9200/_cat/shards'|  fgrep UNASSIGNED); do
    INDEX=$(echo $line | (awk '{print $1}'))
    SHARD=$(echo $line | (awk '{print $2}'))
    NODE=${array[$node_counter]}
    echo $NODE
    curl -XPOST 'http://127.0.0.1:9200/_cluster/reroute' -d '{
        "commands": [
        {
            "allocate": {
                "index": "'$INDEX'",
                "shard": '$SHARD',
                "node": "'$NODE'",
                "allow_primary": true
            }
        }
        ]
    }'
    node_counter=$(((node_counter)%length +1))
done

(4) 快速分配分片

在上面的命令執行輸出結果中，假如所有的primary shards都是好的，所有replica shards有問題，有一種快速恢復的方法，就是強制刪除掉replica shards，讓elasticsearch自主重新生成。首先先將出問題的index的副本為0

curl -XPUT '/問題索引名稱/_settings?pretty' -d '{
    "index" : {
        "number_of_replicas" : 0
    }
}'

然后觀察集群狀態，最后通過命令在恢復期索引副本數據

curl -XGET '/問題索引名稱/_settings
{
    "index" : {
        "number_of_replicas" : 1
    }
}

等待節點自動分配后，集群成功恢復成gree

（5）集群分片始終處於 INITIALIZING狀態

curl -XGET 'localhost:9200/_cat/shards/7a_cool?v&pretty'

7a_cool 5  r STARTED      4583018 759.4mb 10.2.4.21 pt01-pte-10-2-4-21
7a_cool 17 r INITIALIZING                 10.2.4.22 pt01-pte-10-2-4-22  《==異常分片

解決辦法：

1)首先關閉異常分片主機es 服務；

登陸pt01-pte-10-2-4-22 主機  ，/etc/init.d/elasticsearch  stop

如果分片自動遷移至其它主機，狀態恢復，則集群正常，如果狀態還是在初始化狀態，則說明問題依舊存在；則執行上面手動分配分片命令，如果問題依然存在，則將問題索引分片副本數置為0，讓集群

自主調整集群分片，調整完成后集群狀態變成：green

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 http常見狀態碼有哪些？ HTTP常見狀態碼 elasticsearch 集群搭建及啟動常見錯誤修復 Elasticsearch 集群的常見錯誤和問題網絡——http常見狀態碼干貨 | Elasticsearch 集群健康值紅色終極解決方案【轉】 HTTP常見狀態碼（404、400、500）等錯誤 ElasticSearch 集群 ElasticSearch（七）：ElasticSearch集群的搭建 ELK學習003：Elasticsearch啟動常見問題