方案一
找到狀態為 red
的索引
curl -X GET "http://172.xxx.xxx.174:9288/_cat/indices?v="
red open index 5 1 3058268 97588 2.6gb 1.3gb
狀態為 red
是無法對外提供服務的,說明有主節點沒有分配到對應的機子上。
找到 UNASSIGNED
節點
_cat/shards 能夠看到節點的分配情況
curl -X GET "http://172.xxx.xxx.174:9288/_cat/shards"
index shard prirep state docs store ip node
index 1 p STARTED 764505 338.6mb 172.xxx.xxx.174 Calypso
index 1 r STARTED 764505 338.6mb 172.xxx.xxx.89 Savage Steel
index 2 p STARTED 763750 336.6mb 172.xxx.xxx.174 Calypso
index 2 r STARTED 763750 336.6mb 172.xxx.xxx.88 Temugin
index 3 p STARTED 764537 340.2mb 172.xxx.xxx.89 Savage Steel
index 3 r STARTED 764537 340.2mb 172.xxx.xxx.88 Temugin
index 4 p STARTED 765476 339.3mb 172.xxx.xxx.89 Savage Steel
index 4 r STARTED 765476 339.3mb 172.xxx.xxx.88 Temugin
index 0 p UNASSIGNED
index 0 r UNASSIGNED
index
有一個主節點 0
和一個副本 0
處於 UNASSIGNED
狀態,也就是沒有分配到機子上,因為主節點沒有分配到機子上,所以狀態為 red
。
從 ip
列可以看出一共有三台機子,尾數分別為 174
,89
以及 88
。一共有 10
個 index
所以對應的 elasticsearch
的 index.number_of_shards: 5
,index.number_of_replicas: 1
。一共有 10
個分片,可以按照 3,3,4
這樣分配到三台不同的機子上。88
和 89
機子都分配多個節點,所以可以將另外一個主節點分配到 174
機子上。
找出機子的 id
找到 174
機子對應的 id
,后續重新分配主節點得要用到
curl -X GET "http://172.xxx.xxx.174:9288/_nodes/process?v="
{
"cluster_name": "es2.3.2-titan-cl",
"nodes": {
"Leivp0laTYSqvMVm49SulQ": {
"name": "Calypso",
"transport_address": "172.xxx.xxx.174:9388",
"host": "172.xxx.xxx.174",
"ip": "172.xxx.xxx.174",
"version": "2.3.2",
"build": "b9e4a6a",
"http_address": "172.xxx.xxx.174:9288",
"process": {
"refresh_interval_in_millis": 1000,
"id": 32130,
"mlockall": false
}
},
"EafIS3ByRrm4g-14KmY_wg": {
"name": "Savage Steel",
"transport_address": "172.xxx.xxx.89:9388",
"host": "172.xxx.xxx.89",
"ip": "172.xxx.xxx.89",
"version": "2.3.2",
"build": "b9e4a6a",
"http_address": "172.xxx.xxx.89:9288",
"process": {
"refresh_interval_in_millis": 1000,
"id": 7560,
"mlockall": false
}
},
"tojQ9EiXS0m6ZP16N7Ug3A": {
"name": "Temugin",
"transport_address": "172.xxx.xxx.88:9388",
"host": "172.xxx.xxx.88",
"ip": "172.xxx.xxx.88",
"version": "2.3.2",
"build": "b9e4a6a",
"http_address": "172.xxx.xxx.88:9288",
"process": {
"refresh_interval_in_millis": 1000,
"id": 47701,
"mlockall": false
}
}
}
}
174
機子對應的 id
為 Leivp0laTYSqvMVm49SulQ
。
為了簡單也可以直接將該主分片放到 master
機子上,但是如果節點過於集中肯定會影響性能,同時會影響宕機后數據丟失的可能性,所以建議根據機子目前節點的分布情況重新分配。
curl -X GET "http://172.xxx.xxx.174:9288/_cat/master?v="
id host ip node
EafIS3ByRrm4g-14KmY_wg 172.xxx.xxx.89 172.xxx.xxx.89 Savage Steel
分配 UNASSIGNED
節點到機子
得要找到 UNASSIGNED
狀態的主分片才能夠重新分配,如果重新分配不是 UNASSIGNED
狀態的主分片,例如我視圖重新分配 shard 1
會出現如下的錯誤。
curl -X POST -d '{
"commands" : [ {
"allocate" : {
"index" : "index",
"shard" : 1,
"node" : "EafIS3ByRrm4g-14KmY_wg",
"allow_primary" : true
}
}]
}' "http://172.xxx.xxx.174:9288/_cluster/reroute"
{
"error": {
"root_cause": [
{
"type": "remote_transport_exception",
"reason": "[Savage Steel][172.xxx.xxx.89:9388][cluster:admin/reroute]"
}
],
"type": "illegal_argument_exception",
"reason": "[allocate] failed to find [index][1] on the list of unassigned shards"
},
"status": 400
}
重新分配 index shard 0
到某一台機子。_cluster/reroute 的參數 allow_primary
得要小心,有概率會導致數據丟失。具體的看看官方文檔該接口的說明吧。
curl -X POST -d '{
"commands" : [ {
"allocate" : {
"index" : "index",
"shard" : 0,
"node" : "Leivp0laTYSqvMVm49SulQ",
"allow_primary" : true
}
}]
}' "http://172.xxx.xxx.174:9288/_cluster/reroute"
{
"acknowledged": true,
.........
"index": {
"shards": {
"0": [
{
"state": "INITIALIZING",
"primary": true,
"node": "Leivp0laTYSqvMVm49SulQ",
"relocating_node": null,
"shard": 0,
"index": "index",
"version": 1,
"allocation_id": {
"id": "wk5q0CryQpmworGFalfWQQ"
},
"unassigned_info": {
"reason": "INDEX_CREATED",
"at": "2017-03-23T12:27:33.405Z",
"details": "force allocation from previous reason INDEX_REOPENED, null"
}
},
{
"state": "UNASSIGNED",
"primary": false,
"node": null,
"relocating_node": null,
"shard": 0,
"index": "index",
"version": 1,
"unassigned_info": {
"reason": "INDEX_REOPENED",
"at": "2017-03-23T11:56:25.568Z"
}
}
]
}
}
.............
}
輸出結果只羅列出了關鍵部分,主節點處於 INITIALIZING
狀態,在看看索引的狀態
curl -X GET "http://172.xxx.xxx.174:9288/_cat/indices?v="
green open index 5 1 3058268 97588 2.6gb 1.3gb
索引狀態已經為 green
,恢復正常使用。
以上參考 ELASTICSEARCH幾個問題的解決
方案二
導致集群變red,很可能是因為集群中有機子宕機了,其中一部分數據沒有同步完成,因此將之前宕機的機子起來,和現有集群同步完成,集群也就恢復了。
另外也可以找一台空的機子,與現有的機子組成集群,索引會自動平衡,如果集群沒有數據丟失,也是可以將集群恢復正常。
歡迎轉載,但請注明本文鏈接,謝謝你。
2017.3.24 12:15