MongoDB集群節點RECOVERING故障恢復


今天收到客服部說查詢不到數據,隨后上服務器檢查集群狀態,發現了有兩台機的副本集異常了,狀態為RECOVERING
ps: 集群為3節點3副本
集群主機副本2
mongo --port 27002
shard2:RECOVERING> rs.status()
{
    "set" : "shard2",
    "date" : ISODate("2019-04-26T07:42:03.684Z"),
    "myState" : 3,
    "term" : NumberLong(29),
    "heartbeatIntervalMillis" : NumberLong(2000),
    "optimes" : {
        "lastCommittedOpTime" : {
            "ts" : Timestamp(0, 0),
            "t" : NumberLong(-1)
        },
        "appliedOpTime" : {
            "ts" : Timestamp(1549939736, 11),
            "t" : NumberLong(24)
        },
        "durableOpTime" : {
            "ts" : Timestamp(1549939736, 11),
            "t" : NumberLong(24)
        }
    },
    "members" : [
        {
            "_id" : 0,
            "name" : "192.168.1.87:27002",
            "health" : 1,
            "state" : 7,
            "stateStr" : "ARBITER",
            "uptime" : 5331065,
            "lastHeartbeat" : ISODate("2019-04-26T07:42:02.009Z"),
            "lastHeartbeatRecv" : ISODate("2019-04-26T07:42:02.992Z"),
            "pingMs" : NumberLong(0),
            "configVersion" : 1
        },
        {
            "_id" : 1,
            "name" : "192.168.1.110:27002",
            "health" : 1,
            "state" : 3,
            "stateStr" : "RECOVERING",
            "uptime" : 5331067,
            "optime" : {
                "ts" : Timestamp(1549939736, 11),
                "t" : NumberLong(24)
            },
            "optimeDate" : ISODate("2019-02-12T02:48:56Z"),
            "maintenanceMode" : 1,
            "infoMessage" : "could not find member to sync from",
            "configVersion" : 1,
            "self" : true
        },
        {
            "_id" : 2,
            "name" : "192.168.1.150:27002",
            "health" : 1,
            "state" : 1,
            "stateStr" : "PRIMARY",
            "uptime" : 16994,
            "optime" : {
                "ts" : Timestamp(1556264522, 26),
                "t" : NumberLong(29)
            },
            "optimeDurable" : {
                "ts" : Timestamp(1556264522, 26),
                "t" : NumberLong(29)
            },
            "optimeDate" : ISODate("2019-04-26T07:42:02Z"),
            "optimeDurableDate" : ISODate("2019-04-26T07:42:02Z"),
            "lastHeartbeat" : ISODate("2019-04-26T07:42:02.009Z"),
            "lastHeartbeatRecv" : ISODate("2019-04-26T07:42:02.555Z"),
            "pingMs" : NumberLong(0),
            "electionTime" : Timestamp(1556247538, 1),
            "electionDate" : ISODate("2019-04-26T02:58:58Z"),
            "configVersion" : 1
        }
    ],
    "ok" : 1
}
shard2:RECOVERING>
集群主機副本3
mongo --port 27003
shard3:RECOVERING> rs.status()
{
    "set" : "shard3",
    "date" : ISODate("2019-04-26T06:59:28.662Z"),
    "myState" : 3,
    "term" : NumberLong(20),
    "heartbeatIntervalMillis" : NumberLong(2000),
    "optimes" : {
        "lastCommittedOpTime" : {
            "ts" : Timestamp(0, 0),
            "t" : NumberLong(-1)
        },
        "appliedOpTime" : {
            "ts" : Timestamp(1554746640, 479),
            "t" : NumberLong(19)
        },
        "durableOpTime" : {
            "ts" : Timestamp(1554746640, 479),
            "t" : NumberLong(19)
        }
    },
    "members" : [
        {
            "_id" : 0,
            "name" : "192.168.1.87:27003",
            "health" : 1,
            "state" : 1,
            "stateStr" : "PRIMARY",
            "uptime" : 14402,
            "optime" : {
                "ts" : Timestamp(1556261964, 17),
                "t" : NumberLong(20)
            },
            "optimeDurable" : {
                "ts" : Timestamp(1556261964, 17),
                "t" : NumberLong(20)
            },
            "optimeDate" : ISODate("2019-04-26T06:59:24Z"),
            "optimeDurableDate" : ISODate("2019-04-26T06:59:24Z"),
            "lastHeartbeat" : ISODate("2019-04-26T06:59:24.816Z"),
            "lastHeartbeatRecv" : ISODate("2019-04-26T06:59:27.933Z"),
            "pingMs" : NumberLong(0),
            "electionTime" : Timestamp(1555689632, 1),
            "electionDate" : ISODate("2019-04-19T16:00:32Z"),
            "configVersion" : 1
        },
        {
            "_id" : 1,
            "name" : "192.168.1.110:27003",
            "health" : 1,
            "state" : 7,
            "stateStr" : "ARBITER",
            "uptime" : 14402,
            "lastHeartbeat" : ISODate("2019-04-26T06:59:24.816Z"),
            "lastHeartbeatRecv" : ISODate("2019-04-26T06:59:26.163Z"),
            "pingMs" : NumberLong(0),
            "configVersion" : 1
        },
        {
            "_id" : 2,
            "name" : "192.168.1.150:27003",
            "health" : 1,
            "state" : 3,
            "stateStr" : "RECOVERING",
            "uptime" : 14442,
            "optime" : {
                "ts" : Timestamp(1554746640, 479),
                "t" : NumberLong(19)
            },
            "optimeDate" : ISODate("2019-04-08T18:04:00Z"),
            "maintenanceMode" : 1,
            "infoMessage" : "could not find member to sync from",
            "configVersion" : 1,
            "self" : true
        }
    ],
    "ok" : 1
}
shard3:RECOVERING> 
發現副本集2 已經斷檔了好幾個月了(工作嚴重失誤了),發現不會自動同步回來了,只好上官方找文檔了,官檔推薦了兩種方式:
Resync a Member of a Replica Set
On this page
A replica set member becomes “stale” when its replication process falls so far behind that the primary overwrites oplog entries the member has not yet replicated. The member cannot catch up and becomes “stale.” When this occurs, you must completely resynchronize the member by removing its data and performing an initial sync.
This tutorial addresses both resyncing a stale member and creating a new member using seed data from another member, both of which can be used to restore a replica set member. When syncing a member, choose a time when the system has the bandwidth to move a large amount of data. Schedule the synchronization during a time of low usage or during a maintenance window.
MongoDB provides two options for performing an initial sync:
  • Restart the mongod with an empty data directory and let MongoDB’s normal initial syncing feature restore the data. This is the more simple option but may take longer to replace the data.
  • Restart the machine with a copy of a recent data directory from another member in the replica set. This procedure can replace the data more quickly but requires more manual steps.
鏈接: https://docs.mongodb.com/manual/tutorial/resync-replica-set-member/
 
第一種方式自動同步,
第二種找主節點copy數據文件快照,
 
第一種適合數據量不大的情況,第二操作繁雜
結合實際情況,本集群磁盤是SSD,數據量100G+ ,故采用第一種方式,備份原目標數據,新建新數據目錄
 
操作步驟:
1. mongo --port 27002
shard2:RECOVERING> use admin
switched to db admin
shard2:RECOVERING>
shard2:RECOVERING>
shard2:RECOVERING> db.shutdownServer()
關掉副本集2進程
2. mv /opt/mongodb/shard2/data /opt/mongodb/shard2/data_bak
 
3. mongod -f /opt/mongodb/conf/shard2.conf
 
查看數據恢復情況
2019-04-26T15:45:53.234+0800 I ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Connecting to 192.168.1.87:21000
2019-04-26T15:45:53.235+0800 I ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Successfully connected to 192.168.1.87:21000, took 1ms (1 connections now open to 192.168.1.87:21000)
2019-04-26T15:46:23.233+0800 I ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Ending idle connection to host 192.168.1.110:21000 because the pool meets constraints; 1 connections to that host remain open
2019-04-26T15:46:49.910+0800 I -        [repl writer worker 9]   happygo_audit.log collection clone progress: 4045916/45630311 8% (documents copied)
2019-04-26T15:48:03.645+0800 I -        [repl writer worker 1]   happygo_audit.log collection clone progress: 7909796/45630311 17% (documents copied)
2019-04-26T15:48:23.238+0800 I ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Connecting to 192.168.1.150:21000
2019-04-26T15:48:23.240+0800 I ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Successfully connected to 192.168.1.150:21000, took 2ms (1 connections now open to 192.168.1.150:21000)
2019-04-26T15:49:18.553+0800 I -        [repl writer worker 7]   happygo_audit.log collection clone progress: 10708792/45630311 23% (documents copied)
2019-04-26T15:50:43.352+0800 I -        [repl writer worker 11]   happygo_audit.log collection clone progress: 13973466/45630311 30% (documents copied)
2019-04-26T15:52:01.789+0800 I -        [repl writer worker 4]   happygo_audit.log collection clone progress: 17330452/45630311 37% (documents copied)
2019-04-26T15:53:38.471+0800 I -        [repl writer worker 9]   happygo_audit.log collection clone progress: 21542508/45630311 47% (documents copied)
2019-04-26T15:55:09.464+0800 I -        [repl writer worker 8]   happygo_audit.log collection clone progress: 24541323/45630311 53% (documents copied)
2019-04-26T15:56:20.805+0800 I -        [repl writer worker 8]   happygo_audit.log collection clone progress: 27658757/45630311 60% (documents copied)
2019-04-26T15:57:42.897+0800 I -        [repl writer worker 10]   happygo_audit.log collection clone progress: 30314479/45630311 66% (documents copied)
2019-04-26T15:58:45.747+0800 I -        [repl writer worker 1]   happygo_audit.log collection clone progress: 31706123/45630311 69% (documents copied)
2019-04-26T15:59:33.188+0800 I ASIO     [NetworkInterfaceASIO-RS-0] Connecting to 192.168.1.150:27002
2019-04-26T15:59:33.238+0800 I ASIO     [NetworkInterfaceASIO-RS-0] Successfully connected to 192.168.1.150:27002, took 50ms (3 connections now open to 192.168.1.150:27002)
2019-04-26T15:59:57.513+0800 I -        [repl writer worker 6]   happygo_audit.log collection clone progress: 33270641/45630311 72% (documents copied)
2019-04-26T16:00:33.238+0800 I ASIO     [NetworkInterfaceASIO-RS-0] Ending idle connection to host 192.168.1.150:27002 because the pool meets constraints; 2 connections to that host remain open
2019-04-26T16:01:04.776+0800 I -        [repl writer worker 15]   happygo_audit.log collection clone progress: 34367175/45630311 75% (documents copied)
2019-04-26T16:02:14.698+0800 I -        [repl writer worker 11]   happygo_audit.log collection clone progress: 36152828/45630311 79% (documents copied)
2019-04-26T16:03:32.032+0800 I -        [repl writer worker 14]   happygo_audit.log collection clone progress: 37237049/45630311 81% (documents copied)
2019-04-26T16:03:58.960+0800 I ASIO     [NetworkInterfaceASIO-RS-0] Connecting to 192.168.1.150:27002
2019-04-26T16:03:58.964+0800 I ASIO     [NetworkInterfaceASIO-RS-0] Successfully connected to 192.168.1.150:27002, took 4ms (3 connections now open to 192.168.1.150:27002)
2019-04-26T16:04:38.612+0800 I -        [repl writer worker 8]   happygo_audit.log collection clone progress: 38609400/45630311 84% (documents copied)
2019-04-26T16:04:58.964+0800 I ASIO     [NetworkInterfaceASIO-RS-0] Ending idle connection to host 192.168.1.150:27002 because the pool meets constraints; 2 connections to that host remain open
2019-04-26T16:05:43.622+0800 I -        [repl writer worker 12]   happygo_audit.log collection clone progress: 39937334/45630311 87% (documents copied)
2019-04-26T16:06:50.405+0800 I -        [repl writer worker 8]   happygo_audit.log collection clone progress: 41161992/45630311 90% (documents copied)
2019-04-26T16:07:59.129+0800 I -        [repl writer worker 14]   happygo_audit.log collection clone progress: 42424943/45630311 92% (documents copied)
2019-04-26T16:09:15.617+0800 I -        [repl writer worker 1]   happygo_audit.log collection clone progress: 43787849/45630311 95% (documents copied)
2019-04-26T16:10:18.407+0800 I -        [repl writer worker 4]   happygo_audit.log collection clone progress: 45069709/45630311 98% (documents copied)
2019-04-26T16:10:57.001+0800 I -        [InitialSyncInserters-happygo_audit.log0]   Index: (2/3) BTree Bottom Up Progress: 18204600/45634357 39%
2019-04-26T16:11:07.001+0800 I -        [InitialSyncInserters-happygo_audit.log0]   Index: (2/3) BTree Bottom Up Progress: 39578600/45634357 86%
2019-04-26T16:11:09.753+0800 I INDEX    [InitialSyncInserters-happygo_audit.log0]      done building bottom layer, going to commit
2019-04-26T16:11:24.001+0800 I -        [InitialSyncInserters-happygo_audit.log0]   Index: (2/3) BTree Bottom Up Progress: 20851400/45634357 45%
2019-04-26T16:11:34.001+0800 I -        [InitialSyncInserters-happygo_audit.log0]   Index: (2/3) BTree Bottom Up Progress: 42016400/45634357 92%
2019-04-26T16:11:35.696+0800 I INDEX    [InitialSyncInserters-happygo_audit.log0]      done building bottom layer, going to commit
2019-04-26T16:11:47.240+0800 I ASIO     [NetworkInterfaceASIO-RS-0] Ending idle connection to host 192.168.1.150:27002 because the pool meets constraints; 1 connections to that host remain open
2019-04-26T16:11:49.001+0800 I -        [InitialSyncInserters-happygo_audit.log0]   Index: (2/3) BTree Bottom Up Progress: 23722200/45634357 51%
2019-04-26T16:11:57.521+0800 I INDEX    [InitialSyncInserters-happygo_audit.log0]      done building bottom layer, going to commit
2019-04-26T16:12:13.001+0800 I -        [InitialSyncInserters-happygo_audit.log0]   Index: (2/3) BTree Bottom Up Progress: 21606700/45634357 47%
2019-04-26T16:12:23.001+0800 I -        [InitialSyncInserters-happygo_audit.log0]   Index: (2/3) BTree Bottom Up Progress: 45531900/45634357 99%
2019-04-26T16:12:23.042+0800 I INDEX    [InitialSyncInserters-happygo_audit.log0]      done building bottom layer, going to commit
2019-04-26T16:12:48.001+0800 I -        [InitialSyncInserters-happygo_audit.log0]   Index: (2/3) BTree Bottom Up Progress: 17939000/45634357 39%
2019-04-26T16:12:58.001+0800 I -        [InitialSyncInserters-happygo_audit.log0]   Index: (2/3) BTree Bottom Up Progress: 35990200/45634357 78%
2019-04-26T16:13:02.865+0800 I INDEX    [InitialSyncInserters-happygo_audit.log0]      done building bottom layer, going to commit
2019-04-26T16:13:03.777+0800 I ASIO     [NetworkInterfaceASIO-RS-0] Connecting to 192.168.1.150:27002
2019-04-26T16:13:03.779+0800 I ASIO     [NetworkInterfaceASIO-RS-0] Successfully connected to 192.168.1.150:27002, took 2ms (2 connections now open to 192.168.1.150:27002)
2019-04-26T16:13:03.780+0800 I REPL     [replication-1] CollectionCloner::start called, on ns:happygo_carStatusLog.CarStatusLog
2019-04-26T16:13:03.793+0800 I INDEX    [InitialSyncInserters-happygo_carStatusLog.CarStatusLog0] build index on: happygo_carStatusLog.CarStatusLog properties: { v: 2, key: { carNumber: "hashed" }, name: "carNumber_hashed", ns: "happygo_carStatusLog.CarStatusLog" }
至此副本集2重新同步數據完成。
副本集3同樣操作一次即可。
 
 
 
 
 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM