#### 一.集群的問題
- 1.當某個主節點宕機后,對應的槽位沒有節點承擔,整個集群處於失敗狀態,不可用,怎么辦
- 2.如何判斷某個主節點是否真正的岩機?
- 3.如果從某個主節點的所有從節點中選舉出一個合適的節點作為新的主節點?
#### 二.集群復制
- 1.復制原理與單節點的主從復制一樣
- 2.從節點也是運行在集群模式下,所以安裝主節點的方式配置即可
- 3.通過cluster meet把此節點添加到集群中去
- 4.在即將成為從節點的節點命令行執行cluster replicate <node-id> ,即把此節點設置為node-jd對應節點的從節點
#### 三.故障檢測
- 1.集群中所有節點都會向其它節點發送PING消息,當在規定的時間內,沒有收到對應的PONG消息,就把此節點標記為疑似下線
- 2.在發送的PING消息里面,會帶着當前集群和節點的信息;通過這種方式,即可檢測節點的存活,又能維護集群信息的統一性,不過有一定的時延
- 3.疑似下線不是真的下線,只有滿足以下條件才是真的下線
- 主節點並且是被分配了slot槽位的主節點中有超過一半的節點都認為此節點疑似下線,才能真的下線
- 4.當某個節點通過消息得知有一個節點的疑似下線投票已經超過集群一半的時候,會發送一個標識此節點下線的廣播消息
- 5.其它節點收到某節點已經下線的廣播后,把自己內部的集群維護信息也修改為節點已下線狀態
#### 四.故障轉移
- 1.從所有的從節點里面選舉出一個新的主
- 2.選舉出的新主會執行slaveof no one把自己的狀態從slave變成master
- 3.撤銷已下線的主節點的槽指派,並把這些槽位重新指派給自己
- 4.新的主節點向集群廣播一條PONG消息,通過這個消息告訴所有集群節點:自己已經變成了主節點,接管了原來的主節點
- 5.新的主節點開始接收和處理與自己槽位相關的命令請求
#### 五.如何從所有的從節點中選舉產生新的主節點
- 1.每一次選舉,配置紀元都會加一,從0開始
- 2.所有具備投票資格的節點在一次選舉里面只能投一次票,並且是先到先得
- 條件一:主節點
- 條件二:被分配了槽位
- 3.當從節點發現自己所屬的主節點宕機后,從節點會向集群廣播一條CLUSTERMSGTYPE-FAILOVERAUTH-REQUEST的消息,要求具備投票資格的節點給自己投票
- 4.如果收到消息的節點具備資格,並且沒有投過別的節點,則返回一條CLUSTERMSG-TYPE-FAILOVERAUTH-ACK消息,表示自己投票了
- 5.發起投票的節點計算自己收到的投票數,如果超過了一半,則自己變成主節點,執行故障轉移操作
- 6.如果沒有節點滿足一半的要求,則配置紀元加一,重新進行選舉
#### 六.cluster的故障轉移操作,截圖展示
- 1.當前集群狀態
```
127.0.0.1:6379> cluster nodes
9555a765592418cd9e975ace7df053d202bcc876 172.16.10.154:6379@16379 slave d0beb418f4682c1d93ab133df804626252fbc265 0 1542932936823 10 connected
0640b032745e6a0c5aae3fe1d5ad839a4f1fed7d 172.16.10.143:6379@16379 master - 0 1542932937000 2 connected 5803-10922
3e52dcc1694d213e47d0b7efbd1cb85fa69a8dba 172.16.10.142:6381@16381 slave 1f974aaa1ca841259b2f4c6ce6302f66b0aa2e27 0 1542932934000 11 connected
afac608e5f37e399fe11bc5125a6f5a6548deef4 172.16.10.153:6379@16379 slave fea781a76cf0dc84b047bdee3f82112be362d483 0 1542932937825 12 connected
d0beb418f4682c1d93ab133df804626252fbc265 172.16.10.142:6379@16379 myself,master - 0 1542932934000 9 connected 461-5460
1f974aaa1ca841259b2f4c6ce6302f66b0aa2e27 172.16.10.142:6380@16380 master - 0 1542932938827 11 connected 0-340 5461-5801 10923-11456
f7c1cf0ba03202dd9b6c520a038225d6dabbea5d 172.16.10.155:6379@16379 slave 0640b032745e6a0c5aae3fe1d5ad839a4f1fed7d 0 1542932935000 6 connected
fea781a76cf0dc84b047bdee3f82112be362d483 172.16.10.144:6379@16379 master - 0 1542932936000 12 connected 341-460 5802 11457-16383
```
- 2.kill掉節點`d0beb418f4682c1d93ab133df804626252fbc265 172.16.10.142:6379@16379 myself,master - 0 1542932934000 9 connected 461-5460`
```
# 對應slave的log 172.16.10.154:6379
1058:S 23 Nov 08:34:51.104 # Connection with master lost.
1058:S 23 Nov 08:34:51.104 * Caching the disconnected master state.
1058:S 23 Nov 08:34:51.240 * Connecting to MASTER 172.16.10.142:6379
1058:S 23 Nov 08:34:51.240 * MASTER <-> SLAVE sync started
1058:S 23 Nov 08:34:51.240 # Error condition on socket for SYNC: Connection refused
1058:S 23 Nov 08:35:08.287 * Connecting to MASTER 172.16.10.142:6379
1058:S 23 Nov 08:35:08.287 * MASTER <-> SLAVE sync started
1058:S 23 Nov 08:35:08.287 # Error condition on socket for SYNC: Connection refused
1058:S 23 Nov 08:35:09.173 * FAIL message received from fea781a76cf0dc84b047bdee3f82112be362d483 about d0beb418f4682c1d93ab133df804626252fbc265
1058:S 23 Nov 08:35:09.174 # Cluster state changed: fail
1058:S 23 Nov 08:35:09.189 # Start of election delayed for 830 milliseconds (rank #0, offset 1912008).
1058:S 23 Nov 08:35:09.289 * Connecting to MASTER 172.16.10.142:6379
1058:S 23 Nov 08:35:09.289 * MASTER <-> SLAVE sync started
1058:S 23 Nov 08:35:09.289 # Error condition on socket for SYNC: Connection refused
1058:S 23 Nov 08:35:10.091 # Starting a failover election for epoch 13.
1058:S 23 Nov 08:35:10.093 # Failover election won: I'm the new master.
1058:S 23 Nov 08:35:10.094 # configEpoch set to 13 after successful failover
1058:M 23 Nov 08:35:10.094 # Setting secondary replication ID to 1dff33da4b26e7a3b350cc5eb3d1d908d6a40915, valid up to offset: 1912009. New replication ID is b328006d8d93ecdfb0d98d55cc0a01b897b17d95
1058:M 23 Nov 08:35:10.094 * Discarding previously cached master state.
1058:M 23 Nov 08:35:10.094 # Cluster state changed: ok
```
大概20秒后slave提升為master
- 3.其他節點的log
```
14818:M 23 Nov 08:35:06.313 * FAIL message received from fea781a76cf0dc84b047bdee3f82112be362d483 about d0beb418f4682c1d93ab133df804626252fbc265
14818:M 23 Nov 08:35:06.313 # Cluster state changed: fail
14818:M 23 Nov 08:35:07.232 # Failover auth granted to 9555a765592418cd9e975ace7df053d202bcc876 for epoch 13
14818:M 23 Nov 08:35:07.234 # Cluster state changed: ok
```
- 4.故障轉以后的集群狀態
```
172.16.10.153:6379> cluster nodes
fea781a76cf0dc84b047bdee3f82112be362d483 172.16.10.144:6379@16379 master - 0 1542933587000 12 connected 341-460 5802 11457-16383
afac608e5f37e399fe11bc5125a6f5a6548deef4 172.16.10.153:6379@16379 myself,slave fea781a76cf0dc84b047bdee3f82112be362d483 0 1542933586000 4 connected
3e52dcc1694d213e47d0b7efbd1cb85fa69a8dba 172.16.10.142:6381@16381 slave 1f974aaa1ca841259b2f4c6ce6302f66b0aa2e27 0 1542933587455 11 connected
1f974aaa1ca841259b2f4c6ce6302f66b0aa2e27 172.16.10.142:6380@16380 master - 0 1542933588456 11 connected 0-340 5461-5801 10923-11456
d0beb418f4682c1d93ab133df804626252fbc265 172.16.10.142:6379@16379 master,fail - 1542933288229 1542933287000 9 disconnected
0640b032745e6a0c5aae3fe1d5ad839a4f1fed7d 172.16.10.143:6379@16379 master - 0 1542933588000 2 connected 5803-10922
9555a765592418cd9e975ace7df053d202bcc876 172.16.10.154:6379@16379 master - 0 1542933590459 13 connected 461-5460
f7c1cf0ba03202dd9b6c520a038225d6dabbea5d 172.16.10.155:6379@16379 slave 0640b032745e6a0c5aae3fe1d5ad839a4f1fed7d 0 1542933589458 6 connected
```
原slave 172.16.10.154:6379 提升為master,原master 172.16.10.142:6379 變為faild狀態
- 5.重啟開啟原master
```
# 原master log,轉變為slave,重新全量同步新的master
24809:M 23 Nov 08:45:08.078 # Configuration change detected. Reconfiguring myself as a replica of 9555a765592418cd9e975ace7df053d202bcc876
24809:S 23 Nov 08:45:08.078 * Before turning into a slave, using my master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer.
24809:S 23 Nov 08:45:08.078 # Cluster state changed: ok
24809:S 23 Nov 08:45:09.079 * Connecting to MASTER 172.16.10.154:6379
24809:S 23 Nov 08:45:09.079 * MASTER <-> SLAVE sync started
24809:S 23 Nov 08:45:09.079 * Non blocking connect for SYNC fired the event.
24809:S 23 Nov 08:45:09.080 * Master replied to PING, replication can continue...
24809:S 23 Nov 08:45:09.080 * Trying a partial resynchronization (request 1aa4d36b7bf6dcf3e87f41c5768436b8e26ccbb5:1).
24809:S 23 Nov 08:45:09.082 * Full resync from master: b328006d8d93ecdfb0d98d55cc0a01b897b17d95:1912008
24809:S 23 Nov 08:45:09.082 * Discarding previously cached master state.
24809:S 23 Nov 08:45:09.100 * MASTER <-> SLAVE sync: receiving 178 bytes from master
24809:S 23 Nov 08:45:09.100 * MASTER <-> SLAVE sync: Flushing old data
24809:S 23 Nov 08:45:09.100 * MASTER <-> SLAVE sync: Loading DB in memory
24809:S 23 Nov 08:45:09.100 * MASTER <-> SLAVE sync: Finished with success
24809:S 23 Nov 08:45:09.100 * Background append only file rewriting started by pid 24813
24809:S 23 Nov 08:45:09.122 * AOF rewrite child asks to stop sending diffs.
24813:C 23 Nov 08:45:09.123 * Parent agreed to stop sending diffs. Finalizing AOF...
24813:C 23 Nov 08:45:09.123 * Concatenating 0.00 MB of AOF diff received from parent.
24813:C 23 Nov 08:45:09.123 * SYNC append only file rewrite performed
24813:C 23 Nov 08:45:09.123 * AOF rewrite: 0 MB of memory used by copy-on-write
24809:S 23 Nov 08:45:09.179 * Background AOF rewrite terminated with success
24809:S 23 Nov 08:45:09.179 * Residual parent diff successfully flushed to the rewritten AOF (0.00 MB)
24809:S 23 Nov 08:45:09.179 * Background AOF rewrite finished successfully
# master的log
1058:M 23 Nov 08:45:10.982 * Clear FAIL state for node d0beb418f4682c1d93ab133df804626252fbc265: master without slots is reachable again.
1058:M 23 Nov 08:45:11.964 * Slave 172.16.10.142:6379 asks for synchronization
1058:M 23 Nov 08:45:11.965 * Partial resynchronization not accepted: Replication ID mismatch (Slave asked for '1aa4d36b7bf6dcf3e87f41c5768436b8e26ccbb5', my replication IDs are 'b328006d8d93ecdfb0d98d55cc0a01b897b17d95' and '1dff33da4b26e7a3b350cc5eb3d1d908d6a40915')
1058:M 23 Nov 08:45:11.965 * Starting BGSAVE for SYNC with target: disk
1058:M 23 Nov 08:45:11.966 * Background saving started by pid 4063
4063:C 23 Nov 08:45:11.968 * DB saved on disk
4063:C 23 Nov 08:45:11.969 * RDB: 6 MB of memory used by copy-on-write
1058:M 23 Nov 08:45:11.983 * Background saving terminated with success
1058:M 23 Nov 08:45:11.983 * Synchronization with slave 172.16.10.142:6379 succeeded
# 其他節點識別
14818:M 23 Nov 08:45:08.135 * Clear FAIL state for node d0beb418f4682c1d93ab133df804626252fbc265: master without slots is reachable again.
```
- 6.集群狀態
```
3e52dcc1694d213e47d0b7efbd1cb85fa69a8dba 172.16.10.142:6381@16381 slave 1f974aaa1ca841259b2f4c6ce6302f66b0aa2e27 0 1542934118000 11 connected
0640b032745e6a0c5aae3fe1d5ad839a4f1fed7d 172.16.10.143:6379@16379 master - 0 1542934119000 2 connected 5803-10922
9555a765592418cd9e975ace7df053d202bcc876 172.16.10.154:6379@16379 master - 0 1542934118417 13 connected 461-5460
fea781a76cf0dc84b047bdee3f82112be362d483 172.16.10.144:6379@16379 master - 0 1542934120421 12 connected 341-460 5802 11457-16383
afac608e5f37e399fe11bc5125a6f5a6548deef4 172.16.10.153:6379@16379 slave fea781a76cf0dc84b047bdee3f82112be362d483 0 1542934120000 12 connected
f7c1cf0ba03202dd9b6c520a038225d6dabbea5d 172.16.10.155:6379@16379 slave 0640b032745e6a0c5aae3fe1d5ad839a4f1fed7d 0 1542934116413 6 connected
1f974aaa1ca841259b2f4c6ce6302f66b0aa2e27 172.16.10.142:6380@16380 master - 0 1542934118000 11 connected 0-340 5461-5801 10923-11456
d0beb418f4682c1d93ab133df804626252fbc265 172.16.10.142:6379@16379 myself,slave 9555a765592418cd9e975ace7df053d202bcc876 0 1542934117000 9 connected
```