深入理解Redis高可用方案-Sentinel

本文轉載自查看原文 2018-10-09 16:55 3977 Redis

Redis Sentinel是Redis的高可用方案。是Redis 2.8中正式引入的。

在之前的主從復制方案中，如果主節點出現問題，需要手動將一個從節點升級為主節點，然后將其它從節點指向新的主節點，並且需要修改應用方主節點的地址。整個過程都需要人工干預。

下面通過日志具體看看Sentinel的切換流程。

Sentinel的切換流程

集群拓撲圖如下。

角色 IP 端口 runID

主節點 127.0.0.1 6379

從節點-1 127.0.0.1 6380

從節點-2 127.0.0.1 6381

Sentinel-1 127.0.0.1 26379 d4424b8684977767be4f5abd1e364153fbb0adbd

Sentinel-2 127.0.0.1 26380 18311edfbfb7bf89fe4b67d08ef432053db62fff

Sentinel-3 127.0.0.1 26381 3e9eb1aa9378d89cfe04fe21bf4a05a901747fa8

kill -9 將主節點進程殺死。

1. 最先反應的是從節點。

其會馬上輸出如下信息。

28244:S 08 Oct 16:03:34.184 # Connection with master lost.
28244:S 08 Oct 16:03:34.184 * Caching the disconnected master state.
28244:S 08 Oct 16:03:34.548 * Connecting to MASTER 127.0.0.1:6379
28244:S 08 Oct 16:03:34.548 * MASTER <-> SLAVE sync started
28244:S 08 Oct 16:03:34.548 # Error condition on socket for SYNC: Connection refused
28244:S 08 Oct 16:03:35.556 * Connecting to MASTER 127.0.0.1:6379
28244:S 08 Oct 16:03:35.556 * MASTER <-> SLAVE sync started
...

2. Sentinel的日志30s后才有輸出，這個與“sentinel down-after-milliseconds mymaster 30000”的設置有關。

下面，依次貼出哨兵各個節點及slave的日志輸出。

Sentinel-1

28087:X 08 Oct 16:04:04.277 # +sdown master mymaster 127.0.0.1 6379
28087:X 08 Oct 16:04:04.379 # +new-epoch 1
28087:X 08 Oct 16:04:04.385 # +vote-for-leader 18311edfbfb7bf89fe4b67d08ef432053db62fff 1
28087:X 08 Oct 16:04:05.388 # +odown master mymaster 127.0.0.1 6379 #quorum 3/2
28087:X 08 Oct 16:04:05.388 # Next failover delay: I will not start a failover before Mon Oct  8 16:10:04 2018
28087:X 08 Oct 16:04:05.631 # +config-update-from sentinel 18311edfbfb7bf89fe4b67d08ef432053db62fff 127.0.0.1 26380 @ mymaster 127.0.0.1 6379
28087:X 08 Oct 16:04:05.631 # +switch-master mymaster 127.0.0.1 6379 127.0.0.1 6381
28087:X 08 Oct 16:04:05.631 * +slave slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6381
28087:X 08 Oct 16:04:05.631 * +slave slave 127.0.0.1:6379 127.0.0.1 6379 @ mymaster 127.0.0.1 6381
28087:X 08 Oct 16:04:35.656 # +sdown slave 127.0.0.1:6379 127.0.0.1 6379 @ mymaster 127.0.0.1 6381

Sentinel-2

28163:X 08 Oct 16:04:04.289 # +sdown master mymaster 127.0.0.1 6379
28163:X 08 Oct 16:04:04.366 # +odown master mymaster 127.0.0.1 6379 #quorum 3/2
28163:X 08 Oct 16:04:04.366 # +new-epoch 1
28163:X 08 Oct 16:04:04.366 # +try-failover master mymaster 127.0.0.1 6379
28163:X 08 Oct 16:04:04.373 # +vote-for-leader 18311edfbfb7bf89fe4b67d08ef432053db62fff 1
28163:X 08 Oct 16:04:04.385 # 3e9eb1aa9378d89cfe04fe21bf4a05a901747fa8 voted for 18311edfbfb7bf89fe4b67d08ef432053db62fff 1
28163:X 08 Oct 16:04:04.385 # d4424b8684977767be4f5abd1e364153fbb0adbd voted for 18311edfbfb7bf89fe4b67d08ef432053db62fff 1
28163:X 08 Oct 16:04:04.450 # +elected-leader master mymaster 127.0.0.1 6379
28163:X 08 Oct 16:04:04.450 # +failover-state-select-slave master mymaster 127.0.0.1 6379
28163:X 08 Oct 16:04:04.528 # +selected-slave slave 127.0.0.1:6381 127.0.0.1 6381 @ mymaster 127.0.0.1 6379
28163:X 08 Oct 16:04:04.528 * +failover-state-send-slaveof-noone slave 127.0.0.1:6381 127.0.0.1 6381 @ mymaster 127.0.0.1 6379
28163:X 08 Oct 16:04:04.586 * +failover-state-wait-promotion slave 127.0.0.1:6381 127.0.0.1 6381 @ mymaster 127.0.0.1 6379
28163:X 08 Oct 16:04:05.543 # +promoted-slave slave 127.0.0.1:6381 127.0.0.1 6381 @ mymaster 127.0.0.1 6379
28163:X 08 Oct 16:04:05.543 # +failover-state-reconf-slaves master mymaster 127.0.0.1 6379
28163:X 08 Oct 16:04:05.629 * +slave-reconf-sent slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6379
28163:X 08 Oct 16:04:06.554 # -odown master mymaster 127.0.0.1 6379
28163:X 08 Oct 16:04:06.555 * +slave-reconf-inprog slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6379
28163:X 08 Oct 16:04:06.555 * +slave-reconf-done slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6379
28163:X 08 Oct 16:04:06.606 # +failover-end master mymaster 127.0.0.1 6379
28163:X 08 Oct 16:04:06.606 # +switch-master mymaster 127.0.0.1 6379 127.0.0.1 6381
28163:X 08 Oct 16:04:06.606 * +slave slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6381
28163:X 08 Oct 16:04:06.606 * +slave slave 127.0.0.1:6379 127.0.0.1 6379 @ mymaster 127.0.0.1 6381
28163:X 08 Oct 16:04:36.687 # +sdown slave 127.0.0.1:6379 127.0.0.1 6379 @ mymaster 127.0.0.1 6381

Sentinel-3

28234:X 08 Oct 16:04:04.288 # +sdown master mymaster 127.0.0.1 6379
28234:X 08 Oct 16:04:04.378 # +new-epoch 1
28234:X 08 Oct 16:04:04.385 # +vote-for-leader 18311edfbfb7bf89fe4b67d08ef432053db62fff 1
28234:X 08 Oct 16:04:04.385 # +odown master mymaster 127.0.0.1 6379 #quorum 2/2
28234:X 08 Oct 16:04:04.385 # Next failover delay: I will not start a failover before Mon Oct  8 16:10:04 2018
28234:X 08 Oct 16:04:05.630 # +config-update-from sentinel 18311edfbfb7bf89fe4b67d08ef432053db62fff 127.0.0.1 26380 @ mymaster 127.0.0.1 6379
28234:X 08 Oct 16:04:05.630 # +switch-master mymaster 127.0.0.1 6379 127.0.0.1 6381
28234:X 08 Oct 16:04:05.630 * +slave slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6381
28234:X 08 Oct 16:04:05.630 * +slave slave 127.0.0.1:6379 127.0.0.1 6379 @ mymaster 127.0.0.1 6381
28234:X 08 Oct 16:04:35.709 # +sdown slave 127.0.0.1:6379 127.0.0.1 6379 @ mymaster 127.0.0.1 6381

slave2

28244:S 08 Oct 16:04:04.762 * MASTER <-> SLAVE sync started
28244:S 08 Oct 16:04:04.762 # Error condition on socket for SYNC: Connection refused
28244:S 08 Oct 16:04:05.630 * SLAVE OF 127.0.0.1:6381 enabled (user request from 'id=6 addr=127.0.0.1:43880 fd=12 name= age=148 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=224 qbuf-free=
32544 obl=81 oll=0 omem=0 events=r cmd=slaveof')28244:S 08 Oct 16:04:05.636 # CONFIG REWRITE executed with success.
28244:S 08 Oct 16:04:05.770 * Connecting to MASTER 127.0.0.1:6381
28244:S 08 Oct 16:04:05.770 * MASTER <-> SLAVE sync started
28244:S 08 Oct 16:04:05.770 * Non blocking connect for SYNC fired the event.
28244:S 08 Oct 16:04:05.770 * Master replied to PING, replication can continue...
28244:S 08 Oct 16:04:05.770 * Trying a partial resynchronization (request b95802ca8afd97c578b355a5838d219681d0af27:24302).
28244:S 08 Oct 16:04:05.770 * Successful partial resynchronization with master.
28244:S 08 Oct 16:04:05.770 # Master replication ID changed to a4022bb5c361353a4773fd460cec5cdcc5c02031
28244:S 08 Oct 16:04:05.770 * MASTER <-> SLAVE sync: Master accepted a Partial Resynchronization.

slave3

28253:S 08 Oct 16:04:03.655 * MASTER <-> SLAVE sync started
28253:S 08 Oct 16:04:03.655 # Error condition on socket for SYNC: Connection refused
28253:M 08 Oct 16:04:04.586 # Setting secondary replication ID to b95802ca8afd97c578b355a5838d219681d0af27, valid up to offset: 24302. New replication ID is a4022bb5c361353a4773fd460cec5cdc
c5c0203128253:M 08 Oct 16:04:04.586 * Discarding previously cached master state.
28253:M 08 Oct 16:04:04.586 * MASTER MODE enabled (user request from 'id=9 addr=127.0.0.1:49316 fd=8 name=sentinel-18311edf-cmd age=137 idle=0 flags=x db=0 sub=0 psub=0 multi=3 qbuf=0 qbuf-
free=32768 obl=36 oll=0 omem=0 events=r cmd=exec')28253:M 08 Oct 16:04:04.593 # CONFIG REWRITE executed with success.
28253:M 08 Oct 16:04:05.770 * Slave 127.0.0.1:6380 asks for synchronization
28253:M 08 Oct 16:04:05.770 * Partial resynchronization request from 127.0.0.1:6380 accepted. Sending 156 bytes of backlog starting from offset 24302.

結合上面的日志，可以看到，

各個Sentinel節點都判斷127.0.0.1 6379為主觀下線（Subjectively Down，縮寫為sdown）。

28163:X 08 Oct 16:04:04.289 # +sdown master mymaster 127.0.0.1 6379

達到quorum的設置，Sentinel-2判斷其為客觀下線（Objectively Down，縮寫為odown）。結合其它兩個Sentinel節點的日志，可以看到，Sentinel-2最先判定其客觀下線。接下來，會進行Sentinel的領導者選舉。一般來說，誰先完成客觀下線的判定，誰就是領導者，只有Sentinel領導者才能進行failover。

28163:X 08 Oct 16:04:04.366 # +odown master mymaster 127.0.0.1 6379 #quorum 3/2
28163:X 08 Oct 16:04:04.366 # +new-epoch 1
28163:X 08 Oct 16:04:04.366 # +try-failover master mymaster 127.0.0.1 6379
28163:X 08 Oct 16:04:04.373 # +vote-for-leader 18311edfbfb7bf89fe4b67d08ef432053db62fff 1
28163:X 08 Oct 16:04:04.385 # 3e9eb1aa9378d89cfe04fe21bf4a05a901747fa8 voted for 18311edfbfb7bf89fe4b67d08ef432053db62fff 1
28163:X 08 Oct 16:04:04.385 # d4424b8684977767be4f5abd1e364153fbb0adbd voted for 18311edfbfb7bf89fe4b67d08ef432053db62fff 1
28163:X 08 Oct 16:04:04.450 # +elected-leader master mymaster 127.0.0.1 6379

尋找合適的slave作為master

28163:X 08 Oct 16:04:04.450 # +failover-state-select-slave master mymaster 127.0.0.1 6379

+failover-state-select-slave <instance details> -- New failover state is select-slave: we are trying to find a suitable slave for promotion.

將127.0.0.1 6381設置為新主

28163:X 08 Oct 16:04:04.528 # +selected-slave slave 127.0.0.1:6381 127.0.0.1 6381 @ mymaster 127.0.0.1 6379

+selected-slave <instance details> -- We found the specified good slave to promote.

命令6381節點執行slaveof no one，使其成為主節點

28163:X 08 Oct 16:04:04.528 * +failover-state-send-slaveof-noone slave 127.0.0.1:6381 127.0.0.1 6381 @ mymaster 127.0.0.1 6379

+failover-state-send-slaveof-noone <instance details> -- We are trying to reconfigure the promoted slave as master, waiting for it to switch.

等待6381節點升級為主節點

28163:X 08 Oct 16:04:04.586 * +failover-state-wait-promotion slave 127.0.0.1:6381 127.0.0.1 6381 @ mymaster 127.0.0.1 6379

確認6381節點已經升級為主節點

28163:X 08 Oct 16:04:05.543 # +promoted-slave slave 127.0.0.1:6381 127.0.0.1 6381 @ mymaster 127.0.0.1 6379

再來看看16:04:04.528到16:04:05.543這個時間段slave3的日志輸出。可以看到，其開啟了MASTER模式，且重寫了配置文件。

28253:M 08 Oct 16:04:04.586 # Setting secondary replication ID to b95802ca8afd97c578b355a5838d219681d0af27, valid up to offset: 24302. New replication ID is a4022bb5c361353a4773fd460cec5cdcc5c02031
28253:M 08 Oct 16:04:04.586 * Discarding previously cached master state.
28253:M 08 Oct 16:04:04.586 * MASTER MODE enabled (user request from 'id=9 addr=127.0.0.1:49316 fd=8 name=sentinel-18311edf-cmd age=137 idle=0 flags=x db=0 sub=0 psub=0 multi=3 qbuf=0 qbuf-free=32768 obl=36 oll=0 omem=0 events=r cmd=exec')
28253:M 08 Oct 16:04:04.593 # CONFIG REWRITE executed with success.

failover進入重新配置從節點階段

28163:X 08 Oct 16:04:05.543 # +failover-state-reconf-slaves master mymaster 127.0.0.1 6379

命令6380節點復制新的主節點

28163:X 08 Oct 16:04:05.629 * +slave-reconf-sent slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6379

+slave-reconf-sent <instance details> -- The leader sentinel sent the SLAVEOF command to this instance in order to reconfigure it for the new slave.

看看這個時間點slave2的日志輸出，基本吻合。其進行的是增量同步。

28244:S 08 Oct 16:04:05.630 * SLAVE OF 127.0.0.1:6381 enabled (user request from 'id=6 addr=127.0.0.1:43880 fd=12 name= age=148 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=224 qbuf-free=32544 obl=81 oll=0 omem=0 events=r cmd=slaveof')
28244:S 08 Oct 16:04:05.636 # CONFIG REWRITE executed with success.
28244:S 08 Oct 16:04:05.770 * Connecting to MASTER 127.0.0.1:6381
28244:S 08 Oct 16:04:05.770 * MASTER <-> SLAVE sync started
28244:S 08 Oct 16:04:05.770 * Non blocking connect for SYNC fired the event.
28244:S 08 Oct 16:04:05.770 * Master replied to PING, replication can continue...
28244:S 08 Oct 16:04:05.770 * Trying a partial resynchronization (request b95802ca8afd97c578b355a5838d219681d0af27:24302).
28244:S 08 Oct 16:04:05.770 * Successful partial resynchronization with master.
28244:S 08 Oct 16:04:05.770 # Master replication ID changed to a4022bb5c361353a4773fd460cec5cdcc5c02031
28244:S 08 Oct 16:04:05.770 * MASTER <-> SLAVE sync: Master accepted a Partial Resynchronization.

同時，在這個時間點，sentinel也有日志輸出，以sentinel1為例。從日志中，可以看到，在這個時間點它會更改配置信息。

28087:X 08 Oct 16:04:05.631 # +config-update-from sentinel 18311edfbfb7bf89fe4b67d08ef432053db62fff 127.0.0.1 26380 @ mymaster 127.0.0.1 6379
28087:X 08 Oct 16:04:05.631 # +switch-master mymaster 127.0.0.1 6379 127.0.0.1 6381
28087:X 08 Oct 16:04:05.631 * +slave slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6381
28087:X 08 Oct 16:04:05.631 * +slave slave 127.0.0.1:6379 127.0.0.1 6379 @ mymaster 127.0.0.1 6381

switch-master <master name> <oldip> <oldport> <newip> <newport> -- The master new IP and address is the specified one after a configuration change. This is the message most external users are interested in.

同步過程尚未完成。

28163:X 08 Oct 16:04:06.555 * +slave-reconf-inprog slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6379

+slave-reconf-inprog <instance details> -- The slave being reconfigured showed to be a slave of the new master ip:port pair, but the synchronization process is not yet complete.

主從同步完成。

28163:X 08 Oct 16:04:06.555 * +slave-reconf-done slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6379

+slave-reconf-done <instance details> -- The slave is now synchronized with the new master.

failover切換完成。

28163:X 08 Oct 16:04:06.606 # +failover-end master mymaster 127.0.0.1 6379

failover成功后，發布主節點的切換消息

28163:X 08 Oct 16:04:06.606 # +switch-master mymaster 127.0.0.1 6379 127.0.0.1 6381

關聯新主節點的slave信息，需要注意的是，原來的主節點會作為新主節點的slave。

28163:X 08 Oct 16:04:06.606 * +slave slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6381
28163:X 08 Oct 16:04:06.606 * +slave slave 127.0.0.1:6379 127.0.0.1 6379 @ mymaster 127.0.0.1 6381

+slave <instance details> -- A new slave was detected and attached.

過了30s后，判定原來的主節點主觀下線。

28163:X 08 Oct 16:04:36.687 # +sdown slave 127.0.0.1:6379 127.0.0.1 6379 @ mymaster 127.0.0.1 6381

綜合來看，Sentinel進行failover的流程如下

1. 每隔1秒，每個Sentinel節點會向主節點、從節點、其余Sentinel節點發送一條ping命令做一次心跳檢測，來確認這些節點當前是否可達。當這些節點超過down-after-milliseconds沒有進行有效回復，Sentinel節點就會判定該節點為主觀下線。

2. 如果被判定為主觀下線的節點是主節點，該Sentinel節點會通過sentinel is master-down-by-addr命令向其他Sentinel節點詢問對主節點的判斷，當超過<quorum>個數，Sentinel節點會判定該節點為客觀下線。如果從節點、Sentinel節點被判定為主觀下線，並不會進行后續的故障切換操作。

3. 對Sentinel進行領導者選舉，由其來進行后續的故障切換（failover）工作。選舉算法基於Raft。

4. Sentinel領導者節點開始進行故障切換。

5. 選擇合適的從節點作為新主節點。

6. Sentinel領導者節點對上一步選出來的從節點執行slaveof no one命令讓其成為主節點。

7. 向剩余的從節點發送命令，讓它們成為新主節點的從節點，復制規則和parallel-syncs參數有關。

8. 將原來的主節點更新為從節點，並將其納入到Sentinel的管理，讓其恢復后去復制新的主節點。

Sentinel的領導者選舉流程。

Sentinel的領導者選舉基於Raft協議。

1. 每個在線的Sentinel節點都有資格成為領導者，當它確認主節點主觀下線時候，會向其他Sentinel節點發送sentinel is-master-down-by-addr命令，要求將自己設置為領導者。

2. 收到命令的Sentinel節點，如果沒有同意過其他Sentinel節點的sentinel is-master-down-by-addr命令，將同意該請求，否則拒絕。

3. 如果該Sentinel節點發現自己的票數已經大於等於max（quorum，num（sentinels）/2+1），那么它將成為領導者。

新主節點的選擇流程。

1. 刪除所有已經處於下線或斷線狀態的從節點。

2. 刪除最近5秒沒有回復過領導者Sentinel的INFO命令的從節點。

3. 刪除所有與已下線主節點連接斷開超過down-after-milliseconds*10毫秒的從節點。

4. 選擇優先級最高的從節點。

5. 選擇復制偏移量最大的從節點。

6. 選擇runid最小的從節點。

三個定時監控任務

1. 每隔10秒，每個Sentinel節點會向主節點和從節點發送info命令獲取最新的拓撲結構。其作用如下：

1> 通過向主節點執行info命令，獲取從節點的信息，這也是為什么Sentinel節點不需要顯式配置監控從節點。
2> 當有新的從節點加入時可立刻感知出來。
3> 節點不可達或者故障切換后，可通過info命令實時更新節點拓撲信息。

2. 每隔2秒，每個Sentinel節點會向Redis數據節點的__sentinel__：hello頻道上發送該Sentinel節點對於主節點的判斷以及當前Sentinel節點的信息，同時每個Sentinel節點也會訂閱該頻道，來了解其它Sentinel節點以及它們對主節點的判斷。其作用如下：

1> 發現新的Sentinel節點：通過訂閱主節點的__sentinel__：hello了解其它Sentinel節點信息，如果是新加入的Sentinel節點，將該Sentinel節點信息保存起來，並與該Sentinel節點創建連接。
2> Sentinel節點之間交換主節點的狀態，作為后面客觀下線以及領導者選舉的依據。

3. 每隔1秒，每個Sentinel節點會向主節點、從節點、其余Sentinel節點發送一條ping命令做一次心跳檢測，來確認這些節點當前是否可達。這個定時任務是節點失敗判定的重要依據。

Sentinel的相關參數

# bind 127.0.0.1 192.168.1.1
# protected-mode no
port 26379
# sentinel announce-ip <ip>
# sentinel announce-port <port>
dir /tmp
sentinel monitor mymaster 127.0.0.1 6379 2
# sentinel auth-pass <master-name> <password>
sentinel down-after-milliseconds mymaster 30000
sentinel parallel-syncs mymaster 1
sentinel failover-timeout mymaster 180000
# sentinel notification-script mymaster /var/redis/notify.sh
# sentinel client-reconfig-script mymaster /var/redis/reconfig.sh
sentinel deny-scripts-reconfig yes

其中，

dir：設置Sentinel的工作目錄。

sentinel monitor mymaster 127.0.0.1 6379 2：其中2是quorum，即權重，代表至少需要兩個Sentinel節點認為主節點主觀下線，才可判定主節點為客觀下線。一般建議將其設置為Sentinel節點的一半加1。不僅如此，quorum還與Sentinel節點的領導者選舉有關。為了選出Sentinel的領導者，至少需要max(quorum, num(sentinels) / 2 + 1)個Sentinel節點參與選舉。

sentinel down-after-milliseconds mymaster 30000：每個Sentinel節點都要通過定期發送ping命令來判斷Redis節點和其余Sentinel節點是否可達。

如果在指定的時間內，沒有收到主節點的有效回復，則判斷其為主觀下線。需要注意的是，該參數不僅用來判斷主節點狀態，同樣也用來判斷該主節點下面的從節點及其它Sentinel的狀態。其默認值為30s。

sentinel parallel-syncs mymaster 1：在failover期間，允許多少個slave同時指向新的主節點。如果numslaves設置較大的話，雖然復制操作並不會阻塞主節點，但多個節點同時指向新的主節點，會增加主節點的網絡和磁盤IO負載。

sentinel failover-timeout mymaster 180000：定義故障切換超時時間。默認180000，單位秒，即3min。需要注意的是，該時間不是總的故障切換的時間，而是適用於故障切換的多個場景。

# Specifies the failover timeout in milliseconds. It is used in many ways:
#
# - The time needed to re-start a failover after a previous failover was
#   already tried against the same master by a given Sentinel, is two
#   times the failover timeout.
#
# - The time needed for a slave replicating to a wrong master according
#   to a Sentinel current configuration, to be forced to replicate
#   with the right master, is exactly the failover timeout (counting since
#   the moment a Sentinel detected the misconfiguration).
#
# - The time needed to cancel a failover that is already in progress but
#   did not produced any configuration change (SLAVEOF NO ONE yet not
#   acknowledged by the promoted slave).
#
# - The maximum time a failover in progress waits for all the slaves to be
#   reconfigured as slaves of the new master. However even after this time
#   the slaves will be reconfigured by the Sentinels anyway, but not with
#   the exact parallel-syncs progression as specified.

第一種適用場景：如果Redis Sentinel對一個主節點故障切換失敗，那么下次再對該主節點做故障切換的起始時間是failover-timeout的2倍。這點從Sentinel的日志就可體現出來（28234:X 08 Oct 16:04:04.385 # Next failover delay: I will not start a failover before Mon Oct 8 16:10:04 2018）

sentinel notification-script：定義通知腳本，當Sentinel出現WARNING級別的事件時，會調用該腳本，其會傳入兩個參數：事件類型，事件描述。

sentinel client-reconfig-script：當主節點發生切換時，會調用該參數定義的腳本，其會傳入以下參數：<master-name> <role> <state> <from-ip> <from-port> <to-ip> <to-port>

關於腳本，其必須遵循一定的規則。

# SCRIPTS EXECUTION
#
# sentinel notification-script and sentinel reconfig-script are used in order
# to configure scripts that are called to notify the system administrator
# or to reconfigure clients after a failover. The scripts are executed
# with the following rules for error handling:
#
# If script exits with "1" the execution is retried later (up to a maximum
# number of times currently set to 10).
#
# If script exits with "2" (or an higher value) the script execution is
# not retried.
#
# If script terminates because it receives a signal the behavior is the same
# as exit code 1.
#
# A script has a maximum running time of 60 seconds. After this limit is
# reached the script is terminated with a SIGKILL and the execution retried.

sentinel deny-scripts-reconfig：不允許使用SENTINEL SET設置notification-script和client-reconfig-script。

Sentinel的常見操作

PING This command simply returns PONG.
SENTINEL masters Show a list of monitored masters and their state.
SENTINEL master <master name> Show the state and info of the specified master.
SENTINEL slaves <master name> Show a list of slaves for this master, and their state.
SENTINEL sentinels <master name> Show a list of sentinel instances for this master, and their state.
SENTINEL get-master-addr-by-name <master name> Return the ip and port number of the master with that name. If a failover is in progress or terminated successfully for this master it returns the address and port of the promoted slave.
SENTINEL reset <pattern> This command will reset all the masters with matching name. The pattern argument is a glob-style pattern. The reset process clears any previous state in a master (including a failover in progress), and removes every slave and sentinel already discovered and associated with the master.
SENTINEL failover <master name> Force a failover as if the master was not reachable, and without asking for agreement to other Sentinels (however a new version of the configuration will be published so that the other Sentinels will update their configurations).
SENTINEL ckquorum <master name> Check if the current Sentinel configuration is able to reach the quorum needed to failover a master, and the majority needed to authorize the failover. This command should be used in monitoring systems to check if a Sentinel deployment is ok.
SENTINEL flushconfig Force Sentinel to rewrite its configuration on disk, including the current Sentinel state. Normally Sentinel rewrites the configuration every time something changes in its state (in the context of the subset of the state which is persisted on disk across restart). However sometimes it is possible that the configuration file is lost because of operation errors, disk failures, package upgrade scripts or configuration managers. In those cases a way to to force Sentinel to rewrite the configuration file is handy. This command works even if the previous configuration file is completely missing.
SENTINEL MONITOR <name> <ip> <port> <quorum> This command tells the Sentinel to start monitoring a new master with the specified name, ip, port, and quorum. It is identical to the sentinel monitor configuration directive in sentinel.conf configuration file
SENTINEL REMOVE <name> is used in order to remove the specified master: the master will no longer be monitored, and will totally be removed from the internal state of the Sentinel, so it will no longer listed by SENTINEL masters and so forth.
SENTINEL SET <name> <option> <value> The SET command is very similar to the CONFIG SET command of Redis, and is used in order to change configuration parameters of a specific master. Multiple option / value pairs can be specified (or none at all). All the configuration parameters that can be configured via sentinel.conf are also configurable using the SET command.

sentinel masters

輸出被監控的主節點的狀態信息

127.0.0.1:26379> sentinel masters
1)  1) "name"
    2) "mymaster"
    3) "ip"
    4) "127.0.0.1"
    5) "port"
    6) "6379"
    7) "runid"
    8) "6ab2be5db3a37c10f2473c8fb9daed147a32df3e"
    9) "flags"
   10) "master"
   11) "link-pending-commands"
   12) "0"
   13) "link-refcount"
   14) "1"
   15) "last-ping-sent"
   16) "0"
   17) "last-ok-ping-reply"
   18) "639"
   19) "last-ping-reply"
   20) "639"
   21) "down-after-milliseconds"
   22) "30000"
   23) "info-refresh"
   24) "2075"
   25) "role-reported"
   26) "master"
   27) "role-reported-time"
   28) "759682"
   29) "config-epoch"
   30) "0"
   31) "num-slaves"
   32) "2"
   33) "num-other-sentinels"
   34) "2"
   35) "quorum"
   36) "2"
   37) "failover-timeout"
   38) "180000"
   39) "parallel-syncs"
   40) "1"

View Code

也可單獨查看某個主節點的狀態

sentinel master mymaster

sentinel slaves mymaster

查看某個主節點slave的狀態

127.0.0.1:26379> sentinel slaves mymaster
1)  1) "name"
    2) "127.0.0.1:6380"
    3) "ip"
    4) "127.0.0.1"
    5) "port"
    6) "6380"
    7) "runid"
    8) "983b87fd070c7f052b26f5135bbb30fdeb170a54"
    9) "flags"
   10) "slave"
   11) "link-pending-commands"
   12) "0"
   13) "link-refcount"
   14) "1"
   15) "last-ping-sent"
   16) "0"
   17) "last-ok-ping-reply"
   18) "178"
   19) "last-ping-reply"
   20) "178"
   21) "down-after-milliseconds"
   22) "30000"
   23) "info-refresh"
   24) "6160"
   25) "role-reported"
   26) "slave"
   27) "role-reported-time"
   28) "489019"
   29) "master-link-down-time"
   30) "0"
   31) "master-link-status"
   32) "ok"
   33) "master-host"
   34) "127.0.0.1"
   35) "master-port"
   36) "6379"
   37) "slave-priority"
   38) "100"
   39) "slave-repl-offset"
   40) "70375"
2)  1) "name"
    2) "127.0.0.1:6381"
    3) "ip"
    4) "127.0.0.1"
    5) "port"
    6) "6381"
    7) "runid"
    8) "b88059cce9104dd4e0366afd6ad07a163dae8b15"
    9) "flags"
   10) "slave"
   11) "link-pending-commands"
   12) "0"
   13) "link-refcount"
   14) "1"
   15) "last-ping-sent"
   16) "0"
   17) "last-ok-ping-reply"
   18) "178"
   19) "last-ping-reply"
   20) "178"
   21) "down-after-milliseconds"
   22) "30000"
   23) "info-refresh"
   24) "2918"
   25) "role-reported"
   26) "slave"
   27) "role-reported-time"
   28) "489019"
   29) "master-link-down-time"
   30) "0"
   31) "master-link-status"
   32) "ok"
   33) "master-host"
   34) "127.0.0.1"
   35) "master-port"
   36) "6379"
   37) "slave-priority"
   38) "100"
   39) "slave-repl-offset"
   40) "71040"

View Code

sentinel sentinels mymaster

查看其它Sentinel的狀態

127.0.0.1:26379> sentinel sentinels mymaster
1)  1) "name"
    2) "738ccbddaa0d4379d89a147613d9aecfec765bcb"
    3) "ip"
    4) "127.0.0.1"
    5) "port"
    6) "26381"
    7) "runid"
    8) "738ccbddaa0d4379d89a147613d9aecfec765bcb"
    9) "flags"
   10) "sentinel"
   11) "link-pending-commands"
   12) "0"
   13) "link-refcount"
   14) "1"
   15) "last-ping-sent"
   16) "0"
   17) "last-ok-ping-reply"
   18) "475"
   19) "last-ping-reply"
   20) "475"
   21) "down-after-milliseconds"
   22) "30000"
   23) "last-hello-message"
   24) "79"
   25) "voted-leader"
   26) "?"
   27) "voted-leader-epoch"
   28) "0"
2)  1) "name"
    2) "7251bb129ca373ad0d8c7baf3b6577ae2593079f"
    3) "ip"
    4) "127.0.0.1"
    5) "port"
    6) "26380"
    7) "runid"
    8) "7251bb129ca373ad0d8c7baf3b6577ae2593079f"
    9) "flags"
   10) "sentinel"
   11) "link-pending-commands"
   12) "0"
   13) "link-refcount"
   14) "1"
   15) "last-ping-sent"
   16) "0"
   17) "last-ok-ping-reply"
   18) "475"
   19) "last-ping-reply"
   20) "475"
   21) "down-after-milliseconds"
   22) "30000"
   23) "last-hello-message"
   24) "985"
   25) "voted-leader"
   26) "?"
   27) "voted-leader-epoch"
   28) "0"

View Code

sentinel get-master-addr-by-name <master name>

返回指定<master name>主節點的IP地址和端口。如果在進行故障切換，則顯示的是新主的信息。

127.0.0.1:26379> sentinel get-master-addr-by-name mymaster
1) "127.0.0.1"
2) "6379"

sentinel reset <pattern>

對符合<pattern>（通配符風格）主節點的配置進行重置。

如果某個slave宕機了，其依然處於sentinel的管理中，所以，在其恢復正常后，其依然會加入到之前的復制環境中，即使配置文件中沒有指定slaveof選項。不僅如此，如果主節點宕機了，在其重啟后，其默認會作為從節點接入到之前的復制環境中。

但很多時候，我們可能就是想移除old master，slave，這個時候，sentinel reset就派上用場了。其會基於當前主節點的狀態，重置其配置（they'll refresh the list of slaves within the next 10 seconds, only adding the ones listed as correctly replicating from the current master INFO output）。關鍵的是，對於非正常狀態的slave，會從當前的配置中剔除。這樣，被剔除節點在恢復正常后（注意此時的配置文件，需剔除slaveof的配置），也不會自動加入到之前的復制環境中。

需要注意的是，該命令僅對當前sentinel節點有效，如果要剔除某個節點，需要在所有的sentinel節點上執行reset操作。

sentinel failover <master name>

對指定 <master name> 主節點進行強制故障切換。相對於常規的故障切換，其無需進行Sentinel節點的領導者選舉。直接由當前Sentinel節點進行后續的故障切換。

sentinel ckquorum <master name>

檢測當前可達的Sentinel節點總數是否達到<quorum>的個數

127.0.0.1:26379> sentinel ckquorum mymaster
OK 3 usable Sentinels. Quorum and failover authorization can be reached

sentinel flushconfig

將Sentinel節點的配置信息強制刷到磁盤上，這個命令Sentinel節點自身用得比較多，對於開發和運維人員只有當外部原因（例如磁盤損壞）造成配置文件損壞或者丟失時，才會用上。

sentinel remove <master name>

取消當前Sentinel節點對於指定<master name>主節點的監控。

[root@slowtech redis-4.0.11]# grep -Ev "^#|^$" sentinel_26379.conf 
port 26379
dir "/tmp"
sentinel myid 2467530fa249dbbc435c50fbb0dc2a4e766146f8
sentinel deny-scripts-reconfig yes
sentinel monitor mymaster 127.0.0.1 6381 2
sentinel config-epoch mymaster 12
sentinel leader-epoch mymaster 0
sentinel known-slave mymaster 127.0.0.1 6380
sentinel known-slave mymaster 127.0.0.1 6379
sentinel known-sentinel mymaster 127.0.0.1 26381 738ccbddaa0d4379d89a147613d9aecfec765bcb
sentinel known-sentinel mymaster 127.0.0.1 26380 7251bb129ca373ad0d8c7baf3b6577ae2593079f
sentinel current-epoch 12

[root@slowtech redis-4.0.11]# redis-cli -p 26379
127.0.0.1:26379> sentinel remove mymaster
OK
127.0.0.1:26379> quit

[root@slowtech redis-4.0.11]# grep -Ev "^#|^$" sentinel_26379.conf 
port 26379
dir "/tmp"
sentinel myid 2467530fa249dbbc435c50fbb0dc2a4e766146f8
sentinel deny-scripts-reconfig yes
sentinel current-epoch 12

View Code

sentinel set <name> <option> <value>

參數用法

quorum 　　　　　　　 sentinel set mymaster quorum 3

down-after-milliseconds sentinel set mymaster down-after-milliseconds 30000

failover-timeout　　　　 sentinel set mymaster failover-timeout 18000

parallel-syncs　　　　 sentinel set mymaster parallel-syncs 3

notification-script sentinel set mymaster notification-script /tmp/a.sh

client-reconfig-script sentinel set mymaster client-reconfig-script /tmp/b.sh

auth-pass　　　　　　 sentinel set mymaster auth-pass masterpassword

需要注意的是：

1. sentinel set命令只對當前Sentinel節點有效。

2. sentinel set命令如果執行成功會立即刷新配置文件，這點和Redis普通數據節點不同，后者修改完配置后，需要執行config rewrite刷新到配置文件。

3. 建議所有Sentinel節點的配置盡可能一致。

4. Sentinel不支持config命令。如何要查看參數的設置，可痛過SENTINEL MASTER命令查看。

參考：

1. 《Redis開發與運維》

2. 《Redis設計與實現》

3. 《Redis 4.X Cookbook》

4. 官方文檔

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 基於Sentinel的Redis3.2高可用方案理解redis高可用方案基於Redis Sentinel的Redis集群(主從&Sharding)高可用方案 Redis高可用方案----Redis主從+Sentinel+Haproxy Redis Sentinel的Redis集群(主從&Sharding)高可用方案高可用Redis(九)：Redis Sentinel 深入了解Redis(8)-高可用方案 Sentinel-Redis高可用方案（二）：主從切換 Sentinel-Redis高可用方案（一）：主從復制 redis 學習筆記(4)-HA高可用方案Sentinel配置