Redis集群的主從切換研究


目錄

目錄 1

1. 前言 1

2. slave發起選舉 2

3. master響應選舉 5

4. 選舉示例 5

5. 哈希槽傳播方式 6

6. 一次主從切換記錄1 6

6.1. 相關參數 6

6.2. 時間點記錄 6

6.3. 其它master日志 6

6.4. 其它master日志 7

6.5. slave日志 7

7. 一次主從切換記錄2 8

7.1. 相關參數 8

7.2. 時間點記錄 8

7.3. 其它master日志 8

7.4. 其它master日志 9

7.5. slave日志 9

8. slave延遲發起選舉代碼 9

 

1. 前言

Redis官方原文:https://redis.io/topics/cluster-spec。另外,從Redis-5.0開始,slave已改叫replica,配置項和部分文檔及變量已做改名。

Redis集群的主從切換采取選舉機制,要求少數服從多數,而參與選舉的只能為master,所以只有多數master存活動時才能進行,選舉由slave發起。

Redis用了和Raft算法term(任期)類似的的概念,在Redis中叫作epoch(紀元),epoch是一個無符號的64整數,一個節點的epoch0開始。

如果一個節點接收到的epoch比自己的大,則將自已的epoch更新接收到的epoch(假定為信任網絡,無拜占庭將軍問題)。

每個master都會在pingpong消息中廣播自己的epoch和所負責的slots位圖,slave發起選舉時,創建一個新的epoch(增一),epoch的值會持久化到文件nodes.conf中,如(最新epoch值為27,最近一次投票給了27):

vars currentEpoch 27 lastVoteEpoch 27

2. slave發起選舉

只有masterfail狀態,slave才會發起選舉。但並不是masterfail時立即發起選舉,而是延遲下列隨機時長,以避免多個slaves同時發起選舉(至少延遲0.5秒后才會發起選舉):

500 milliseconds + random delay between 0 and 500 milliseconds + SLAVE_RANK * 1000 milliseconds

 

一個slave發起選舉的條件:

1) 它的masterfail狀態(非pfail狀態);

2) 它的master至少負責了一個slot

3) slavemaster的復制連接斷開時間不超過給定的值(值可配置,目的是確保slave上的數據足夠完整,所以運維時不能任由一個slave長時間不可用,需要通過監控將異常的slave及時恢復)。

 

因過長時間不可用而不能自動切換的slave日志:

slave過長時間不可用,導致無法自動切換為master

12961:S 06 Jan 2019 19:00:21.969 # Currently unable to failover: Disconnected from master for longer than allowed. Please check the 'cluster-replica-validity-factor' configuration option.

 

相關的源代碼:

/* This function is called if we are a slave node and our master serving

 * a non-zero amount of hash slots is in FAIL state.

 *

 * The gaol of this function is:

 * 1) To check if we are able to perform a failover, is our data updated?

 * 2) Try to get elected by masters.

 * 3) Perform the failover informing all the other nodes.

 */

void clusterHandleSlaveFailover(void) {

     mstime_t data_age; // 與master斷開的時長,單位毫秒

     mstime_t auth_age = mstime() - server.cluster->failover_auth_time;

     int needed_quorum = (server.cluster->size / 2) + 1;

     int manual_failover = server.cluster->mf_end != 0 && server.cluster->mf_can_start;

     auth_timeout = server.cluster_node_timeout*2;

     if (auth_timeout < 2000) auth_timeout = 2000;

     auth_retry_time = auth_timeout*2;

     。。。。。。

     /* Set data_age to the number of seconds we are disconnected from

     * the master. */

    if (server.repl_state == REPL_STATE_CONNECTED) {

        data_age = (mstime_t)(server.unixtime - server.master->lastinteraction) * 1000;

    } else {

        data_age = (mstime_t)(server.unixtime - server.repl_down_since) * 1000;

    }

 

    /* Remove the node timeout from the data age as it is fine that we are

     * disconnected from our master at least for the time it was down to be

     * flagged as FAIL, that's the baseline. */

    if (data_age > server.cluster_node_timeout)

        data_age -= server.cluster_node_timeout;

 

    /* Check if our data is recent enough according to the slave validity

     * factor configured by the user.

     *

     * Check bypassed for manual failovers. */

    if (server.cluster_slave_validity_factor &&

        data_age >

        (((mstime_t)server.repl_ping_slave_period * 1000) +

         (server.cluster_node_timeout * server.cluster_slave_validity_factor)))

    {

        // slave不可用時間過長,導致不能自動切換為master

        if (!manual_failover) { // 人工切換除外

            clusterLogCantFailover(CLUSTER_CANT_FAILOVER_DATA_AGE);

            return;

        }

    }

    。。。。。。

    /* Ask for votes if needed. */

    // failover_auth_sent標記是否已發送過投票消息

    if (server.cluster->failover_auth_sent == 0) {

        server.cluster->currentEpoch++;

        server.cluster->failover_auth_epoch = server.cluster->currentEpoch;

        serverLog(LL_WARNING,"Starting a failover election for epoch %llu.",

            (unsigned long long) server.cluster->currentEpoch);

 

        // 給所有節點(包括slaves)發送投票消息FAILOVE_AUTH_REQUEST(請求投票成為master消息),但注意只有master響應該消息

        clusterRequestFailoverAuth();

        server.cluster->failover_auth_sent = 1;

        clusterDoBeforeSleep(CLUSTER_TODO_SAVE_CONFIG|

                             CLUSTER_TODO_UPDATE_STATE|

                             CLUSTER_TODO_FSYNC_CONFIG);

        return; /* Wait for replies. */

    }

    /* Check if we reached the quorum. */

    if (server.cluster->failover_auth_count >= needed_quorum) {

        /* We have the quorum, we can finally failover the master. */

 

        serverLog(LL_WARNING,

            "Failover election won: I'm the new master.");

 

        /* Update my configEpoch to the epoch of the election. */

        if (myself->configEpoch < server.cluster->failover_auth_epoch) {

            myself->configEpoch = server.cluster->failover_auth_epoch;

            serverLog(LL_WARNING,

                "configEpoch set to %llu after successful failover",

                (unsigned long long) myself->configEpoch);

        }

 

        /* Take responsibility for the cluster slots. */

        clusterFailoverReplaceYourMaster();

    } else {

        clusterLogCantFailover(CLUSTER_CANT_FAILOVER_WAITING_VOTES);

    }

}

 

從上段代碼,還可以看到配置項cluster-slave-validity-factor影響slave是否能夠切換為master

 

發起選舉前,slave先給自己的epoch(即currentEpoch)增一,然后請求其它master給自己投票。slave是通過廣播FAILOVER_AUTH_REQUEST包給集中的每一個masters

slave發起投票后,會等待至少兩倍NODE_TIMEOUT時長接收投票結果,不管NODE_TIMEOUT何值,也至少會等待2秒。

master接收投票后給slave響應FAILOVER_AUTH_ACK,並且在(NODE_TIMEOUT*2)時間內不會給同一master的其它slave投票。

如果slave收到FAILOVER_AUTH_ACK響應的epoch值小於自己的epoch,則會直接丟棄。一旦slave收到多數masterFAILOVER_AUTH_ACK,則聲明自己贏得了選舉。

如果slave在兩倍的NODE_TIMEOUT時間內(至少2秒)未贏得選舉,則放棄本次選舉,然后在四倍NODE_TIMEOUT時間(至少4秒)后重新發起選舉。

 

只所以強制延遲至少0.5選舉,是為確保masterfail狀態在整個集群內傳開,否則可能只有小部分master知曉,而master只會給處於fail狀態的masterslaves投票。如果一個slavemaster狀態不是fail,則其它master不會給它投票,Redis通過八卦協議(即Gossip協議,也叫謠言協議)傳播fail。而在固定延遲上再加一個隨機延遲,是為了避免多個slaves同時發起選舉。

 

slaveSLAVE_RANK是一個與master復制數有關的值,具有最新復制時SLAVE_RANK值為0,第二則為1,以此類推。這樣可讓具有最全數據的slave優先發起選舉。當具有更高SLAVE_RANK值的slave如果沒有當選,則其它slaves會很快發起選舉(至少4秒后)。

slave贏得選舉后,會向集群內的所有節點廣播pong,以盡快完成重新配置(體現在node.conf的更新)。當前未能到達的節點,最終也會完成重新配置。

其它節點會發現有兩個相同的master負責相同的slots,這時就看哪個masterepoch值更大。

slave成為master后,並不立即服務,而是留了一個時間差。

3. master響應選舉

master收到slave的投票請求FAILOVER_AUTH_REQUEST后,只有滿足下列條件時,才會響應投票:

1) 對一個epoch,只投票一次;

2) 會拒絕所有更小epoch的投票請求;

3) 不會給小於lastVoteEpochepoch投票;

4) master只給master狀態為failslave投票;

5) 如果slave請求的currentEpoch小於mastercurrentEpoch,則master忽略該請求,但下列情況例外:

① 假設master的currentEpoch值為5,lastVoteEpoch值為1(當有選舉失敗會出現這個情況,亦即currentEpoch值增加了,但因為選舉失敗,lastVoteEpoch值未變);

② slave的currentEpoch值為3;

③ slave增一,使用值為4的epoch發起選舉,這個時候master會響應epoch值為5,不巧這個響應延遲了;

④ slave重新發起選舉,這個時候選舉用的epoch值為5(每次發起選舉epoch值均需增一),湊巧這個時候原來延遲的響應達到了,這個時候原來延遲的響應被slave認為有效。

 

master投票后,會用請求中的epoch更新本地的lastVoteEpoch,並持久化到node.conf文件中。master不會參與選擇最優的slave,由於最優的slave有最好的SLAVE_RANK,因此最優的slave可相對更快發起選舉。

4. 選舉示例

假設一個masterABC三個slaves節點,當這個master不可達時:

1) 假設slave A贏得選舉成為master

2) slave A因為網絡分區不再可用;

3) slave B贏得選舉;

4) slave B因為網絡分區不再可用;

5) 網絡分區修復,slave A又可用。

 

B掛了,A又可用。同一時刻,slave C發起選舉,試圖替代B成為master。由於slave Cmaster已不可用,所以它能夠選舉成為master,並將configEpoch值增一。而A將不能成為master,因為C已成為master,並且Cepoch值更大。

5. 哈希槽傳播方式

有兩種哈希槽(hash slot)傳播途徑:

1) 心跳消息(Heartbeat messages)。節點在發送pingpong消息時,總是攜帶了它所負責(或它的master所負責)的哈希槽信息;

2) 更新消息(UPDATE messages)。由於心跳包還包含了epoch信息,當消息接收者發現心跳包攜帶的信息陳舊時,會響應更新的信息,這樣強迫發送者更新哈希槽。

6. 一次主從切換記錄1

測試集群運行在同一個物理機上,cluster-node-timeout值比repl-timeout值大。

6.1. 相關參數

cluster-slave-validity-factor值為1

cluster-node-timeout值為30000

repl-ping-slave-period值為1

repl-timeout值為10

6.2. 時間點記錄

masterFAIL之時的1秒左右時間內,即為主從切換之時。

master A標記fail時間:20:12:55.467

master B標記fail時間:20:12:55.467

master A投票時間:20:12:56.164

master B投票時間:20:12:56.164

slave發起選舉時間:20:12:56.160

slave准備發起選舉時間:20:12:55.558(延遲579毫秒)

slave發現和master心跳超時時間:20:12:32.810(在這之后24秒才發生主從切換

slave收到其它master發來的自己的master為fail時間:20:12:55.467

切換前服務最后一次正常時間:(服務異常約發生在秒)20:12:22/279275

切換后服務恢復正常時間:20:12:59/278149

服務不可用時長:約37秒

6.3. 其它master日志

master IDc67dc9e02e25f2e6321df8ac2eb4d99789917783

30613:M 04 Jan 2019 20:12:55.467 * FAIL message received from bfad383775421b1090eaa7e0b2dcfb3b38455079 about 44eb43e50c101c5f44f48295c42dda878b6cb3e9 // 從其它master收到44eb43e50c101c5f44f48295c42dda878b6cb3e9已fail消息

30613:M 04 Jan 2019 20:12:55.467 # Cluster state changed: fail

30613:M 04 Jan 2019 20:12:56.164 # Failover auth granted to 0ae8b5400d566907a3d8b425d983ac3b7cbd8412 for epoch 30 // 對選舉投票

30613:M 04 Jan 2019 20:12:56.204 # Cluster state changed: ok

30613:M 04 Jan 2019 20:12:56.708 * Ignoring FAIL message from unknown node 082c079149a9915612d21cca8e08c831a4edeade about 44eb43e50c101c5f44f48295c42dda878b6cb3e9

6.4. 其它master日志

master IDbfad383775421b1090eaa7e0b2dcfb3b38455079

30614:M 04 Jan 2019 20:12:55.467 * Marking node 44eb43e50c101c5f44f48295c42dda878b6cb3e9 as failing (quorum reached). // 標記44eb43e50c101c5f44f48295c42dda878b6cb3e9為已fail

30614:M 04 Jan 2019 20:12:56.164 # Failover auth granted to 0ae8b5400d566907a3d8b425d983ac3b7cbd8412 for epoch 30 // 對選舉投票

30614:M 04 Jan 2019 20:12:56.709 * Ignoring FAIL message from unknown node 082c079149a9915612d21cca8e08c831a4edeade about 44eb43e50c101c5f44f48295c42dda878b6cb3e9

6.5. slave日志

slavemaster ID44eb43e50c101c5f44f48295c42dda878b6cb3e9slave自己的ID0ae8b5400d566907a3d8b425d983ac3b7cbd8412

30651:S 04 Jan 2019 20:12:32.810 # MASTER timeout: no data nor PING received... // 發現master超時,master異常10秒后發現,原因是repl-timeout的值為10

30651:S 04 Jan 2019 20:12:32.810 # Connection with master lost.

30651:S 04 Jan 2019 20:12:32.810 * Caching the disconnected master state.

30651:S 04 Jan 2019 20:12:32.810 * Connecting to MASTER 1.9.16.9:4073

30651:S 04 Jan 2019 20:12:32.810 * MASTER <-> REPLICA sync started

30651:S 04 Jan 2019 20:12:32.810 * Non blocking connect for SYNC fired the event.

 

30651:S 04 Jan 2019 20:12:43.834 # Timeout connecting to the MASTER...

30651:S 04 Jan 2019 20:12:43.834 * Connecting to MASTER 1.9.16.9:4073

30651:S 04 Jan 2019 20:12:43.834 * MASTER <-> REPLICA sync started

30651:S 04 Jan 2019 20:12:43.834 * Non blocking connect for SYNC fired the event.

30651:S 04 Jan 2019 20:12:54.856 # Timeout connecting to the MASTER...

30651:S 04 Jan 2019 20:12:54.856 * Connecting to MASTER 1.9.16.9:4073

30651:S 04 Jan 2019 20:12:54.856 * MASTER <-> REPLICA sync started

30651:S 04 Jan 2019 20:12:54.856 * Non blocking connect for SYNC fired the event.

30651:S 04 Jan 2019 20:12:55.467 * FAIL message received from bfad383775421b1090eaa7e0b2dcfb3b38455079 about 44eb43e50c101c5f44f48295c42dda878b6cb3e9 // 從其它master收到自己的master的FAIL消息

30651:S 04 Jan 2019 20:12:55.467 # Cluster state changed: fail

30651:S 04 Jan 2019 20:12:55.558 # Start of election delayed for 579 milliseconds (rank #0, offset 227360). // 准備發起選舉,延遲579毫秒,其中500毫秒為固定延遲,279秒為隨機延遲,因為RANK值為0,所以RANK延遲為0毫秒

30651:S 04 Jan 2019 20:12:56.160 # Starting a failover election for epoch 30. // 發起選舉

30651:S 04 Jan 2019 20:12:56.180 # Failover election won: I'm the new master. // 贏得選舉

30651:S 04 Jan 2019 20:12:56.180 # configEpoch set to 30 after successful failover

30651:M 04 Jan 2019 20:12:56.180 # Setting secondary replication ID to 154a9c2319403d610808477dcda3d4bede0f374c, valid up to offset: 227361. New replication ID is 927fb64a420236ee46d39389611ab2d8f6530b6a

30651:M 04 Jan 2019 20:12:56.181 * Discarding previously cached master state.

30651:M 04 Jan 2019 20:12:56.181 # Cluster state changed: ok

30651:M 04 Jan 2019 20:12:56.708 * Ignoring FAIL message from unknown node 082c079149a9915612d21cca8e08c831a4edeade about 44eb43e50c101c5f44f48295c42dda878b6cb3e9 // 忽略來自非集群成員1.9.16.9:4077的消息

7. 一次主從切換記錄2

測試集群運行在同一個物理機上,cluster-node-timeout值比repl-timeout值小。

7.1. 相關參數

cluster-slave-validity-factor值為1

cluster-node-timeout值為10000

repl-ping-slave-period值為1

repl-timeout值為30

7.2. 時間點記錄

masterFAIL之時的1秒左右時間內,即為主從切換之時。

master A標記fail時間:20:37:10.398

master B標記fail時間:20:37:10.398

master A投票時間:20:37:11.084

master B投票時間:20:37:11.085

slave發起選舉時間:20:37:11.077

slave准備發起選舉時間:20:37:10.475(延遲539毫秒)

slave發現和master心跳超時時間:沒有發生,因為slave在超時之前已成為master

slave收到其它master發來的自己的master為fail時間:20:37:10.398

切換前服務最后一次正常時間:20:36:55/266889(服務異常約發生在56秒)

切換后服務恢復正常時間:20:37:12/265802

服務不可用時長:約17秒

7.3. 其它master日志

master IDc67dc9e02e25f2e6321df8ac2eb4d99789917783

30613:M 04 Jan 2019 20:37:10.398 * Marking node 44eb43e50c101c5f44f48295c42dda878b6cb3e9 as failing (quorum reached).

30613:M 04 Jan 2019 20:37:10.398 # Cluster state changed: fail

30613:M 04 Jan 2019 20:37:11.084 # Failover auth granted to 0ae8b5400d566907a3d8b425d983ac3b7cbd8412 for epoch 32

30613:M 04 Jan 2019 20:37:11.124 # Cluster state changed: ok

30613:M 04 Jan 2019 20:37:17.560 * Ignoring FAIL message from unknown node 082c079149a9915612d21cca8e08c831a4edeade about 44eb43e50c101c5f44f48295c42dda878b6cb3e9

7.4. 其它master日志

master IDbfad383775421b1090eaa7e0b2dcfb3b38455079

30614:M 04 Jan 2019 20:37:10.398 * Marking node 44eb43e50c101c5f44f48295c42dda878b6cb3e9 as failing (quorum reached).

30614:M 04 Jan 2019 20:37:11.085 # Failover auth granted to 0ae8b5400d566907a3d8b425d983ac3b7cbd8412 for epoch 32

30614:M 04 Jan 2019 20:37:17.560 * Ignoring FAIL message from unknown node 082c079149a9915612d21cca8e08c831a4edeade about 44eb43e50c101c5f44f48295c42dda878b6cb3e9

7.5. slave日志

slavemaster ID44eb43e50c101c5f44f48295c42dda878b6cb3e9slave自己的ID0ae8b5400d566907a3d8b425d983ac3b7cbd8412

30651:S 04 Jan 2019 20:37:10.398 * FAIL message received from c67dc9e02e25f2e6321df8ac2eb4d99789917783 about 44eb43e50c101c5f44f48295c42dda878b6cb3e9

30651:S 04 Jan 2019 20:37:10.398 # Cluster state changed: fail

30651:S 04 Jan 2019 20:37:10.475 # Start of election delayed for 539 milliseconds (rank #0, offset 228620).

30651:S 04 Jan 2019 20:37:11.077 # Starting a failover election for epoch 32.

30651:S 04 Jan 2019 20:37:11.100 # Failover election won: I'm the new master.

30651:S 04 Jan 2019 20:37:11.100 # configEpoch set to 32 after successful failover

30651:M 04 Jan 2019 20:37:11.100 # Setting secondary replication ID to 0cf19d01597610c7933b7ed67c999a631655eafc, valid up to offset: 228621. New replication ID is 53daa7fa265d982aebd3c18c07ed5f178fc3f70b

30651:M 04 Jan 2019 20:37:11.101 # Connection with master lost.

30651:M 04 Jan 2019 20:37:11.101 * Caching the disconnected master state.

30651:M 04 Jan 2019 20:37:11.101 * Discarding previously cached master state.

30651:M 04 Jan 2019 20:37:11.101 # Cluster state changed: ok

30651:M 04 Jan 2019 20:37:17.560 * Ignoring FAIL message from unknown node 082c079149a9915612d21cca8e08c831a4edeade about 44eb43e50c101c5f44f48295c42dda878b6cb3e9

8. slave延遲發起選舉代碼

// 摘自Redis-5.0.3

// cluster.c

/* This function is called if we are a slave node and our master serving

 * a non-zero amount of hash slots is in FAIL state.

 *

 * The gaol of this function is:

 * 1) To check if we are able to perform a failover, is our data updated?

 * 2) Try to get elected by masters.

 * 3) Perform the failover informing all the other nodes.

 */

void clusterHandleSlaveFailover(void) {

    。。。。。。

    /* Check if our data is recent enough according to the slave validity

     * factor configured by the user.

     *

     * Check bypassed for manual failovers. */

    if (server.cluster_slave_validity_factor &&

        data_age >

        (((mstime_t)server.repl_ping_slave_period * 1000) +

         (server.cluster_node_timeout * server.cluster_slave_validity_factor)))

    {

        if (!manual_failover) {

            clusterLogCantFailover(CLUSTER_CANT_FAILOVER_DATA_AGE);

            return;

        }

    }

    /* If the previous failover attempt timedout and the retry time has

     * elapsed, we can setup a new one. */

    if (auth_age > auth_retry_time) {

        server.cluster->failover_auth_time = mstime() +

            500 + /* Fixed delay of 500 milliseconds, let FAIL msg propagate. */

            random() % 500; /* Random delay between 0 and 500 milliseconds. */

        server.cluster->failover_auth_count = 0;

        server.cluster->failover_auth_sent = 0;

        server.cluster->failover_auth_rank = clusterGetSlaveRank();

        /* We add another delay that is proportional to the slave rank.

         * Specifically 1 second * rank. This way slaves that have a probably

         * less updated replication offset, are penalized. */

        server.cluster->failover_auth_time +=

            server.cluster->failover_auth_rank * 1000;

        /* However if this is a manual failover, no delay is needed. */

        if (server.cluster->mf_end) {

            server.cluster->failover_auth_time = mstime();

            server.cluster->failover_auth_rank = 0;

        }

        serverLog(LL_WARNING,

            "Start of election delayed for %lld milliseconds "

            "(rank #%d, offset %lld).",

            server.cluster->failover_auth_time - mstime(),

            server.cluster->failover_auth_rank,

            replicationGetSlaveOffset());

        /* Now that we have a scheduled election, broadcast our offset

         * to all the other slaves so that they'll updated their offsets

         * if our offset is better. */

        clusterBroadcastPong(CLUSTER_BROADCAST_LOCAL_SLAVES);

        return;

    }

    。。。。。。

}

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM