[ mongoDB ] - ReplSet選舉策略 [轉]


首先介紹一下在replica set里分為三種節點類型:
1primary  負責client的讀寫。
2secondary作為熱備節點,應用Primary的oplog讀取的操作日志,和primary保持一致,不提供讀寫操作!
  secondary有兩種類型: 
  1)normal secondary   隨時和Primay保持同步,  
  2)delayed secondary  延時指定時間和primary保持同步,防止誤操作. 
3arbiter.它不負責任何讀寫,只作為一個仲裁者,負責primary down的時候剩余節點的選舉操作.
    在Replica Set 如果主庫down了,要進行故障切換,集群的選舉策略:
當primary當了之后,剩下的節點會選擇一個primary節點,仲裁節點也會參與投票,避免僵局出現(如果沒有仲裁節點,對於兩節點的replica set 從節點down,主節點會變為secondary,導致整個replica set 不可用)選擇依據為:優先級最高的且數據新鮮度最新的!
    primary 節點使用心跳來跟蹤集群中有多少節點對其可見。如果達不到1/2,活躍節點會自動降級為secondary。這樣就能夠防止上面說的僵局狀態或者當網絡切割后primary已經與集群隔離的時候!
來自官方文檔的例子:
初始狀況:
server-a: secondary oplog: ()
server-b: secondary oplog: ()
server-c: secondary oplog: ()
主庫寫入數據
server-a: primary oplog: (a1,a2,a3,a4,a5)
server-b: secondary oplog: ()
server-c: secondary oplog: ()
secondary庫應用數據
server-a: primary oplog: (a1,a2,a3,a4,a5)
server-b: secondary oplog: (a1)
server-c: secondary oplog: (a1,a2,a3)
主庫 server-a down
server-b: secondary oplog: (a1)
server-c: secondary oplog: (a1,a2,a3)
...
server-b: secondary oplog: (a1)
server-c: primary oplog: (a1,a2,a3) // c 具有最新的數據被選擇為primary庫
...
server-b: secondary oplog: (a1,a2,a3)
server-c: primary oplog: (a1,a2,a3,c4)
...
server-a 恢復或者起來
...
server-a: recovering oplog: (a1,a2,a3,a4,a5) --做數據恢復
server-b: secondary oplog: (a1,a2,a3)
server-c: primary oplog: (a1,a2,a3,c4)
…應用從server-c中的數據,此時 數據a4,a5丟失
server-a: recovering oplog: (a1,a2,a3,c4)
server-b: secondary oplog: (a1,a2,a3,c4)
server-c: primary oplog: (a1,a2,a3,c4)
新的主庫server-c進行數據寫入。
server-a: secondary oplog: (a1,a2,a3,c4)
server-b: secondary oplog: (a1,a2,a3,c4)
server-c: primary oplog: (a1,a2,a3,c4,c5,c6,c7,c8)
server-a: secondary oplog: (a1,a2,a3,c4,c5,c6,c7,c8)
server-b: secondary oplog: (a1,a2,a3,c4,c5,c6,c7,c8)
server-c: primary oplog: (a1,a2,a3,c4,c5,c6,c7,c8)
從上面的過程中可以看出server-c 變為主庫,其他節點則應用從server-c的日志。數據a4,a5 丟失。
    當選出新的primary之后,此數據庫的數據就會被假定為整個集群中的最新數據,對其他節點(原來的活躍節點)的操作都會回滾,即便之前的主庫已經恢復工作了。為了完成回滾,所有節點連接新的主庫后都要重新進行同步。此過程如下:
這些節點會查看自己的oplog日志,找到還沒應用的主庫操作,然后向主庫請求這些操作影響的文檔的最新副本,進行數據同步。
 
對於Replica Set中的選擇策略:
We use a consensus protocol to pick a primary. Exact details will be spared here but that basic process is:
1 get maxLocalOpOrdinal from each server.
2 if a majority of servers are not up (from this server's POV), remain in Secondary mode and stop.
3 if the last op time seems very old, stop and await human intervention.
4 else, using a consensus protocol, pick the server with the highest maxLocalOpOrdinal as the Primary.
 
對於策略2:當集群里的大多數服務器發生down 機了,剩余的節點就會保持在secondary模式並停止服務。
做了實驗結果是對於4節點的 Replica Set,當兩個secondary節點down了的時候,主節點變為secondary。整個集群相當於掛了,因為secondary 不提供讀寫操作。。
在一個集群中關閉兩個secondary 節點:rac4:27019和rac3:27017 
[mongodb@rac4bin]$ ./mongo 127.0.0.1:27019
MongoDB shell version: 2.0.1
connecting to: 127.0.0.1:27019/test
SECONDARY> 
SECONDARY> use admin
switched to db admin
SECONDARY> db.shutdownServer();
Wed Nov  2 11:02:29 DBClientCursor::init call() failed
Wed Nov  2 11:02:29 query failed : admin.$cmd { shutdown: 1.0 } to: 127.0.0.1:27019
server should be down...
 
[mongodb@rac3bin]$ ./mongo  10.250.7.241:27017
MongoDB shell version: 2.0.1
connecting to: 127.0.0.1:27017/test
SECONDARY> 
SECONDARY> use admin
switched to db admin
SECONDARY> db.shutdownServer();
Tue Nov  1 22:02:46 DBClientCursor::init call() failed
Tue Nov  1 22:02:46 query failed : admin.$cmd { shutdown: 1.0 } to: 127.0.0.1:27017
server should be down...
Tue Nov  1 22:02:46 trying reconnect to 127.0.0.1:27017
Tue Nov  1 22:02:46 reconnect 127.0.0.1:27017 failed couldn't connect to server 127.0.0.1:27017
Tue Nov  1 22:02:46 Error: error doing query: unknown shell/collection.js:150
從主庫的客戶端退出以后,再次進入提示符發生變化:由PRIMARY--->SECONDARY ,查看Replica Set的狀態信息:
[mongodb@rac4 bin]$ ./mongo 127.0.0.1:27020       
MongoDB shell version: 2.0.1
connecting to: 127.0.0.1:27020/test
SECONDARY
SECONDARY> rs.status();
{
        "set" : "myset",
        "date" : ISODate("2011-11-01T13:56:05Z"),
        "myState" : 2,
        "members" : [
                {
                        "_id" : 0,
                        "name" : "10.250.7.220:27018",
                        "health" : 1,
                        "state" : 2,
                        "stateStr" : "SECONDARY",
                        "uptime" : 101,
                        "optime" : {
                                "t" : 1320154033000,
                                "i" : 1
                        },
                        "optimeDate" : ISODate("2011-11-01T13:27:13Z"),
                        "lastHeartbeat" : ISODate("2011-11-01T13:56:04Z"),
                        "pingMs" : 0
                },
                {
                        "_id" : 1,
                        "name" : "10.250.7.220:27019",
                       "health" : 0,  --已經關閉
                        "state" : 8,
                        "stateStr" : "(not reachable/healthy)",
                        "uptime" : 0,
                        "optime" : {
                                "t" : 1320154033000,
                                "i" : 1
                        },
                        "optimeDate" : ISODate("2011-11-01T13:27:13Z"),
                        "lastHeartbeat" : ISODate("2011-11-01T13:53:50Z"),
                        "pingMs" : 0,
                        "errmsg" : "socket exception"
                },
                {
                        "_id" : 2,
                        "name" : "10.250.7.220:27020",
                        "health" : 1,
                        "state" : 2,
                       "stateStr" : "SECONDARY", ---由主庫變為從庫
                        "optime" : {
                                "t" : 1320154033000,
                                "i" : 1
                        },
                        "optimeDate" : ISODate("2011-11-01T13:27:13Z"),
                        "self" : true
                },
                {
                        "_id" : 3,
                        "name" : "10.250.7.241:27017",
                        "health" : 0,
                        "state" : 8,
                        "stateStr" : "(not reachable/healthy)",
                        "uptime" : 0,
                        "optime" : {
                                "t" : 1320154033000,
                                "i" : 1
                        },
                        "optimeDate" : ISODate("2011-11-01T13:27:13Z"),
                        "lastHeartbeat" : ISODate("2011-11-01T13:53:54Z"),
                        "pingMs" : 0,
                        "errmsg" : "socket exception"
                }
        ],
        "ok" : 1
}
SECONDARY> exut
Wed Nov  2 15:23:02 ReferenceError: exut is not defined (shell):1
Wed Nov  2 15:23:02 DBClientCursor::init call() failed
> exit
bye
 
承接之前的文章繼續介紹replica set 選舉機制。
創建兩節點的Replica Sets,一主一備secondary,如果Secondary宕機,Primary會變成Secondary!這時候集群里沒有Primary了!為什么會出現這樣的情況呢。
[mongodb@rac4 bin]$ mongo 127.0.0.1:27018 init1node.js 
MongoDB shell version: 2.0.1
connecting to: 127.0.0.1:27018/test
[mongodb@rac4 bin]$ ./mongo 127.0.0.1:27019
MongoDB shell version: 2.0.1
connecting to: 127.0.0.1:27019/test
RECOVERING> 
SECONDARY> 
SECONDARY> use admin
switched to db admin
SECONDARY> db.shutdownServer() 
Sun Nov  6 20:16:11 DBClientCursor::init call() failed
Sun Nov  6 20:16:11 query failed : admin.$cmd { shutdown: 1.0 } to: 127.0.0.1:27019
server should be down...
Sun Nov  6 20:16:11 trying reconnect to 127.0.0.1:27019
Sun Nov  6 20:16:11 reconnect 127.0.0.1:27019 failed couldn't connect to server 127.0.0.1:27019
Sun Nov  6 20:16:11 Error: error doing query: unknown shell/collection.js:150
secondary 當機之后,主庫有PRIMARY變為SECONDARY
[mongodb@rac4 bin]$ mongo 127.0.0.1:27018 
MongoDB shell version: 2.0.1
connecting to: 127.0.0.1:27018/test
PRIMARY> 
PRIMARY> 
PRIMARY> 
SECONDARY> 
從日志中可以看出:從庫down了之后,主庫的變化
Sun Nov  6 20:16:13 [rsHealthPoll] replSet info 10.250.7.220:27019 is down (or slow to respond): DBClientBase::findN: transport error: 10.250.7.220:27019 query: { replSetHeartbeat: "myset", v: 1, pv: 1, checkEmpty: false, from: "10.250.7.220:27018" }
Sun Nov  6 20:16:13 [rsHealthPoll] replSet member 10.250.7.220:27019 is now in state DOWN
Sun Nov  6 20:16:13 [conn7] end connection 10.250.7.220:13217
Sun Nov  6 20:16:37 [rsMgr] can't see a majority of the set, relinquishing primary
Sun Nov  6 20:16:37 [rsMgr] replSet relinquishing primary state
Sun Nov  6 20:16:37 [rsMgr] replSet SECONDARY
這是和MongoDB的Primary選舉策略有關的,如果情況不是Secondary宕機,而是網絡斷開,那么兩個節點都會選取自己為Primary,因為他們能連接上的只有自己這一個節點。而這樣的情況在網絡恢復后就需要處理復雜的一致性問題。而且斷開的時間越長,時間越復雜。所以MongoDB選擇的策略是如果集群中只有自己一個節點,那么不選取自己為Primary。
所以正確的做法應該是添加兩個以上的節點,或者添加arbiter,當然最好也最方便的做法是添加arbiter,aribiter節點只參與選舉,幾乎不會有壓力,所以你可以在各種閑置機器上啟動arbiter節點,這不僅會避免上面說到的無法選舉Primary的情況,更會讓選取更快速的進行。因為如果是三台數據節點,一個節點宕機,另外兩個節點很可能會各自選舉自己為Primary,從而導致很長時間才能得出選舉結果。實際上集群選舉主庫上由優先級和數據的新鮮度這兩個條件決定的。
官方文檔:
Example: if B and C are candidates in an election, B having a higher priority but C being the most up to date:
1 C will be elected primary
2 Once B catches up a re-election should be triggered and B (the higher priority node) should win the election between B and C
3 Alternatively, suppose that, once B is within 12 seconds of synced to C, C goes down.
B will be elected primary.
When C comes back up, those 12 seconds of unsynced writes will be written to a file in the rollback directory of your data directory (rollback is created when needed).
You can manually apply the rolled-back data, see Replica Sets - Rollbacks.
重新搭建replica set 集群不過這次加上仲裁者:
[mongodb@rac4 bin]$ cat init2node.js 
rs.initiate({
    _id : "myset",
    members : [
        {_id : 0, host : "10.250.7.220:28018"},
        {_id : 1, host : "10.250.7.220:28019"},
        {_id : 2, host : "10.250.7.220:28020", arbiterOnly: true}
    ]
})
[mongodb@rac4 bin]$ ./mongo 127.0.0.1:28018 init2node.js 
[mongodb@rac4 bin]$ ./mongo 127.0.0.1:28018 
MongoDB shell version: 2.0.1
connecting to: 127.0.0.1:28018/test
PRIMARY> rs.status()
{
        "set" : "myset",
        "date" : ISODate("2011-11-06T14:16:13Z"),
        "myState" : 1,
        "members" : [
                {
                        "_id" : 0,
                        "name" : "10.250.7.220:28018",
                        "health" : 1,
                        "state" : 1,
...
                },
                {
                        "_id" : 1,
                        "name" : "10.250.7.220:28019",
                        "health" : 1,
                        "state" : 2,
                        "stateStr" : "SECONDARY",
....
                },
                {
                        "_id" : 2,
                        "name" : "10.250.7.220:28020",
                        "health" : 1,
                        "state" : 7,
                        "stateStr" : "ARBITER",
....
                }
        ],
        "ok" : 1
}
PRIMARY> 
再次測試,測試主庫變成secondary節點。
對於前一篇文章多節點的,比如4個primary,secondary節點,一個仲裁者,當兩個節點down了之后,不會出現的文章說的down 1/2的機器整個集群不可用,但是如果down 3/4的機器時,整個集群將不可用!
日志記錄中描述的 “majority of” 並沒有給出一個具體的數值,目前所做的實驗是多於1/2的時候,整個集群就不可用了
Sun Nov  6 19:34:16 [rsMgr] can't see a majority of the set, relinquishing primary 
 
參考文章:
http://www.mongodb.org/display/DOCS/Replica+Sets+-+Priority
http://blog.nosqlfan.com/html/2523.html
 
 
-----------------------------------

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM