1、背景
MGR是個好東西,因為他從本質上解決了數據不一致的問題。不光是解決了問題,而且出自名門正派(Oracle的MySQL團隊),對品質和后續的維護,我們是可以期待的。
但是在調研的過程中,發現有個嚴重的bug(https://bugs.mysql.com/bug.php?id=92690),在網絡有延遲、丟包和數據損壞時,會導致各個節點間數據嚴重不一致。而上述網絡情況,在跨地域部署時候,出現的概率還是比較高的,因此,必須解決上述問題。我也一直在等待官方團隊的修復(該bug在2018年11月5號被提出,截止到寫作這篇文章,已經3個月了),但是一直沒有bug fix放出。
2、社區對於該bug的分析
a)相同gtid編號,內容不同
從這個bug submmiter的分析來看,即使是相同的gtid編號,其內容也不相同,甚至操作的表都不一樣。
gr01 > show binlog events in 'binlog.000223' from 11590 limit 11; +---------------+-------+-------------+-----------+-------------+------------------------------------------------------------------------+ | Log_name | Pos | Event_type | Server_id | End_log_pos | Info | +---------------+-------+-------------+-----------+-------------+------------------------------------------------------------------------+ | binlog.000223 | 11590 | Gtid | 10 | 11651 | SET @@SESSION.GTID_NEXT= 'aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:295533' | | binlog.000223 | 11651 | Query | 10 | 11723 | BEGIN | | binlog.000223 | 11723 | Table_map | 10 | 11775 | table_id: 125 (db1.sbtest6) | | binlog.000223 | 11775 | Update_rows | 10 | 12185 | table_id: 125 flags: STMT_END_F | | binlog.000223 | 12185 | Table_map | 10 | 12237 | table_id: 116 (db1.sbtest5) | | binlog.000223 | 12237 | Update_rows | 10 | 12647 | table_id: 116 flags: STMT_END_F | | binlog.000223 | 12647 | Table_map | 10 | 12699 | table_id: 118 (db1.sbtest1) | | binlog.000223 | 12699 | Delete_rows | 10 | 12919 | table_id: 118 flags: STMT_END_F | | binlog.000223 | 12919 | Table_map | 10 | 12971 | table_id: 118 (db1.sbtest1) | | binlog.000223 | 12971 | Write_rows | 10 | 13191 | table_id: 118 flags: STMT_END_F | | binlog.000223 | 13191 | Xid | 10 | 13218 | COMMIT /* xid=6231928 */ | +---------------+-------+-------------+-----------+-------------+------------------------------------------------------------------------+ 11 rows in set (0.00 sec) gr02 > show binlog events in 'binlog.000221' from 9912 limit 11; +---------------+-------+-------------+-----------+-------------+------------------------------------------------------------------------+ | Log_name | Pos | Event_type | Server_id | End_log_pos | Info | +---------------+-------+-------------+-----------+-------------+------------------------------------------------------------------------+ | binlog.000221 | 9912 | Gtid | 10 | 9973 | SET @@SESSION.GTID_NEXT= 'aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:295533' | | binlog.000221 | 9973 | Query | 10 | 10037 | BEGIN | | binlog.000221 | 10037 | Table_map | 10 | 10089 | table_id: 108 (db1.sbtest5) | | binlog.000221 | 10089 | Update_rows | 10 | 10499 | table_id: 108 flags: STMT_END_F | | binlog.000221 | 10499 | Table_map | 10 | 10551 | table_id: 110 (db1.sbtest4) | | binlog.000221 | 10551 | Update_rows | 10 | 10961 | table_id: 110 flags: STMT_END_F | | binlog.000221 | 10961 | Table_map | 10 | 11014 | table_id: 109 (db1.sbtest10) | | binlog.000221 | 11014 | Delete_rows | 10 | 11234 | table_id: 109 flags: STMT_END_F | | binlog.000221 | 11234 | Table_map | 10 | 11287 | table_id: 109 (db1.sbtest10) | | binlog.000221 | 11287 | Write_rows | 10 | 11507 | table_id: 109 flags: STMT_END_F | | binlog.000221 | 11507 | Xid | 10 | 11534 | COMMIT /* xid=1185380 */ | +---------------+-------+-------------+-----------+-------------+------------------------------------------------------------------------+ 11 rows in set (0.00 sec)
b)相同的paxos信息編號,消息類型不同
社區內有人在MGR源碼中加入日志,分析出相同編號的paxos信息,在Primary節點上是帶有實際的信息(和應用相關的,比如binlog的信息),但是在Secondary節點上是空消息(noop,不會提交給應用)。因此出現了數據不一致。
c)prepare階段出現問題
注意,這里的prepare和數據庫概念無關,而是paxos中的prepare。相關概念參考這篇博客(http://mysqlhighavailability.com/the-king-is-dead-long-live-the-king-our-homegrown-paxos-based-consensus/)。下圖中的Election和prepare的含義是一樣的。
簡要來說,任何一個節點在發送一個消息的時候,
1)先發送prepare消息,以確定其要發起提議的值;
2)根據上一步的結果,發送accept信息到各個參與節點;
3)如果收到多數派的回應,則發送learn信息,如果其他節點(比如節點1)收到learn信息,則消息的值,(在節點1)被確認了。
從上面可以看出,一個消息的發送,需要經過3個階段,不僅產生了較大的網絡流量,更糟糕的是整個消息,從被發起到被確認經歷了較大延遲。
因此很多paxos的變種,都會試圖去優化這個過程。比如在系統運行穩定時,省略了第一步的prepare階段。但是在系統不穩定時,比如某個節點發現其缺少某個編號的消息時,會走完整的三階段。
而該bug正好就是在網絡很糟糕的情況下出現,因此有很大可能性就是prepare階段出現了問題。
3、相關概念補充
a)節點編號
每個節點都有一個編號。比如一個MGR集群,有三個節點,那么其編號就分別是0,1,2。其中編號的大小與是否為Primary無關
b)消息編號
消息編號由兩部分組成,第一部分是64位的無符號數,一般是遞增的,另一部分是節點編號。
比如(10064,0)就是一個消息編號
c)投票號
投票號也是由兩個部分組成,第一部分是32位的有符號數,另一部分也是節點編號。
4、筆者分析
a)分析方法
之前所熟悉和擅長的方法,尤其是調試方式上,在分布式系統中顯得力不從心。比如在研發智能SQL優化器時,利用gdb,幫助我了解了很多優化器的細節。但是對於分布式系統來說,一旦使用gdb掛載,可能會對其行為產生影響。因此筆者采用,以日志為主,gdb為輔的調試方法
b)paxos消息分析
1)結果正確、網絡正常
節點編號:0,Primary節點
in push_msg_2p msg_no 1683, 0
in dispatch_op msg_no 1683, 0, paxos op accept_op from : 0 to : 0
in dispatch_op msg_no 1683, 0, paxos op ack_accept_op from : 0 to : 0
in dispatch_op msg_no 1683, 0, paxos op ack_accept_op from : 2 to : 0
in dispatch_op msg_no 1683, 0, paxos op tiny_learn_op from : 0 to : 0
in dispatch_op msg_no 1683, 0, paxos op ack_accept_op from : 1 to : 0
msg_no 1683, 0 cargo_type : app_type , msg_type : normal, pax_op : learn_op
節點編號:2,Secondary節點
in dispatch_op msg_no 1683, 0, paxos op accept_op from : 0 to : 2
in dispatch_op msg_no 1683, 0, paxos op tiny_learn_op from : 0 to : 2
msg_no 1683, 0 cargo_type : app_type , msg_type : normal, pax_op : learn_op
節點編號:1,Secondary節點
in dispatch_op msg_no 1683, 0, paxos op accept_op from : 0 to : 1
in dispatch_op msg_no 1683, 0, paxos op tiny_learn_op from : 0 to : 1
msg_no 1683, 0 cargo_type : app_type , msg_type : normal, pax_op : learn_op
分析:
A)0號節點發送編號為(1683,0)的消息
in push_msg_2p msg_no 1683, 0
B)0,1、2號節點均收到accept_op
in dispatch_op msg_no 1683, 0, paxos op accept_op from : 0 to : 0 in dispatch_op msg_no 1683, 0, paxos op accept_op from : 0 to : 2 in dispatch_op msg_no 1683, 0, paxos op accept_op from : 0 to : 1
C)0,1、2號節點均回復ack_accept_ok
in dispatch_op msg_no 1683, 0, paxos op ack_accept_op from : 0 to : 0 in dispatch_op msg_no 1683, 0, paxos op ack_accept_op from : 2 to : 0 in dispatch_op msg_no 1683, 0, paxos op tiny_learn_op from : 0 to : 0 in dispatch_op msg_no 1683, 0, paxos op ack_accept_op from : 1 to : 0
注意到0號節點在收到2個ack_accept_ok(大多數)時,就開始發送tiny_learn_op,其中0號節點立即收到了這個消息
D)0、1、2號節點收到tiny_learn_op,表示消息在各個節點確認了。
in dispatch_op msg_no 1683, 0, paxos op tiny_learn_op from : 0 to : 0 in dispatch_op msg_no 1683, 0, paxos op tiny_learn_op from : 0 to : 1 in dispatch_op msg_no 1683, 0, paxos op tiny_learn_op from : 0 to : 2
E)0、1、2號節點將收到的信息,傳送給上層應用
msg_no 1683, 0 cargo_type : app_type , msg_type : normal, pax_op : learn_op msg_no 1683, 0 cargo_type : app_type , msg_type : normal, pax_op : learn_op msg_no 1683, 0 cargo_type : app_type , msg_type : normal, pax_op : learn_op
2)結果正確、網絡不正常
節點編號:0,Primary節點 in push_msg_2p msg_no 44043, 0 in dispatch_op msg_no 44043, 0, paxos op accept_op from : 0 to : 0 in dispatch_op msg_no 44043, 0, paxos op ack_accept_op from : 0 to : 0 in dispatch_op msg_no 44043, 0, paxos op ack_accept_op from : 2 to : 0 in dispatch_op msg_no 44043, 0, paxos op tiny_learn_op from : 0 to : 0 msg_no 44043, 0 cargo_type : app_type , msg_type : normal, pax_op : learn_op 節點編號:2,Secondary節點 in dispatch_op msg_no 44043, 0, paxos op accept_op from : 0 to : 2 in dispatch_op msg_no 44043, 0, paxos op tiny_learn_op from : 0 to : 2 msg_no 44043, 0 cargo_type : app_type , msg_type : normal, pax_op : learn_op in dispatch_op msg_no 44043, 0, paxos op read_op from : 1 to : 2 節點編號:1,Secondary節點 in read_missing_values msg_no 44043, 0 in dispatch_op msg_no 44043, 0, paxos op learn_op from : 2 to : 1 msg_no 44043, 0 cargo_type : app_type , msg_type : normal, pax_op : learn_op
分析:
對比情況 1),有如下幾個發現
A)0號節點,只收到兩個ack_accept_op(來自0、2號),其中缺失了1號節點的回復。但是由於構成了多數派,還是能夠成功。
B)1號節點發現缺少編號(44043,0)的信息后,往節點2發送read_op的信息,以獲取缺失的信息
in dispatch_op msg_no 44043, 0, paxos op read_op from : 1 to : 2
C)2號節點,將信息發給1號節點
in dispatch_op msg_no 44043, 0, paxos op learn_op from : 2 to : 1
3)節點錯誤、網絡不正常
節點編號:0,Primary節點 in push_msg_2p msg_no 44044, 0 in dispatch_op msg_no 44044, 0, paxos op accept_op from : 0 to : 0 in dispatch_op msg_no 44044, 0, paxos op ack_accept_op from : 0 to : 0 in dispatch_op msg_no 44044, 0, paxos op ack_accept_op from : 2 to : 0 in dispatch_op msg_no 44044, 0, paxos op tiny_learn_op from : 0 to : 0 in dispatch_op msg_no 44044, 0, paxos op read_op from : 1 to : 0 in dispatch_op msg_no 44044, 0, paxos op read_op from : 1 to : 0 in dispatch_op msg_no 44044, 0, paxos op prepare_op from : 1 to : 0 in dispatch_op msg_no 44044, 0, paxos op accept_op from : 1 to : 0 in dispatch_op msg_no 44044, 0, paxos op tiny_learn_op from : 1 to : 0 in dispatch_op msg_no 44044, 0, paxos op read_op from : 2 to : 0 in dispatch_op msg_no 44044, 0, paxos op read_op from : 2 to : 0 msg_no 44044, 0 cargo_type : app_type , msg_type : normal, pax_op : learn_op 節點編號:2,Secondary節點 in dispatch_op msg_no 44044, 0, paxos op accept_op from : 0 to : 2 in read_missing_values msg_no 44044, 0 in read_missing_values msg_no 44044, 0 in dispatch_op msg_no 44044, 0, paxos op read_op from : 1 to : 2 in read_missing_values msg_no 44044, 0 in dispatch_op msg_no 44044, 0, paxos op prepare_op from : 1 to : 2 in dispatch_op msg_no 44044, 0, paxos op accept_op from : 1 to : 2 in dispatch_op msg_no 44044, 0, paxos op tiny_learn_op from : 1 to : 2 msg_no 44044, 0 is no_op in dispatch_op msg_no 44044, 0, paxos op tiny_learn_op from : 0 to : 2 in dispatch_op msg_no 44044, 0, paxos op learn_op from : 0 to : 2 in dispatch_op msg_no 44044, 0, paxos op learn_op from : 0 to : 2 節點編號:1,Secondary節點 in read_missing_values msg_no 44044, 0 in dispatch_op msg_no 44044, 0, paxos op read_op from : 2 to : 1 in read_missing_values msg_no 44044, 0 in read_missing_values msg_no 44044, 0 in push_msg_3p msg_no 44044, 0 in dispatch_op msg_no 44044, 0, paxos op prepare_op from : 1 to : 1 in dispatch_op msg_no 44044, 0, paxos op ack_prepare_empty_op from : 1 to : 1 in dispatch_op msg_no 44044, 0, paxos op ack_prepare_op from : 2 to : 1 in dispatch_op msg_no 44044, 0, paxos op accept_op from : 1 to : 1 in dispatch_op msg_no 44044, 0, paxos op ack_accept_op from : 1 to : 1 in dispatch_op msg_no 44044, 0, paxos op ack_accept_op from : 2 to : 1 in dispatch_op msg_no 44044, 0, paxos op tiny_learn_op from : 1 to : 1 msg_no 44044, 0 is no_op
分析:
A)0號節點發送消息(44044, 0),只有0、2節點回復了。
in dispatch_op msg_no 44044, 0, paxos op ack_accept_op from : 0 to : 0 in dispatch_op msg_no 44044, 0, paxos op ack_accept_op from : 2 to : 0
B)0號節點,發送tiny_learn_op,但是只有0號節點收到了,此時編號為(44044, 0)的消息,在0號節點確定了,但是2號節點還未及時收到tiny_learn_op
msg_no 44044, 0 is no_op in dispatch_op msg_no 44044, 0, paxos op tiny_learn_op from : 0 to : 2
注意2號節點,在消息(44044, 0)被確定為no_op時,才收到tiny_learn_op。也就是說這個tiny_learn_op對於2號節點來說,就像沒收到過一樣
C)1、2號節點,試圖讀取缺失的(44044, 0)消息(通過發送read_op),但是由於網絡問題,均沒有得到及時的回復
in dispatch_op msg_no 44044, 0, paxos op read_op from : 1 to : 0 in dispatch_op msg_no 44044, 0, paxos op read_op from : 1 to : 0 in dispatch_op msg_no 44044, 0, paxos op read_op from : 2 to : 0 in dispatch_op msg_no 44044, 0, paxos op read_op from : 2 to : 0
D)1號節點試圖針對(44044, 0)發起noop消息提議
in push_msg_3p msg_no 44044, 0
注意關鍵點,錯誤在下一步
E)節點1試圖獲得他應該提議的值,並完成了propose過程
in dispatch_op msg_no 44044, 0, paxos op prepare_op from : 1 to : 1 in dispatch_op msg_no 44044, 0, paxos op ack_prepare_empty_op from : 1 to : 1 in dispatch_op msg_no 44044, 0, paxos op ack_prepare_op from : 2 to : 1 in dispatch_op msg_no 44044, 0, paxos op accept_op from : 1 to : 1 in dispatch_op msg_no 44044, 0, paxos op ack_accept_op from : 1 to : 1 in dispatch_op msg_no 44044, 0, paxos op ack_accept_op from : 2 to : 1 in dispatch_op msg_no 44044, 0, paxos op tiny_learn_op from : 1 to : 1 msg_no 44044, 0 is no_op
E.1)節點1發送prepare_op,並收到了節點1、2的回復
E.2) 節點1是沒有值(因為其之前沒有收到過來自節點0的accept_op),所以其回復ack_prepare_empty_op
E.3) 節點2有值(參考A),所以其回復ack_prepare_op
E.4)原則上節點1會發起節點2返回給的值,但是從結果來看,其並沒有。
msg_no 44044, 0 is no_op
針對這個問題,筆者仔細閱讀MGR的prepare階段的源碼,核心代碼為
int gt_ballot(ballot x, ballot y) { return x.cnt > y.cnt || (x.cnt == y.cnt && x.node > y.node); }
只有滿足這個條件,節點1才會使用節點2返回的值。
F)進一步研究發現,凡是編號大的節點發起的noop提議都有可能會有上述問題。
in handle_ack_prepare msg_no 82876, 0, m->proposal.cnt = 0, m->proposal.node = 0, p->proposer.msg->proposal.cnt = 0, p->proposer.msg->proposal.node = 1
由於初始化的時候,cnt均為0,大小完全取決於節點編號。 本例中為 0 < 1 , 所以無法使用0的值
5、解決辦法
static void propose_noop(synode_no find, pax_machine *p) { /* Prepare to send a noop */ site_def const *site = find_site_def(find); assert(!too_far(find)); replace_pax_msg(&p->proposer.msg, pax_msg_new(find, site));
/* set cnt to -1 when propose noop*/
int cnt = -1; node_no nodeno = VOID_NODE_NO; if (site) nodeno = get_nodeno(site); init_ballot(&p->proposer.msg->proposal, cnt, nodeno);
/* set cnt to -1 when propose noop*/ assert(p->proposer.msg); create_noop(p->proposer.msg); //printf("in propose_noop msg_no %lld, %d\n", p->proposer.msg->synode.msgno, p->proposer.msg->synode.node); //fflush(stdout); /* DBGOUT(FN; SYCEXP(find);); */ push_msg_3p(site, p, clone_pax_msg(p->proposer.msg), find, no_op); }
代碼中紅色部分為增加部分。修改后,由於 -1 < 0, 所以解決了那個問題。