一個導致MGR數據混亂Bug的分析和修復


1、背景

MGR是個好東西,因為他從本質上解決了數據不一致的問題。不光是解決了問題,而且出自名門正派(Oracle的MySQL團隊),對品質和后續的維護,我們是可以期待的。

但是在調研的過程中,發現有個嚴重的bug(https://bugs.mysql.com/bug.php?id=92690),在網絡有延遲、丟包和數據損壞時,會導致各個節點間數據嚴重不一致。而上述網絡情況,在跨地域部署時候,出現的概率還是比較高的,因此,必須解決上述問題。我也一直在等待官方團隊的修復(該bug在2018年11月5號被提出,截止到寫作這篇文章,已經3個月了),但是一直沒有bug fix放出。

2、社區對於該bug的分析

a)相同gtid編號,內容不同

從這個bug submmiter的分析來看,即使是相同的gtid編號,其內容也不相同,甚至操作的表都不一樣。

gr01 > show binlog events in 'binlog.000223' from 11590 limit 11;
+---------------+-------+-------------+-----------+-------------+------------------------------------------------------------------------+
| Log_name      | Pos   | Event_type  | Server_id | End_log_pos | Info                                                                   |
+---------------+-------+-------------+-----------+-------------+------------------------------------------------------------------------+
| binlog.000223 | 11590 | Gtid        |        10 |       11651 | SET @@SESSION.GTID_NEXT= 'aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:295533' |
| binlog.000223 | 11651 | Query       |        10 |       11723 | BEGIN                                                                  |
| binlog.000223 | 11723 | Table_map   |        10 |       11775 | table_id: 125 (db1.sbtest6)                                            |
| binlog.000223 | 11775 | Update_rows |        10 |       12185 | table_id: 125 flags: STMT_END_F                                        |
| binlog.000223 | 12185 | Table_map   |        10 |       12237 | table_id: 116 (db1.sbtest5)                                            |
| binlog.000223 | 12237 | Update_rows |        10 |       12647 | table_id: 116 flags: STMT_END_F                                        |
| binlog.000223 | 12647 | Table_map   |        10 |       12699 | table_id: 118 (db1.sbtest1)                                            |
| binlog.000223 | 12699 | Delete_rows |        10 |       12919 | table_id: 118 flags: STMT_END_F                                        |
| binlog.000223 | 12919 | Table_map   |        10 |       12971 | table_id: 118 (db1.sbtest1)                                            |
| binlog.000223 | 12971 | Write_rows  |        10 |       13191 | table_id: 118 flags: STMT_END_F                                        |
| binlog.000223 | 13191 | Xid         |        10 |       13218 | COMMIT /* xid=6231928 */                                               |
+---------------+-------+-------------+-----------+-------------+------------------------------------------------------------------------+
11 rows in set (0.00 sec)

gr02 > show binlog events in 'binlog.000221' from 9912 limit 11;
+---------------+-------+-------------+-----------+-------------+------------------------------------------------------------------------+
| Log_name      | Pos   | Event_type  | Server_id | End_log_pos | Info                                                                   |
+---------------+-------+-------------+-----------+-------------+------------------------------------------------------------------------+
| binlog.000221 |  9912 | Gtid        |        10 |        9973 | SET @@SESSION.GTID_NEXT= 'aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:295533' |
| binlog.000221 |  9973 | Query       |        10 |       10037 | BEGIN                                                                  |
| binlog.000221 | 10037 | Table_map   |        10 |       10089 | table_id: 108 (db1.sbtest5)                                            |
| binlog.000221 | 10089 | Update_rows |        10 |       10499 | table_id: 108 flags: STMT_END_F                                        |
| binlog.000221 | 10499 | Table_map   |        10 |       10551 | table_id: 110 (db1.sbtest4)                                            |
| binlog.000221 | 10551 | Update_rows |        10 |       10961 | table_id: 110 flags: STMT_END_F                                        |
| binlog.000221 | 10961 | Table_map   |        10 |       11014 | table_id: 109 (db1.sbtest10)                                           |
| binlog.000221 | 11014 | Delete_rows |        10 |       11234 | table_id: 109 flags: STMT_END_F                                        |
| binlog.000221 | 11234 | Table_map   |        10 |       11287 | table_id: 109 (db1.sbtest10)                                           |
| binlog.000221 | 11287 | Write_rows  |        10 |       11507 | table_id: 109 flags: STMT_END_F                                        |
| binlog.000221 | 11507 | Xid         |        10 |       11534 | COMMIT /* xid=1185380 */                                               |
+---------------+-------+-------------+-----------+-------------+------------------------------------------------------------------------+
11 rows in set (0.00 sec)

b)相同的paxos信息編號,消息類型不同

社區內有人在MGR源碼中加入日志,分析出相同編號的paxos信息,在Primary節點上是帶有實際的信息(和應用相關的,比如binlog的信息),但是在Secondary節點上是空消息(noop,不會提交給應用)。因此出現了數據不一致。

c)prepare階段出現問題

注意,這里的prepare和數據庫概念無關,而是paxos中的prepare。相關概念參考這篇博客(http://mysqlhighavailability.com/the-king-is-dead-long-live-the-king-our-homegrown-paxos-based-consensus/)。下圖中的Election和prepare的含義是一樣的。

 

簡要來說,任何一個節點在發送一個消息的時候,

1)先發送prepare消息,以確定其要發起提議的值;  

2)根據上一步的結果,發送accept信息到各個參與節點;

3)如果收到多數派的回應,則發送learn信息,如果其他節點(比如節點1)收到learn信息,則消息的值,(在節點1)被確認了。

從上面可以看出,一個消息的發送,需要經過3個階段,不僅產生了較大的網絡流量,更糟糕的是整個消息,從被發起到被確認經歷了較大延遲。

因此很多paxos的變種,都會試圖去優化這個過程。比如在系統運行穩定時,省略了第一步的prepare階段。但是在系統不穩定時,比如某個節點發現其缺少某個編號的消息時,會走完整的三階段。

而該bug正好就是在網絡很糟糕的情況下出現,因此有很大可能性就是prepare階段出現了問題。

3、相關概念補充

a)節點編號

每個節點都有一個編號。比如一個MGR集群,有三個節點,那么其編號就分別是0,1,2。其中編號的大小與是否為Primary無關

b)消息編號

消息編號由兩部分組成,第一部分是64位的無符號數,一般是遞增的,另一部分是節點編號。

比如(10064,0)就是一個消息編號

c)投票號

投票號也是由兩個部分組成,第一部分是32位的有符號數,另一部分也是節點編號。

4、筆者分析

a)分析方法

之前所熟悉和擅長的方法,尤其是調試方式上,在分布式系統中顯得力不從心。比如在研發智能SQL優化器時,利用gdb,幫助我了解了很多優化器的細節。但是對於分布式系統來說,一旦使用gdb掛載,可能會對其行為產生影響。因此筆者采用,以日志為主,gdb為輔的調試方法

b)paxos消息分析

1)結果正確、網絡正常

節點編號:0,Primary節點

in push_msg_2p msg_no 1683, 0
in dispatch_op msg_no 1683, 0, paxos op accept_op from : 0 to : 0
in dispatch_op msg_no 1683, 0, paxos op ack_accept_op from : 0 to : 0
in dispatch_op msg_no 1683, 0, paxos op ack_accept_op from : 2 to : 0
in dispatch_op msg_no 1683, 0, paxos op tiny_learn_op from : 0 to : 0
in dispatch_op msg_no 1683, 0, paxos op ack_accept_op from : 1 to : 0
msg_no 1683, 0 cargo_type : app_type , msg_type : normal, pax_op : learn_op


節點編號:2,Secondary節點

in dispatch_op msg_no 1683, 0, paxos op accept_op from : 0 to : 2
in dispatch_op msg_no 1683, 0, paxos op tiny_learn_op from : 0 to : 2
msg_no 1683, 0 cargo_type : app_type , msg_type : normal, pax_op : learn_op


節點編號:1,Secondary節點

in dispatch_op msg_no 1683, 0, paxos op accept_op from : 0 to : 1
in dispatch_op msg_no 1683, 0, paxos op tiny_learn_op from : 0 to : 1
msg_no 1683, 0 cargo_type : app_type , msg_type : normal, pax_op : learn_op

分析:

A)0號節點發送編號為(1683,0)的消息

in push_msg_2p msg_no 1683, 0

B)0,1、2號節點均收到accept_op

in dispatch_op msg_no 1683, 0, paxos op accept_op from : 0 to : 0
in dispatch_op msg_no 1683, 0, paxos op accept_op from : 0 to : 2
in dispatch_op msg_no 1683, 0, paxos op accept_op from : 0 to : 1


C)0,1、2號節點均回復ack_accept_ok

in dispatch_op msg_no 1683, 0, paxos op ack_accept_op from : 0 to : 0
in dispatch_op msg_no 1683, 0, paxos op ack_accept_op from : 2 to : 0
in dispatch_op msg_no 1683, 0, paxos op tiny_learn_op from : 0 to : 0
in dispatch_op msg_no 1683, 0, paxos op ack_accept_op from : 1 to : 0

注意到0號節點在收到2個ack_accept_ok(大多數)時,就開始發送tiny_learn_op,其中0號節點立即收到了這個消息

D)0、1、2號節點收到tiny_learn_op,表示消息在各個節點確認了。

in dispatch_op msg_no 1683, 0, paxos op tiny_learn_op from : 0 to : 0
in dispatch_op msg_no 1683, 0, paxos op tiny_learn_op from : 0 to : 1
in dispatch_op msg_no 1683, 0, paxos op tiny_learn_op from : 0 to : 2

E)0、1、2號節點將收到的信息,傳送給上層應用

msg_no 1683, 0 cargo_type : app_type , msg_type : normal, pax_op : learn_op
msg_no 1683, 0 cargo_type : app_type , msg_type : normal, pax_op : learn_op
msg_no 1683, 0 cargo_type : app_type , msg_type : normal, pax_op : learn_op

2)結果正確、網絡不正常

節點編號:0,Primary節點
in push_msg_2p msg_no 44043, 0
in dispatch_op msg_no 44043, 0, paxos op accept_op from : 0 to : 0
in dispatch_op msg_no 44043, 0, paxos op ack_accept_op from : 0 to : 0
in dispatch_op msg_no 44043, 0, paxos op ack_accept_op from : 2 to : 0
in dispatch_op msg_no 44043, 0, paxos op tiny_learn_op from : 0 to : 0
msg_no 44043, 0 cargo_type : app_type , msg_type : normal, pax_op : learn_op

節點編號:2,Secondary節點
in dispatch_op msg_no 44043, 0, paxos op accept_op from : 0 to : 2
in dispatch_op msg_no 44043, 0, paxos op tiny_learn_op from : 0 to : 2
msg_no 44043, 0 cargo_type : app_type , msg_type : normal, pax_op : learn_op
in dispatch_op msg_no 44043, 0, paxos op read_op from : 1 to : 2


節點編號:1,Secondary節點
in read_missing_values msg_no 44043, 0
in dispatch_op msg_no 44043, 0, paxos op learn_op from : 2 to : 1
msg_no 44043, 0 cargo_type : app_type , msg_type : normal, pax_op : learn_op

 分析:

對比情況 1),有如下幾個發現

A)0號節點,只收到兩個ack_accept_op(來自0、2號),其中缺失了1號節點的回復。但是由於構成了多數派,還是能夠成功。

B)1號節點發現缺少編號(44043,0)的信息后,往節點2發送read_op的信息,以獲取缺失的信息

in dispatch_op msg_no 44043, 0, paxos op read_op from : 1 to : 2

C)2號節點,將信息發給1號節點

in dispatch_op msg_no 44043, 0, paxos op learn_op from : 2 to : 1

3)節點錯誤、網絡不正常

節點編號:0,Primary節點
in push_msg_2p msg_no 44044, 0
in dispatch_op msg_no 44044, 0, paxos op accept_op from : 0 to : 0
in dispatch_op msg_no 44044, 0, paxos op ack_accept_op from : 0 to : 0
in dispatch_op msg_no 44044, 0, paxos op ack_accept_op from : 2 to : 0
in dispatch_op msg_no 44044, 0, paxos op tiny_learn_op from : 0 to : 0
in dispatch_op msg_no 44044, 0, paxos op read_op from : 1 to : 0
in dispatch_op msg_no 44044, 0, paxos op read_op from : 1 to : 0
in dispatch_op msg_no 44044, 0, paxos op prepare_op from : 1 to : 0
in dispatch_op msg_no 44044, 0, paxos op accept_op from : 1 to : 0
in dispatch_op msg_no 44044, 0, paxos op tiny_learn_op from : 1 to : 0
in dispatch_op msg_no 44044, 0, paxos op read_op from : 2 to : 0
in dispatch_op msg_no 44044, 0, paxos op read_op from : 2 to : 0
msg_no 44044, 0 cargo_type : app_type , msg_type : normal, pax_op : learn_op

節點編號:2,Secondary節點
in dispatch_op msg_no 44044, 0, paxos op accept_op from : 0 to : 2
in read_missing_values msg_no 44044, 0
in read_missing_values msg_no 44044, 0
in dispatch_op msg_no 44044, 0, paxos op read_op from : 1 to : 2
in read_missing_values msg_no 44044, 0
in dispatch_op msg_no 44044, 0, paxos op prepare_op from : 1 to : 2
in dispatch_op msg_no 44044, 0, paxos op accept_op from : 1 to : 2
in dispatch_op msg_no 44044, 0, paxos op tiny_learn_op from : 1 to : 2
msg_no 44044, 0 is no_op
in dispatch_op msg_no 44044, 0, paxos op tiny_learn_op from : 0 to : 2
in dispatch_op msg_no 44044, 0, paxos op learn_op from : 0 to : 2
in dispatch_op msg_no 44044, 0, paxos op learn_op from : 0 to : 2

節點編號:1,Secondary節點
in read_missing_values msg_no 44044, 0
in dispatch_op msg_no 44044, 0, paxos op read_op from : 2 to : 1
in read_missing_values msg_no 44044, 0
in read_missing_values msg_no 44044, 0
in push_msg_3p msg_no 44044, 0
in dispatch_op msg_no 44044, 0, paxos op prepare_op from : 1 to : 1
in dispatch_op msg_no 44044, 0, paxos op ack_prepare_empty_op from : 1 to : 1
in dispatch_op msg_no 44044, 0, paxos op ack_prepare_op from : 2 to : 1
in dispatch_op msg_no 44044, 0, paxos op accept_op from : 1 to : 1
in dispatch_op msg_no 44044, 0, paxos op ack_accept_op from : 1 to : 1
in dispatch_op msg_no 44044, 0, paxos op ack_accept_op from : 2 to : 1
in dispatch_op msg_no 44044, 0, paxos op tiny_learn_op from : 1 to : 1
msg_no 44044, 0 is no_op

 分析:

A)0號節點發送消息(44044, 0),只有0、2節點回復了。

in dispatch_op msg_no 44044, 0, paxos op ack_accept_op from : 0 to : 0
in dispatch_op msg_no 44044, 0, paxos op ack_accept_op from : 2 to : 0

B)0號節點,發送tiny_learn_op,但是只有0號節點收到了,此時編號為(44044, 0)的消息,在0號節點確定了,但是2號節點還未及時收到tiny_learn_op

msg_no 44044, 0 is no_op
in dispatch_op msg_no 44044, 0, paxos op tiny_learn_op from : 0 to : 2

注意2號節點,在消息(44044, 0)被確定為no_op時,才收到tiny_learn_op。也就是說這個tiny_learn_op對於2號節點來說,就像沒收到過一樣

C)1、2號節點,試圖讀取缺失的(44044, 0)消息(通過發送read_op),但是由於網絡問題,均沒有得到及時的回復 

in dispatch_op msg_no 44044, 0, paxos op read_op from : 1 to : 0
in dispatch_op msg_no 44044, 0, paxos op read_op from : 1 to : 0
in dispatch_op msg_no 44044, 0, paxos op read_op from : 2 to : 0
in dispatch_op msg_no 44044, 0, paxos op read_op from : 2 to : 0

D)1號節點試圖針對(44044, 0)發起noop消息提議

in push_msg_3p msg_no 44044, 0

注意關鍵點,錯誤在下一步

E)節點1試圖獲得他應該提議的值,並完成了propose過程

in dispatch_op msg_no 44044, 0, paxos op prepare_op from : 1 to : 1
in dispatch_op msg_no 44044, 0, paxos op ack_prepare_empty_op from : 1 to : 1
in dispatch_op msg_no 44044, 0, paxos op ack_prepare_op from : 2 to : 1
in dispatch_op msg_no 44044, 0, paxos op accept_op from : 1 to : 1
in dispatch_op msg_no 44044, 0, paxos op ack_accept_op from : 1 to : 1
in dispatch_op msg_no 44044, 0, paxos op ack_accept_op from : 2 to : 1
in dispatch_op msg_no 44044, 0, paxos op tiny_learn_op from : 1 to : 1
msg_no 44044, 0 is no_op

E.1)節點1發送prepare_op,並收到了節點1、2的回復

E.2) 節點1是沒有值(因為其之前沒有收到過來自節點0的accept_op),所以其回復ack_prepare_empty_op

E.3) 節點2有值(參考A),所以其回復ack_prepare_op

E.4)原則上節點1會發起節點2返回給的值,但是從結果來看,其並沒有

msg_no 44044, 0 is no_op

針對這個問題,筆者仔細閱讀MGR的prepare階段的源碼,核心代碼為

int gt_ballot(ballot x, ballot y) {
  return x.cnt > y.cnt || (x.cnt == y.cnt && x.node > y.node);
}

 

只有滿足這個條件,節點1才會使用節點2返回的值。

F)進一步研究發現,凡是編號大的節點發起的noop提議都有可能會有上述問題。

in handle_ack_prepare msg_no 82876, 0, m->proposal.cnt = 0, m->proposal.node = 0, p->proposer.msg->proposal.cnt = 0, p->proposer.msg->proposal.node = 1

由於初始化的時候,cnt均為0,大小完全取決於節點編號。 本例中為 0 < 1 , 所以無法使用0的值

 5、解決辦法

static void propose_noop(synode_no find, pax_machine *p) {
  /* Prepare to send a noop */
  site_def const *site = find_site_def(find);
  assert(!too_far(find));
  replace_pax_msg(&p->proposer.msg, pax_msg_new(find, site));
/* set cnt to -1 when propose noop*/
int cnt = -1; node_no nodeno = VOID_NODE_NO; if (site) nodeno = get_nodeno(site); init_ballot(&p->proposer.msg->proposal, cnt, nodeno);
/* set cnt to -1 when propose noop*/ assert(p
->proposer.msg); create_noop(p->proposer.msg); //printf("in propose_noop msg_no %lld, %d\n", p->proposer.msg->synode.msgno, p->proposer.msg->synode.node); //fflush(stdout); /* DBGOUT(FN; SYCEXP(find);); */ push_msg_3p(site, p, clone_pax_msg(p->proposer.msg), find, no_op); }

代碼中紅色部分為增加部分。修改后,由於 -1 < 0, 所以解決了那個問題。

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM