造成的復制延遲原因(從binlog入手分析, 20201101更新 )

本文轉載自查看原文 2020-02-03 20:20 1056 mysql

•大事務造成的延遲，其延遲不會從0開始增加，而是直接從主庫執行了這個事務多長時間開始的，從庫也要一樣的執行時間執行這個事務，而且事務執行時間累加上去，非常合理。

比如主庫執行這個事務花費的20秒，那么延遲就會從20開始，

從 log_event.cc 看到，這是因為 Query Event 中沒有准確的執行時間

log_event.cc

入口:

int Query_log_event::do_apply_event(Relay_log_info const *rli,
const char *query_arg, size_t q_len_arg)

包括: sql_mode , 客戶端字符集,自增環境設置,登陸的db,結果執行時間,ddl和dml執行時間
dml: 數據被寫盤后的時間 thd->set_time(&(common_header->when)); , 一個事務對應一個Query_Event 
ddl: 實際語句執行時間 thd->set_time(&(common_header->when)); 整個語句執行后才寫入執行時間

show slave status 入口:

rpl_slave.cc:

bool show_slave_status_send_data(THD *thd, Master_info *mi,
char* io_gtid_set_buffer,
char* sql_gtid_set_buffer)


if ((mi->get_master_log_pos() == mi->rli->get_group_master_log_pos()) &&
(!strcmp(mi->get_master_log_name(), mi->rli->get_group_master_log_name())))

檢查SQL線程是否在中繼日志的末尾
        檢查應使用兩個條件進行
        condition1：比較日志位置和
        condition2：比較文件名

獲得 seconds_behind_master 時間
long time_diff= ((long)(time(0) - mi->rli->last_master_timestamp) - mi->clock_diff_with_master);

系統時間 (long)(time(0)
主從之間系統時間差值 mi->clock_diff_with_master

1. 重點:
主從服務器的操作系統時間要一致,因為這兩個是可變不穩定因素

dml語句和ddl語句的執行時間 mi->rli->last_master_timestamp

log_event.cc

Query_log_event::Query_log_event(const char* buf, uint event_len,
const Format_description_event
*description_event,
Log_event_type event_type)

dml(從begin開始) 和 ddl 執行時間
exec_time= end_time.tv_sec - thd_arg->start_time.tv_sec;

rpl_slave.cc 
等於各個Event header 的 timestamp , 事務都是 GTID_EVENT 和 XID_EVENT 提交時間
rli->last_master_timestamp= ev->common_header->when.tv_sec + (time_t) ev->exec_time;

(題外話: cnblogs可以隨便給其他網站爬蟲( https://www.colabug.com/2020/0203/6936599/ ) , 我都還沒寫完它就爬 , cnblogs 一點保護不了作者勞動成果

binlog_event.h
class Log_event_header
·timeval：固定4字節，是新紀元時間（1970年1月1日0時0分0秒）以來的秒數。這個時
間是命令發起的時間。如何定義命令發起呢？它是執行計划生成后開始實際執行語句的
時候。源碼可以是在dispatch一command函數的開頭設置的(thd->set-time())。言外
之意對於語法意義、權限橙查、優化器生成執行計划的時間都包括在里面。其次要注意
了這個時間在從庫計算Seconds一Behind一Master的時候是一個重要的依據


·type_code：固定1字節，event事件的編碼。每TEvent有自己的編碼。
·unmasked_server_id:固定4字節。就是生成這個Event服務端的serverid。即便從庫端開啟了
log-slave-updates%，從庫將主庫的Event寫到BinaryLog，這個serverid也是主庫
的serveride如果這個Event再冫欠傳到主庫，那么需要跳過。源碼可以在


Log-event::do-shalLskip函數中找到跳過邏輯如下“

·event Len：固定4字節，整個Event的長度。
·end_log_p：固定4字節，下一個Event的開始位置。
．flags:固定2字節，某些Event包含這個標示，比如Format-descriptionlog_event中
LOG_EVENT_BINLOG_IN_USE_F 標示當前Binary log是當前寫入文件。

class Log_event_footer
·Event footer中的crc：固定4字節，這部分就是整個Event的一個crc校驗碼，用於標示Event的完整性

因為能力和篇幅有限，不可能介紹所有的Event,本系列值介紹一些常用的Evnet，下面是本系列將會介紹到了 Event:
• QUERY_EVENT=2 :在語句模式下記錄實際的語句，在行模式下不記錄任何語句相關的倍息，
但是DDL始終是記錄的語句，本系列值考慮行模式。因此即便是行模式下的 DDL也會記錄為語句。
• FORMAT_DESCRIPTION_EVENT= 15：說明Binary log的版本信息。總包含在每一個 Binary log的開頭。
• XID_EVENT=16:當事務提交的時候記錄這個Event其中攜帶了XID信息。
• TABLE_MAP_EVENT = 19: 包含了tablejd和具體表名的映射。
• WRITE_ROWS_EVENT = 30: INSERT語句生成的Event,包含插入的實際數據。是行模式才有的。
• UPDATE_ROWS_EVENT = 31: UPDATE語句生成的Event,包含數據的前后印象數據。是行模式才有的
• DELETE_ROWS_EVENT = 32: DELETE語句生成的Event,包含實際需要刪除的數據。是行模式才有的。
• GTID_LOG_EVENT = 33: 如果開啟GTID模式的時候生成關於GTID的倍息，並且攜帶 last commit和seq number作為MTS並行回放的依據。
• ANONYMOUS_GTIDJ_OG_EVENT=34:在關閉GTID模式的時候生成，並且攜帶last commit和seq number作為MTS並行回放的依據#。
• PREVIOUS_GIlDS_LOG_EVENT=35：說明前面所有的Binary log包含的GIlD SET, relay log則代表10線程收到的GTID SET。

參考Log_event_type

一、FORMAT DESCRIPTION EVENT
1、FORMAT_DESCRIPTION_EVENT的作用
攜帶的數據都是固定的，
包含了binary log的版本信息、MySQL的版本信息、Event_header的長度、
以及每個Event type的固定部分的長度。

下面倌患將會保存在從庫的內存中：
• 在從庫的內存中保存主庫的倍息，這個變星是Master_info.mi_description_event。
queue_event函數中case binary_log::FORMAT_DESCRIPTION_EVENT部分。

• 將從庫的relay log的FORMAT_DESCRIPTION_EVENT記錄為和主庫相同，
即更新 RelayJogInfo.rlLdescription_event

log_event.cc:
int Format_description_log_event::do_apply_event(Relay_log_info const *rli) 函數中如下片段:

如果從站未請求該事件，即Master發送了當從站要求位置> 4時，
該事件將使rli-> group_master_log_pos前進。 
說Slave要位置1000，Format_desc事件的結尾是96。
然后在復制rli-> group_master_log_pos的開始將是0，
然后是96，然后跳到第一個真正詢問的事件（即> 96）

/* Save the information describing this binlog */
const_cast<Relay_log_info *>(rli)->set_rli_description_event(this);

至少下面幾個地方都會讀取其中的倍息：
1.    每次SQL線程應用Event的時候會獲取Event_header的長度和相應Even個定部分的長度。
2.    I0線程啟動的時候檢測版本，參考函數 get_master_version_and_clock 。
3.    將倌患寫入到 relay log 的開頭的FORMAT_DESCRIPTION_EVENT中。
4. 位置必須在#4

rpl_slave.cc:
static int get_master_version_and_clock(MYSQL* mysql, Master_info* mi)

從庫針對不同版本不同處理 , mysql>=5.0 都能復制
switch (version_number)
{
case 0:
case 1:
case 2:
errmsg = "Master reported unrecognized MySQL version";
err_code= ER_SLAVE_FATAL_ERROR;
sprintf(err_buff, ER(err_code), errmsg);
break;
case 3:
mi->set_mi_description_event(new
Format_description_log_event(1, mysql->server_version));
break;
case 4:
mi->set_mi_description_event(new
Format_description_log_event(3, mysql->server_version));
break;
default:
/*
Master is MySQL >=5.0. Give a default Format_desc event, so that we can
take the early steps (like tests for "is this a 3.23 master") which we
have to take before we receive the real master's Format_desc which will
override this one. Note that the Format_desc we create below is garbage
(it has the format of the *slave*); it's only good to help know if the
master is 3.23, 4.0, etc.
*/
mi->set_mi_description_event(new
Format_description_log_event(4, mysql->server_version));
break;
}
}

將binlog event 寫入 binlog cache

log_event.cc
bool Format_description_log_event::write(IO_CACHE* file)

if (post_header_len_size == Binary_log_event::LOG_EVENT_TYPES)
相同版本主從服務器復制

else if (post_header_len_size > Binary_log_event::LOG_EVENT_TYPES)
在新的Master和舊的Slave之間復制。但不會從內存復制,所以任何內存不足的讀取。

else 在舊版本的主服務器和新版本的從服務器之間復制。
在這種情況下，它可能導致 Master 和 Slave 上發生不同的number_of_events。 當循環relay log時，
來自Master的FORMAT_DESCRIPTION_EVENT用於在Slave上創建FORMAT_DESCRIPTION_EVENT事件，
該事件將在此處寫入。 在這種情況下可能最終會讀取更多字節，
如post_header_len.size()<Binary_log_event::LOG_EVENT_TYPES;。引起內存問題。

binlog_version
server_version
create_timestamp
headerjength
array of post-header

•固定部分，這部分是大小不變的。
1. binlog_version: 2字節 binary log的版本當前為‘4’。
2. server_version: 50字節，MySQL的版本，為字符串形式#
3. create_timestamp: 4字節，MySQL每次啟動的時候的第一個binary log的 FORMAT_DESCRIPTION_EVENT會記錄，其他情況為0,源碼有如下解釋：
4. headerjength: 1 字節，Event header的長度。當前為‘19’。
5. array of post-header:當前版本為39字節#這是一個數組用於保存每個Event類型的固定部分的大小9

生成時機
這個Event作為binary log的第一個Event, _般都是在binary log切換的時候發生比如：
• flush binary 丨ogs命令。
• binary log自動切換。
• 重啟MySQL實例

最后注意下在本Event的Event header中flags如果為 LOG_EVENT_BINLOGJN_USE_F標示說明當前binary log沒有關閉（比如本binary log為當前寫入文件或者異常關閉MySQL實例）^如果異常關閉MySQL實例會檢測這個值決定是否做binary log recovery。

二、PREVIOUS GTIDS LOG EVENT
1、PREVIOUS_GTIDS_LDG_EVENT的作用
這個Event只包含可變部分。通常作為binary log的第二個Event,用於描述前面所有的binary log包含的GTID SET (包括已經刪除的）。初始化GTID模塊的時候也會掃描 binary log中的這個Event*在relay log同樣包含這個Event，主要用於描述I/O線程接收過哪些 GTID,后面能看到MySQL實例初始化的時候可能會掃描relay log中的這個Event來確認 Retrieved_Gtid_Set 。

3、主體格式

整個寫入過程集中在Gtid_set::encode函數中，因為GTID SET中可能出現多個server_uuid並且可能出現'gap'，因此是可變的。在Gtid_set::encode函數中我們也可以清晰的看到它在循環 GTID SET中的每個server_uuid和每一個GTID SET Interval,如下：
• 可變部分，這部分大小可變
• number of sids: 8字節# 小端顯示，本GITD SET有多少個server_uuid
• server_uuid: 16字節。GTID SET中的server_uuid
• n_intervals: 8字節。本server_uuid下GTID SET Interval的個數
• inter_start: 8字節。每個GTID SET Interval起始的gno
• inter_next: 8字節。每個GTID SET Interval結尾的下一個gno

注意：甶於一個GTIDSET可以包含多個server_uuid,因此第2到第5部分可能包含多個。
如果還包含多個GTID SET Interval則第4和第5部分也可能多個。

4、實例解析
下面是一個PREVIOUS_GTIDS_LOG_EVENT (mysqlbinlog —hexdump 輸出>，這種情況是我手動刪除了 auto.cnf手動構造出來的：

(cnblog格式能否優美一下)

• 02 00 00 00 00 00 00 00:包含2個server_uuid U小端顯示就是2個。

• 24 98 54 63 a5 36 11 e8 a3 0c 52 54 00 81 38 e4:第一個server_uuid。
• 01 00 00 ◦0 00 00 00 00: njntervals表示本GTID SET Interval的個數。小端顯示就是1個
• 01 00 00 00 00 00 00 00: inter_start,第一個GTID SET Interva丨的起始的gno為1
• 08 00 00 00 00 00 00 00: inter_next,第一個GTID SET Interval結尾的下一個gno為8.
• 6c ea 48 f6 92 6c 11 e9 b1 cb 52 54 00 81 38 e4:第二個server_uuid
• 01 00 00 00 00 00 00 00: njntervals表示本GT丨D SET Interval的個數。小端顯示就是1個
• 01 00 00 00 00 00 00 00: inter_start，第一個GTID SET Interva丨的起始的gno為 1
• 05 00 00 00 00 00 00 00: inter_next,第一個GTID SET Interval結尾的下一個gno位

我們看到解析出來的：

• 24985463-a536-11e8-a30c-5254008138e4:1-7
• 6cea48f6-926c-11e9-b1cb-5254008138e4:1-4
可以看到它們是一致的，只是inter_next應該減去了 1,因為Event中記錄的是GTID SET Interval結尾的下一個gno*

5、生成時機
生成時機一般也是進行binary log切換的時候作為第二個Event寫入binary log

7、重點 Event

GTID_LOG_EVENT
因為GTID_LOG_EVENT和ANONYMOUS_GTID_LOG_EVENT格式一致，只是攜帶的數據不一樣而已，只解釋GTID_LOG_EVENT即可
一、GTID LOG EVENT
(1) GTID_LOG_EVENT的作用
只需要知道GTID_LOG_EVENT 這個Event主要記錄的部分有下面三個：

• GTID的詳細倌息。
• 邏輯時鍾詳細倍息，即 last commit 和 seq number
• 是否為行模式，比如DDL語句就不是行模式的

顯示開啟事務的情況下GTID_LOG_EVENT和XID_EVENT Event header的 timestamp都是commit命令發起的時間，

當然如果沒有顯示開啟事務那么timestamp還是命令發起的時間。

log_event.cc

固定部分
• flags: 1字節。主要用於表示是否是行模式的，如果是則為0X00。不是則為0X01,注意 DDL都不是行模式的，而是語句模式
• server_uuid: 16字節。server_uuid參數去掉中間的16進制表示。
• gno: 8字節。小端顯示。表示GTID的序號。
• ts type: 1字節•固定為02。
• last commit: 8字節。小端顯示。
• seq number: 8字節。小端顯示。

(4)簡單解析
下面是一個的GTID_LOG_EVENT（mysqlbinlog—hexdump 輸出):

#191218 21:08:49 server id 1 end_log_pos 417 CRC32 0xac5430b1
# Position Timestamp Type Master ID Size Master Pos Flags
# 160 e1 24 fa 5d 21 01 00 00 00 41 00 00 00 a1 01 00 00 00 00
# 173 00 12 cf ee 78 e5 80 11 e6 a7 90 00 ff 05 93 af |....x...........|
# 183 ce 02 00 00 00 00 00 00 00 02 01 00 00 00 00 00 |................|
# 193 00 00 02 00 00 00 00 00 00 00 b1 30 54 ac |...........0T.|
# GTID last_committed=1 sequence_number=2 rbr_only=yes
/*!50718 SET TRANSACTION ISOLATION LEVEL READ COMMITTED*//*!*/;

mysql> show variables like '%uuid%';
+---------------+--------------------------------------+
| Variable_name | Value |
+---------------+--------------------------------------+
| server_uuid | 12cfee78-e580-11e6-a790-00ff0593afce |
+---------------+--------------------------------------+
1 row in set (0.00 sec)

• 02 00 00 00 00 00 00 00:這就是GTID的序號，小端顯示就是0X02, 10進制的2
• 02: ts type
• 01 00 00 00 00 00 00 00:即last_committed=1 , 小端顯示就是0X01
• 02 00 00 00 00 00 00 00:即sequence_nunnber=2 ,小端顯示就是0X02

• 40 00 00 00 00 00 00 00:這就是GTID的序號，小端顯示就是0X40, 10進制的64
• 02: ts type
• 01 00 00 00 00 00 00 00:即last_committed=1,小端顯示就是0X01
• 02 00 00 00 00 00 00 00:即sequence_nunnber=2 ,小端顯示就是0X02

(5) 生成時機
關於生成時機來講GTID_LOG_EVENT的生成和寫入binary log文件都是 order commit 的flush 階段

(6) ANONYMOUS_GTID_LOG_EVENT
這是匿名GTID Event, 5.7如果不開啟GTID則使用這種格式。它除了不生成GTID相關倍患外和GTID_LOG_EVENT保持一致，即如下部分全部為0:
• server_uuid
• gno

(7) GTID三種模式
•自動生成GTID:主庫一般是這種情況（AUTOMATIC_GROUP)
•指定GTID:從庫或者使用GTID_NEXT—般就是這種情況（GTID_GROUP)
•匿名GTID:當然也就是不開啟GTID了時候了（ANONYMOUS_GROUP)
源碼的注釋

rpl_gtid.h

/**
Specifies that the GTID has not been generated yet; it will be
generated on commit. It will depend on the GTID_MODE: if
GTID_MODE<=OFF_PERMISSIVE, then the transaction will be anonymous;
if GTID_MODE>=ON_PERMISSIVE, then the transaction will be assigned
a new GTID.

This is the default value: thd->variables.gtid_next has this state
when GTID_NEXT="AUTOMATIC".

It is important that AUTOMATIC_GROUP==0 so that the default value
for thd->variables->gtid_next.type is AUTOMATIC_GROUP.
*/
AUTOMATIC_GROUP= 0,
/**
Specifies that the transaction has been assigned a GTID (UUID:NUMBER).

thd->variables.gtid_next has this state when GTID_NEXT="UUID:NUMBER".

This is the state of GTID-transactions replicated to the slave.
*/
GTID_GROUP,
/**
Specifies that the transaction is anonymous, i.e., it does not
have a GTID and will never be assigned one.

thd->variables.gtid_next has this state when GTID_NEXT="ANONYMOUS".

This is the state of any transaction generated on a pre-GTID
server, or on a server with GTID_MODE==OFF.
*/
ANONYMOUS_GROUP,
/**
GTID_NEXT is set to this state after a transaction with
GTID_NEXT=='UUID:NUMBER' is committed.

This is used to protect against a special case of unsafe
non-transactional updates.

Background: Non-transactional updates are allowed as long as they
are sane. Non-transactional updates must be single-statement
transactions; they must not be mixed with transactional updates in
the same statement or in the same transaction. Since
non-transactional updates must be logged separately from
transactional updates, a single mixed statement would generate two
different transactions.

Problematic case: Consider a transaction, Tx1, that updates two
transactional tables on the master, t1 and t2. Then slave (s1) later
replays Tx1. However, t2 is a non-transactional table at s1. As such, s1
will report an error because it cannot split Tx1 into two different
transactions. Had no error been reported, then Tx1 would be split into Tx1
and Tx2, potentially causing severe harm in case some form of fail-over
procedure is later engaged by s1.

To detect this case on the slave and generate an appropriate error
message rather than causing an inconsistency in the GTID state, we
do as follows. When committing a transaction that has
GTID_NEXT==UUID:NUMBER, we set GTID_NEXT to UNDEFINED_GROUP. When
the next part of the transaction is being processed, an error is
generated, because it is not allowed to execute a transaction when
GTID_NEXT==UNDEFINED. In the normal case, the error is not
generated, because there will always be a Gtid_log_event after the
next transaction.
*/
UNDEFINED_GROUP,
/*
GTID_NEXT is set to this state by the slave applier thread when it
reads a Format_description_log_event that does not originate from
this server.

Background: when the slave applier thread reads a relay log that
comes from a pre-GTID master, it must preserve the transactions as
anonymous transactions, even if GTID_MODE>=ON_PERMISSIVE. This
may happen, e.g., if the relay log was received when master and
slave had GTID_MODE=OFF or when master and slave were old, and the
relay log is applied when slave has GTID_MODE>=ON_PERMISSIVE.

So the slave thread should set GTID_NEXT=ANONYMOUS for the next
transaction when it starts to process an old binary log. However,
there is no way for the slave to tell if the binary log is old,
until it sees the first transaction. If the first transaction
begins with a Gtid_log_event, we have the GTID there; if it begins
with query_log_event, row events, etc, then this is an old binary
log. So at the time the binary log begins, we just set
GTID_NEXT=NOT_YET_DETERMINED_GROUP. If it remains
NOT_YET_DETERMINED when the next transaction begins,
gtid_pre_statement_checks will automatically turn it into an
anonymous transaction. If a Gtid_log_event comes across before
the next transaction starts, then the Gtid_log_event will just set
GTID_NEXT='UUID:NUMBER' accordingly.
*/
NOT_YET_DETERMINED_GROUP
};

一、QUERY_EVENT
(1) QUERY_EVENT的作用
QUERY_EVENT不僅會記錄一些語句運行的環境比如SQL_MODE,客戶端字符集、自增環境 , 設置、當前登錄數據庫等，而且會記錄執行時間。但對於行模式的DDL和DML記錄的執行時間會有所不同t如下：
• DML:執行時間記錄的是第一條數據更改后的時間，而不是真正本條DML語句執行的時間（一個DML語句可能修改很多條數據），往往這個時間非常短，不能正確的表示DML 語句執行的時間 , 語句部分記錄的是’BEGIN’
• DDL:執行時間是實際語句的執行時間，語句部分記錄的是實際的語句。
執行時間是Seconds_Behind_Master計算的一個影晌因素，對於一個事務來講只有一個這樣的QUERY_EVENT。

格式部分(單位:bytes):

固定部分，這部分是大小不變的。

• slave_proxy_id:4字節，主庫生成本Event的 thread id,它和show processlist中的id 對應，
• query_exec_time: 4字節，這就是執行時間了。但是對於行模式的DML語句來講這個執行時間並不准確，上面已經描述了原因。而對於DDL來講還是比較准確的。
• db_len: 1字節，用於描述數據庫名的長度0
• error_code: 2字節，執行語句的錯誤碼。
• status_vars_len: 2字節，status variables部分的長度。

可変部分，這部分是可變的。
• status variables:這部分是一個環境參數的設置，其中包含了很多種格式。每種格式都有自己的長度和數據域，因此可以輕松的讀取到各種值,比如SQL_MODE,客戶端字符集、自增環堍、客戶端排序字符集等，但是其過於復雜這里不做實際解析。
• db:當前登錄的database名字，以0x00結尾。主庫來源為thd->db。如果是語句模式從庫做過濾的時候會使用這個名字。下面是邏輯驗證：
• query:這就是具體的語句9對於行模式的DML來講就是‘BEGIN’，而對於DDL來講就是具體的語句。
如果我們打開Query_log_event::do_apply_event函數就會看到這個Event在從庫應用的時候就是設置各種環境比如客戶端字符集、自增環境設置、當前登錄數據庫等，然后執行相應的語句，而對於行模式的DML這里只會執行‘BEGIN、這里注意一個細節，其中包含一句：
這句代碼會設置線程的命令執行時間為Event header中Timestamp的吋間，因此在從庫上執行now()函數時是可以得到正確的結果的。

statement_events.cpp:

從Slave讀取來源:

Query_event::Query_event(const char* buf, unsigned int event_len,
const Format_description_event *description_event,
Log_event_type event_type)

 

......

// 讀取 status variables

start= (Log_event_header::Byte*) (buf + post_header_len);
end= (const Log_event_header::Byte*) (start + status_vars_len);

// 讀取 db

  db= (const char* )end;

一、QUERY-EVENT
(1) QUERY_EVENT的作用
QUERY_EVENT不僅會記錄一些語句運行的環境比如SQL_MODE,客戶端字符集、自增環境設置、當前登錄數據庫等，而且會記錄執行時間。但對於行模式的DDL和DML記錄的執行時間會有所不同t如下：
• DML:執行時間記錄的是第一條數據更改后的時間，而不是真正本條DML語句執行的時間（一個DML語句可能修改很多條數據），往往這個時間非常短，不能正確的表示DML 語句執行的時間 , 語句部分記錄的是’BEGIN’
• DDL:執行時間是實際語句的執行時間，語句部分記錄的是實際的語句。
執行時間是Seconds_Behind_Master計算的一個影晌因素，對於一個事務來講只有一個這樣的QUERY_EVENT。

如果我們打開 QueryLog_event::do_apply_event 函數就會看到這個Event在從庫應用的時候就是設罝各種環境比如客戶端字符集、自增環境設罝、當前登錄數據庫等，然后執行相應的語句，而對於行模式的DML這里只會執行‘BEGIN’。這里注意一個細節，其中包含一句：

log_event.cc

int Load_log_event::do_apply_event(NET* net, Relay_log_info const *rli,
bool use_rli_only_for_errors)

......

if (rpl_filter->db_ok(thd->db().str))
{
thd->set_time(&(common_header->when));

這句代碼會設罝線程的命令執行時間為Event header中Timestamp的時間。因此當我們在從庫上執行now()函數時是可以得到正確的結果的

(4)實例解析
下面是一個行模式的DML的QUERY_EVENT (mysqlbinlog --hexdump 輸出）：

slave_proxy_id 4 bytes
query_exec_time 4 bytes
db_len 1 bytes
error_code 2 bytes
status_vars_len 2 bytes
status_variables 可變
db 可變
query 可變

固定部分，這部分是大小不變的。
• slave_proxy_id:4字節，主庫生成本Event的thread丨d,它和show processlist中的id 對應，
• query_exec_time: 4字節，這就是執行時間了。但是對於行模式的DML語句來講這個執行時間並不准確，上面已經描述了原因。而對於DDL來講還是比較准確的。
• db_len: 1字節，用於描述數據庫名的長度0
• error_code: 2字節，執行語句的錯誤碼。
• status_vars_len: 2字節，status variables部分的長度。

從讀取來源:

statement_events.cpp
Query_event::Query_event(const char* buf, unsigned int event_len,
const Format_description_event *description_event,
Log_event_type event_type)

end= (const Log_event_header::Byte*) (start + status_vars_len);
db= (const char* )end;
// 這里將指針指向 Query_event header中的db-name

log_event.cc
int Query_log_event::do_apply_event(Relay_log_info const *rli,
const char *query_arg, size_t q_len_arg)

// 讀取db-name
set_thd_db(thd, db, db_len);

log_event.cc 調用 rpl_filter.cc , rpl_filter.h

bool
Rpl_filter::db_ok_with_wild_table(const char *db)

• query:這就是具體的語句9對於行模式的DML來講就是‘BEGIN’，而對於DDL來講就是具體的語句。
如果我們打開Query_log_event::do_apply_event函數就會看到這個Event在從庫應用的時候就是設置各種環境比如客戶端字符集、自增環境設置、當前登錄數據庫等，然后執行相應的語句，而對於行模式的DML這里只會執行‘BEGIN、這里注意一個細節，其中包含一句：
這句代碼會設霣線程的命令執行時間為Event header中Timestamp的吋間，因此當我們在從庫上執行now()函數時是可以得到正確的結果的。

(4)實例解析
下面是一個行模式的DML的QUERY_EVENT (mysqlbinlog --hexdump 輸出）：

# at 417
#191218 21:08:49 server id 1 end_log_pos 490 CRC32 0x1b37cdc7
# Position Timestamp Type Master ID Size Master Pos Flags
# 1a1 e1 24 fa 5d 02 01 00 00 00 49 00 00 00 ea 01 00 00 08 00
# 1b4 03 00 00 00 00 00 00 00 05 00 00 1a 00 00 00 00 |................|
# 1c4 00 00 01 00 00 00 00 00 00 00 00 06 03 73 74 64 |.............std|
# 1d4 04 2d 00 2d 00 21 00 77 6f 72 6c 64 00 42 45 47 |.......world.BEG|
# 1e4 49 4e c7 cd 37 1b |IN..7.|
# Query thread_id=3 exec_time=0 error_code=0
SET TIMESTAMP=1576674529/*!*/;
BEGIN

固定部分
• 03 00 00 00： thread id 3.
• 00 00 00 00:執行時間，對於行模式的DML來講通常不准。
• 05:當前登錄庫名的長度。
• 00 00:錯誤碼#
• 1a 00: status variables部分的長度16進制 '1a' 就是十進制26。

可變部分
• status variables:略。
• 77 6f 72 6c 64 00:當前登錄庫名world的ASCII編碼，以0x00結尾
• 42 45 47 49 4e:語句‘BEGIN’的ASCII編碼

中間有部分就是我們的status variables,這部分過於復雜我沒有仔細研究，因此我就沒做解析了

(5) 生成時機
•對於行模式的DML而言，生成時機是事務的第一個DML語句第一行需要修改的數據在 Innodb引擎層修改完成后。通常來講一個事務對應了一個QUERY_EVENT。
•對於DDL而言，生成時機是在整個操作執行完成之后。

二、MAP_EVENT
(1) MAP_EVENT的作用
MAP_EVENT是行模式特有的，它主要作用是做table id和實際訪問表的映射。
同時其中還包含了一些表的定義，如：表所在庫名、表名、字段類型、可變字段長度等。
這里的庫名和 QUERY_EVENT的庫名不一樣，這個庫名來自表的定義，
而QUERY_EVENT的庫名來自於當前登錄的數據庫即：thd->db。
固定部分，這部分是大小不變的。

table_id 6 bytes
Reserved 2 bytes
db len 可變
db name 可變
table len 可變
table name 可變
no of cols 可變
array of col types 可變
metadata len 可變
metadata block 可變
m_null_bits 可變

• table id: 6字節，這個table id和Innodb層的table id不一樣，他分配的時機是第一次打開表定義的時候。它不是固定的，重啟MySQL實例或
者 flush tables 都會導致其改變。

sql_base.cc
TABLE_SHARE *get_table_share(THD *thd, TABLE_LIST *table_list,
const char *key, size_t key_length,
uint db_flags, int *error,
my_hash_value_type hash_value)

assign_new_table_id(share);
share->table_map_id= last_table_id++; // 遞增table id
在LOCK_open的保護下分配一個新的表ID。 這樣做不是創建新的互斥鎖
並將其僅用於序列化對 靜態變量， 在此處分配表ID。我們將其分配給
共享，然后將其插入到 table_def_cache , table_cache_size 中確保沒有表就無法從緩存中讀取它
分配的ID。

20200826更新

如果 table_def_cache 比 參數 table_open_cache 大 , 從庫就會釋放緩存,
重新打開 table 重新生成 table id (sql_base.cc: table_def_init )

導致跟主庫生成 table id 不一樣 ( pfs_autosize.cc: estimate_hints )和 table id急劇增長 ,
最終主從漏掉某些表同步

sql_base.cc: process_found_table_share
......

while (table_def_cache->size() > table_def_size && oldest_unused_share->next)

table_def_cache->erase(to_string(oldest_unused_share->table_cache_key));

......

關注 show global status 的 Open_tables 和 Opened_tables

因此它是可變的。
• Reserved: 2字節，保留以后使用。
可變部分，這部分大小是可變的《
• dblen:表所在的數據庫名長度。
• db name:表所在的數據庫名以0x00結尾。
• table丨en:表名的長度。
• table name:表名以0x00結尾》
• no of cols:表中字段數。
• array of col types:字段的類型數組。
• metadata len: metadata block的長度v
• metadata block:對於可變字段需要記錄字段的長度，但對於像int這種數據類型就不需要了。下面是varchar關於可變長度的輸出，可以看到它占用2個字節：

field.cc
int Field_varstring::do_save_field_metadata(uchar *metadata_ptr)
{
DBUG_ASSERT(field_length <= 65535);
int2store((char*)metadata_ptr, field_length);
return 2;
}

如果感興趣可以査看do_save_field_metadata這個函數。
• m_null_bits:字段是否可以為空的一個位圖。下面是位圖的獲取方式。

現在將數據的大小設置為字段元數據數組的大小
加上1或3個字節（請參閱pack.c：net_store_length）以獲取
字段元數據數組中的元素。

log_event.cc
if (m_field_metadata_size < 251)
m_data_size+= m_field_metadata_size + 1; 
else
m_data_size+= m_field_metadata_size + 3;

memset(m_null_bits, 0, num_null_bytes);
for (unsigned int i= 0 ; i < m_table->s->fields ; ++i)
if (m_table->field[i]->maybe_null())
m_null_bits[(i / 8)]+= 1 << (i % 8);

(4) 實例解析
執行如下語句：

mysql> use test;
Database changed
mysql> create table test(id int(11) not null auto_increment,
-> a int(11) default null,
-> b int(11) default null,
-> primary key(id),
-> key idxa (a)
-> ,
-> key idxb (b)
-> ) engine=innodb auto_increment=10 default charset=utf8mb4;
Query OK, 0 rows affected (0.02 sec)

mysql> insert into test values(100,100,100);
Query OK, 1 row affected (0.01 sec)

# at 634
#200206 23:27:49 server id 1 end_log_pos 683 CRC32 0xf235602e
# Position Timestamp Type Master ID Size Master Pos Flags
# 27a 75 30 3c 5e 13 01 00 00 00 31 00 00 00 ab 02 00 00 00 00
# 28d 65 01 00 00 00 00 01 00 04 74 65 73 74 00 04 74 |e........test..t|
# 29d 65 73 74 00 03 03 03 03 00 06 2e 60 35 f2 |est.........5.|
# Table_map: `test`.`test` mapped to number 357

這個‘Insert’語句的MAP_EVENT為：
• 65 01 00 00 00 00: table id為 16進制的0165,轉換為 10進制為 357
• 01 00:保留。
• 04:表所在的數據庫名長度為4
• 74 65 73 74 00: 'test'的ASCII表示，以0x00結尾
• 04:表名的長度為2。
• 65 73 74 00: 'test'的ASCII表示，以0x00結尾
• 03:表擁有3個字段
• 03 03 03:每個字段的類型都是03,實際上就是int這種類型了。具體可以參考 enum_field_types 這個枚舉類型
• 00: metadata長度為0,沒有可變字段
• 06:轉換為2進制為110,表示第一個字段不可以為空，其他兩個字段可以為空

(5) 生成時機
本Event只會在行模式下生成#生成時機是事務的每條DML語句條改的第一行數據在Innodb引擎層修改完成后，QUERY_EVENT生成之后。
通常來講每個語句的每個表（Table List)都會包含這樣一個MAP_EVENT。
前面我們說過了table_id是可變的，甚至會出現下面的情況。我們來構造這種情況，還是使用上面的 test.test表：

set global autocommit = off;

Session1:
begin;
insert into test.test values(1,2,3);

Session2:
flush tables;

Session1:
insert into test.test values(4,5,6);
commit;

/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
DELIMITER /*!*/;
# at 4
#200207 21:18:24 server id 1 end_log_pos 123 CRC32 0x0a0e6713 Start: binlog v 4, server v 5.7.19-enterprise-commercial-advanced-log created 200207 21:18:24
# at 123
#200207 21:18:24 server id 1 end_log_pos 194 CRC32 0x7fe14b87 Previous-GTIDs
# 12cfee78-e580-11e6-a790-00ff0593afce:17-21
# at 194
#200207 21:18:49 server id 1 end_log_pos 259 CRC32 0x99027597 GTID last_committed=0 sequence_number=1 rbr_only=no
SET @@SESSION.GTID_NEXT= '12cfee78-e580-11e6-a790-00ff0593afce:22'/*!*/;
# at 259
#200207 21:18:49 server id 1 end_log_pos 338 CRC32 0xe188ca4b Query thread_id=4 exec_time=0 error_code=0
use `test`/*!*/;
SET TIMESTAMP=1581081529/*!*/;
SET @@session.pseudo_thread_id=4/*!*/;
SET @@session.foreign_key_checks=1, @@session.sql_auto_is_null=0, @@session.unique_checks=1, @@session.autocommit=1/*!*/;
SET @@session.sql_mode=0/*!*/;
SET @@session.auto_increment_increment=1, @@session.auto_increment_offset=1/*!*/;
/*!\C utf8 *//*!*/;
SET @@session.character_set_client=33,@@session.collation_connection=33,@@session.collation_server=33/*!*/;
SET @@session.lc_time_names=0/*!*/;
SET @@session.collation_database=DEFAULT/*!*/;
flush tables
/*!*/;
# at 338
#200207 21:19:00 server id 1 end_log_pos 403 CRC32 0xa13e5263 GTID last_committed=1 sequence_number=2 rbr_only=yes
/*!50718 SET TRANSACTION ISOLATION LEVEL READ COMMITTED*//*!*/;
SET @@SESSION.GTID_NEXT= '12cfee78-e580-11e6-a790-00ff0593afce:23'/*!*/;
# at 403
#200207 21:18:46 server id 1 end_log_pos 475 CRC32 0x4284ad54 Query thread_id=3 exec_time=0 error_code=0
SET TIMESTAMP=1581081526/*!*/;
BEGIN
/*!*/;
# at 475
#200207 21:18:46 server id 1 end_log_pos 524 CRC32 0xb34f27f8 Table_map: `test`.`test` mapped to number 360
# at 524
#200207 21:18:46 server id 1 end_log_pos 572 CRC32 0x8e20673a Write_rows: table id 360 flags: STMT_END_F
### INSERT INTO `test`.`test`
### SET
### @1=1 /* INT meta=0 nullable=0 is_null=0 */
### @2=2 /* INT meta=0 nullable=1 is_null=0 */
### @3=3 /* INT meta=0 nullable=1 is_null=0 */
# at 572
#200207 21:18:58 server id 1 end_log_pos 621 CRC32 0xa4f321aa Table_map: `test`.`test` mapped to number 361
# at 621
#200207 21:18:58 server id 1 end_log_pos 669 CRC32 0x3108b1d9 Write_rows: table id 361 flags: STMT_END_F
### INSERT INTO `test`.`test`
### SET
### @1=4 /* INT meta=0 nullable=0 is_null=0 */
### @2=5 /* INT meta=0 nullable=1 is_null=0 */
### @3=6 /* INT meta=0 nullable=1 is_null=0 */
# at 669
#200207 21:19:00 server id 1 end_log_pos 700 CRC32 0xc7a499f7 Xid = 21
COMMIT/*!*/;
# at 700
#200207 21:19:14 server id 1 end_log_pos 723 CRC32 0x0357cffb Stop
SET @@SESSION.GTID_NEXT= 'AUTOMATIC' /* added by mysqlbinlog */ /*!*/;
DELIMITER ;
# End of log file
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;

一個事務內相同的表不同的 table_id ,

# at 475
#200207 21:18:46 server id 1 end_log_pos 524 CRC32 0xb34f27f8 Table_map: `test`.`test` mapped to number 360

# at 572
#200207 21:18:58 server id 1 end_log_pos 621 CRC32 0xa4f321aa Table_map: `test`.`test` mapped to number 361

Seconds_Behind_Master作為判斷主從同步的一個重要它的計算方法。
同時討論有哪些需要注意的地方。這里還會說明Seconds_Behind_Master為0並不一定代表沒有延遲的原因。

一、 Seconds_Behind_Master計算的方式
當每次發起‘show slave status’命令的時候都會進行一次Seconds_Behind_Master的計算，
其計算的方式集中在函數show_slave_status_send_data中，下面是一段偽代碼，非常清晰：

rpl_slave.cc
/*
The pseudo code to compute Seconds_Behind_Master:
if (SQL thread is running)
{
if (SQL thread processed all the available relay log)
{ // sql線呈已經 apply 完所有 I/O線程寫入的Event
if (IO thread is running)
print 0;    // 顯示延遲為0
else
print NULL;
}
else
compute Seconds_Behind_Master;
}
else
print NULL;
*/

正式代碼如下,這里有一點需要注慂就是如何判斷SQL線程應用完了所有的Event的，下面這部 分實際的代碼：


bool show_slave_status_send_data(THD *thd, Master_info *mi,
char* io_gtid_set_buffer,
char* sql_gtid_set_buffer)
...
if ((mi->get_master_log_pos() == mi->rli->get_group_master_log_pos()) &&
(!strcmp(mi->get_master_log_name(), mi->rli->get_group_master_log_name())))
{
if (mi->slave_running == MYSQL_SLAVE_RUN_CONNECT)
protocol->store(0LL);
else
protocol->store_null();
}
else
{
long time_diff= ((long)(time(0) - mi->rli->last_master_timestamp)
- mi->clock_diff_with_master);

條件“mi->get_master_log_pos() == mi->rli->get_group_master_log_pos()”
就是通過 10 線程讀取到主庫binary log的位罝和SQL線程應用到的主庫binary log位罝進行比較來進行判斷。
如果主從之間的網絡狀態很糟糕的話，
從庫的SQL線程的應用Event的速度可能比10線程讀取Event的速度更快。
那么就會出現一種情況，雖然SQL線程應用完了所有的Event，
並且 Seconds_Behind_Master也顯示為0,但是並不代表沒有延遲，
這個時候的延遲主要在於10線程讀取Event過慢。
這也是如果Seconds_Behind_Master為0不代表沒有延遲的第一個原因。

二、影晌 Seconds_Behind_Master 計算的因素
一般看到的正常Seconds_Behind_Master是怎么計薄出來的呢？實際上也很簡單源碼如下：
分別解釋一下這三個因素：
(1) (long)(time(0)：取當前從庫服務器的系統時間。

(2) mi->clock_diff_with_master
這個值是從庫服務器的系統時間和主庫服務器的系統時間的差值。
實際上它只在10線程啟動的時候進行一次性計算，
就是如果啟動 I/O 線程后人為的修改從庫服務器的時間，
那么Seconds_Behind_Master的計算必然出現問題。更有可能出現負數，
如果出現負數延遲依然顯示為0,如下：

if (mi->rli->slave_running)
{
/*
Check if SQL thread is at the end of relay log
Checking should be done using two conditions
condition1: compare the log positions and
condition2: compare the file names (to handle rotation case)
*/
if ((mi->get_master_log_pos() == mi->rli->get_group_master_log_pos()) &&
(!strcmp(mi->get_master_log_name(), mi->rli->get_group_master_log_name())))
{
if (mi->slave_running == MYSQL_SLAVE_RUN_CONNECT)
protocol->store(0LL);
else
protocol->store_null();
}
else
{
long time_diff= ((long)(time(0) - mi->rli->last_master_timestamp)
- mi->clock_diff_with_master);
/*
Apparently on some systems time_diff can be <0. Here are possible
reasons related to MySQL:
- the master is itself a slave of another master whose time is ahead.
- somebody used an explicit SET TIMESTAMP on the master.
Possible reason related to granularity-to-second of time functions
(nothing to do with MySQL), which can explain a value of -1:
assume the master's and slave's time are perfectly synchronized, and
that at slave's connection time, when the master's timestamp is read,
it is at the very end of second 1, and (a very short time later) when
the slave's timestamp is read it is at the very beginning of second
2. Then the recorded value for master is 1 and the recorded value for
slave is 2. At SHOW SLAVE STATUS time, assume that the difference
between timestamp of slave and rli->last_master_timestamp is 0
(i.e. they are in the same second), then we get 0-(2-1)=-1 as a result.
This confuses users, so we don't go below 0: hence the max().

last_master_timestamp == 0 (an "impossible" timestamp 1970) is a
special marker to say "consider we have caught up".
*/
protocol->store((longlong)(mi->rli->last_master_timestamp ?
max(0L, time_diff) : 0));
}

// max(0L,time_diff) 如果time_diff為負數取0

因此主從服務器最好保證時鍾同步。
這也是Seconds_Behind_Master顯示0並不代表沒有延遲的第二個原因。

(3) mi->rli->last_master_timestamp
這個值的取值就比較復雜，因為DML和DDL下是不同的。
如果是DML在單SQL線程和MTS下又不一樣。分開分析。

• DML (單SQL線程)
這種情況下前面已經說過了，SQL線程每次執行Event的時候會獲取Event的相關信息如下：

rpl_slave.cc

// event header 的timestamp

static int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{
DBUG_ENTER("exec_relay_log_event");

/*
We acquire this mutex since we need it for all operations except
event execution. But we will release it in places where we will
wait for something for example inside of next_event().
*/
mysql_mutex_lock(&rli->data_lock);

/*
UNTIL_SQL_AFTER_GTIDS, UNTIL_MASTER_POS and UNTIL_RELAY_POS require
special handling since we have to check whether the until_condition is
satisfied *before* the SQL threads goes on a wait inside next_event()
for the relay log to grow.
This is required in the following case: We have already applied the last
event in the waiting set, but the relay log ends after this event. Then it
is not enough to check the condition in next_event; we also have to check
it here, before going to sleep. Otherwise, if no updates were coming from
the master, we would sleep forever despite having reached the required
position.
*/
if ((rli->until_condition == Relay_log_info::UNTIL_SQL_AFTER_GTIDS ||
rli->until_condition == Relay_log_info::UNTIL_MASTER_POS ||
rli->until_condition == Relay_log_info::UNTIL_RELAY_POS ||
rli->until_condition == Relay_log_info::UNTIL_SQL_VIEW_ID) &&
rli->is_until_satisfied(thd, NULL))
{
rli->abort_slave= 1;
mysql_mutex_unlock(&rli->data_lock);
DBUG_RETURN(1);
}

Log_event *ev = next_event(rli), **ptr_ev;

DBUG_ASSERT(rli->info_thd==thd);

if (sql_slave_killed(thd,rli))
{
mysql_mutex_unlock(&rli->data_lock);
delete ev;
DBUG_RETURN(1);
}
if (ev)
{
enum enum_slave_apply_event_and_update_pos_retval exec_res;

ptr_ev= &ev;
/*
Even if we don't execute this event, we keep the master timestamp,
so that seconds behind master shows correct delta (there are events
that are not replayed, so we keep falling behind).

If it is an artificial event, or a relay log event (IO thread generated
event) or ev->when is set to 0, or a FD from master, or a heartbeat
event with server_id '0' then we don't update the last_master_timestamp.

In case of parallel execution last_master_timestamp is only updated when
a job is taken out of GAQ. Thus when last_master_timestamp is 0 (which
indicates that GAQ is empty, all slave workers are waiting for events from
the Coordinator), we need to initialize it with a timestamp from the first
event to be executed in parallel.
*/
if ((!rli->is_parallel_exec() || rli->last_master_timestamp == 0) &&
!(ev->is_artificial_event() || ev->is_relay_log_event() ||
(ev->common_header->when.tv_sec == 0) ||
ev->get_type_code() == binary_log::FORMAT_DESCRIPTION_EVENT ||
ev->server_id == 0))
{
rli->last_master_timestamp= ev->common_header->when.tv_sec +
(time_t) ev->exec_time;
DBUG_ASSERT(rli->last_master_timestamp >= 0);
}

// query event 才有的執行時間

實際上因為行格式的binary log (本系只討論行格式），DML中QUERY_EVENT的ev->exec_time基本為0因此可以忽略，
因為QUERY_EVENT中的exec_time只記錄第一條數據更改消耗的時間，且一般看到的是‘BEGIN’，
所以 last_master_timestamp就基本等於各個Event中header的timestamp。
但是對於一個事務而言 GTID_EVENT 和 XID_EVENT 都是提交時刻的時間，
而對於其他Event都是命令發起時刻的時間（顯示“begin commit”開啟事務的情況）。
因此如果一個長時間未提交的事務在SQL線程應用的時候可能觀察到Seconds_Behind_Master 的瞬間跳動。

以一個DML語句為例，假設主庫10分鍾后才提交這個事務，
那么可能延遲計算如下（T1為語句執行時間，T1+10分鍾為事務提交的時間，
T2為從庫系統時間-主和從庫的時間差，
T2每次執查詢Seconds_Behind_Master可變，因為從庫系統時間可變）：
這種情況就能看到突然延遲跳到很大然后一下恢復正常，比如例子中為10分鍾。

• DML (MTS)
實際上rli->last_master_timestamp會取值為檢查點位罝事務XID_EVENT的timestamp,流程大概如下：
因此last_master_timestamp就是檢查點位罝事務的XID_EVENT header中的timestamp。
如果不開啟‘slave_preserve_commit_order’參數，那么就可能出現‘gap’，

例如大事務在從庫apply很慢，小事務已經很快完成了apply，這個大事務形成gap導致主從數據不一致
檢查點只能停留在‘gap’之前的一個事務，可能后面的事務已經執行完成了。
這種情況下延遲並不是那么准確但是誤差也不大，
因為默認slave_checkpoint_period參數設罝為300毫秒。
關於‘gap’后面可以調高參數slave_checkpoint_period參數，
以此來證明MTS延遲的計算是以檢查點為准的。

一、 Seconds__Behind_Masteri十算的方式
當每次發起‘show slave status’命令的時候都會進行一次Seconds_Behind_Master的計算，
其計算的方式集中在函數show_slave_status_send_data中，下面是一段偽代碼，非常清晰：

rpl_slave.cc
/*
The pseudo code to compute Seconds_Behind_Master:
if (SQL thread is running)
{
if (SQL thread processed all the available relay log)
{ // sql線呈已經 apply 完所有 I/O線程寫入的Event
if (IO thread is running)
print 0;    // 顯示延遲為0
else
print NULL;
}
else
compute Seconds_Behind_Master;
}
else
print NULL;
*/

正式代碼如下,這里有一點需要注慂就是如何判斷SQL線程應用完了所有的Event的，下面我們可以看看這部 分實際的代碼：


bool show_slave_status_send_data(THD *thd, Master_info *mi,
char* io_gtid_set_buffer,
char* sql_gtid_set_buffer)
...
if ((mi->get_master_log_pos() == mi->rli->get_group_master_log_pos()) &&
(!strcmp(mi->get_master_log_name(), mi->rli->get_group_master_log_name())))
{
if (mi->slave_running == MYSQL_SLAVE_RUN_CONNECT)
protocol->store(0LL);
else
protocol->store_null();
}
else
{
long time_diff= ((long)(time(0) - mi->rli->last_master_timestamp)
- mi->clock_diff_with_master);

二、影晌 Seconds_Behind_Master 計算的因素
那么我們一般看到的正常Seconds_Behind_Master是怎么計薄出來的呢？實際上也很簡單源碼如下：
我們來分別解釋一下這三個因素：
(1) (long)(time(0)：
取當前從庫服務器的系統時間。

(2) mi->clock_diff_with_master
這個值是從庫服務器的系統時間和主庫服務器的系統時間的差值。
實際上它只在10線程啟動的時候進行一次性計算，
言外之意就是如果我們啟動 I/O 線程后人為的修改從庫服務器的時間，
那么Seconds_Behind_Master的計算必然出現問題。更有可能出現負數，
如果出現負數延遲依然顯示為0,如下：

protocol->store((longlong)(mi->rli->last_master_timestamp ?
max(0L, time_diff) : 0));
}
}
// max(0L,time_diff) 如果time_diff為負數取0

因此主從服務器最好保證時鍾同步。
這也是Seconds_Behind_Master顯示0並不代表沒有延遲的第二個原因。

(3) mi->rli->last_master_timestamp
這個值的取值就比較復雜，因為DML和DDL下是不同的。
如果是DML在單SQL線程和MTS下又不一樣。因此我們分幵討論。

• DML (單SQL線程)
這種情況下前面我們已經說過了，SQL線程每次執行Event的時候會獲取Event的相關信息如下：

rpl_slave.cc
static int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{
....
// event header 的timestamp
rli->last_master_timestamp= ev->common_header->when.tv_sec +
(time_t) ev->exec_time;

// query event 才有的執行時間

實際上因為行格式的binary log (本系只討論行格式），DML中QUERY_EVENT的ev->exec_time基本為0因此可以忽略，
因為QUERY_EVENT中的exec_time只記錄第一條數據更改消耗的時間，且我們一般看到的是‘BEGIN’，
所以 last_master_timestamp就基本等於各個Event中header的timestamp。
但是對於一個事務而言我們知道GTID_EVENT和XID_EVENT都是提交時刻的時間，
而對於其他Event都是命令發起時刻的時間（顯示“begin commit”開啟事務的情況）。
因此如果一個長時間未提交的事務在SQL線程應用的時候可能觀察到Seconds_Behind_Master 的瞬間跳動。

我們以一個‘Delete’語句為例，假設我們主庫10分鍾后才提交這個事務，
那么可能延遲計算如下（T1為語句執行時間，T1+10為事務提交的時間，
T2為從庫系統時間-主從時間差，
T2每次執查詢Seconds_Behind_Master可變，因為從庫系統時間可變）：
這種情況我就能看到突然延遲跳到很大然后一下恢復正常，比如例子中為10分鍾。

• DML (MTS)
實際上rli->last_master_timestamp會取值為 checkpoint 位罝事務XID_EVENT的timestamp,流程大概如下：

log_event.cc
Slave_worker *Log_event::get_slave_worker(Relay_log_info *rli)
{

....
1. 工作線程執行到事務XID_EVENT時,取XID_EVENT的timestamp
ptr_group->checkpoint_seqno= rli->checkpoint_seqno;
ptr_group->ts= common_header->when.tv_sec + (time_t) exec_time; // Seconds_behind_master related
...

2. 當進行 checkpoint 的時候設置變量ts , 調用函數 mts_checkpoint_routine

GAQ是什么?

https://dev.mysql.com/doc/dev/mysql-server/8.0.15/classSlave__committed__queue.html#details

bool mts_checkpoint_routine(Relay_log_info *rli, ulonglong period,
bool force, bool need_data_lock)

/*
Update the rli->last_master_timestamp for reporting correct Seconds_behind_master.

If GAQ is empty, set it to zero.
Else, update it with the timestamp of the first job of the Slave_job_queue
which was assigned in the Log_event::get_slave_worker() function.
*/
ts= rli->gaq->empty()
? 0
: reinterpret_cast<Slave_job_group*>(rli->gaq->head_queue())->ts;
rli->reset_notified_checkpoint(cnt, ts, need_data_lock, true);
/* e

3. 修改 last_master_timestamp 值
rpl_rli.cc
void Relay_log_info::reset_notified_checkpoint(ulong shift, time_t new_ts,
bool need_data_lock,
bool update_timestamp)
{

...

if (update_timestamp)
{
if (need_data_lock)
mysql_mutex_lock(&data_lock);
else
mysql_mutex_assert_owner(&data_lock);
last_master_timestamp= new_ts;
if (need_data_lock)
mysql_mutex_unlock(&data_lock);
}
}

因此last_master_timestamp就是檢查點位罝事務的XID_EVENT header中的timestamp。
如果不開啟‘slave_preserve_commit_order’參數，那么就可能出現‘gap’，
檢查點只能停留在‘gap’之前的一個事務，可能后面的事務已經執行完成了。
這種情況下延遲並不是那么准確但是誤差也不大，
因為默認slave_checkpoint_period參數設罝為300毫秒。
關於‘gap’后面可以調高參數slave_checkpoint_period參數，
以此來證明MTS延遲的計算是以檢查點為准的。

• DDL
同樣是上面 DML (單SQL線程) 的公式：

rpl_slave.cc
static int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{
....
// event header 的timestamp
rli->last_master_timestamp= ev->common_header->when.tv_sec +
(time_t) ev->exec_time;

// query event 才有的執行時間

實際上DML和DDL計 算延遲的方式區別就在於這里的exec_time是否參與運算。
DDL記錄Binary log中QUERY_EVENT記錄的是實際的語句,exec_time就會記錄整個語句執行完成所消耗的時間

三、 Seconds_Behind_Masteri十算方式總結

四、 MTS中Seconds_Behind_Masteri十算誤差的測試
證明slave_checkpoint_period( 默認 300ms) 在MTS中的Seconds_Behind_Master計算和檢查點有關。
從庫操作如下：

set slave_checkpoint_period = 60000;
stop slave;
start slave;

然后在主庫做手動做幾個事務。然后查看Executed_Gtid_Set如下：

insert * 3;

show master status \G

檢查主庫和從庫的Executed_Gtid_Set 發現這些事務早就已經在從庫做完了

用上面的公式來套的話就是mi->rli->last_master_timestamp甶於MTS沒有做檢查點而一直不變化，
直到60秒后才做了一次檢查點，mi->rli->last_master_timestamp信息才得到了更新。
從庫show slave status,會發現Seconds_Behind_Master—直增長到60, 下面是延遲到了 60的截圖：

下面是變為0的截圖：

但是這種問題一般不需要擔心，
因為默認情況下參數slave_checkpoint_period為300毫秒，檢查點非常頻繁。

slave_checkpoint_period 在代碼里面實際為 mts_checkpoint_period 這個 ,

涉及源碼: rpl_slave.cc: static Log_event* next_event(Relay_log_info* rli) ,

rpl_slave.cc: mts_checkpoint_routine 處理"rli->gaq" 找出恢復表中的低水位標記（lwm）。

GAQ的長度 == slave_checkpoint_group == rli->checkpoint_group ,

slave_checkpoint_period 性能隨着 slave_checkpoint_group 越大 SQL thread 循環檢查 "檢查點" 是否需要執行次數越多 ,

slave_checkpoint_group 設置比較小會導致頻繁 SQL thread 循環檢查 "檢查點"

設置適當一個值需要持續觀察

五、手動修改從庫系統時間，導致Seconds_Behind_Master為0
這里再來測試一下手動修改從庫系統時間的情況。
只需要在主庫做一個大操作，
導致從庫有延遲。已經做了一個大表的DDL,當前從庫的延遲如下，
從圖中可以看到gno為31的DDL操作還沒有執行完成，導致延遲，如下：

然后我們將從庫系統時間改小，如下：

修改系統時間
# date
# date -s 01:00:00

然后再次查看延遲， Seconds_Behind_Master為0,但是gno為31的DDL操作還是沒有執行完成，如下：

從庫數據的查找和參數slave_rows_search_algorithms

對於DML語句來講其數據的更改將被放到對應的Event中。
比如‘Delete’語句會將所有刪除數據的before Image放到 DELETE_ROWS_EVENT 中，
從庫只要讀取這些beforejmage進行數據查找，
然后調用相應的‘Delete’的操作就可以完成數據的刪除了。
下面從庫是如何進行數據查找的。
假定參數binlog_row_image設罝為‘FULL’也就是默認值，
binlog_row_image設置minimal 在不修改非空唯一索引和主鍵時候的字段時候 binlog是不包含非空唯一索引和主鍵的字段

一、沒有Primary Key 和唯一索引有索引的案例
在開始之前我們先假定參數‘slave_rows_search_algorithms’為默認值，即：
• TABLE_SCAN，INDEX_SCAN
因為這個參數會直接影響到對索引的利用方式。
以‘Delete’操作為例，實際上對於索引的選擇‘Update’操作也是一樣的，
因為都是通過before Image去查找數據。我測試的表結構、數據和操作如下：

mysql>
create Table: CREATE TABLE 'aaa' (
'a' int(ll) DEFAULT NULL,
'b' int(ll) DEFAULT NULL,
'c' int(ll) DEFAULT NULL,
KEY 'a、（、a、）
)ENGINE=InnoDB DEFAULT CHARSET=utf8
1 row in set C0.00 sec)

mysql> delete from aaa where a=1;
Query OK, 3 rows affected (6.24 sec)

對於這樣一個‘Delete’和'Update'語句來講主庫會利用到索引KEY a ,
刪除的三條數據我們實際上只需要一次索引的定位，
然后順序掃描接下來的數據進行刪除就可以了。大概的流程和代碼如下：

在沒有唯一索引和Primary Key循環:
索引定位(申請 latch(空閑塊和讀取IO帶寬優先) , B樹的非葉子級別上使用修改的搜索模式 ,B樹葉子中使用原始的搜索模式 ) -> 回表 -> 刪除/修改該行 -> 訪問下一行 -> 回表 -> 刪除/修改該行

索引定位

btr0cur.cc
void
btr_cur_search_to_nth_level(
/*========================*/
dict_index_t*    index,    /*!< in: index */

這條數據刪除/修改數據的 before_Image 將會記錄到一個DELETE/UPDATE_ROWS_EVENT中。
從庫優先使用主鍵和唯一鍵，評估使用哪個索引，
在Event中的每條數據需要進行索引定位數據，
在非唯一索引情況第一次返回的第一行數據不刪除數據，
繼續掃描下一行

如下代碼：

log_event.cc
int Rows_log_event::do_index_scan_and_update(Relay_log_info const *rli)
...
while (record_compare(m_table, &m_cols))
{
while((error= next_record_scan(false)))
{
/* We just skip records that has already been deleted */
if (error == HA_ERR_RECORD_DELETED)
continue;
DBUG_PRINT("info",("no record matching the given row found"));
goto end;
}
}

在沒有主鍵和唯一鍵時候從庫I/O的消耗代價是比主庫更高的，
因此依舊使用的是索引KEY a ，
流程如下：

索引定位(申請 latch(空閑塊和讀取IO帶寬優先) , B樹的非葉子級別上使用修改的搜索模式 ,B樹葉子中使用原始的搜索模式 ) -> 回表 -> 刪除/修改該行 -> 訪問下一行 -> 回表 -> 刪除/修改該行

在從庫增加一個主鍵，
那么在從庫進行應用的時候流程如下:

循環：
主鍵定位 -> 刪除本行

log_event.cc
int Rows_log_event::do_index_scan_and_update(Relay_log_info const *rli)

...

if (m_key_info->flags & HA_NOSAME || m_key_index == m_table->s->primary_key)
{
/* Unique does not have non nullable part */
if (!(m_key_info->flags & (HA_NULL_PART_KEY)))
goto end; // record found
else
{
/*
Unique has nullable part. We need to check if there is any field in the
BI image that is null and part of UNNI.
*/
bool null_found= FALSE;
for (uint i=0; i < m_key_info->user_defined_key_parts && !null_found; i++)
{
uint fieldnr= m_key_info->key_part[i].fieldnr - 1;
Field **f= m_table->field+fieldnr;
null_found= (*f)->is_null();
}

if (!null_found)
goto end; // record found

/* else fall through to index scan */
}
}

主庫‘Delete/Update’操作和從庫‘Delete/Update’操作主要的區別在於：
• 從庫每條數據都需要索引定位查找數據。
• 從庫在非唯一索引情況下查找第一條數據並不立刻刪除/修改數據，而繼續進行索引定位和查找。

對於主庫來只需要一次數據定位查找即可，
接下來訪問下一條數據就可以了。
刪除/修改操作來講並沒有區別。
合理的使用主鍵和睢一鍵可以將上面提到的兩點影晌延遲性能降低。
在從庫延遲的時候，
沒有合理的使用主鍵和唯一鍵是非常重要的原因。
最后如果表上一個索引都沒有，情況變得更加嚴重，簡單的圖如下：

循環：
全表掃描 -> 刪除本行

每一行數據的更改都需要進行全表掃描，
性能問題非常嚴重。
這種情況使用 slave_rows_search_algorithms 的HASH_SCAN代替TABLE_SCAN 在特定情況下也許可以提高性能，

二、確認查找數據的方式
slave_rows_search_algorithms 參數主要用於確認如何查找數據。
其取值可以是下面幾個組合（來自官方文檔），源碼中體現為一個位圖：

• TABLE_SCAN,INDEX__SCAN (默認值）
• INDEX_SCAN,HASH_SCAN
• TABLE_SCAN,HASH_SCAN
• TABLE_SCANJNDEX_SCAN,HASH_SCAN

在源碼中有如下的說明，當然官方文檔也有類似的說明：

log_event.cc
int
Rows_log_event::row_operations_scan_and_key_setup()
....

根據 ROW_LOOKUP_HASH_SCAN ， ROW_LOOKUP_INDEX_SCAN ，ROW_LOOKUP_TABLE_SCAN 調用

log_event.cc
void
Rows_log_event::decide_row_lookup_algorithm_and_key()
...

/*
Decision table:
- I --> Index scan / search
- T --> Table scan
- Hi --> Hash over index
- Ht --> Hash over the entire table

|--------------+-----------+------+------+------|
| Index\Option | I , T , H | I, T | I, H | T, H |
|--------------+-----------+------+------+------|
| PK / UK | I | I | I | Hi |
| K | Hi | I | Hi | Hi |
| No Index | Ht | T | Ht | Ht |
|--------------+-----------+------+------+------|

*/

源碼中會有三種數據查找的方式，分別是：

• R0W_L00KUP_INDEX_SCAN
對應函數接口：log_event.cc Rows_Log_event::do_index_scan_and_update

• R0W_L00KUP_HASH_SCAN
對應函數接口： log_event.cc Rows_Log_event::do_hash_scan_and_update 包含：
(1) Hi --> Hash over index
(2) Ht --> Hash over the entire table
• R0W_L00KUP_TABLE_SCAN

對應函數接口：

log_event.cc
int Rows_log_event::do_apply_event(Relay_log_info const *rli) 在源碼中如下：

...
if ((m_rows_lookup_algorithm != ROW_LOOKUP_NOT_NEEDED) &&
!is_any_column_signaled_for_table(table, &m_cols))
{
error= HA_ERR_END_OF_FILE;
goto AFTER_MAIN_EXEC_ROW_LOOP;
}
switch (m_rows_lookup_algorithm)
{
case ROW_LOOKUP_HASH_SCAN:
do_apply_row_ptr= &Rows_log_event::do_hash_scan_and_update;
break;

case ROW_LOOKUP_INDEX_SCAN:
do_apply_row_ptr= &Rows_log_event::do_index_scan_and_update;
break;

case ROW_LOOKUP_TABLE_SCAN:
do_apply_row_ptr= &Rows_log_event::do_table_scan_and_update;
break;

case ROW_LOOKUP_NOT_NEEDED:
DBUG_ASSERT(get_general_type_code() == binary_log::WRITE_ROWS_EVENT);

/* No need to scan for rows, just apply it */
do_apply_row_ptr= &Rows_log_event::do_apply_row;
break;

default:
DBUG_ASSERT(0);
error= 1;
goto AFTER_MAIN_EXEC_ROW_LOOP;
break;
}

slave_rows_search_algorithms 參數不能決定使用哪個索引。
下面這個圖就是決定的過程，可以參考函數 log_event.cc : void Rows_log_event::decide_row_lookup_algorithm_and_key()

ROW_LOOKUP_HASH_SCAN方式的數據查找
總的來講這種方式和 R0W_L00KUPJNDEX_SCAN 和 ROW_LOOKUP_TABLE_SCAN 都不同，
它是通過表中的數據和Event中的數據進行比對，
而不是通過Event中的數據和表中的數據進行比對。

將參數‘slave_rows_search_algorithms’設罝為叫 INDEX_SCAN,HASH_SCAN 且表上沒有主鍵和唯一鍵的話，
把數據查找的方式設罝為 ROW_LOOKUP_HASH_SCAN。

在ROW_LOOKUP_HASH_SCAN 兩種數據查找的方式：
• Hi ——> Hash over index
• Ht 一-> Hash over the entire table

判斷 slave_rows_search_algorithms 使用方式
log_event.cc void Rows_log_event::decide_row_lookup_algorithm_and_key()
....
TABLE_OR_INDEX_HASH_SCAN:
....

1. 有索引情況下: 將binlog DELETE/UPDATE Event使用 Hi方式將每一行數據 HASH 存入到 set結構中( log_event.cc , Rows_log_event::do_hash_row )，
然后對set進行索引定位掃描，每行數據查詢對 set結構中的數據進行比對。重復數據越少，索引定位查找越多；重復數據越多，set元素減少，減少索引查找定位開銷。

2. 沒有索引情況下: 將binlog DELETE/UPDATE Event使用Ht 方式全表掃描，
一個Event最大8K，使用Ht會進行每個Event進行一次全表掃描（原來是每行數據進行一次算表掃描）

考慮另外一種情況，如果每條delete/update語句一次只刪除一行數據而不是delete/update—條語句大量的數據，
這種情況每個DELETE/UPDATE_ROWS_EVENT只有一條數據存在，
那么使用 ROW_LOOKUP_HASH_SCAN方式並不會提高性能，
因為這條數據還是需要進行一次全表掃描或者索引定位才能查找到數據，
和默認TABLE_SCAN方式沒什么區別。

整個過程參考如下接口：
• Rows_log_event::do_apply_event: 判斷Hash Scan, Index Scan, Table Scan 入口
• Rows_Log_event::do_hash_scan_and_update: 總接口，調用下面兩個接口。
• Rows_Log_event::do_hash_row: 將數據加入到hash結構，如果有索引還需要維護集合(set)
• Rows_log_event::do_scan_and_update: 查找並且進行刪除操作，會調用 Rows_Log_event::next_record_scan 進行數據查找。
• Rows_Log_event::next_record_scan: 具體的查找方式實現了Hi --> Hash over index 和Ht --> Hash over the entire table的查找方式

還是用Delete的列子，刪除了三條數據，
因此DELETE_ROW_EVENT中包含了三條數據。
假設參數 slave_rows_search_algorithms 設置為 INDEX_SCAN，HASH_SCAN。
表中沒有主鍵和唯一鍵，因此會最終使用 ROW_LOOKUP_HASH_SCAN進行數據查找。
但是因為有一個索引key a,因此會使用到 Hi—> Hash over index。為了更好的描述Hi和Ht兩種方式，
假定另一種情況是表上一個索引都沒有，
將兩種方式放到一個圖中方便大家發現不同點，
如下圖：

總結
主庫沒有Primary Key如果在從庫建立一個Primary Key能降低延遲嗎？
肯定的，因為從庫會根據Event中的行數據進行使用索引的選擇。

• slave_rows_search_algorithms參數設罝了 HASH_SCAN並不一定會提高性能，只有滿足如下兩個條件才會提高性能：
(1) (表中沒有任何索引）或者（有索引且一個事務中的update/delete的數據關鍵字重復值較多）。
(2) —個update/delete語句刪除了大星的數據，
形成了很多個8K左右的 UPDATE_ROW_EVENT/DELETE_ROW_EVENT。
update/delete語句只修改少量的數據（比如每個語句修改一行數據）並不能提高性能。

•從庫索引的利用是自行判斷的，順序為主鍵-> 唯一鍵-> 普通索引。
•如果slave_rows_search_algorithms參數沒有設罝HASH_SCAN，
並且沒有主鍵/睢一鍵那么性能將會急劇下降造成延遲。
如果連索引都沒有那么這個情況更加嚴重，
因為更改的每一行數據都會引發一次全表掃描。

因為 mi->clock__diff_with_master 只會在I0線程啟動的時候進行初始化，

如果手動改小從庫服務器系統的時間，那么公式中的(long)(time(0))將變小，
如果整個計算結果為負數那么Seconds_Behind_Master將顯示為0。
因此我們需要保證主從服務器的系統時間同步才行。

有了上面的分析我們應該清楚的知道在GTID AUTCLPOSITION MODE模式下，
應該通過比較主庫和從庫的Executed_Gtld_Set來確保沒有主從延遲，
因為比較 Seconds_Behind_Master是否為0並不一定靠譜。

從庫的SQL線程（MTS協調線程）和參數 sql_slave_skip_counter

總的說來SQL線程有以下三個主要功能：
1. 讀取relay log中的Event。
2. 應用這些讀取到的Event,將修改作用於從庫。
3. 如果是MTS通常情況下則不會應用Event，SQL線程會蛻變為協調線程，分發Event給工作線程。

一、流程圖

二、重要步驟說明
1、如果是MTS需要啟動工作線程
每次重新啟動SQL線程（MTS協調線程）的時候 slave_parallel_workers 參數才會生效。

2、檢查relay_log_info參數
在GTID AUTO_POSITION MODE模式下和POSITION MODE模式下這個參數設罝非常重要參數 relay_log_info = TABLE ,
啟動成功的關鍵就是slave_relay_log_info信息的正確性

如果沒有設置為TABLE, MySQL就會出現Warning：
Slave SQL for channel If a crash happens this configuration does not guarantee that the relay log info will be consistent, Error_code: 0

3、狀態reading event from the relay log
開始讀取Event的時候會進入狀態：‘reading event from the relay log’。

4、在MTS模式下判定是否需要進行MTS檢查點
每讀取一個Event都會去判斷是否需要進行MTS的檢查點，其條件有兩個：
• MTS的檢查點超過參數 slave_checkpoint_period 的設置
• GAQ隊列已滿

源碼中如下：
是否超過檢査點周期

rpl_slave.cc

處理rli-> gaq以找出存儲在集中恢復表中的 lwm 位置。
bool mts_checkpoint_routine(Relay_log_info *rli, ulonglong period,
bool force, bool need_data_lock)

set_timespec_nsec(&curr_clock, 0);
ulonglong diff= diff_timespec(&curr_clock, &rli->last_clock);
if (!force && diff < period)
{
/*
We do not need to execute the checkpoint now because
the time elapsed is not enough.
*/
DBUG_RETURN(FALSE);
}

rpl_slave.cc
從中繼日志中讀取下一個事件。應該從從SQL線程中調用。
static Log_event* next_event(Relay_log_info* rli)

是否已經GAQ隊列已經滿了
//如果達到了 GAQ的大小設貿為force強制checkpoint
bool force= (rli->checkpoint_seqno > (rli->checkpoint_group - 1));

5、判斷是否需要切換和清理relay log
事務是連續性,即使relay log 和binary log 當前滿了也不能切換.
當recovery_relay_log = 0時候，可以打破事務連續性這個硬性條件
例如在GTID AUTO_POSITION MODE模式同時從庫異常重啟情況下relay log會產生partial transaction，
並且啟動后恢復時候會造成額外回滾操作。
在POSITION MODE模式下繼續發送事務余下部分的Event到當前relay log

源碼注釋如下，‘in a group’就代表在一個事務中：

/* 
If we have reached the limit of the relay space and we
are going to sleep, waiting for more events:

1. If outside a group, SQL thread asks the IO thread 
to force a rotation so that the SQL thread purges 
logs next time it processes an event (thus space is
freed).

2. If in a group, SQL thread asks the IO thread to 
ignore the limit and queues yet one more event 
so that the SQL thread finishes the group and 
is are able to rotate and purge sometime soon.
*/


rpl_slave.cc
relay_log_purge=ON, 非當前relay log讀取完了的Event會進入清理流程。

if (relay_log_purge)    //做relay log的清理
{

if (rli->relay_log.purge_first_log
(rli,
rli->get_group_relay_log_pos() == rli->get_event_relay_log_pos()
&& !strcmp(rli->get_group_relay_log_name(),rli->get_event_relay_log_name())))
//做relay log的清理
{
errmsg = "Error purging processed logs";
goto err;
}
DBUG_PRINT("info", ("next_event group master %s %lu group relay %s %lu event %s %lu\n",
rli->get_group_master_log_name(),
(ulong) rli->get_group_master_log_pos(),
rli->get_group_relay_log_name(),
(ulong) rli->get_group_relay_log_pos(),
rli->get_event_relay_log_name(),
(ulong) rli->get_event_relay_log_pos()));
}
else
{
// 錯誤時候，鎖定relay log
if (rli->relay_log.find_next_log(&rli->linfo, !hot_log))
{
errmsg = "error switching to the next log";
goto err;
}
rli->set_event_relay_log_pos(BIN_LOG_HEADER_SIZE);
rli->set_event_relay_log_name(rli->linfo.log_file_name);

rli->flush_info();
}

如果是讀取完的是當前relay log,則不能進行清理，
會等待10線程的喚醒，
如果是MTS等待喚醒期間還需要進行MTS的檢查點。

6、獲取last_master_timestamp
在單SQL線程下計算Seconds_Behind_Master其中最重要的因素就來自這里。
但是對於MTS而言，last_master_timestamp來自檢查點指向事務的XID_Event的時間，
因此MTS計算Seconds_Behind_Master的方法是不同。源碼如下：

rpl_slave.cc
static int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{
......
if ((!rli->is_parallel_exec() || rli->last_master_timestamp == 0) &&    //event header 的timestamp
!(ev->is_artificial_event() || ev->is_relay_log_event() ||
(ev->common_header->when.tv_sec == 0) ||    //query event才有的執行時間
ev->get_type_code() == binary_log::FORMAT_DESCRIPTION_EVENT ||
ev->server_id == 0))
{
rli->last_master_timestamp= ev->common_header->when.tv_sec +
(time_t) ev->exec_time;
//但是對於MTS來講應該注意是最后一個XID EVENT的時間不是這里設置的，
//在 mts_checkpoint_routine 里面
DBUG_ASSERT(rli->last_master_timestamp >= 0);
}

......

注意到其中一個重要條件就是非MTS環境下，即！rli->is_parallel_exec()條件需要滿足。

7、進行 partial transaction的恢復
下面兩種情況可能出現這種情況:
• GTID AUTCLPOSITION MODE模式下如果 I/O 線程出現重連，DUMP線程會根據GTID SET進行重新定位，重發部分已經發送過的Event。
• GTID AUTCLPOSITION MODE模式下如果從庫異常重啟，並且recovery_relay_log=0 的情況下，
DUMP線程會根據GTID進行重新定位，重發部分已經發送過的Event。

這兩種情況下甶於一個事務可能包含部分重暨的Event,就涉及到回滾操作。
對於MTS來講是由協調線程進行回滾，也就是這里。如果是非MTS環境下則在GTID Event應用的時候進行回滾。

8、sql_slave_skip_counter參數
注意在GITD環境下不能使用sql_slave_skip_counter跳過Event
sql_slave_skip_counter參數的基本計數單位是Event,
但是如果最后一個Event正處於事務中的時候，
整個事務也會被跳過。
源碼：

log_event.cc
Log_event::enum_skip_reason
Log_event::do_shall_skip(Relay_log_info *rli)
{
DBUG_PRINT("info", ("ev->server_id=%lu, ::server_id=%lu,"
" rli->replicate_same_server_id=%d,"
" rli->slave_skip_counter=%d",
(ulong) server_id, (ulong) ::server_id,
rli->replicate_same_server_id,
rli->slave_skip_counter));
if ((server_id == ::server_id && !rli->replicate_same_server_id) ||
(rli->slave_skip_counter == 1 && rli->is_in_group()))
return EVENT_SKIP_IGNORE;
else if (rli->slave_skip_counter > 0)
return EVENT_SKIP_COUNT;
else
return EVENT_SKIP_NOT;
}

9、MTS進行分發，單SQL線程進行應用Event處理：
• MTS進行Event分發給工作線程進行Event的apply。
• 單SQL線程進行Event的應用。

三、各個Event大概做了什么
最終從庫就開始應用Event 了，
開啟GTID的情況下一個簡單的‘Delete’語句生成的Event為例，Event包含：

Event:
1. GTID_EVENT
2. QUERY_EVENT
3. MAP_EVENT
4. DELETE_EVENT
5. XID_EVENT

1. GTID_EVENT
• 接口 log_event.cc int Gtid_log_event::do_apply_event(Relay_log_info const *rli)
• 單SQL線程下回滾partial transaction
• 設置事務的GTID 參考 set_gtid_next 函數。

2. QUERY_EVENT
參接□ int Query_log_event::do_apply_event(Relay_log_info const *rli)
• 設置線程的執行環境，比如SQL_MODE,客戶端字符集、自增環境、客戶端排序字符集、當前登錄的數據庫名等。
• 執行相應的語句。對於ROW模式的DML語句是‘BEGIN’，DDL而言就是執行實際的語句了。‘BEGIN’語句只是在MySQL層做了一個標記，如下：

lib_sql.cc
void *create_embedded_thd(int client_flag)
......
if (thd->variables.max_join_size == HA_POS_ERROR)
thd->variables.option_bits |= OPTION_BIG_SELECTS;

語句進入提交流程后會判斷這個標記如下：

sql_class.h
inline bool in_multi_stmt_transaction_mode() const
{
return variables.option_bits & (OPTION_NOT_AUTOCOMMIT | OPTION_BEGIN);
}

如果設罝了‘BEGIN’標記則不會進入order commit流程。

3. MAP_EVENT
• 接口 log_event.cc int Table_map_log_event::do_apply_event(Relay_log_info const *rli)
• 設置表的相關的屬性，比如數據庫名、表名、字段數量、字段類型、可變字段的長度等。
• 設置table id。

4. DELETE_EVENT
• 接口 log_event.cc int Rows_log_event::do_apply_event(Relay_log_info const *rli)
• 檢查本事務是否GTID，是則需要跳過。這個判斷過程位於函數 rpl_gtid_execution.cc static inline bool is_already_logged_transaction(const THD *thd)
• 打開表獲取MDL鎖。
• 設置table id和表的映射
• 開啟讀寫事務。
• Innodb行鎖。
• 對每行數據進行刪除/修改操作。

可以看到對於從庫來講同樣需要開啟讀寫事務上鎖等操作，
如果從庫有相應的DML操作，
SQL線程（MTS工作線程）也可能因為不能獲得鎖而堵塞。

5.XID_EVENT
• 接口log_event.cc int Xid_apply_log_event::do_apply_event(Relay_log_info const *rli) (單SQL線程）或者 int Xid_apply_log_event::do_apply_event_worker(Slave_worker *w) (MTS)
• 更新內存位點信息。
• 單SQL線程更新slave_relay_log_info表中的相關信息，MTS更新slave_worker_info表中的相關信息。 rpl_info_table.cc int Rpl_info_table::do_flush_info(const bool force)為相應的更新接口。
• 進行提交操作。

binlog_group_commit_sync_delay 造成延遲

binlog.cc

int MYSQL_BIN_LOG::ordered_commit(THD *thd, bool all, bool skip_commit)

....

// 在sync_binlog= 1 才有這個延遲

if (!flush_error && (sync_counter + 1 >= get_sync_period()))
stage_manager.wait_count_or_timeout(opt_binlog_group_commit_sync_no_delay_count,
opt_binlog_group_commit_sync_delay,
Stage_manager::SYNC_STAGE);

 ...

void Stage_manager::wait_count_or_timeout(ulong count, long usec, StageID stage)
{
long to_wait=
DBUG_EVALUATE_IF("bgc_set_infinite_delay", LONG_MAX, usec);
/*
For testing purposes while waiting for inifinity
to arrive, we keep checking the queue size at regular,
small intervals. Otherwise, waiting 0.1 * infinite
is too long.
*/
long delta=
DBUG_EVALUATE_IF("bgc_set_infinite_delay", 100000,
max<long>(1, (to_wait * 0.1)));　　　　　　// binlog_group_commit_sync_delay*0.1 和 1之間 , 1/1000000 秒 , 這參數隨着增大下面循環消耗越多

while (to_wait > 0 && (count == 0 || static_cast<ulong>(m_queue[stage].get_size()) < count)) 　　 // binlog_group_commit_sync_delay時間消耗完
{
#ifndef DBUG_OFF
if (current_thd)
DEBUG_SYNC(current_thd, "bgc_wait_count_or_timeout");
#endif
my_sleep(delta);
to_wait -= delta;
}
}

總結出event 的offset 關系:

在沒有mysql lib情況下 offset + 1:
event_time: EVENT_TIMESTAMP = 4
event_type: EVENT_TYPE_OFFSET = 5 - EVENT_TIMESTAMP

// 向前走
EVENT_LEN_OFFSET = 13 - EVENT_TYPE_OFFSET = 5 , SEEK_CUR

event_next: LOG_POS_OFFSET = 17 - EVENT_LEN_OFFSET = 13

event_size = event_next - event_pos = ( fseek(fds,event_next,SEEK_SET); ftell(fds); )

Insert Event: 30
Update Event: 31
Delete Event: 32
Map Event: 19
Gtid Event: 33

event_size > 44: 偏移 FLAGS_OFFSET = 19 - LOG_POS_OFFSET = 17 , GTID_FLAG = 26

relay_log_space_limit 設置導致從庫 I/O thread 等待復制延遲 , 從庫的I / O線程在等待relay log 空間時中止

rpl_slave.cc: static Log_event* next_event(Relay_log_info* rli)

如果 rli->log_space_limit >= rli->log_space_total
　　進入rli->relay_log.wait_for_update_relay_log(thd, NULL);
　　　　進入Linux 線程互斥鎖: pthread_cond_wait 和 pthread_cond_timedwait

如果 rli->log_space_limit< rli->log_space_total
　　從庫 的I/O線程 unlock
        mysql_mutex_unlock(&rli->log_space_lock);
        mysql_cond_broadcast(&rli->log_space_cond);

Slave_SQL_Running_State狀態分析(參考八爺文章: " MySQL里Wating for Slave workers to free pending events到底在等什么 " ):

“Waiting for dependent transaction to commit”

由於協調線程判定本事務由於last commit大於current_lwm因此不能並行回放 ( 通過clock_leq 進行比較 )，協調線程處於等待，大事務會加劇這種情況。

rpl_mts_submode.cc:
bool Mts_submode_logical_clock::
wait_for_last_committed_trx(Relay_log_info* rli,
                            longlong last_committed_arg,
                            longlong lwm_estimate_arg)
{......

if ((!rli->info_thd->killed && !is_error) &&
      !clock_leq(last_committed_arg, get_lwm_timestamp(rli, true)))
  {
    PSI_stage_info old_stage;
    struct timespec ts[2];
    set_timespec_nsec(&ts[0], 0);

    DBUG_ASSERT(rli->gaq->len >= 2); // there's someone to wait

    thd->ENTER_COND(&rli->logical_clock_cond, &rli->mts_gaq_LOCK,
                    &stage_worker_waiting_for_commit_parent, &old_stage);
    do
    {
      mysql_cond_wait(&rli->logical_clock_cond, &rli->mts_gaq_LOCK);
    }
　　    while ((!rli->info_thd->killed && !is_error) &&
           !clock_leq(last_committed_arg, estimate_lwm_timestamp()));
    my_atomic_store64(&min_waited_timestamp, SEQ_UNINIT);  // reset waiting flag
    mysql_mutex_unlock(&rli->mts_gaq_LOCK);
    thd->EXIT_COND(&old_stage);
    set_timespec_nsec(&ts[1], 0);
    my_atomic_add64(&rli->mts_total_wait_overlap, diff_timespec(&ts[1], &ts[0]));

"Waiting for slave workers to process their queues"

由於沒有空閑的工作線程，協調線程會等待。這種情況說明理論上的並行度是理想的，但是可能是參數slave_parallel_workers設置不夠。當然設置工作線程的個數應該和服務器的配置和負載相結合考慮。


有兩個函數觸發這個狀態,如下:
1.判斷有沒有空閑的工作線程
Slave_worker *
Mts_submode_logical_clock::get_least_occupied_worker(Relay_log_info *rli,
                                                     Slave_worker_array *ws,
                                                     Log_event * ev)
{
......
else
  {
    worker= get_free_worker(rli);

    DBUG_ASSERT(ev->get_type_code() != binary_log::USER_VAR_EVENT ||
                rli->curr_group_seen_begin || rli->curr_group_seen_gtid);

    if (worker == NULL)
    {
      struct timespec ts[2];

      set_timespec_nsec(&ts[0], 0);
      // Update thd info as waiting for workers to finish.
      thd->enter_stage(&stage_slave_waiting_for_workers_to_process_queue,
                       old_stage,
                       __func__, __FILE__, __LINE__);


2. 等待從worker完成待辦任務

int
Mts_submode_logical_clock::
   wait_for_workers_to_finish(Relay_log_info *rli,
                              MY_ATTRIBUTE((unused)) Slave_worker * ignore)
{
  PSI_stage_info *old_stage= 0;
  THD *thd= rli->info_thd;
  DBUG_ENTER("Mts_submode_logical_clock::wait_for_workers_to_finish");
  DBUG_PRINT("info",("delegated %d, jobs_done %d", delegated_jobs,
                          jobs_done));

  thd->enter_stage(&stage_slave_waiting_for_workers_to_process_queue,
                   old_stage,
                    __func__, __FILE__, __LINE__);
......

"Waiting for Slave Worker queue"

由於工作線程的任務隊列已滿，協調線程會等待。這種情況前面說過是由於一個事務包含了過多的Event並且工作線程應用Event的速度趕不上協調線程分配Event的速度，
導致了積壓並且超過了16384個Event,超過后觸發這個狀態。


rpl_rli_pdb.cc ( en_queue函數判斷任務隊列是否溢出 )
bool append_item_to_jobs(slave_job_item *job_item,
                         Slave_worker *worker, Relay_log_info *rli)
{
......
while (worker->running_status == Slave_worker::RUNNING && !thd->killed &&
         (ret= en_queue(&worker->jobs, job_item)) == -1)
  {
    thd->ENTER_COND(&worker->jobs_cond, &worker->jobs_lock,
                    &stage_slave_waiting_worker_queue, &old_stage);



rpl_slave.cc
int slave_start_workers(Relay_log_info *rli, ulong n, bool *mts_inited)
{
...........
  rli->mts_slave_worker_queue_len_max= mts_slave_worker_queue_len_max;


rpl_rli_pdb.cc
int Slave_worker::init_worker(Relay_log_info * rli, ulong i)
{
...........
  jobs.entry= jobs.size= c_rli->mts_slave_worker_queue_len_max;

"Waiting for Slave Workers to free pending events"

1. 如果涉及到big event，那么要求worker線程當前沒有積壓的event正在執行
2. 如果不是big event，那么需要判斷當前worker線程積壓的event大小加上本次進入worker線程對列的event大小之和不能超過參數 slave_pending_jobs_size_max 的大小


rpl_rli_pdb.cc

bool append_item_to_jobs(slave_job_item *job_item,
                         Slave_worker *worker, Relay_log_info *rli)
{
..........

  while ( (!big_event && new_pend_size > rli->mts_pending_jobs_size_max)
          || (big_event && rli->mts_pending_jobs_size != 0 ))
  {
    rli->mts_wq_oversize= TRUE;
    rli->wq_size_waits_cnt++; // waiting due to the total size
    thd->ENTER_COND(&rli->pending_jobs_cond, &rli->pending_jobs_lock,
                    &stage_slave_waiting_worker_to_free_events, &old_stage);
    mysql_cond_wait(&rli->pending_jobs_cond, &rli->pending_jobs_lock);
    mysql_mutex_unlock(&rli->pending_jobs_lock);
    thd->EXIT_COND(&old_stage);

總結MySQL服務器層面影響主從復制情況:

1. 從庫的系統時間被修改比當前時間更大(推薦配置時間服務器)

2. 在MTS設置不合理 rpl_slave.cc: slave_checkpoint_period 參數 ( 設置在調用檢查點操作之前允許通過的最大時間（以毫秒為單位），以更新多線程Slave狀態 )

在MTS模式下判定是否需要進行MTS檢查點
每讀取一個Event都會去判斷是否需要進行MTS的檢查點，其條件有兩個：
• MTS的檢查點超過參數 slave_checkpoint_period
• GAQ( group assigned queue )隊列已滿 slave_checkpoint_group ( 隊列越長, 檢測 " 檢查點" 越多 )

3. 主庫大事務的sql語句性能: 延遲不會從0開始而是直接從主庫執行了當前事務多長時間開始。例如主庫執行這個事務花費的20秒，那么延遲就會從20開始 , 導致 Query Event中沒有准確執行時間 . ( 從這里參考 (3) mi->rli->last_master_timestamp )

4. 大表DDL: 延遲會從0開始，因為Query Event記錄了准確的執行時間。

5. slave_rows_search_algorithms 參數 , 表沒有主鍵或者唯一鍵，就不要設置INDEX_SCAN,HASH_SCAN

6. rpl_slave.cc: slave_checkpoint_period ( 默認 300ms) 在MTS中的Seconds_Behind_Master計算和檢查點有關。( 參考 • DML (MTS) )

7. 根據線程模式 , binlog事務模式, recovery_relay_log , sync_master_info , sync_relay_log , sync_relay_log_info 進行一個性能判斷,如下:

sync_master_info: 如果sync_master_info的值大於 0，則Slave服務器將在每個sync_master_info事件后更新其主信息存儲庫表。如果是 0，則永遠不會更新表。

sync_relay_log: 值 1 是最安全的選擇，因為在發生崩潰時，最多會從Relay Log中丟失一個事件。但是，這也是最慢的選擇（除非磁盤具有電池支持的緩存，這使得同步速度非常快）。

sync_relay_log_info: 如果relay_log_info_repository設置為 TABLE，並且該表的存儲引擎是事務性的，則該表在每個事務后將更新表。（在這種情況下， sync_relay_log_info 設置實際上被忽略。master_info_repository = FILE 和 relay_log_info_repository = FILE 不是事務性的, 調用 fdatasync() 將磁盤緩存寫入到文件, my_sync.cc: int my_sync(File fd, myf my_flags) , 推薦用 TABLE.

8. 表沒有主鍵或者唯一鍵造成的延遲。這種情況不要以為設置slave_rows_search_algorithms參數為 INDEX_SCAN,HASH_SCAN , HASH_SCAN在有索引且一個事務中的 update/delete的數據關鍵字重復值多才能提升性能

9. 長時間不提交的事務可能造成延遲瞬間增加，因為GTID_EVENT和XID_EVENT是提交時間,而其他Event是命令發起的時間。

10. binlog.cc: binlog_group_commit_sync_delay 的高值可能會影響多線程從庫中的 slave worker 性能,但必須考慮主服務器的吞吐量和從服務器的吞吐量。還可以減少對具有binlog的任何服務器（主服務器或從服務器）上的binlog的 fsync() 調用次數。必須配合 slave_parallel_type=LOGICAL_CLOCK

11. 在舊版本MySQL的主服務器和新版本MySQL的從服務器之間復制。

12. 啟動 I/O 線程后人為的修改從庫服務器的時間

13. 關閉 slave_preserve_commit_order 參數出現 gap ,檢查點只能停留在 gap 之前的一個事務,可能后面的事務已經執行完成了。開啟 slave_preserve_commit_order和開啟log_slave_updates 防止在MTS復制模式下防止從庫形成gap,導致主從數據不一致問題，交驗和修正數據也是要導致延遲。

rpl_slave.cc
extern "C" void *handle_slave_sql(void *arg)

......

if (opt_slave_preserve_commit_order && rli->opt_slave_parallel_workers > 0 &&
      opt_bin_log && opt_log_slave_updates)
    commit_order_mngr= new Commit_order_manager(rli->opt_slave_parallel_workers);

  rli->set_commit_order_manager(commit_order_mngr);

14. binlog_row_image設置minimal 在不修改非空唯一索引和主鍵時候的字段時候 binlog是不包含非空唯一索引和主鍵的字段

15. recovery_relay_log = 0 打破 relay log 和binary log的事務是連續性，recovery_relay_log = 1 relay log 和binary log當前滿了也不能切換

16. relay_log_space_limit 限制不當頻繁比較 rpl_slave.cc: static Log_event* next_event(Relay_log_info* rli)下 rli->log_space_limit 和 rli->log_space_total 比較導致 I/O thread 頻繁被等待

17. 關注 table_open_cache 和 table_definition_cache 參數, show global status 的 Open_tables 和 Opened_tables ,如果 table_open_cache 和 table_definition_cache 參數太小導致漏復制表數據

18. 使用臨時 binlog 緩存但超過“ binlog_cache_size ”值並使用臨時文件存儲事務中的語句的事務數。導致二進制日志事務緩存寫入磁盤的非事務性語句數在“ Binlog_stmt_cache_disk_use ”狀態變量中單獨跟蹤。 binlog cache efficiency = ( 1 - ( Binlog_cache_disk_use / Binlog_cache_use )) * 100% 結果值應該非常接近100％。如果它明顯較少，您需要調整（增加） binlog_cache_size 。

SELECT (1 - (select VARIABLE_VALUE
from information_schema.GLOBAL_STATUS
where VARIABLE_NAME = 'Binlog_cache_disk_use') / (select VARIABLE_VALUE
from information_schema.GLOBAL_STATUS
where VARIABLE_NAME = 'Binlog_cache_use' )) * 100;

......

100. 所以我最終遷移到PXC 或者 MGR 再或者遷移到 PostgreSQL物理流復制延遲時間也是十分低（主從模式）解決上述主從復制延遲問題

待續....

binlog_group_commit_sync_delay 的高值可能會影響多線程從庫中的 slave worker 性能。

if (rli->log_space_limit && rli->log_space_limit < rli->log_space_total)

------------恢復內容開始------------

比如主庫執行這個事務花費的20秒，那么延遲就會從20開始，

從 log_event.cc 看到，這是因為 Query Event 中沒有准確的執行時間

log_event.cc

入口:

int Query_log_event::do_apply_event(Relay_log_info const *rli,
const char *query_arg, size_t q_len_arg)

包括: sql_mode , 客戶端字符集,自增環境設置,登陸的db,結果執行時間,ddl和dml執行時間
dml: 數據被寫盤后的時間 thd->set_time(&(common_header->when)); , 一個事務對應一個Query_Event 
ddl: 實際語句執行時間 thd->set_time(&(common_header->when)); 整個語句執行后才寫入執行時間

show slave status 入口:

rpl_slave.cc:

bool show_slave_status_send_data(THD *thd, Master_info *mi,
char* io_gtid_set_buffer,
char* sql_gtid_set_buffer)


if ((mi->get_master_log_pos() == mi->rli->get_group_master_log_pos()) &&
(!strcmp(mi->get_master_log_name(), mi->rli->get_group_master_log_name())))

檢查SQL線程是否在中繼日志的末尾
        檢查應使用兩個條件進行
        condition1：比較日志位置和
        condition2：比較文件名

獲得 seconds_behind_master 時間
long time_diff= ((long)(time(0) - mi->rli->last_master_timestamp) - mi->clock_diff_with_master);

系統時間 (long)(time(0)
主從之間系統時間差值 mi->clock_diff_with_master

1. 重點:
主從服務器的操作系統時間要一致,因為這兩個是可變不穩定因素

dml語句和ddl語句的執行時間 mi->rli->last_master_timestamp

log_event.cc

Query_log_event::Query_log_event(const char* buf, uint event_len,
const Format_description_event
*description_event,
Log_event_type event_type)

dml(從begin開始) 和 ddl 執行時間
exec_time= end_time.tv_sec - thd_arg->start_time.tv_sec;

rpl_slave.cc 
等於各個Event header 的 timestamp , 事務都是 GTID_EVENT 和 XID_EVENT 提交時間
rli->last_master_timestamp= ev->common_header->when.tv_sec + (time_t) ev->exec_time;

(題外話: cnblogs可以隨便給其他網站爬蟲( https://www.colabug.com/2020/0203/6936599/ ) , 我都還沒寫完它就爬 , cnblogs 一點保護不了作者勞動成果

binlog_event.h
class Log_event_header
·timeval：固定4字節，是新紀元時間（1970年1月1日0時0分0秒）以來的秒數。這個時
間是命令發起的時間。如何定義命令發起呢？它是執行計划生成后開始實際執行語句的
時候。源碼可以是在dispatch一command函數的開頭設置的(thd->set-time())。言外
之意對於語法意義、權限橙查、優化器生成執行計划的時間都包括在里面。其次要注意
了這個時間在從庫計算Seconds一Behind一Master的時候是一個重要的依據


·type_code：固定1字節，event事件的編碼。每TEvent有自己的編碼。
·unmasked_server_id:固定4字節。就是生成這個Event服務端的serverid。即便從庫端開啟了
log-slave-updates%，從庫將主庫的Event寫到BinaryLog，這個serverid也是主庫
的serveride如果這個Event再冫欠傳到主庫，那么需要跳過。源碼可以在


Log-event::do-shalLskip函數中找到跳過邏輯如下“

·event Len：固定4字節，整個Event的長度。
·end_log_p：固定4字節，下一個Event的開始位置。
．flags:固定2字節，某些Event包含這個標示，比如Format-descriptionlog_event中
LOG_EVENT_BINLOG_IN_USE_F 標示當前Binary log是當前寫入文件。

class Log_event_footer
·Event footer中的crc：固定4字節，這部分就是整個Event的一個crc校驗碼，用於標示Event的完整性

參考Log_event_type

• 將從庫的relay log的FORMAT_DESCRIPTION_EVENT記錄為和主庫相同，
即更新 RelayJogInfo.rlLdescription_event

log_event.cc:
int Format_description_log_event::do_apply_event(Relay_log_info const *rli) 函數中如下片段:

如果從站未請求該事件，即Master發送了當從站要求位置> 4時，
該事件將使rli-> group_master_log_pos前進。 
說Slave要位置1000，Format_desc事件的結尾是96。
然后在復制rli-> group_master_log_pos的開始將是0，
然后是96，然后跳到第一個真正詢問的事件（即> 96）

/* Save the information describing this binlog */
const_cast<Relay_log_info *>(rli)->set_rli_description_event(this);

至少下面幾個地方都會讀取其中的倍息：
1.    每次SQL線程應用Event的時候會獲取Event_header的長度和相應Even個定部分的長度。
2.    I0線程啟動的時候檢測版本，參考函數 get_master_version_and_clock 。
3.    將倌患寫入到 relay log 的開頭的FORMAT_DESCRIPTION_EVENT中。
4. 位置必須在#4

rpl_slave.cc:
static int get_master_version_and_clock(MYSQL* mysql, Master_info* mi)

從庫針對不同版本不同處理 , mysql>=5.0 都能復制
switch (version_number)
{
case 0:
case 1:
case 2:
errmsg = "Master reported unrecognized MySQL version";
err_code= ER_SLAVE_FATAL_ERROR;
sprintf(err_buff, ER(err_code), errmsg);
break;
case 3:
mi->set_mi_description_event(new
Format_description_log_event(1, mysql->server_version));
break;
case 4:
mi->set_mi_description_event(new
Format_description_log_event(3, mysql->server_version));
break;
default:
/*
Master is MySQL >=5.0. Give a default Format_desc event, so that we can
take the early steps (like tests for "is this a 3.23 master") which we
have to take before we receive the real master's Format_desc which will
override this one. Note that the Format_desc we create below is garbage
(it has the format of the *slave*); it's only good to help know if the
master is 3.23, 4.0, etc.
*/
mi->set_mi_description_event(new
Format_description_log_event(4, mysql->server_version));
break;
}
}

將binlog event 寫入 binlog cache

log_event.cc
bool Format_description_log_event::write(IO_CACHE* file)

if (post_header_len_size == Binary_log_event::LOG_EVENT_TYPES)
相同版本主從服務器復制

else if (post_header_len_size > Binary_log_event::LOG_EVENT_TYPES)
在新的Master和舊的Slave之間復制。但不會從內存復制,所以任何內存不足的讀取。

else 在舊版本的主服務器和新版本的從服務器之間復制。
在這種情況下，它可能導致 Master 和 Slave 上發生不同的number_of_events。 當循環relay log時，
來自Master的FORMAT_DESCRIPTION_EVENT用於在Slave上創建FORMAT_DESCRIPTION_EVENT事件，
該事件將在此處寫入。 在這種情況下可能最終會讀取更多字節，
如post_header_len.size()<Binary_log_event::LOG_EVENT_TYPES;。引起內存問題。

binlog_version
server_version
create_timestamp
headerjength
array of post-header

生成時機
這個Event作為binary log的第一個Event, _般都是在binary log切換的時候發生比如：
• flush binary 丨ogs命令。
• binary log自動切換。
• 重啟MySQL實例

3、主體格式

注意：甶於一個GTIDSET可以包含多個server_uuid,因此第2到第5部分可能包含多個。
如果還包含多個GTID SET Interval則第4和第5部分也可能多個。

4、實例解析
下面是一個PREVIOUS_GTIDS_LOG_EVENT (mysqlbinlog —hexdump 輸出>，這種情況是我手動刪除了 auto.cnf手動構造出來的：

(cnblog格式能否優美一下)

• 02 00 00 00 00 00 00 00:包含2個server_uuid U小端顯示就是2個。

我們看到解析出來的：

5、生成時機
生成時機一般也是進行binary log切換的時候作為第二個Event寫入binary log

7、重點 Event

• GTID的詳細倌息。
• 邏輯時鍾詳細倍息，即 last commit 和 seq number
• 是否為行模式，比如DDL語句就不是行模式的

顯示開啟事務的情況下GTID_LOG_EVENT和XID_EVENT Event header的 timestamp都是commit命令發起的時間，

當然如果沒有顯示開啟事務那么timestamp還是命令發起的時間。

log_event.cc

(4)簡單解析
下面是一個的GTID_LOG_EVENT（mysqlbinlog—hexdump 輸出):

(5) 生成時機
關於生成時機來講GTID_LOG_EVENT的生成和寫入binary log文件都是 order commit 的flush 階段

rpl_gtid.h

/**
Specifies that the GTID has not been generated yet; it will be
generated on commit. It will depend on the GTID_MODE: if
GTID_MODE<=OFF_PERMISSIVE, then the transaction will be anonymous;
if GTID_MODE>=ON_PERMISSIVE, then the transaction will be assigned
a new GTID.

This is the default value: thd->variables.gtid_next has this state
when GTID_NEXT="AUTOMATIC".

It is important that AUTOMATIC_GROUP==0 so that the default value
for thd->variables->gtid_next.type is AUTOMATIC_GROUP.
*/
AUTOMATIC_GROUP= 0,
/**
Specifies that the transaction has been assigned a GTID (UUID:NUMBER).

thd->variables.gtid_next has this state when GTID_NEXT="UUID:NUMBER".

This is the state of GTID-transactions replicated to the slave.
*/
GTID_GROUP,
/**
Specifies that the transaction is anonymous, i.e., it does not
have a GTID and will never be assigned one.

thd->variables.gtid_next has this state when GTID_NEXT="ANONYMOUS".

This is the state of any transaction generated on a pre-GTID
server, or on a server with GTID_MODE==OFF.
*/
ANONYMOUS_GROUP,
/**
GTID_NEXT is set to this state after a transaction with
GTID_NEXT=='UUID:NUMBER' is committed.

This is used to protect against a special case of unsafe
non-transactional updates.

Background: Non-transactional updates are allowed as long as they
are sane. Non-transactional updates must be single-statement
transactions; they must not be mixed with transactional updates in
the same statement or in the same transaction. Since
non-transactional updates must be logged separately from
transactional updates, a single mixed statement would generate two
different transactions.

Problematic case: Consider a transaction, Tx1, that updates two
transactional tables on the master, t1 and t2. Then slave (s1) later
replays Tx1. However, t2 is a non-transactional table at s1. As such, s1
will report an error because it cannot split Tx1 into two different
transactions. Had no error been reported, then Tx1 would be split into Tx1
and Tx2, potentially causing severe harm in case some form of fail-over
procedure is later engaged by s1.

To detect this case on the slave and generate an appropriate error
message rather than causing an inconsistency in the GTID state, we
do as follows. When committing a transaction that has
GTID_NEXT==UUID:NUMBER, we set GTID_NEXT to UNDEFINED_GROUP. When
the next part of the transaction is being processed, an error is
generated, because it is not allowed to execute a transaction when
GTID_NEXT==UNDEFINED. In the normal case, the error is not
generated, because there will always be a Gtid_log_event after the
next transaction.
*/
UNDEFINED_GROUP,
/*
GTID_NEXT is set to this state by the slave applier thread when it
reads a Format_description_log_event that does not originate from
this server.

Background: when the slave applier thread reads a relay log that
comes from a pre-GTID master, it must preserve the transactions as
anonymous transactions, even if GTID_MODE>=ON_PERMISSIVE. This
may happen, e.g., if the relay log was received when master and
slave had GTID_MODE=OFF or when master and slave were old, and the
relay log is applied when slave has GTID_MODE>=ON_PERMISSIVE.

So the slave thread should set GTID_NEXT=ANONYMOUS for the next
transaction when it starts to process an old binary log. However,
there is no way for the slave to tell if the binary log is old,
until it sees the first transaction. If the first transaction
begins with a Gtid_log_event, we have the GTID there; if it begins
with query_log_event, row events, etc, then this is an old binary
log. So at the time the binary log begins, we just set
GTID_NEXT=NOT_YET_DETERMINED_GROUP. If it remains
NOT_YET_DETERMINED when the next transaction begins,
gtid_pre_statement_checks will automatically turn it into an
anonymous transaction. If a Gtid_log_event comes across before
the next transaction starts, then the Gtid_log_event will just set
GTID_NEXT='UUID:NUMBER' accordingly.
*/
NOT_YET_DETERMINED_GROUP
};

格式部分(單位:bytes):

固定部分，這部分是大小不變的。

statement_events.cpp:

從Slave讀取來源:

Query_event::Query_event(const char* buf, unsigned int event_len,
const Format_description_event *description_event,
Log_event_type event_type)

 

......

// 讀取 status variables

start= (Log_event_header::Byte*) (buf + post_header_len);
end= (const Log_event_header::Byte*) (start + status_vars_len);

// 讀取 db

  db= (const char* )end;

log_event.cc

int Load_log_event::do_apply_event(NET* net, Relay_log_info const *rli,
bool use_rli_only_for_errors)

......

if (rpl_filter->db_ok(thd->db().str))
{
thd->set_time(&(common_header->when));

這句代碼會設罝線程的命令執行時間為Event header中Timestamp的時間。因此當我們在從庫上執行now()函數時是可以得到正確的結果的

(4)實例解析
下面是一個行模式的DML的QUERY_EVENT (mysqlbinlog --hexdump 輸出）：

slave_proxy_id 4 bytes
query_exec_time 4 bytes
db_len 1 bytes
error_code 2 bytes
status_vars_len 2 bytes
status_variables 可變
db 可變
query 可變

從讀取來源:

statement_events.cpp
Query_event::Query_event(const char* buf, unsigned int event_len,
const Format_description_event *description_event,
Log_event_type event_type)

end= (const Log_event_header::Byte*) (start + status_vars_len);
db= (const char* )end;
// 這里將指針指向 Query_event header中的db-name

log_event.cc
int Query_log_event::do_apply_event(Relay_log_info const *rli,
const char *query_arg, size_t q_len_arg)

// 讀取db-name
set_thd_db(thd, db, db_len);

log_event.cc 調用 rpl_filter.cc , rpl_filter.h

bool
Rpl_filter::db_ok_with_wild_table(const char *db)

(4)實例解析
下面是一個行模式的DML的QUERY_EVENT (mysqlbinlog --hexdump 輸出）：

可變部分
• status variables:略。
• 77 6f 72 6c 64 00:當前登錄庫名world的ASCII編碼，以0x00結尾
• 42 45 47 49 4e:語句‘BEGIN’的ASCII編碼

中間有部分就是我們的status variables,這部分過於復雜我沒有仔細研究，因此我就沒做解析了

sql_base.cc
TABLE_SHARE *get_table_share(THD *thd, TABLE_LIST *table_list,
const char *key, size_t key_length,
uint db_flags, int *error,
my_hash_value_type hash_value)

assign_new_table_id(share);
share->table_map_id= last_table_id++; // 遞增table id
在LOCK_open的保護下分配一個新的表ID。 這樣做不是創建新的互斥鎖
並將其僅用於序列化對 靜態變量， 在此處分配表ID。我們將其分配給
共享，然后將其插入到 table_def_cache , table_cache_size 中確保沒有表就無法從緩存中讀取它
分配的ID。

20200826更新

如果 table_def_cache 比 參數 table_open_cache 大 , 從庫就會釋放緩存,
重新打開 table 重新生成 table id (sql_base.cc: table_def_init )

導致跟主庫生成 table id 不一樣 ( pfs_autosize.cc: estimate_hints )和 table id急劇增長 ,
最終主從漏掉某些表同步

sql_base.cc: process_found_table_share
......

while (table_def_cache->size() > table_def_size && oldest_unused_share->next)

table_def_cache->erase(to_string(oldest_unused_share->table_cache_key));

......

關注 show global status 的 Open_tables 和 Opened_tables

field.cc
int Field_varstring::do_save_field_metadata(uchar *metadata_ptr)
{
DBUG_ASSERT(field_length <= 65535);
int2store((char*)metadata_ptr, field_length);
return 2;
}

如果感興趣可以査看do_save_field_metadata這個函數。
• m_null_bits:字段是否可以為空的一個位圖。下面是位圖的獲取方式。

現在將數據的大小設置為字段元數據數組的大小
加上1或3個字節（請參閱pack.c：net_store_length）以獲取
字段元數據數組中的元素。

log_event.cc
if (m_field_metadata_size < 251)
m_data_size+= m_field_metadata_size + 1; 
else
m_data_size+= m_field_metadata_size + 3;

memset(m_null_bits, 0, num_null_bytes);
for (unsigned int i= 0 ; i < m_table->s->fields ; ++i)
if (m_table->field[i]->maybe_null())
m_null_bits[(i / 8)]+= 1 << (i % 8);

(4) 實例解析
執行如下語句：

mysql> insert into test values(100,100,100);
Query OK, 1 row affected (0.01 sec)

set global autocommit = off;

Session1:
begin;
insert into test.test values(1,2,3);

Session2:
flush tables;

Session1:
insert into test.test values(4,5,6);
commit;

一個事務內相同的表不同的 table_id ,

# at 475
#200207 21:18:46 server id 1 end_log_pos 524 CRC32 0xb34f27f8 Table_map: `test`.`test` mapped to number 360

# at 572
#200207 21:18:58 server id 1 end_log_pos 621 CRC32 0xa4f321aa Table_map: `test`.`test` mapped to number 361

rpl_slave.cc
/*
The pseudo code to compute Seconds_Behind_Master:
if (SQL thread is running)
{
if (SQL thread processed all the available relay log)
{ // sql線呈已經 apply 完所有 I/O線程寫入的Event
if (IO thread is running)
print 0;    // 顯示延遲為0
else
print NULL;
}
else
compute Seconds_Behind_Master;
}
else
print NULL;
*/

正式代碼如下,這里有一點需要注慂就是如何判斷SQL線程應用完了所有的Event的，下面這部 分實際的代碼：


bool show_slave_status_send_data(THD *thd, Master_info *mi,
char* io_gtid_set_buffer,
char* sql_gtid_set_buffer)
...
if ((mi->get_master_log_pos() == mi->rli->get_group_master_log_pos()) &&
(!strcmp(mi->get_master_log_name(), mi->rli->get_group_master_log_name())))
{
if (mi->slave_running == MYSQL_SLAVE_RUN_CONNECT)
protocol->store(0LL);
else
protocol->store_null();
}
else
{
long time_diff= ((long)(time(0) - mi->rli->last_master_timestamp)
- mi->clock_diff_with_master);

// max(0L,time_diff) 如果time_diff為負數取0

因此主從服務器最好保證時鍾同步。
這也是Seconds_Behind_Master顯示0並不代表沒有延遲的第二個原因。

(3) mi->rli->last_master_timestamp
這個值的取值就比較復雜，因為DML和DDL下是不同的。
如果是DML在單SQL線程和MTS下又不一樣。分開分析。

• DML (單SQL線程)
這種情況下前面已經說過了，SQL線程每次執行Event的時候會獲取Event的相關信息如下：

rpl_slave.cc

// event header 的timestamp

static int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{
DBUG_ENTER("exec_relay_log_event");

/*
We acquire this mutex since we need it for all operations except
event execution. But we will release it in places where we will
wait for something for example inside of next_event().
*/
mysql_mutex_lock(&rli->data_lock);

/*
UNTIL_SQL_AFTER_GTIDS, UNTIL_MASTER_POS and UNTIL_RELAY_POS require
special handling since we have to check whether the until_condition is
satisfied *before* the SQL threads goes on a wait inside next_event()
for the relay log to grow.
This is required in the following case: We have already applied the last
event in the waiting set, but the relay log ends after this event. Then it
is not enough to check the condition in next_event; we also have to check
it here, before going to sleep. Otherwise, if no updates were coming from
the master, we would sleep forever despite having reached the required
position.
*/
if ((rli->until_condition == Relay_log_info::UNTIL_SQL_AFTER_GTIDS ||
rli->until_condition == Relay_log_info::UNTIL_MASTER_POS ||
rli->until_condition == Relay_log_info::UNTIL_RELAY_POS ||
rli->until_condition == Relay_log_info::UNTIL_SQL_VIEW_ID) &&
rli->is_until_satisfied(thd, NULL))
{
rli->abort_slave= 1;
mysql_mutex_unlock(&rli->data_lock);
DBUG_RETURN(1);
}

Log_event *ev = next_event(rli), **ptr_ev;

DBUG_ASSERT(rli->info_thd==thd);

if (sql_slave_killed(thd,rli))
{
mysql_mutex_unlock(&rli->data_lock);
delete ev;
DBUG_RETURN(1);
}
if (ev)
{
enum enum_slave_apply_event_and_update_pos_retval exec_res;

ptr_ev= &ev;
/*
Even if we don't execute this event, we keep the master timestamp,
so that seconds behind master shows correct delta (there are events
that are not replayed, so we keep falling behind).

If it is an artificial event, or a relay log event (IO thread generated
event) or ev->when is set to 0, or a FD from master, or a heartbeat
event with server_id '0' then we don't update the last_master_timestamp.

In case of parallel execution last_master_timestamp is only updated when
a job is taken out of GAQ. Thus when last_master_timestamp is 0 (which
indicates that GAQ is empty, all slave workers are waiting for events from
the Coordinator), we need to initialize it with a timestamp from the first
event to be executed in parallel.
*/
if ((!rli->is_parallel_exec() || rli->last_master_timestamp == 0) &&
!(ev->is_artificial_event() || ev->is_relay_log_event() ||
(ev->common_header->when.tv_sec == 0) ||
ev->get_type_code() == binary_log::FORMAT_DESCRIPTION_EVENT ||
ev->server_id == 0))
{
rli->last_master_timestamp= ev->common_header->when.tv_sec +
(time_t) ev->exec_time;
DBUG_ASSERT(rli->last_master_timestamp >= 0);
}

// query event 才有的執行時間

rpl_slave.cc
/*
The pseudo code to compute Seconds_Behind_Master:
if (SQL thread is running)
{
if (SQL thread processed all the available relay log)
{ // sql線呈已經 apply 完所有 I/O線程寫入的Event
if (IO thread is running)
print 0;    // 顯示延遲為0
else
print NULL;
}
else
compute Seconds_Behind_Master;
}
else
print NULL;
*/

正式代碼如下,這里有一點需要注慂就是如何判斷SQL線程應用完了所有的Event的，下面我們可以看看這部 分實際的代碼：


bool show_slave_status_send_data(THD *thd, Master_info *mi,
char* io_gtid_set_buffer,
char* sql_gtid_set_buffer)
...
if ((mi->get_master_log_pos() == mi->rli->get_group_master_log_pos()) &&
(!strcmp(mi->get_master_log_name(), mi->rli->get_group_master_log_name())))
{
if (mi->slave_running == MYSQL_SLAVE_RUN_CONNECT)
protocol->store(0LL);
else
protocol->store_null();
}
else
{
long time_diff= ((long)(time(0) - mi->rli->last_master_timestamp)
- mi->clock_diff_with_master);

protocol->store((longlong)(mi->rli->last_master_timestamp ?
max(0L, time_diff) : 0));
}
}
// max(0L,time_diff) 如果time_diff為負數取0

因此主從服務器最好保證時鍾同步。
這也是Seconds_Behind_Master顯示0並不代表沒有延遲的第二個原因。

(3) mi->rli->last_master_timestamp
這個值的取值就比較復雜，因為DML和DDL下是不同的。
如果是DML在單SQL線程和MTS下又不一樣。因此我們分幵討論。

• DML (單SQL線程)
這種情況下前面我們已經說過了，SQL線程每次執行Event的時候會獲取Event的相關信息如下：

rpl_slave.cc
static int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{
....
// event header 的timestamp
rli->last_master_timestamp= ev->common_header->when.tv_sec +
(time_t) ev->exec_time;

// query event 才有的執行時間

• DML (MTS)
實際上rli->last_master_timestamp會取值為 checkpoint 位罝事務XID_EVENT的timestamp,流程大概如下：

log_event.cc
Slave_worker *Log_event::get_slave_worker(Relay_log_info *rli)
{

....
1. 工作線程執行到事務XID_EVENT時,取XID_EVENT的timestamp
ptr_group->checkpoint_seqno= rli->checkpoint_seqno;
ptr_group->ts= common_header->when.tv_sec + (time_t) exec_time; // Seconds_behind_master related
...

2. 當進行 checkpoint 的時候設置變量ts , 調用函數 mts_checkpoint_routine

GAQ是什么?

https://dev.mysql.com/doc/dev/mysql-server/8.0.15/classSlave__committed__queue.html#details

bool mts_checkpoint_routine(Relay_log_info *rli, ulonglong period,
bool force, bool need_data_lock)

/*
Update the rli->last_master_timestamp for reporting correct Seconds_behind_master.

If GAQ is empty, set it to zero.
Else, update it with the timestamp of the first job of the Slave_job_queue
which was assigned in the Log_event::get_slave_worker() function.
*/
ts= rli->gaq->empty()
? 0
: reinterpret_cast<Slave_job_group*>(rli->gaq->head_queue())->ts;
rli->reset_notified_checkpoint(cnt, ts, need_data_lock, true);
/* e

3. 修改 last_master_timestamp 值
rpl_rli.cc
void Relay_log_info::reset_notified_checkpoint(ulong shift, time_t new_ts,
bool need_data_lock,
bool update_timestamp)
{

...

if (update_timestamp)
{
if (need_data_lock)
mysql_mutex_lock(&data_lock);
else
mysql_mutex_assert_owner(&data_lock);
last_master_timestamp= new_ts;
if (need_data_lock)
mysql_mutex_unlock(&data_lock);
}
}

• DDL
同樣是上面 DML (單SQL線程) 的公式：

rpl_slave.cc
static int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{
....
// event header 的timestamp
rli->last_master_timestamp= ev->common_header->when.tv_sec +
(time_t) ev->exec_time;

// query event 才有的執行時間

實際上DML和DDL計 算延遲的方式區別就在於這里的exec_time是否參與運算。
DDL記錄Binary log中QUERY_EVENT記錄的是實際的語句,exec_time就會記錄整個語句執行完成所消耗的時間

三、 Seconds_Behind_Masteri十算方式總結

四、 MTS中Seconds_Behind_Masteri十算誤差的測試
證明slave_checkpoint_period( 默認 300ms) 在MTS中的Seconds_Behind_Master計算和檢查點有關。
從庫操作如下：

set slave_checkpoint_period = 60000;
stop slave;
start slave;

然后在主庫做手動做幾個事務。然后查看Executed_Gtid_Set如下：

insert * 3;

show master status \G

檢查主庫和從庫的Executed_Gtid_Set 發現這些事務早就已經在從庫做完了

下面是變為0的截圖：

但是這種問題一般不需要擔心，
因為默認情況下參數slave_checkpoint_period為300毫秒，檢查點非常頻繁。

slave_checkpoint_period 在代碼里面實際為 mts_checkpoint_period 這個 ,

涉及源碼: rpl_slave.cc: static Log_event* next_event(Relay_log_info* rli) ,

rpl_slave.cc: mts_checkpoint_routine 處理"rli->gaq" 找出恢復表中的低水位標記（lwm）。

GAQ的長度 == slave_checkpoint_group == rli->checkpoint_group ,

slave_checkpoint_period 性能隨着 slave_checkpoint_group 越大 SQL thread 循環檢查 "檢查點" 是否需要執行次數越多 ,

slave_checkpoint_group 設置比較小會導致頻繁 SQL thread 循環檢查 "檢查點"

設置適當一個值需要持續觀察

然后我們將從庫系統時間改小，如下：

修改系統時間
# date
# date -s 01:00:00

然后再次查看延遲， Seconds_Behind_Master為0,但是gno為31的DDL操作還是沒有執行完成，如下：

從庫數據的查找和參數slave_rows_search_algorithms

mysql> delete from aaa where a=1;
Query OK, 3 rows affected (6.24 sec)

索引定位

btr0cur.cc
void
btr_cur_search_to_nth_level(
/*========================*/
dict_index_t*    index,    /*!< in: index */

如下代碼：

log_event.cc
int Rows_log_event::do_index_scan_and_update(Relay_log_info const *rli)
...
while (record_compare(m_table, &m_cols))
{
while((error= next_record_scan(false)))
{
/* We just skip records that has already been deleted */
if (error == HA_ERR_RECORD_DELETED)
continue;
DBUG_PRINT("info",("no record matching the given row found"));
goto end;
}
}

在沒有主鍵和唯一鍵時候從庫I/O的消耗代價是比主庫更高的，
因此依舊使用的是索引KEY a ，
流程如下：

在從庫增加一個主鍵，
那么在從庫進行應用的時候流程如下:

循環：
主鍵定位 -> 刪除本行

log_event.cc
int Rows_log_event::do_index_scan_and_update(Relay_log_info const *rli)

...

if (m_key_info->flags & HA_NOSAME || m_key_index == m_table->s->primary_key)
{
/* Unique does not have non nullable part */
if (!(m_key_info->flags & (HA_NULL_PART_KEY)))
goto end; // record found
else
{
/*
Unique has nullable part. We need to check if there is any field in the
BI image that is null and part of UNNI.
*/
bool null_found= FALSE;
for (uint i=0; i < m_key_info->user_defined_key_parts && !null_found; i++)
{
uint fieldnr= m_key_info->key_part[i].fieldnr - 1;
Field **f= m_table->field+fieldnr;
null_found= (*f)->is_null();
}

if (!null_found)
goto end; // record found

/* else fall through to index scan */
}
}

循環：
全表掃描 -> 刪除本行

• TABLE_SCAN,INDEX__SCAN (默認值）
• INDEX_SCAN,HASH_SCAN
• TABLE_SCAN,HASH_SCAN
• TABLE_SCANJNDEX_SCAN,HASH_SCAN

在源碼中有如下的說明，當然官方文檔也有類似的說明：

log_event.cc
int
Rows_log_event::row_operations_scan_and_key_setup()
....

根據 ROW_LOOKUP_HASH_SCAN ， ROW_LOOKUP_INDEX_SCAN ，ROW_LOOKUP_TABLE_SCAN 調用

log_event.cc
void
Rows_log_event::decide_row_lookup_algorithm_and_key()
...

/*
Decision table:
- I --> Index scan / search
- T --> Table scan
- Hi --> Hash over index
- Ht --> Hash over the entire table

|--------------+-----------+------+------+------|
| Index\Option | I , T , H | I, T | I, H | T, H |
|--------------+-----------+------+------+------|
| PK / UK | I | I | I | Hi |
| K | Hi | I | Hi | Hi |
| No Index | Ht | T | Ht | Ht |
|--------------+-----------+------+------+------|

*/

源碼中會有三種數據查找的方式，分別是：

• R0W_L00KUP_INDEX_SCAN
對應函數接口：log_event.cc Rows_Log_event::do_index_scan_and_update

對應函數接口：

log_event.cc
int Rows_log_event::do_apply_event(Relay_log_info const *rli) 在源碼中如下：

...
if ((m_rows_lookup_algorithm != ROW_LOOKUP_NOT_NEEDED) &&
!is_any_column_signaled_for_table(table, &m_cols))
{
error= HA_ERR_END_OF_FILE;
goto AFTER_MAIN_EXEC_ROW_LOOP;
}
switch (m_rows_lookup_algorithm)
{
case ROW_LOOKUP_HASH_SCAN:
do_apply_row_ptr= &Rows_log_event::do_hash_scan_and_update;
break;

case ROW_LOOKUP_INDEX_SCAN:
do_apply_row_ptr= &Rows_log_event::do_index_scan_and_update;
break;

case ROW_LOOKUP_TABLE_SCAN:
do_apply_row_ptr= &Rows_log_event::do_table_scan_and_update;
break;

case ROW_LOOKUP_NOT_NEEDED:
DBUG_ASSERT(get_general_type_code() == binary_log::WRITE_ROWS_EVENT);

/* No need to scan for rows, just apply it */
do_apply_row_ptr= &Rows_log_event::do_apply_row;
break;

default:
DBUG_ASSERT(0);
error= 1;
goto AFTER_MAIN_EXEC_ROW_LOOP;
break;
}

將參數‘slave_rows_search_algorithms’設罝為叫 INDEX_SCAN,HASH_SCAN 且表上沒有主鍵和唯一鍵的話，
把數據查找的方式設罝為 ROW_LOOKUP_HASH_SCAN。

總結
主庫沒有Primary Key如果在從庫建立一個Primary Key能降低延遲嗎？
肯定的，因為從庫會根據Event中的行數據進行使用索引的選擇。

因為 mi->clock__diff_with_master 只會在I0線程啟動的時候進行初始化，

從庫的SQL線程（MTS協調線程）和參數 sql_slave_skip_counter

一、流程圖

二、重要步驟說明
1、如果是MTS需要啟動工作線程
每次重新啟動SQL線程（MTS協調線程）的時候 slave_parallel_workers 參數才會生效。

如果沒有設置為TABLE, MySQL就會出現Warning：
Slave SQL for channel If a crash happens this configuration does not guarantee that the relay log info will be consistent, Error_code: 0

3、狀態reading event from the relay log
開始讀取Event的時候會進入狀態：‘reading event from the relay log’。

源碼中如下：
是否超過檢査點周期

rpl_slave.cc

處理rli-> gaq以找出存儲在集中恢復表中的 lwm 位置。
bool mts_checkpoint_routine(Relay_log_info *rli, ulonglong period,
bool force, bool need_data_lock)

set_timespec_nsec(&curr_clock, 0);
ulonglong diff= diff_timespec(&curr_clock, &rli->last_clock);
if (!force && diff < period)
{
/*
We do not need to execute the checkpoint now because
the time elapsed is not enough.
*/
DBUG_RETURN(FALSE);
}

rpl_slave.cc
從中繼日志中讀取下一個事件。應該從從SQL線程中調用。
static Log_event* next_event(Relay_log_info* rli)

是否已經GAQ隊列已經滿了
//如果達到了 GAQ的大小設貿為force強制checkpoint
bool force= (rli->checkpoint_seqno > (rli->checkpoint_group - 1));

源碼注釋如下，‘in a group’就代表在一個事務中：

/* 
If we have reached the limit of the relay space and we
are going to sleep, waiting for more events:

1. If outside a group, SQL thread asks the IO thread 
to force a rotation so that the SQL thread purges 
logs next time it processes an event (thus space is
freed).

2. If in a group, SQL thread asks the IO thread to 
ignore the limit and queues yet one more event 
so that the SQL thread finishes the group and 
is are able to rotate and purge sometime soon.
*/


rpl_slave.cc
relay_log_purge=ON, 非當前relay log讀取完了的Event會進入清理流程。

if (relay_log_purge)    //做relay log的清理
{

if (rli->relay_log.purge_first_log
(rli,
rli->get_group_relay_log_pos() == rli->get_event_relay_log_pos()
&& !strcmp(rli->get_group_relay_log_name(),rli->get_event_relay_log_name())))
//做relay log的清理
{
errmsg = "Error purging processed logs";
goto err;
}
DBUG_PRINT("info", ("next_event group master %s %lu group relay %s %lu event %s %lu\n",
rli->get_group_master_log_name(),
(ulong) rli->get_group_master_log_pos(),
rli->get_group_relay_log_name(),
(ulong) rli->get_group_relay_log_pos(),
rli->get_event_relay_log_name(),
(ulong) rli->get_event_relay_log_pos()));
}
else
{
// 錯誤時候，鎖定relay log
if (rli->relay_log.find_next_log(&rli->linfo, !hot_log))
{
errmsg = "error switching to the next log";
goto err;
}
rli->set_event_relay_log_pos(BIN_LOG_HEADER_SIZE);
rli->set_event_relay_log_name(rli->linfo.log_file_name);

rli->flush_info();
}

如果是讀取完的是當前relay log,則不能進行清理，
會等待10線程的喚醒，
如果是MTS等待喚醒期間還需要進行MTS的檢查點。

rpl_slave.cc
static int exec_relay_log_event(THD* thd, Relay_log_info* rli)
{
......
if ((!rli->is_parallel_exec() || rli->last_master_timestamp == 0) &&    //event header 的timestamp
!(ev->is_artificial_event() || ev->is_relay_log_event() ||
(ev->common_header->when.tv_sec == 0) ||    //query event才有的執行時間
ev->get_type_code() == binary_log::FORMAT_DESCRIPTION_EVENT ||
ev->server_id == 0))
{
rli->last_master_timestamp= ev->common_header->when.tv_sec +
(time_t) ev->exec_time;
//但是對於MTS來講應該注意是最后一個XID EVENT的時間不是這里設置的，
//在 mts_checkpoint_routine 里面
DBUG_ASSERT(rli->last_master_timestamp >= 0);
}

......

注意到其中一個重要條件就是非MTS環境下，即！rli->is_parallel_exec()條件需要滿足。

log_event.cc
Log_event::enum_skip_reason
Log_event::do_shall_skip(Relay_log_info *rli)
{
DBUG_PRINT("info", ("ev->server_id=%lu, ::server_id=%lu,"
" rli->replicate_same_server_id=%d,"
" rli->slave_skip_counter=%d",
(ulong) server_id, (ulong) ::server_id,
rli->replicate_same_server_id,
rli->slave_skip_counter));
if ((server_id == ::server_id && !rli->replicate_same_server_id) ||
(rli->slave_skip_counter == 1 && rli->is_in_group()))
return EVENT_SKIP_IGNORE;
else if (rli->slave_skip_counter > 0)
return EVENT_SKIP_COUNT;
else
return EVENT_SKIP_NOT;
}

9、MTS進行分發，單SQL線程進行應用Event處理：
• MTS進行Event分發給工作線程進行Event的apply。
• 單SQL線程進行Event的應用。

三、各個Event大概做了什么
最終從庫就開始應用Event 了，
開啟GTID的情況下一個簡單的‘Delete’語句生成的Event為例，Event包含：

Event:
1. GTID_EVENT
2. QUERY_EVENT
3. MAP_EVENT
4. DELETE_EVENT
5. XID_EVENT

1. GTID_EVENT
• 接口 log_event.cc int Gtid_log_event::do_apply_event(Relay_log_info const *rli)
• 單SQL線程下回滾partial transaction
• 設置事務的GTID 參考 set_gtid_next 函數。

lib_sql.cc
void *create_embedded_thd(int client_flag)
......
if (thd->variables.max_join_size == HA_POS_ERROR)
thd->variables.option_bits |= OPTION_BIG_SELECTS;

語句進入提交流程后會判斷這個標記如下：

sql_class.h
inline bool in_multi_stmt_transaction_mode() const
{
return variables.option_bits & (OPTION_NOT_AUTOCOMMIT | OPTION_BEGIN);
}

如果設罝了‘BEGIN’標記則不會進入order commit流程。

可以看到對於從庫來講同樣需要開啟讀寫事務上鎖等操作，
如果從庫有相應的DML操作，
SQL線程（MTS工作線程）也可能因為不能獲得鎖而堵塞。

binlog_group_commit_sync_delay 造成延遲

binlog.cc

int MYSQL_BIN_LOG::ordered_commit(THD *thd, bool all, bool skip_commit)

....

// 在sync_binlog= 1 才有這個延遲

if (!flush_error && (sync_counter + 1 >= get_sync_period()))
stage_manager.wait_count_or_timeout(opt_binlog_group_commit_sync_no_delay_count,
opt_binlog_group_commit_sync_delay,
Stage_manager::SYNC_STAGE);

 ...

void Stage_manager::wait_count_or_timeout(ulong count, long usec, StageID stage)
{
long to_wait=
DBUG_EVALUATE_IF("bgc_set_infinite_delay", LONG_MAX, usec);
/*
For testing purposes while waiting for inifinity
to arrive, we keep checking the queue size at regular,
small intervals. Otherwise, waiting 0.1 * infinite
is too long.
*/
long delta=
DBUG_EVALUATE_IF("bgc_set_infinite_delay", 100000,
max<long>(1, (to_wait * 0.1)));　　　　　　// binlog_group_commit_sync_delay*0.1 和 1之間 , 1/1000000 秒 , 這參數隨着增大下面循環消耗越多

while (to_wait > 0 && (count == 0 || static_cast<ulong>(m_queue[stage].get_size()) < count)) 　　 // binlog_group_commit_sync_delay時間消耗完
{
#ifndef DBUG_OFF
if (current_thd)
DEBUG_SYNC(current_thd, "bgc_wait_count_or_timeout");
#endif
my_sleep(delta);
to_wait -= delta;
}
}

總結出event 的offset 關系:

在沒有mysql lib情況下 offset + 1:
event_time: EVENT_TIMESTAMP = 4
event_type: EVENT_TYPE_OFFSET = 5 - EVENT_TIMESTAMP

// 向前走
EVENT_LEN_OFFSET = 13 - EVENT_TYPE_OFFSET = 5 , SEEK_CUR

event_next: LOG_POS_OFFSET = 17 - EVENT_LEN_OFFSET = 13

event_size = event_next - event_pos = ( fseek(fds,event_next,SEEK_SET); ftell(fds); )

Insert Event: 30
Update Event: 31
Delete Event: 32
Map Event: 19
Gtid Event: 33

event_size > 44: 偏移 FLAGS_OFFSET = 19 - LOG_POS_OFFSET = 17 , GTID_FLAG = 26

relay_log_space_limit 設置導致從庫 I/O thread 等待復制延遲 , 從庫的I / O線程在等待relay log 空間時中止

rpl_slave.cc: static Log_event* next_event(Relay_log_info* rli)

如果 rli->log_space_limit >= rli->log_space_total
　　進入rli->relay_log.wait_for_update_relay_log(thd, NULL);
　　　　進入Linux 線程互斥鎖: pthread_cond_wait 和 pthread_cond_timedwait

如果 rli->log_space_limit< rli->log_space_total
　　從庫 的I/O線程 unlock
        mysql_mutex_unlock(&rli->log_space_lock);
        mysql_cond_broadcast(&rli->log_space_cond);

Slave_SQL_Running_State狀態分析(參考八爺文章: " MySQL里Wating for Slave workers to free pending events到底在等什么 " ):

“Waiting for dependent transaction to commit”

由於協調線程判定本事務由於last commit大於current_lwm因此不能並行回放 ( 通過clock_leq 進行比較 )，協調線程處於等待，大事務會加劇這種情況。

rpl_mts_submode.cc:
bool Mts_submode_logical_clock::
wait_for_last_committed_trx(Relay_log_info* rli,
                            longlong last_committed_arg,
                            longlong lwm_estimate_arg)
{......

if ((!rli->info_thd->killed && !is_error) &&
      !clock_leq(last_committed_arg, get_lwm_timestamp(rli, true)))
  {
    PSI_stage_info old_stage;
    struct timespec ts[2];
    set_timespec_nsec(&ts[0], 0);

    DBUG_ASSERT(rli->gaq->len >= 2); // there's someone to wait

    thd->ENTER_COND(&rli->logical_clock_cond, &rli->mts_gaq_LOCK,
                    &stage_worker_waiting_for_commit_parent, &old_stage);
    do
    {
      mysql_cond_wait(&rli->logical_clock_cond, &rli->mts_gaq_LOCK);
    }
　　    while ((!rli->info_thd->killed && !is_error) &&
           !clock_leq(last_committed_arg, estimate_lwm_timestamp()));
    my_atomic_store64(&min_waited_timestamp, SEQ_UNINIT);  // reset waiting flag
    mysql_mutex_unlock(&rli->mts_gaq_LOCK);
    thd->EXIT_COND(&old_stage);
    set_timespec_nsec(&ts[1], 0);
    my_atomic_add64(&rli->mts_total_wait_overlap, diff_timespec(&ts[1], &ts[0]));


"Waiting for slave workers to process their queues"

由於沒有空閑的worker線程，協調線程會等待。這種情況說明理論上的並行度是理想的，
但是可能是參數 slave_parallel_workers 設置不夠。當然設置工作線程的個數應該和服務器的配置和負載相結合考慮。

rpl_mts_submode.cc

Slave_worker *
Mts_submode_logical_clock::get_least_occupied_worker(Relay_log_info *rli,
                                                     Slave_worker_array *ws,
                                                     Log_event * ev)
{
.......

else
  {
    worker= get_free_worker(rli);        // 用於獲取一個沒有分配事件的 Worker。隨着Worker數量的增加，掃描成本越來越高。

    DBUG_ASSERT(ev->get_type_code() != binary_log::USER_VAR_EVENT ||
                rli->curr_group_seen_begin || rli->curr_group_seen_gtid);

    if (worker == NULL)        // 沒有 worker 就觸發 Waiting for slave workers to process their queues
    {
      struct timespec ts[2];

      set_timespec_nsec(&ts[0], 0);
      // Update thd info as waiting for workers to finish.
      thd->enter_stage(&stage_slave_waiting_for_workers_to_process_queue,
                       old_stage,
                       __func__, __FILE__, __LINE__);
      while (!worker && !thd->killed)
      {
        /*
          Busy wait with yielding thread control before to next attempt
          to find a free worker. As of current, a worker
          can't have more than one assigned group of events in its
          queue.

          todo: replace this At-Most-One assignment policy with
                First Available Worker as
                this method clearly can't be considered as optimal.
        */
#if !defined(_WIN32)
        sched_yield();
#else
        my_sleep(rli->mts_coordinator_basic_nap);
#endif
        worker= get_free_worker(rli);
      }
      THD_STAGE_INFO(thd, *old_stage);
      set_timespec_nsec(&ts[1], 0);
      rli->mts_total_wait_worker_avail += diff_timespec(&ts[1], &ts[0]);
      rli->mts_wq_no_underrun_cnt++;
      /*
        Even OPTION_BEGIN is set, the 'BEGIN' event is not dispatched to
        any worker thread. So The flag is removed and Coordinator thread
        will not try to finish the group before abort.
      */
      if (worker == NULL)
        rli->info_thd->variables.option_bits&= ~OPTION_BEGIN;
    }
    if (rli->get_commit_order_manager() != NULL && worker != NULL)
      rli->get_commit_order_manager()->register_trx(worker);
  }
  
  DBUG_ASSERT(ptr_group);
  // assert that we have a worker thread for this event or the slave has
  // stopped.
  DBUG_ASSERT(worker != NULL || thd->killed);
  /* The master my have send  db partition info. make sure we never use them*/
  if (ev->get_type_code() == binary_log::QUERY_EVENT)
    static_cast<Query_log_event*>(ev)->mts_accessed_dbs= 0;

  DBUG_RETURN(worker);
}

總結MySQL服務器層面影響主從復制情況:

1. 從庫的系統時間被修改比當前時間更大(推薦配置時間服務器)

2. 在MTS設置不合理 rpl_slave.cc: slave_checkpoint_period 參數 ( 設置在調用檢查點操作之前允許通過的最大時間（以毫秒為單位），以更新多線程Slave狀態 )

4. 大表DDL: 延遲會從0開始，因為Query Event記錄了准確的執行時間。

5. slave_rows_search_algorithms 參數 , 表沒有主鍵或者唯一鍵，就不要設置INDEX_SCAN,HASH_SCAN

6. rpl_slave.cc: slave_checkpoint_period ( 默認 300ms) 在MTS中的Seconds_Behind_Master計算和檢查點有關。( 參考 • DML (MTS) )

7. 根據線程模式 , binlog事務模式, recovery_relay_log , sync_master_info , sync_relay_log , sync_relay_log_info 進行一個性能判斷,如下:

sync_master_info: 如果sync_master_info的值大於 0，則Slave服務器將在每個sync_master_info事件后更新其主信息存儲庫表。如果是 0，則永遠不會更新表。

9. 長時間不提交的事務可能造成延遲瞬間增加，因為GTID_EVENT和XID_EVENT是提交時間,而其他Event是命令發起的時間。

11. 在舊版本MySQL的主服務器和新版本MySQL的從服務器之間復制。

12. 啟動 I/O 線程后人為的修改從庫服務器的時間

rpl_slave.cc
extern "C" void *handle_slave_sql(void *arg)

......

if (opt_slave_preserve_commit_order && rli->opt_slave_parallel_workers > 0 &&
      opt_bin_log && opt_log_slave_updates)
    commit_order_mngr= new Commit_order_manager(rli->opt_slave_parallel_workers);

  rli->set_commit_order_manager(commit_order_mngr);

14. binlog_row_image設置minimal 在不修改非空唯一索引和主鍵時候的字段時候 binlog是不包含非空唯一索引和主鍵的字段

15. recovery_relay_log = 0 打破 relay log 和binary log的事務是連續性，recovery_relay_log = 1 relay log 和binary log當前滿了也不能切換

19.

......

100. 所以我最終遷移到PXC 或者 MGR 再或者遷移到 PostgreSQL物理流復制延遲時間也是十分低（主從模式）解決上述主從復制延遲問題

待續....

binlog_group_commit_sync_delay 的高值可能會影響多線程從庫中的 slave worker 性能。

if (rli->log_space_limit && rli->log_space_limit < rli->log_space_total)

------------恢復內容結束------------

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 如何走出迷茫期？從原因入手，4個辦法走出迷茫期！ [源碼分析] 從實例和源碼入手看 Flink 之廣播 Broadcast mysql復制那點事(2)-binlog組提交源碼分析和實現 ng-grid入手 Blazor入手教程（五）組件造成數據庫表死鎖的原因分析及解決方案 Blazor入手教程（三）列表和條件渲染 Blazor入手教程（十）部署安裝頁面卡頓，可能是哪些原因造成的？ mysql主從復制原理-binlog