現在不少公司都在用MySQL(master)-->MySQL(slave)的框架,當然也有一主多從的架構,這也是MySQL主從的一個延伸架構;當然也有的公司MySQL主主的架構,MySQL主主架構要是處理得不適當,會面臨各種各樣的問題,當然啦,每種數據庫構架都有自己的優缺點,合適自己公司業務需求的且方便自己維護的架構都可以認為是理想的構架,當出現同步斷開了,我們是不是一味的使用--slave-skip-errors=[error_code]來跳過錯誤代碼呢?其實不是的,這樣做可能會造成數據不一致的可能,下面我只針對MySQL Replication常見的錯誤進行說明及處理。
一、在master上更新一條記錄時出現的故障(master與slave處理同步的情況下,binlog為row格式)
在slave庫上,模擬slave少了一條數據,所以把id=6的記錄在slave上先delete掉:
root@mysql-slave> select * from test; +----+------+----------+ | id | name | code | +----+------+----------+ | 6 | aa | 10002011 | | 7 | bb | 10002012 | | 8 | cc | 10002013 | | 9 | dd | 10002014 | +----+------+----------+ 4 rows in set (0.00 sec) root@mysql-slave> delete from test where id=6; Query OK, 1 row affected (0.00 sec)
然后在master上更新id為6的記錄:
root@mysql-master> show variables like 'binlog_format'; +---------------+-------+ | Variable_name | Value | +---------------+-------+ | binlog_format | ROW | +---------------+-------+ 1 row in set (0.00 sec) root@mysql-master> select * from test; +----+------+----------+ | id | name | code | +----+------+----------+ | 6 | aa | 10002011 | | 7 | bb | 10002012 | | 8 | cc | 10002013 | | 9 | dd | 10002014 | +----+------+----------+ 4 rows in set (0.00 sec) root@mysql-master> update test set name='AA' where id=6; Query OK, 1 row affected (0.00 sec) Rows matched: 1 Changed: 1 Warnings: 0
root@mysql-master>
回slave庫看下同步狀態是否正常:
Replicate_Wild_Ignore_Table: Last_Errno: 1032 Last_Error: Could not execute Update_rows event on table xuanzhi.test; Can't find record in 'test', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000004, end_log_pos 3704 Skip_Counter: 0 Exec_Master_Log_Pos: 3529 Relay_Log_Space: 4183 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: NULL Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 1032 Last_SQL_Error: Could not execute Update_rows event on table xuanzhi.test; Can't find record in 'test', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000004, end_log_pos 3704 Replicate_Ignore_Server_Ids: Master_Server_Id: 1 1 row in set (0.00 sec) root@mysql-slave>
可以看到,同步已經斷開,根據slave的報錯信息去查看master的binlog到底做了什么,從上面看現在master做的操作寫的binlog是mysql-bin.000004,end_log_pos=3704
[root@localhost ~]# /usr/local/services/mysql/bin/mysqlbinlog -v --base64-output=DECODE-ROWS /data/mysql/data/mysql-bin.000004 | grep -A '10' 3704 #150610 22:33:08 server id 1 end_log_pos 3704 Update_rows: table id 34 flags: STMT_END_F ### UPDATE xuanzhi.test ### WHERE ### @1=6 ### @2='aa' ### @3=10002011 ### SET ### @1=6 ### @2='AA' ### @3=10002011 # at 3704 #150610 22:33:08 server id 1 end_log_pos 3731 Xid = 89 COMMIT/*!*/; DELIMITER ; # End of log file ROLLBACK /* added by mysqlbinlog */; /*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/; [root@localhost ~]#
可以看到做了更新的操作UPDATE xuanzhi.test where id=6的操作,我們在slave庫上查看id為6的記錄:
root@mysql-slave> select * from xuanzhi.test where id=6; Empty set (0.00 sec) root@mysql-slave>
可以看到slave庫上並沒存在這樣的記錄。我們回到master查看下id=6的記錄:
root@mysql-master> select * from xuanzhi.test where id=6; +----+------+----------+ | id | name | code | +----+------+----------+ | 6 | AA | 10002011 | +----+------+----------+ 1 row in set (0.00 sec) root@mysql-master>
下面我們要解決同步問題呢?操作如下:把丟失的數據補到slave上:
root@mysql-slave> stop slave sql_thread;
Query OK, 0 rows affected (0.00 sec) root@mysql-slave> insert into test (id,name,code) values (6,'AA',10002011); Query OK, 1 row affected (0.00 sec) root@mysql-slave> start slave sql_thread;
Query OK, 0 rows affected (0.00 sec) root@mysql-slave> show slave status\G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: 192.168.10.132 Master_User: root Master_Port: 3306 Connect_Retry: 60 Master_Log_File: mysql-bin.000004 Read_Master_Log_Pos: 3731 Relay_Log_File: localhost-relay-bin.000004 Relay_Log_Pos: 253 Relay_Master_Log_File: mysql-bin.000004 Slave_IO_Running: Yes Slave_SQL_Running: Yes
正常同步了。如果有N多數據缺失,得用pt-table-checksum校驗數據一致性,很多同學會好奇為什么slave庫上會少數據呢?我總結了以下幾種情況,當然還有別的:
1、當人為設置set session sql_log_bin=0時,當前session操作是不記錄到Binlog的。
2、就是slave沒設置為read only,在slave庫上有刪除操作
3、slave讀取master的binlog日志后,需要落地3個文件:relay log、relay log info、master info,這三個文件如果不及時落地,則主機crash后會導致數據的不一致
二、估計比較常見的一種錯誤,就是錯誤代碼為1062的錯誤,主鍵沖突
在slave上添加一條記錄,模擬slave上還存在舊的數據記錄,此時master是沒有的這條記錄的(這里的id自增主鍵)
root@mysql-slave> insert into test value (5,'zz',10002010); Query OK, 1 row affected (0.00 sec) root@mysql-slave> select * from test; +----+------+----------+ | id | name | code | +----+------+----------+ | 5 | zz | 10002010 | | 6 | AA | 10002011 | | 7 | bb | 10002012 | | 8 | cc | 10002013 | | 9 | dd | 10002014 | +----+------+----------+ 5 rows in set (0.00 sec) root@mysql-slave>
在master上操作,添加一條id為5的記錄:
root@mysql-master> select * from test; +----+------+----------+ | id | name | code | +----+------+----------+ | 6 | AA | 10002011 | | 7 | bb | 10002012 | | 8 | cc | 10002013 | | 9 | dd | 10002014 | +----+------+----------+ 5 rows in set (0.00 sec) root@mysql-master> insert into test value (5,'ZZ',10002010); Query OK, 1 row affected (0.00 sec)
回到slave庫查看同步狀態:
Replicate_Wild_Ignore_Table: Last_Errno: 1062 Last_Error: Could not execute Write_rows event on table xuanzhi.test; Duplicate entry '5' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mysql-bin.000004, end_log_pos 3893 Skip_Counter: 0 Exec_Master_Log_Pos: 3731 Relay_Log_Space: 2322 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: NULL Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 1062 Last_SQL_Error: Could not execute Write_rows event on table xuanzhi.test; Duplicate entry '5' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mysql-bin.000004, end_log_pos 3893 Replicate_Ignore_Server_Ids: Master_Server_Id: 1 1 row in set (0.00 sec) root@mysql-slave>
可以看到提示1062主鍵沖突錯誤,在表xuanzhi.test上,那么,此時我們應該考慮以誰的數據為准?我們當然要以master庫的數據為准啦,所以我們需要把slave上主鍵為5的記錄給刪除掉,刪除前要先desc查看表結構確定自增主鍵在什么列:
root@mysql-slave> stop slave sql_thread;
Query OK, 0 rows affected (0.00 sec)
root@mysql-slave> desc xuanzhi.test; +-------+----------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +-------+----------+------+-----+---------+----------------+ | id | int(11) | NO | PRI | NULL | auto_increment | | name | char(10) | YES | | NULL | | | code | int(20) | YES | | NULL | | +-------+----------+------+-----+---------+----------------+ 3 rows in set (0.00 sec) root@mysql-slave> delete from xuanzhi.test where id=5; Query OK, 1 row affected (0.00 sec) root@mysql-slave> start slave sql_thread; Query OK, 0 rows affected (0.00 sec) root@mysql-slave> show slave status\G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: 192.168.10.132 Master_User: root Master_Port: 3306 Connect_Retry: 60 Master_Log_File: mysql-bin.000004 Read_Master_Log_Pos: 3920 Relay_Log_File: localhost-relay-bin.000005 Relay_Log_Pos: 253 Relay_Master_Log_File: mysql-bin.000004 Slave_IO_Running: Yes Slave_SQL_Running: Yes
嘻嘻,有人會想,如果有N多主鍵沖突,這樣手動清除沖突的記錄有點不科學,是的,所以我寫了寫腳本去清除,可以參考我寫的主從復制1062錯誤的解決方法。有人會好其,這樣刪除后,啟動同步線程,記錄還會不會同步過來呢,答案是會的
root@mysql-slave> select * from test; +----+------+----------+ | id | name | code | +----+------+----------+ | 5 | ZZ | 10002010 | | 6 | AA | 10002011 | | 7 | bb | 10002012 | | 8 | cc | 10002013 | | 9 | dd | 10002014 | +----+------+----------+ 5 rows in set (0.00 sec) root@mysql-slave>
可以看到主鍵值為5的記錄被同步過來了。當備庫在一次非計划的關閉后重啟時,會去讀master.info文件以找到上次停止復制的位置。不幸的是,該文件可能並沒有同步寫到磁盤,因為該信息是在緩存中,可能並沒有刷新到磁盤文件master.info。文件中存儲的信息可能是錯誤的,備庫可能會嘗試重新執行一些二進制日志事件,這可能導致主鍵沖突,就是我們常常看見的1062錯誤。除非能確定備庫在哪里停止(很難),否則唯一的辦法就是忽略那些錯誤。
三、master上刪除一條記錄時出現的故障。
在master上刪除一條記錄,但這條記錄在slave庫上並不存在的時候,同步會不會斷開,下面我們瞧瞧看:
在slave上delete一條數據,模擬slave比master少了數據:
root@mysql-slave> select * from test; +----+------+----------+ | id | name | code | +----+------+----------+ | 5 | ZZ | 10002010 | | 6 | AA | 10002011 | | 7 | bb | 10002012 | | 8 | cc | 10002013 | | 9 | dd | 10002014 | +----+------+----------+ 5 rows in set (0.00 sec) root@mysql-slave> delete from test where id=5; Query OK, 1 row affected (0.00 sec) root@mysql-slave>
在master上進行dalete行記錄的操作,此時的slave是不存在這條記錄了的:
root@mysql-master> select * from test; +----+------+----------+ | id | name | code | +----+------+----------+ | 5 | ZZ | 10002010 | | 6 | AA | 10002011 | | 7 | bb | 10002012 | | 8 | cc | 10002013 | | 9 | dd | 10002014 | +----+------+----------+ 5 rows in set (0.00 sec) root@mysql-master> delete from test where id = 5; Query OK, 1 row affected (0.00 sec)
回到slave庫查看下狀態,同步已經斷開:
Replicate_Wild_Ignore_Table: Last_Errno: 1032 Last_Error: Could not execute Delete_rows event on table xuanzhi.test; Can't find record in 'test', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000004, end_log_pos 4082 Skip_Counter: 0 Exec_Master_Log_Pos: 3920 Relay_Log_Space: 937 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: NULL Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 1032 Last_SQL_Error: Could not execute Delete_rows event on table xuanzhi.test; Can't find record in 'test', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000004, end_log_pos 4082 Replicate_Ignore_Server_Ids: Master_Server_Id: 1 1 row in set (0.00 sec) root@mysql-slave>
我們分析錯誤信息可以看到關鍵字是Delete_rows,這樣處理就很簡單了,因為master庫是刪除數據操作,所以slave庫上沒這條數據也沒關系,所以在slave庫上跳過此錯誤即可(當出現這種情況,就應該引起注意了,應該去檢查是否還有更多的數據丟失了)
root@mysql-slave> stop slave sql_thread; Query OK, 0 rows affected (0.00 sec) root@mysql-slave> set global sql_slave_skip_counter=1; Query OK, 0 rows affected (0.00 sec) root@mysql-slave> start slave sql_thread; Query OK, 0 rows affected (0.00 sec) root@mysql-slave> show slave status\G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: 192.168.10.132 Master_User: root Master_Port: 3306 Connect_Retry: 60 Master_Log_File: mysql-bin.000004 Read_Master_Log_Pos: 4109 Relay_Log_File: localhost-relay-bin.000006 Relay_Log_Pos: 253 Relay_Master_Log_File: mysql-bin.000004 Slave_IO_Running: Yes Slave_SQL_Running: Yes
可以看到同步正常了。你在master上delete一條master都沒有記錄,同步是不會斷開的。
四、slave的中繼日志relay-log損壞
現在模擬slave庫down機,relay-log損壞了,同步無法正常:
斷電后啟動slave庫后,執行slave start后查看狀態會報日志讀不了或者損壞(有時直接斷電slave並不一定損壞或者掉數據,如果配置參數合理的話):
root@mysql- show slave status\G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: 192.168.10.132 Master_User: root Master_Port: 3306 Connect_Retry: 60 Master_Log_File: mysql-bin.000006 Read_Master_Log_Pos: 401269011 Relay_Log_File: localhost-relay-bin.000010 Relay_Log_Pos: 439914363 Relay_Master_Log_File: mysql-bin.000004 Slave_IO_Running: Yes Slave_SQL_Running: No Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 1594 Last_Error: Relay log read failure: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave. Skip_Counter: 0 Exec_Master_Log_Pos: 590788350 Relay_Log_Space: 2398604371
show slave status幾個重要參數說明:
Slave_IO_Running: 接收master的binlog信息
Master_Log_File: 正在雲讀取master上binlog日志名
Read_master_Log_Pos: 正在讀取master上當前的binlog日志POS點
slave_SQL_Running: 執行寫操作。
Relay_master_Log_File: 正在同步master上binlog日志名
Exec_master_log_Pos: 正在同步當前binlog日志的POS點
出現relay log損壞的話,以 Relay_master_Log_File 和 Exec_master_Log_Pos參數值為基准,從上面看到Relay_master_Log_File:mysql-bin.000004 、Exec_master_Log_Pos=590788350,這時我們需要做的就是change master操作:
root@mysql-slave> stop slave sql_thread; Query OK, 0 rows affected (0.01 sec)
root@mysql-slave> change master to master_host='192.168.10.132',master_port=3306,master_user='root',master_password='123456',master_log_file='mysql-bin.000004',master_log_pos=590788350; Query OK, 0 rows affected (0.04 sec) root@mysql-slave> start slave sql_thread;
Query OK, 0 rows affected (0.00 sec) root@mysql-slave> show slave status\G *************************** 1. row *************************** Slave_IO_State: Connecting to master Master_Host: 192.168.10.132 Master_User: root Master_Port: 3306 Connect_Retry: 60 Master_Log_File: mysql-bin.000004 Read_Master_Log_Pos: 590788573
Relay_Log_File: localhost-relay-bin.000001 Relay_Log_Pos: 4 Relay_Master_Log_File: mysql-bin.000004 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 590788573 Relay_Log_Space: 107
這樣會導致丟棄所有在磁盤上的中繼日志。
如果出現以下的報錯,也是按以上的方法解決:
Slave_IO_Running: Yes Slave_SQL_Running: No Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 1593 Last_Error: Error initializing relay log position: I/ O error reading the header from the binary log Skip_Counter: 0 Exec_Master_Log_Pos: 59078
通過這種方法去修改中繼日志,是不是發現有些麻煩呢?其實MySQL5.5已經考慮到slave宕機中繼日志損壞這一問題了,即在slave的配置文件my.cnf里要增加一個參數relay_log_recovery=1就可以了。
總結:
一、遇到同步斷開時,不能一味的使用--slave-skip-errors=[error_code]來跳過錯誤代碼,這樣很容易導致數據不一致的發生
二、binlog為STATEMENT格式時,在mater進行更新或者刪除一條slave庫沒有的數據,同步是不會斷開的。
三、定期檢查數據的完整性,可以用pt-table-checksum校驗主從數據的一致性,數據的完整性,對一個公司來說,無疑是最重要的。
四、slave庫上建議把一些重要的選項開啟,例如設置為read only、relay_log_recovery、sync_master_info、sync_relay_log_info、sync_relay_log這些重要選項開啟。
參考資料:
http://blog.itpub.net/25704976/viewspace-1318714
|
作者:陸炫志 出處:xuanzhi的博客 http://www.cnblogs.com/xuanzhi201111 您的支持是對博主最大的鼓勵,感謝您的認真閱讀。本文版權歸作者所有,歡迎轉載,但請保留該聲明。
|
