前言
在發生故障切換后,經常遇到的問題就是同步報錯,數據庫很小的時候,dump完再導入很簡單就處理好了,但線上的數據庫都150G-200G,如果用單純的這種方法,成本太高,故經過一段時間的摸索,總結了幾種處理方法。
生產環境架構圖
目前現網的架構,保存着兩份數據,通過異步復制做的高可用集群,兩台機器提供對外服務。在發生故障時,切換到slave上,並將其變成master,壞掉的機器反向同步新的master,在處理故障時,遇到最多的就是主從報錯。下面是我收錄下來的報錯信息。
常見錯誤
最常見的3種情況
這3種情況是在HA切換時,由於是異步復制,且sync_binlog=0,會造成一小部分binlog沒接收完導致同步報錯。
第一種:在master上刪除一條記錄,而slave上找不到。
Last_SQL_Error: Could not execute Delete_rows event on table hcy.t1;
Can't find record in 't1',
Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND;
the event's master log mysql-bin.000006, end_log_pos 254
第二種:主鍵重復。在slave已經有該記錄,又在master上插入了同一條記錄。
Last_SQL_Error: Could not execute Write_rows event on table hcy.t1;
Duplicate entry '2' for key 'PRIMARY',
Error_code: 1062;
handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mysql-bin.000006, end_log_pos 924
第三種:在master上更新一條記錄,而slave上找不到,丟失了數據。
Last_SQL_Error: Could not execute Update_rows event on table hcy.t1;
Can't find record in 't1',
Error_code: 1032;
handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000010, end_log_pos 263
異步半同步區別
異步復制
簡單的說就是master把binlog發送過去,不管slave是否接收完,也不管是否執行完,這一動作就結束了.
半同步復制
簡單的說就是master把binlog發送過去,slave確認接收完,但不管它是否執行完,給master一個信號我這邊收到了,這一動作就結束了。(谷歌寫的代碼,5.5上正式應用。)
異步的劣勢
當master上寫操作繁忙時,當前POS點例如是10,而slave上IO_THREAD線程接收過來的是3,此時master宕機,會造成相差7個點未傳送到slave上而數據丟失。
特殊的情況
slave的中繼日志relay-bin損壞。
Last_SQL_Error: Error initializing relay log position: I/O error reading the header from the binary log
Last_SQL_Error: Error initializing relay log position: Binlog has bad magic number;
It's not a binary log file that can be used by this version of MySQL
這種情況SLAVE在宕機,或者非法關機,例如電源故障、主板燒了等,造成中繼日志損壞,同步停掉。
人為失誤需謹慎:多台slave存在重復server-id
這種情況同步會一直延時,永遠也同步不完,error錯誤日志里一直出現上面兩行信息。解決方法就是把server-id改成不一致即可。
Slave: received end packet from server, apparent master shutdown:
Slave I/O thread: Failed reading log event, reconnecting to retry, log 'mysql-bin.000012' at postion 106
問題處理
刪除失敗
在master上刪除一條記錄,而slave上找不到。
Last_SQL_Error: Could not execute Delete_rows event on table hcy.t1;
Can't find record in 't1',
Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND;
the event's master log mysql-bin.000006, end_log_pos 254
解決方法:
由於master要刪除一條記錄,而slave上找不到故報錯,這種情況主上都將其刪除了,那么從機可以直接跳過。可用命令:
stop slave;
set global sql_slave_skip_counter=1;
start slave;
如果這種情況很多,可用我寫的一個腳本skip_error_replcation.sh,默認跳過10個錯誤(只針對這種情況才跳,其他情況輸出錯誤結果,等待處理),這個腳本是參考maakit工具包的mk-slave-restart原理用shell寫的,功能上定義了一些自己的東西,不是無論什么錯誤都一律跳過。)
主鍵重復
在slave已經有該記錄,又在master上插入了同一條記錄。
|
1
2
3
4
|
Last_SQL_Error: Could not execute Write_rows event on table hcy.t1;
Duplicate entry '2' for key 'PRIMARY',
Error_code: 1062;
handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mysql-bin.000006, end_log_pos 924
|
解決方法:
在slave上用desc hcy.t1; 先看下表結構:
|
1
2
3
4
5
6
7
|
mysql> desc hcy.t1;
+-------+---------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+---------+------+-----+---------+-------+
| id | int(11) | NO | PRI | 0 | |
| name | char(4) | YES | | NULL | |
+-------+---------+------+-----+---------+-------+
|
刪除重復的主鍵
|
1
2
3
4
5
6
7
8
9
10
11
12
|
mysql> delete from t1 where id=2;
Query OK, 1 row affected (0.00 sec)
mysql> start slave;
Query OK, 0 rows affected (0.00 sec)
mysql> show slave status\G;
……
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
……
mysql> select * from t1 where id=2;
|
在master上和slave上再分別確認一下。
更新丟失
在master上更新一條記錄,而slave上找不到,丟失了數據。
|
1
2
3
4
5
|
Last_SQL_Error: Could not execute Update_rows event on table hcy.t1;
Can't find record in 't1',
Error_code: 1032;
handler error HA_ERR_KEY_NOT_FOUND;
the event's master log mysql-bin.000010, end_log_pos 794
|
解決方法:
在master上,用mysqlbinlog 分析下出錯的binlog日志在干什么。
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
/usr/local/mysql/bin/mysqlbinlog --no-defaults -v -v --base64-output=DECODE-ROWS mysql-bin.000010 | grep -A '10' 794
#120302 12:08:36 server id 22 end_log_pos 794 Update_rows: table id 33 flags: STMT_END_F
### UPDATE hcy.t1
### WHERE
### @1=2 /* INT meta=0 nullable=0 is_null=0 */
### @2='bbc' /* STRING(4) meta=65028 nullable=1 is_null=0 */
### SET
### @1=2 /* INT meta=0 nullable=0 is_null=0 */
### @2='BTV' /* STRING(4) meta=65028 nullable=1 is_null=0 */
# at 794
#120302 12:08:36 server id 22 end_log_pos 821 Xid = 60
COMMIT/*!*/;
DELIMITER ;
# End of log file
ROLLBACK /* added by mysqlbinlog */;
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
|
在slave上,查找下更新后的那條記錄,應該是不存在的。
mysql> select * from t1 where id=2;
Empty set (0.00 sec)
然后再到master查看
|
1
2
3
4
5
6
7
|
mysql> select * from t1 where id=2;
+----+------+
| id | name |
+----+------+
| 2 | BTV |
+----+------+
1 row in set (0.00 sec)
|
把丟失的數據在slave上填補,然后跳過報錯即可。
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
|
mysql> insert into t1 values (2,'BTV');
Query OK, 1 row affected (0.00 sec)
mysql> select * from t1 where id=2;
+----+------+
| id | name |
+----+------+
| 2 | BTV |
+----+------+
1 row in set (0.00 sec)
mysql> stop slave ;set global sql_slave_skip_counter=1;start slave;
Query OK, 0 rows affected (0.01 sec)
Query OK, 0 rows affected (0.00 sec)
Query OK, 0 rows affected (0.00 sec)
mysql> show slave status\G;
……
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
……
|
中繼日志損壞
slave的中繼日志relay-bin損壞。
|
1
2
3
|
Last_SQL_Error: Error initializing relay log position: I/O error reading the header from the binary log
Last_SQL_Error: Error initializing relay log position: Binlog has bad magic number;
It's not a binary log file that can be used by this version of MySQL
|
手工修復
解決方法:找到同步的binlog和POS點,然后重新做同步,這樣就可以有新的中繼日值了。
例子:
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
mysql> show slave status\G;
*************************** 1. row ***************************
Master_Log_File: mysql-bin.000010
Read_Master_Log_Pos: 1191
Relay_Log_File: vm02-relay-bin.000005
Relay_Log_Pos: 253
Relay_Master_Log_File: mysql-bin.000010
Slave_IO_Running: Yes
Slave_SQL_Running: No
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 1593
Last_Error: Error initializing relay log position: I/O error reading the header from the binary log
Skip_Counter: 1
Exec_Master_Log_Pos: 821
|
Slave_IO_Running :接收master的binlog信息
Master_Log_File
Read_Master_Log_Pos
Slave_SQL_Running:執行寫操作
Relay_Master_Log_File
Exec_Master_Log_Pos
以執行寫的binlog和POS點為准。
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
|
Relay_Master_Log_File: mysql-bin.000010
Exec_Master_Log_Pos: 821
mysql> stop slave;
Query OK, 0 rows affected (0.01 sec)
mysql> CHANGE MASTER TO MASTER_LOG_FILE='mysql-bin.000010',MASTER_LOG_POS=821;
Query OK, 0 rows affected (0.01 sec)
mysql> start slave;
Query OK, 0 rows affected (0.00 sec)
mysql> show slave status\G;
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: 192.168.8.22
Master_User: repl
Master_Port: 3306
Connect_Retry: 10
Master_Log_File: mysql-bin.000010
Read_Master_Log_Pos: 1191
Relay_Log_File: vm02-relay-bin.000002
Relay_Log_Pos: 623
Relay_Master_Log_File: mysql-bin.000010
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 1191
Relay_Log_Space: 778
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Ibbackup
|
各種大招都用上了,無奈slave數據丟失過多,ibbackup(需要銀子)該你登場了。
Ibbackup熱備份工具,是付費的。xtrabackup是免費的,功能上一樣。
Ibbackup備份期間不鎖表,備份時開啟一個事務(相當於做一個快照),然后會記錄一個點,之后數據的更改保存在ibbackup_logfile文件里,恢復時把ibbackup_logfile 變化的數據再寫入到ibdata里。
Ibbackup 只備份數據( ibdata、.ibd ),表結構.frm不備份。
下面一個演示例子:
備份:ibbackup /bak/etc/my_local.cnf /bak/etc/my_bak.cnf
恢復:ibbackup --apply-log /bak/etc/my_bak.cnf
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
|
[root@vm01 etc]# more my_local.cnf
datadir =/usr/local/mysql/data
innodb_data_home_dir = /usr/local/mysql/data
innodb_data_file_path = ibdata1:10M:autoextend
innodb_log_group_home_dir = /usr/local/mysql/data
innodb_buffer_pool_size = 100M
innodb_log_file_size = 5M
innodb_log_files_in_group=2
[root@vm01 etc]# ibbackup /bak/etc/my_local.cnf /bak/etc/my_bak.cnf
InnoDB Hot Backup version 3.0.0; Copyright 2002-2005 Innobase Oy
License A21488 is granted to vm01 (chunyang_he@126.com)
(--apply-log works in any computer regardless of the hostname)
Licensed for use in a computer whose hostname is 'vm01'
Expires 2012-5-1 (year-month-day) at 00:00
See http://www.innodb.com for further information
Type ibbackup --license for detailed license terms, --help for help
Contents of /bak/etc/my_local.cnf:
innodb_data_home_dir got value /usr/local/mysql/data
innodb_data_file_path got value ibdata1:10M:autoextend
datadir got value /usr/local/mysql/data
innodb_log_group_home_dir got value /usr/local/mysql/data
innodb_log_files_in_group got value 2
innodb_log_file_size got value 5242880
Contents of /bak/etc/my_bak.cnf:
innodb_data_home_dir got value /bak/data
innodb_data_file_path got value ibdata1:10M:autoextend
datadir got value /bak/data
innodb_log_group_home_dir got value /bak/data
innodb_log_files_in_group got value 2
innodb_log_file_size got value 5242880
ibbackup: Found checkpoint at lsn 0 1636898
ibbackup: Starting log scan from lsn 0 1636864
120302 16:47:43 ibbackup: Copying log...
120302 16:47:43 ibbackup: Log copied, lsn 0 1636898
ibbackup: We wait 1 second before starting copying the data files...
120302 16:47:44 ibbackup: Copying /usr/local/mysql/data/ibdata1
ibbackup: A copied database page was modified at 0 1636898
ibbackup: Scanned log up to lsn 0 1636898
ibbackup: Was able to parse the log up to lsn 0 1636898
ibbackup: Maximum page number for a log record 0
120302 16:47:46 ibbackup: Full backup completed!
[root@vm01 etc]#
[root@vm01 etc]# cd /bak/data/
[root@vm01 data]# ls
ibbackup_logfile ibdata1
[root@vm01 data]# ibbackup --apply-log /bak/etc/my_bak.cnf
InnoDB Hot Backup version 3.0.0; Copyright 2002-2005 Innobase Oy
License A21488 is granted to vm01 (chunyang_he@126.com)
(--apply-log works in any computer regardless of the hostname)
Licensed for use in a computer whose hostname is 'vm01'
Expires 2012-5-1 (year-month-day) at 00:00
See http://www.innodb.com for further information
Type ibbackup --license for detailed license terms, --help for help
Contents of /bak/etc/my_bak.cnf:
innodb_data_home_dir got value /bak/data
innodb_data_file_path got value ibdata1:10M:autoextend
datadir got value /bak/data
innodb_log_group_home_dir got value /bak/data
innodb_log_files_in_group got value 2
innodb_log_file_size got value 5242880
120302 16:48:38 ibbackup: ibbackup_logfile's creation parameters:
ibbackup: start lsn 0 1636864, end lsn 0 1636898,
ibbackup: start checkpoint 0 1636898
ibbackup: start checkpoint 0 1636898
InnoDB: Doing recovery: scanned up to log sequence number 0 1636898
InnoDB: Starting an apply batch of log records to the database...
InnoDB: Progress in percents: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 .....99
Setting log file size to 0 5242880
ibbackup: We were able to parse ibbackup_logfile up to
ibbackup: lsn 0 1636898
ibbackup: Last MySQL binlog file position 0 1191, file name ./mysql-bin.000010
ibbackup: The first data file is '/bak/data/ibdata1'
ibbackup: and the new created log files are at '/bak/data/'
120302 16:48:38 ibbackup: Full backup prepared for recovery successfully!
[root@vm01 data]# ls
ibbackup_logfile ibdata1 ib_logfile0 ib_logfile1
|
把ibdata1 ib_logfile0 ib_logfile1拷貝到從,把.frm也拷貝過去,啟動MySQL后,做同步,那個點就是上面輸出的:
ibbackup: Last MySQL binlog file position 0 1191, file name ./mysql-bin.000010
CHANGE MASTER TO MASTER_LOG_FILE='mysql-bin.000010',MASTER_LOG_POS=1191;
Maatkit工具包
http://www.maatkit.org/
簡介
maatkit是一個開源的工具包,為mysql日常管理提供了幫助。目前,已被Percona公司收購並維護。其中:
mk-table-checksum是用來檢測master和slave上的表結構和數據是否一致。
mk-table-sync是發生主從數據不一致時,來修復的。
這兩個工具包,沒有在現網實際操作的經驗,這里僅僅是新技術探討和學術交流,下面展示下如何使用。
http://www.actionsky.com/products/mysql-others/maatkit.jsp
|
1
2
3
4
5
6
7
|
[root@vm02]# mk-table-checksum h=vm01,u=admin,p=123456 h=vm02,u=admin,p=123456 -d hcy -t t1
Cannot connect to MySQL because the Perl DBI module is not installed or not found.
Run 'perl -MDBI' to see the directories that Perl searches for DBI.
If DBI is not installed, try:
Debian/Ubuntu apt-get install libdbi-perl
RHEL/CentOS yum install perl-DBI
OpenSolaris pgk install pkg:/SUNWpmdbi
|
提示缺少perl-DBI模塊,那么直接 yum install perl-DBI。
|
1
2
3
4
|
[root@vm02 bin]# mk-table-checksum h=vm01,u=admin,p=123456 h=vm02,u=admin,p=123456 -d hcy -t t1
DATABASE TABLE CHUNK HOST ENGINE COUNT CHECKSUM TIME WAIT STAT LAG
hcy t1 0 vm02 InnoDB NULL 1957752020 0 0 NULL NULL
hcy t1 0 vm01 InnoDB NULL 1957752020 0 0 NULL NULL
|
如果表數據不一致,CHECKSUM的值是不相等的。
解釋下輸出的意思:
DATABASE:數據庫名
TABLE:表名
CHUNK:checksum時的近似數值
HOST:MYSQL的地址
ENGINE:表引擎
COUNT:表的行數
CHECKSUM:校驗值
TIME:所用時間
WAIT:等待時間
STAT:MASTER_POS_WAIT()返回值
LAG:slave的延時時間
如果你想過濾出不相等的都有哪些表,可以用mk-checksum-filter這個工具,只要在后面加個管道符就行了。
|
1
2
3
|
[root@vm02 ~]# mk-table-checksum h=vm01,u=admin,p=123456 h=vm02,u=admin,p=123456 -d hcy | mk-checksum-filter
hcy t2 0 vm01 InnoDB NULL 1957752020 0 0 NULL NULL
hcy t2 0 vm02 InnoDB NULL 1068689114 0 0 NULL NULL
|
知道有哪些表不一致,可以用mk-table-sync這個工具來處理。
注:在執行mk-table-checksum時會鎖表,表的大小取決於執行的快慢。
MASTER上的t2表數據:
SLAVE上的t2表數據:
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
mysql> select * from t2; mysql> select * from t2;
+----+------+ +----+------+
| id | name | | id | name |
+----+------+ +----+------+
| 1 | a | | 1 | a |
| 2 | b | | 2 | b |
| 3 | ss | | 3 | ss |
| 4 | asd | | 4 | asd |
| 5 | ss | +----+------+
+----+------+ 4 rows in set (0.00 sec)
5 rows in set (0.00 sec)
mysql> \! hostname;
mysql> \! hostname; vm02
vm01
[root@vm02 ~]# mk-table-sync --execute --print --no-check-slave --transaction --databases hcy h=vm01,u=admin,p=123456 h=vm02,u=admin,p=123456
INSERT INTO `hcy`.`t2`(`id`, `name`) VALUES ('5', 'ss') /*maatkit src_db:hcy src_tbl:t2 src_dsn:h=vm01,p=...,u=admin dst_db:hcy dst_tbl:t2
dst_dsn:h=vm02,p=...,u=admin lock:0 transaction:1 changing_src:0 replicate:0 bidirectional:0 pid:3246 user:root host:vm02*/;
|
它的工作原理是:先一行一行檢查主從庫的表是否一樣,如果哪里不一樣,就執行刪除,更新,插入等操作,使其達到一致。表的大小決定着執行的快慢。
|
1
2
3
4
5
6
7
8
|
If C<--transaction> is specified, C<
LOCK
TABLES> is not used. Instead, lock
and unlock are implemented by beginning and committing transactions.
The exception is if L<"--lock"> is 3.
If C<--no-transaction> is specified, then C<
LOCK
TABLES> is used for any
value of L<"--lock">. See L<"--[no]transaction">.
When enabled, either explicitly or implicitly, the transaction isolation level
is set C<
REPEATABLE
READ> and transactions are started C<
WITH
CONSISTENT
SNAPSHOT>
|
MySQL復制監控
MySQL常見錯誤類型
1005:創建表失敗
1006:創建數據庫失敗
1007:數據庫已存在,創建數據庫失敗
1008:數據庫不存在,刪除數據庫失敗
1009:不能刪除數據庫文件導致刪除數據庫失敗
1010:不能刪除數據目錄導致刪除數據庫失敗
1011:刪除數據庫文件失敗
1012:不能讀取系統表中的記錄
1020:記錄已被其他用戶修改
1021:硬盤剩余空間不足,請加大硬盤可用空間
1022:關鍵字重復,更改記錄失敗
1023:關閉時發生錯誤
1024:讀文件錯誤
1025:更改名字時發生錯誤
1026:寫文件錯誤
1032:記錄不存在
1036:數據表是只讀的,不能對它進行修改
1037:系統內存不足,請重啟數據庫或重啟服務器
1038:用於排序的內存不足,請增大排序緩沖區
1040:已到達數據庫的最大連接數,請加大數據庫可用連接數
1041:系統內存不足
1042:無效的主機名
1043:無效連接
1044:當前用戶沒有訪問數據庫的權限
1045:不能連接數據庫,用戶名或密碼錯誤
1048:字段不能為空
1049:數據庫不存在
1050:數據表已存在
1051:數據表不存在
1054:字段不存在
1065:無效的SQL語句,SQL語句為空
1081:不能建立Socket連接
1114:數據表已滿,不能容納任何記錄
1116:打開的數據表太多
1129:數據庫出現異常,請重啟數據庫
1130:連接數據庫失敗,沒有連接數據庫的權限
1133:數據庫用戶不存在
1141:當前用戶無權訪問數據庫
1142:當前用戶無權訪問數據表
1143:當前用戶無權訪問數據表中的字段
1146:數據表不存在
1147:未定義用戶對數據表的訪問權限
1149:SQL語句語法錯誤
1158:網絡錯誤,出現讀錯誤,請檢查網絡連接狀況
1159:網絡錯誤,讀超時,請檢查網絡連接狀況
1160:網絡錯誤,出現寫錯誤,請檢查網絡連接狀況
1161:網絡錯誤,寫超時,請檢查網絡連接狀況
1062:字段值重復,入庫失敗
1169:字段值重復,更新記錄失敗
1177:打開數據表失敗
1180:提交事務失敗
1181:回滾事務失敗
1203:當前用戶和數據庫建立的連接已到達數據庫的最大連接數,請增大可用的數據庫連接數或重啟數據庫
1205:加鎖超時
1211:當前用戶沒有創建用戶的權限
1216:外鍵約束檢查失敗,更新子表記錄失敗
1217:外鍵約束檢查失敗,刪除或修改主表記錄失敗
1226:當前用戶使用的資源已超過所允許的資源,請重啟數據庫或重啟服務器
1227:權限不足,您無權進行此操作
1235:MySQL版本過低,不具有本功能
復制監控腳本
參考原文修改。
原腳本
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
|
#!/bin/bash
#
#check_mysql_slave_replication_status
#
#
#
parasum=2
help_msg(){
cat <<
help
+---------------------+
+Error
Cause:
+you
must input $parasum parameters!
+1st
: Host_IP
+2st
: Host_Port
help
exit
}
[
$#
-ne ${parasum} ] && help_msg #若參數不夠打印幫助信息並退出
export HOST_IP=$1
export HOST_PORt=$2
MYUSER
=
"root"
MYPASS
=
"123456"
MYSQL_CMD="mysql
-u$MYUSER -p$MYPASS"
MailTitle
=
""
#郵件主題
Mail_Address_MysqlStatus
=
"root@localhost.localdomain"
#收件人郵箱
time1=$(date +"%Y%m%d%H%M%S")
time2=$(date +"%Y-%m-%d
%H:%M:%S")
SlaveStatusFile=/tmp/salve_status_${HOST_PORT}.${time1}
#郵件內容所在文件
echo "--------------------Begin
at: "$time2
> $SlaveStatusFile
echo "" >>
$SlaveStatusFile
#get
slave status
${MYSQL_CMD}
-e "show
slave status\G" >>
$SlaveStatusFile #取得salve進程的狀態
#get
io_thread_status,sql_thread_status,last_errno 取得以下狀態值
IOStatus=$(cat $SlaveStatusFile|grep Slave_IO_Running|awk '{print
$2}')
SQLStatus=$(cat $SlaveStatusFile|grep Slave_SQL_Running
|awk '{print
$2}')
Errno=$(cat $SlaveStatusFile|grep Last_Errno
| awk '{print
$2}')
Behind=$(cat $SlaveStatusFile|grep Seconds_Behind_Master
| awk '{print
$2}')
echo "" >>
$SlaveStatusFile
if [
"$IOStatus" ==
"No" ]
|| [ "$SQLStatus" ==
"No" ];then #判斷錯誤類型
if [
"$Errno" -eq 0
];then #可能是salve線程未啟動
$MYSQL_CMD
-e "start
slave io_thread;start slave sql_thread;"
echo "Cause
slave threads doesnot's running,trying start slsave io_thread;start slave sql_thread;" >>
$SlaveStatusFile
MailTitle="[Warning]
Slave threads stoped on $HOST_IP $HOST_PORT"
elif [
"$Errno" -eq 1007
] || [ "$Errno" -eq 1053
] || [ "$Errno" -eq 1062
] || [ "$Errno" -eq 1213
] || [ "$Errno" -eq 1032
]\
||
[ "Errno" -eq 1158
] || [ "$Errno" -eq 1159
] || [ "$Errno" -eq 1008
];then #忽略此些錯誤
$MYSQL_CMD
-e "stop
slave;set global sql_slave_skip_counter=1;start slave;"
echo "Cause
slave replication catch errors,trying skip counter and restart slave;stop slave ;set global sql_slave_skip_counter=1;slave start;" >>
$SlaveStatusFile
MailTitle="[Warning]
Slave error on $HOST_IP $HOST_PORT! ErrNum: $Errno"
else
echo "Slave
$HOST_IP $HOST_PORT is down!" >>
$SlaveStatusFile
MailTitle="[ERROR]Slave
replication is down on $HOST_IP $HOST_PORT ! ErrNum:$Errno"
fi
fi
if [
-n "$Behind" ];then
Behind=0
fi
echo "$Behind" >>
$SlaveStatusFile
#delay
behind master 判斷延時時間
if [
$Behind -gt 300 ];then
echo `date +"%Y-%m%d
%H:%M:%S"`
"slave
is behind master $Bebind seconds!" >>
$SlaveStatusFile
MailTitle="[Warning]Slave
delay $Behind seconds,from $HOST_IP $HOST_PORT"
fi
if [
-n "$MailTitle" ];then #若出錯或者延時時間大於300s則發送郵件
cat ${SlaveStatusFile}
| /bin/mail -s
"$MailTitle" $Mail_Address_MysqlStatus
fi
#del
tmpfile:SlaveStatusFile
>
$SlaveStatusFile
|
修改后腳本
只做了簡單的整理,修正了Behind為NULL的判斷,但均未測試;
應可考慮增加:
對修復執行結果的判斷;多條錯誤的循環修復、檢測、再修復?
取消SlaveStatusFile臨時文件。
Errno、Behind兩種告警分別發郵件,告警正文增加show slave結果原文。
增加PATH,以便加到crontab中。
考慮crontab中周期執行(加鎖避免執行沖突、執行周期選擇)
增加執行日志?
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
|
#!/bin/sh
#
check_mysql_slave_replication_status
#
Usage(){
echo Usage:
echo "$0
HOST PORT USER PASS"
}
[
-z "$1" -o
-z "$2" -o
-z "$3" -o
-z "$4" ]
&& Usage && exit 1
HOST=$1
PORT=$2
USER=$3
PASS=$4
MYSQL_CMD="mysql
-h$HOST -P$PORT -u$USER -p$PASS"
MailTitle="" #郵件主題
Mail_Address_MysqlStatus="root@localhost.localdomain" #收件人郵箱
time1=$(date +"%Y%m%d%H%M%S")
time2=$(date +"%Y-%m-%d
%H:%M:%S")
SlaveStatusFile=/tmp/salve_status_${HOST_PORT}.${time1}
#郵件內容所在文件
echo "--------------------Begin
at: "$time2
> $SlaveStatusFile
echo "" >>
$SlaveStatusFile
#get
slave status
${MYSQL_CMD}
-e "show
slave status\G" >>
$SlaveStatusFile #取得salve進程的狀態
#get
io_thread_status,sql_thread_status,last_errno 取得以下狀態值
IOStatus=$(cat $SlaveStatusFile|grep Slave_IO_Running|awk '{print
$2}')
SQLStatus=$(cat $SlaveStatusFile|grep Slave_SQL_Running
|awk '{print
$2}')
Errno=$(cat $SlaveStatusFile|grep Last_Errno
| awk '{print
$2}')
Behind=$(cat $SlaveStatusFile|grep Seconds_Behind_Master
| awk '{print
$2}')
echo "" >>
$SlaveStatusFile
if [
"$IOStatus" =
"No" -o
"$SQLStatus" =
"No" ];then
case "$Errno" in
0)
#
可能是slave未啟動
$MYSQL_CMD
-e "start
slave io_thread;start slave sql_thread;"
echo "Cause
slave threads doesnot's running,trying start slsave io_thread;start slave sql_thread;" >>
$SlaveStatusFile
;;
1007|1053|1062|1213|1032|1158|1159|1008)
#
忽略這些錯誤
$MYSQL_CMD
-e "stop
slave;set global sql_slave_skip_counter=1;start slave;"
echo "Cause
slave replication catch errors,trying skip counter and restart slave;stop slave ;set global sql_slave_skip_counter=1;slave start;" >>
$SlaveStatusFile
MailTitle="[Warning]
Slave error on $HOST:$PORT! ErrNum: $Errno"
;;
*)
echo "Slave
$HOST:$PORT is down!" >>
$SlaveStatusFile
MailTitle="[ERROR]Slave
replication is down on $HOST:$PORT! Errno:$Errno"
;;
esac
fi
if [
"$Behind" =
"NULL" -o
-z "$Behind" ];then
Behind=0
fi
echo "Behind:$Behind" >>
$SlaveStatusFile
#delay
behind master 判斷延時時間
if [
$Behind -gt 300 ];then
echo `date +"%Y-%m%d
%H:%M:%S"`
"slave
is behind master $Bebind seconds!" >>
$SlaveStatusFile
MailTitle="[Warning]Slave
delay $Behind seconds,from $HOST $PORT"
fi
if [
-n "$MailTitle" ];then #若出錯或者延時時間大於300s則發送郵件
cat ${SlaveStatusFile}
| /bin/mail -s
"$MailTitle" $Mail_Address_MysqlStatus
fi
#del
tmpfile:SlaveStatusFile
>
$SlaveStatusFile
|
以上這篇線上MYSQL同步報錯故障處理方法總結(必看篇)就是小編分享給大家的全部內容了,希望能給大家一個參考,也希望大家多多支持腳本之家。
原文網址:https://www.jb51.net/article/109107.htm
