主從報錯1205 Slave SQL thread retried transaction 10 time(s) in vain, giving up. Consider raising the value of the slave_transaction_retries variable
Last_Errno: 1205
Last_Error: Slave SQL thread retried transaction 10 time(s) in vain, giving up. Consider raising the value of the slave_transaction_retries variable

解決辦法:
stop slave;
start slave;
show slave status\G;
大家都知道DBA就像是消(背)防(鍋)員(俠),因為前端應用還有開發上線的新版本都會影響到位於最底層的數據庫,前方稍微有些風吹草動,就能反應在數據庫的性能上。但是有的時候SQL不僅能決定數據庫的性能,還能決定數據庫的生死,今天的案例是開發的一條SQL,引起的MySQL數據庫主從復制報錯,vip漂移問題。
今天下午有客戶反饋發消息有點慢,手機收到幾條報警:
nagios報警如下大致意思就是
14:39分 vip發生了漂移
14:46分主從復制中斷
登錄數據庫服務器查看主從復制的狀態,發現報錯1205錯
查看錯誤日志有一下報錯信息:
錯誤日志中給出了報錯原因和解決方法
tail -70 error.log
180328 14:36:05 [ERROR] Slave SQL: Error 'Lock wait timeout exceeded; try restarting transaction' on query. Default database: 'statusnet'. Query: 'UPDATE notice_status_new set valid = 0 WHERE conversation = 384667 AND user_id IN(4337475,4337529,4337471,4337365)', Error_code: 1205
180328 14:36:21 [ERROR] Slave SQL: Error 'Lock wait timeout exceeded; try restarting transaction' on query. Default database: 'statusnet'. Query: 'UPDATE notice_status_new set valid = 0 WHERE conversation = 384667 AND user_id IN(4337475,4337529,4337471,4337365)', Error_code: 1205
180328 14:36:38 [ERROR] Slave SQL: Error 'Lock wait timeout exceeded; try restarting transaction' on query. Default database: 'statusnet'. Query: 'UPDATE notice_status_new set valid = 0 WHERE conversation = 384667 AND user_id IN(4337475,4337529,4337471,4337365)', Error_code: 1205
180328 14:36:56 [ERROR] Slave SQL: Error 'Lock wait timeout exceeded; try restarting transaction' on query. Default database: 'statusnet'. Query: 'UPDATE notice_status_new set valid = 0 WHERE conversation = 384667 AND user_id IN(4337475,4337529,4337471,4337365)', Error_code: 1205
180328 14:37:15 [ERROR] Slave SQL: Error 'Lock wait timeout exceeded; try restarting transaction' on query. Default database: 'statusnet'. Query: 'UPDATE notice_status_new set valid = 0 WHERE conversation = 384667 AND user_id IN(4337475,4337529,4337471,4337365)', Error_code: 1205
180328 14:37:35 [ERROR] Slave SQL: Error 'Lock wait timeout exceeded; try restarting transaction' on query. Default database: 'statusnet'. Query: 'UPDATE notice_status_new set valid = 0 WHERE conversation = 384667 AND user_id IN(4337475,4337529,4337471,4337365)', Error_code: 1205
180328 14:37:56 [ERROR] Slave SQL: Error 'Lock wait timeout exceeded; try restarting transaction' on query. Default database: 'statusnet'. Query: 'UPDATE notice_status_new set valid = 0 WHERE conversation = 384667 AND user_id IN(4337475,4337529,4337471,4337365)', Error_code: 1205
180328 14:38:17 [ERROR] Slave SQL: Error 'Lock wait timeout exceeded; try restarting transaction' on query. Default database: 'statusnet'. Query: 'UPDATE notice_status_new set valid = 0 WHERE conversation = 384667 AND user_id IN(4337475,4337529,4337471,4337365)', Error_code: 1205
180328 14:38:38 [ERROR] Slave SQL: Error 'Lock wait timeout exceeded; try restarting transaction' on query. Default database: 'statusnet'. Query: 'UPDATE notice_status_new set valid = 0 WHERE conversation = 384667 AND user_id IN(4337475,4337529,4337471,4337365)', Error_code: 1205
180328 14:38:59 [ERROR] Slave SQL: Error 'Lock wait timeout exceeded; try restarting transaction' on query. Default database: 'statusnet'. Query: 'UPDATE notice_status_new set valid = 0 WHERE conversation = 384667 AND user_id IN(4337475,4337529,4337471,4337365)', Error_code: 1205
180328 14:39:20 [ERROR] Slave SQL: Error 'Lock wait timeout exceeded; try restarting transaction' on query. Default database: 'statusnet'. Query: 'UPDATE notice_status_new set valid = 0 WHERE conversation = 384667 AND user_id IN(4337475,4337529,4337471,4337365)', Error_code: 1205
180328 14:39:20 [ERROR] Slave SQL thread retried transaction 10 time(s) in vain, giving up. Consider raising the value of the slave_transaction_retries variable.
180328 14:39:20 [Warning] Slave: Lock wait timeout exceeded; try restarting transaction Error_code: 1205
180328 14:39:20 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000481' position 182088298
查看參數結合着兩個參數可以知道主從復制報錯的原因為:
180328 14:39:20 [ERROR] Slave SQL thread retried transaction 10 time(s) in vain, giving up. Consider raising the value of the slave_transaction_retries variable.
180328 14:39:20 [Warning] Slave: Lock wait timeout exceeded; try restarting transaction Error_code: 1205
180328 14:39:20 [ERROR] Error running query, slave SQL thread aborted,Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000481' position 182088298
因為表notice_status_new產生了鎖,阻塞了這條update語句
參數innodb_lock_wait_time 設置的為15秒,單個事務在等待15秒后開始報1025錯:因為鎖執行超時並重啟事務
參數slave_transaction_retries 設置的為10次,如果事務重試次數超過10次,復制中斷。(但是錯誤日志中嘗試了11次)
解決方法:重啟slave后主從復制恢復,並開始追日志
vip的問題等到晚上在切回來,
---------------------------------我是分割線--------------------------
這次問題雖然解決了,但是發現問題我們一定要找到根音,這樣問題才能算是閉環。
為什么會產生這個鎖呢,因為記錄鎖信息的表都是實時的,現在查不出當時的鎖信息,但是從存儲引擎的運行狀態中還是能發現一些問題,數據庫出現了死鎖!!顯然在update這張表一定會等待的!!
發現了一條奇葩SQL,被鎖的那張表也是notice_status_new
UPDATE notice_status_new SET count = count + 1, push_count = push_count + 1 WHERE conversation = 2609759 AND user_id != 7384662 AND user_id IN(7359498)
問題找到了,找對該業務的開發,溝通后決定在周六上線新版本,徹底解決問題。
可以寫一個實時監控鎖的腳本,把內容輸出打印,這樣方便以后定位問題!
-----------------------------------------------
后面想了想可能是由於vip漂移導致的主從復制報錯
線上架構為MM+heartbeat
vip漂移的原因待定,vip漂移后,數據庫開始在192.168.87.123服務器上寫入數據庫,72數據庫作為備份庫,但是還會同步123數據庫的數據,由於192.168.87.72數據庫表notice_status_new還存在死鎖,所以在slave SQL thread應用來自123服務器的中繼日志時,執行到該SQL時,發生了鎖等待,等待15秒后超時並重新嘗試該事務,嘗試10后,還是在鎖表狀態,slave SQL thread 線程停止工作,主從同步中斷。
----------------------------------------
3月29日和開發復盤
因為報警時間不是很准確,實際時間還是以heartbeat日志和mysql日志為准,通過查看日志確定是heartbeat首先發生漂移,然后才是主從復制報錯
主要是討論的問題有兩個
① Heartbeat為什么會發生漂移

通過查看heartbeat日志,72和123存在心跳檢測,在72ping123的時候發生了丟包,
在123ping72的時候正常,72節點以為自己死掉?在14:35:26發生vip漂移,從72漂移到123。
Mar 28 14:35:26 bjuc-mysql1 pengine: [14176]: notice: LogActions: Move VIP_192_168_87_74 (Started bjuc-mysql1 -> prod-uc-yewu-db-slave2)
② 為什么會出現死鎖

從存儲引擎的運行狀態中查看,如果這兩條SQL的請求同時到達數據庫,因為innodb支持到行鎖的粒度,第二條SQL會鎖住很多行,如果和第一條SQL中鎖住有重復的,就會發生相互等待的狀況,所以出現死鎖。
這兩條SQL的where條件如下:
user_id IN(6032694,5824664,4344233,5450840,6032678,4951596,6032668,4344453,6038786,4344347,6032676,5754768);
user_id != 6185996
這兩個條件可能發生邏輯上的沖突,然后導致死鎖
轉自
MySQL主從復制報錯 Errno 1205-11774929-51CTO博客 http://blog.51cto.com/11784929/2092097