轉自:https://www.cnblogs.com/micrari/p/8029710.html
本文針對MySQL InnoDB中在Repeatable Read的隔離級別下使用select for update可能引發的死鎖問題進行分析。
1. 業務案例
業務中需要對各種類型的實體進行編號,例如對於x類實體的編號可能是x201712120001,x201712120002,x201712120003類似於這樣。可以觀察到這類編號有兩個部分組成:x+日期作為前綴,以及流水號(這里是四位的流水號)。
如果用數據庫表實現一個能夠分配流水號的需求,無外乎就可以建立一個類似於下面的表:
CREATE TABLE number ( prefix VARCHAR(20) NOT NULL DEFAULT '' COMMENT '前綴碼', value BIGINT NOT NULL DEFAULT 0 COMMENT '流水號', UNIQUE KEY uk_prefix(prefix) );
那么在業務層,根據業務規則得到編號的前綴比如x20171212,接下去就可以在代碼中起事務,用select for update進行如下的控制。
@Transactional long acquire(String prefix) { SerialNumber current = dao.selectAndLock(prefix); if (current == null) { dao.insert(new Record(prefix, 1)); return 1; } else { current.number++; dao.update(current); return current.number; } }
這段代碼做的事情其實就是加鎖篩選,有則更新,無則插入,然而在Repeatable Read的隔離級別下這段代碼是有潛在死鎖問題的。(另一處與事務傳播行為相關的問題也會在下文提及)。
2. 分析與解決
當可以通過select for update的where條件篩出記錄時,上面的代碼是不會有deadlock問題的。然而當select for update中的where條件無法篩選出記錄時,這時在有多個線程執行上面的acquire方法時是可能會出現死鎖的。
2.1 一個簡單的復現場景
下面通過一個比較簡單的例子復現一下這個場景
首先給表里初始化3條數據。
insert into number select 'bbb',2; insert into number select 'hhh',8; insert into number select 'yyy',25;
接着按照如下的時序進行操作:
| session 1 | session 2 |
|---|---|
| begin; | |
| begin; | |
| select * from number where prefix='ddd' for update; | |
| select * from number where prefix='fff' for update | |
| insert into number select 'ddd',1 | |
| 鎖等待中 | insert into number select 'fff',1 |
| 鎖等待解除 | 死鎖,session 2的事務被回滾 |
2.2 分析下這個死鎖
通過查看show engine innodb status的信息,我們慢慢地觀察每一步的情況:
2.2.1 session1做了select for update
------------
TRANSACTIONS
------------
Trx id counter 238435
Purge done for trx's n:o < 238430 undo n:o < 0 state: running but idle
History list length 13
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 281479459589696, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 281479459588792, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 238434, ACTIVE 3 sec
2 lock struct(s), heap size 1136, 1 row lock(s)
MySQL thread id 160, OS thread handle 123145573965824, query id 69153 localhost root
TABLE LOCK tabletest.numbertrx id 238434 lock mode IX
RECORD LOCKS space id 1506 page no 3 n bits 80 index uk_prefix of tabletest.numbertrx id 238434 lock_mode X locks gap before rec
Record lock, heap no 3 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
0: len 3; hex 686868; asc hhh;;
1: len 6; hex 00000003a350; asc P;;
2: len 7; hex d2000001ff0110; asc ;;
3: len 8; hex 8000000000000008; asc ;;
事務238434拿到了hhh前的gap鎖,也就是('bbb', 'hhh')的gap鎖。
2.2.2 session2做了select for update
------------
TRANSACTIONS
------------
Trx id counter 238436
Purge done for trx's n:o < 238430 undo n:o < 0 state: running but idle
History list length 13
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 281479459589696, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 238435, ACTIVE 3 sec
2 lock struct(s), heap size 1136, 1 row lock(s)
MySQL thread id 161, OS thread handle 123145573408768, query id 69155 localhost root
TABLE LOCK tabletest.numbertrx id 238435 lock mode IX
RECORD LOCKS space id 1506 page no 3 n bits 80 index uk_prefix of tabletest.numbertrx id 238435 lock_mode X locks gap before rec
Record lock, heap no 3 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
0: len 3; hex 686868; asc hhh;;
1: len 6; hex 00000003a350; asc P;;
2: len 7; hex d2000001ff0110; asc ;;
3: len 8; hex 8000000000000008; asc ;;
---TRANSACTION 238434, ACTIVE 30 sec
2 lock struct(s), heap size 1136, 1 row lock(s)
MySQL thread id 160, OS thread handle 123145573965824, query id 69153 localhost root
TABLE LOCK tabletest.numbertrx id 238434 lock mode IX
RECORD LOCKS space id 1506 page no 3 n bits 80 index uk_prefix of tabletest.numbertrx id 238434 lock_mode X locks gap before rec
Record lock, heap no 3 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
0: len 3; hex 686868; asc hhh;;
1: len 6; hex 00000003a350; asc P;;
2: len 7; hex d2000001ff0110; asc ;;
3: len 8; hex 8000000000000008; asc ;;
事務238435也拿到了hhh前的gap鎖。

截自InnoDB的lock_rec_has_to_wait方法實現,可以看到的LOCK_GAP類型的鎖只要不帶有插入意向標識,不必等待其它鎖(表鎖除外)
2.2.3 session1嘗試insert
------------
TRANSACTIONS
------------
Trx id counter 238436
Purge done for trx's n:o < 238430 undo n:o < 0 state: running but idle
History list length 13
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 281479459589696, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 238435, ACTIVE 28 sec
2 lock struct(s), heap size 1136, 1 row lock(s)
MySQL thread id 161, OS thread handle 123145573408768, query id 69155 localhost root
TABLE LOCK tabletest.numbertrx id 238435 lock mode IX
RECORD LOCKS space id 1506 page no 3 n bits 80 index uk_prefix of tabletest.numbertrx id 238435 lock_mode X locks gap before rec
Record lock, heap no 3 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
0: len 3; hex 686868; asc hhh;;
1: len 6; hex 00000003a350; asc P;;
2: len 7; hex d2000001ff0110; asc ;;
3: len 8; hex 8000000000000008; asc ;;
---TRANSACTION 238434, ACTIVE 55 sec inserting
mysql tables in use 1, locked 1
LOCK WAIT 3 lock struct(s), heap size 1136, 2 row lock(s)
MySQL thread id 160, OS thread handle 123145573965824, query id 69157 localhost root executing
insert into number select 'ddd',1
------- TRX HAS BEEN WAITING 2 SEC FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 1506 page no 3 n bits 80 index uk_prefix of tabletest.numbertrx id 238434 lock_mode X locks gap before rec insert intention waiting
Record lock, heap no 3 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
0: len 3; hex 686868; asc hhh;;
1: len 6; hex 00000003a350; asc P;;
2: len 7; hex d2000001ff0110; asc ;;
3: len 8; hex 8000000000000008; asc ;;
TABLE LOCK table test.number trx id 238434 lock mode IX
RECORD LOCKS space id 1506 page no 3 n bits 80 index uk_prefix of table test.number trx id 238434 lock_mode X locks gap before rec
Record lock, heap no 3 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
0: len 3; hex 686868; asc hhh;;
1: len 6; hex 00000003a350; asc P;;
2: len 7; hex d2000001ff0110; asc ;;
3: len 8; hex 8000000000000008; asc ;;
RECORD LOCKS space id 1506 page no 3 n bits 80 index uk_prefix of table test.number trx id 238434 lock_mode X locks gap before rec insert intention waiting
Record lock, heap no 3 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
0: len 3; hex 686868; asc hhh;;
1: len 6; hex 00000003a350; asc P;;
2: len 7; hex d2000001ff0110; asc ;;
3: len 8; hex 8000000000000008; asc ;;
可以看到,這時候事務238434在嘗試插入'ddd',1時,由於發現其他事務(238435)已經有這個區間的gap鎖,因此innodb給事務238434上了插入意向鎖,鎖的模式為LOCK_X | LOCK_GAP | LOCK_INSERT_INTENTION,等待事務238435釋放掉gap鎖。

截取自InnoDB的lock_rec_insert_check_and_lock方法實現
2.2.4 session2嘗試insert
------------------------
LATEST DETECTED DEADLOCK
------------------------
2017-12-21 22:50:40 0x70001028a000
*** (1) TRANSACTION:
TRANSACTION 238434, ACTIVE 81 sec inserting
mysql tables in use 1, locked 1
LOCK WAIT 3 lock struct(s), heap size 1136, 2 row lock(s)
MySQL thread id 160, OS thread handle 123145573965824, query id 69157 localhost root executing
insert into number select 'ddd',1
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 1506 page no 3 n bits 80 index uk_prefix of tabletest.numbertrx id 238434 lock_mode X locks gap before rec insert intention waiting
Record lock, heap no 3 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
0: len 3; hex 686868; asc hhh;;
1: len 6; hex 00000003a350; asc P;;
2: len 7; hex d2000001ff0110; asc ;;
3: len 8; hex 8000000000000008; asc ;;
*** (2) TRANSACTION:
TRANSACTION 238435, ACTIVE 54 sec inserting
mysql tables in use 1, locked 1
3 lock struct(s), heap size 1136, 2 row lock(s)
MySQL thread id 161, OS thread handle 123145573408768, query id 69159 localhost root executing
insert into number select 'fff',1
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 1506 page no 3 n bits 80 index uk_prefix of tabletest.numbertrx id 238435 lock_mode X locks gap before rec
Record lock, heap no 3 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
0: len 3; hex 686868; asc hhh;;
1: len 6; hex 00000003a350; asc P;;
2: len 7; hex d2000001ff0110; asc ;;
3: len 8; hex 8000000000000008; asc ;;
*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 1506 page no 3 n bits 80 index uk_prefix of tabletest.numbertrx id 238435 lock_mode X locks gap before rec insert intention waiting
Record lock, heap no 3 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
0: len 3; hex 686868; asc hhh;;
1: len 6; hex 00000003a350; asc P;;
2: len 7; hex d2000001ff0110; asc ;;
3: len 8; hex 8000000000000008; asc ;;
*** WE ROLL BACK TRANSACTION (2)
TRANSACTIONS
Trx id counter 238436
Purge done for trx's n:o < 238430 undo n:o < 0 state: running but idle
History list length 13
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 281479459589696, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 281479459588792, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 238434, ACTIVE 84 sec
3 lock struct(s), heap size 1136, 3 row lock(s), undo log entries 1
MySQL thread id 160, OS thread handle 123145573965824, query id 69157 localhost root
TABLE LOCK table test.number trx id 238434 lock mode IX
RECORD LOCKS space id 1506 page no 3 n bits 80 index uk_prefix of table test.number trx id 238434 lock_mode X locks gap before rec
Record lock, heap no 3 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
0: len 3; hex 686868; asc hhh;;
1: len 6; hex 00000003a350; asc P;;
2: len 7; hex d2000001ff0110; asc ;;
3: len 8; hex 8000000000000008; asc ;;
Record lock, heap no 7 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
0: len 3; hex 646464; asc ddd;;
1: len 6; hex 00000003a362; asc b;;
2: len 7; hex de000001e60110; asc ;;
3: len 8; hex 8000000000000001; asc ;;
RECORD LOCKS space id 1506 page no 3 n bits 80 index uk_prefix of table test.number trx id 238434 lock_mode X locks gap before rec insert intention
Record lock, heap no 3 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
0: len 3; hex 686868; asc hhh;;
1: len 6; hex 00000003a350; asc P;;
2: len 7; hex d2000001ff0110; asc ;;
3: len 8; hex 8000000000000008; asc ;;
到了這里,我們可以從死鎖信息中看出,由於事務238435在插入時也發現了事務238434的gap鎖,同樣加上了插入意向鎖,等待事務238434釋放掉gap鎖。因此出現死鎖的情況。
2.3 debug it!
接下來通過debug MySQL的源碼來重新復現上面的場景。
這里session2的事務4445加鎖的type_mode為515,也即(LOCK_X | LOCK_GAP),與session1事務的鎖4444的gap鎖lock2->type_mode=547(LOCK_X | LOCK_REC | LOCK_GAP)的lock_mode是不兼容的(兩者皆為LOCK_X)。然而由於type_mode滿足LOCK_GAP且不帶有LCK_INSERT_INTENTION的標識位,這里會判定為不需要等待。因此,第二個session執行select for update也同樣成功加上gap鎖了。


這里sesion1事務4444執行insert時type_mode為2563(LOCK_X | LOCK_GAP | LOCK_INSERT_INTENTION),由於帶有LOCK_INSERT_INTENTION標識位,因此需要等待session2事務釋放4445的gap鎖。后續session1事務4444獲得了一個插入意向鎖,並且在等待session2事務4445釋放gap鎖。



這里session2事務4445同樣執行了insert操作,插入意向鎖需要等待session1的事務4444的gap鎖釋放。在死鎖檢測時,被探測到形成等待環。因此InnoDB會選擇一個事務作為victim進行回滾。
其過程大致如下:
- session2嘗試獲取插入意向鎖,需要等待session1的gap鎖
- session1事務的插入意向鎖處於等待中
- session1事務插入意向鎖在等待session2的gap鎖
- 形成環路,檢測到死鎖
2.4 如何避免這個死鎖
我們已經知道,這種情況出現的原因是:兩個session同時通過select for update,並且未命中任何記錄的情況下,是有可能得到相同gap的鎖的(要看where篩選條件是否落在同一個區間。如果上面的案例如果一個session准備插入'ddd'另一個准備插入'kkk'則不會出現沖突,因為不是同一個gap)。此時再進行並發插入,其中一個會進入鎖等待,待第二個session進行插入時,會出現死鎖。MySQL會根據事務權重選擇一個事務進行回滾。
那么如何避免這個情況呢?
一種解決辦法是將事務隔離級別降低到Read Committed,這時不會有gap鎖,對於上述場景,如果where中條件不同即最終要插入的鍵不同,則不會有問題。如果業務代碼中可能不同線程會嘗試對相同鍵進行select for update,則可在業務代碼中捕獲索引沖突異常進行重試。
此外,上面代碼示例中的代碼還有一處值得注意的地方是事務注解@Transactional的傳播機制,對於這類與主流程事務關系不大的方法,應當將事務傳播行為改為REQUIRES_NEW。
原因有兩點:
- 因為這里的解決方案是對隔離級別降級,如果傳播行為仍然是默認的話,在外層事務隔離級別不是RC的情況下,會拋出IllegalTransactionStateException異常(在你的TransactionManager開啟了validateExistingTransaction校驗的情況下)。
- 如果加入外層事務的話,某個線程在執行獲取流水號的時候可能會因為另一個線程的與流水號不相關的事務代碼還沒執行完畢而阻塞。
