最近遇到一個有趣的.net 2.0程序死鎖問題,一般來說.net死鎖問題都是應用程序顯示的請求鎖的過程出現鎖訪問順序不一致導致的,但是本文中這個死鎖則相對較為隱晦,隱藏的很深。
調試過程
.net的死鎖我們可以通過sos.dll提供的syncblk來查看sync block來發現那些線程擁有鎖,哪些線程等待鎖。所以我們先通過syncblk來查看以下輸出如何。
通過syncblk可以看到目前有一個syncblk已經被線程3(系統線程1814)所擁有。
0:005> .loadby sos mscorwks 0:005> !syncblk Index SyncBlock MonitorHeld Recursion Owning Thread Info SyncBlock Owner 3 0000000000f4e678 3 1 0000000000f58320 1814 3 0000000002ef6040 System.Object ----------------------------- Total 3 CCW 0 RCW 0 ComClassFactory 0 Free 0
切換到3號線程,驗證是否系統線程號為1814。輸出其調用棧。原來該線程在等待進入另外一個CriticalSection。
0:005> ~3s ntdll!ZwWaitForSingleObject+0xa: 000007f8`04ba2c2a c3 ret 0:003> ~. . 3 Id: 88c.1814 Suspend: 1 Teb: 000007f5`ff396000 Unfrozen Start: mscorwks!Thread::intermediateThreadProc (000007ff`e7d5f33c) Priority: 0 Priority class: 32 Affinity: f 0:003> kvL Args to Child : Call Site 00000000`00f535d0 00000000`000001f8 00000000`00000000 00000000`00f07c70 : ntdll!ZwWaitForSingleObject+0xa 00000000`00000000 000007ff`43010042 00000000`00f535d0 00000000`00000001 : ntdll!RtlpWaitOnCriticalSection+0xea 00000000`0000000a 00000000`00000000 00000000`00f1c5b0 00000000`00f535d0 : ntdll!RtlpEnterCriticalSectionContended+0x94 00000000`00000a45 00000000`00000000 ffffffff`fffffffe 000007ff`88533480 : mscorwks!UnsafeEEEnterCriticalSection+0x20 00000000`00000000 000007ff`e82758f0 ffffffff`fffffffe 00000000`00000000 : mscorwks!CrstBase::Enter+0x123 ffffffff`00000001 00000000`00f4d920 00000000`00000000 00000000`00000001 : mscorwks!ListLockEntry::FinishDeadlockAwareEnter+0x2b 00000000`1b81e100 00000000`00f48e80 00000000`00000000 00000000`00000000 : mscorwks!ListLockEntry::LockHolder::DeadlockAwareAcquire+0x32 000007f8`01eb798a 000007ff`883f3ba0 00000000`00000003 000007ff`883f3958 : mscorwks!MethodTable::DoRunClassInitThrowing+0x6cb 00000000`00000000 00000000`00000000 ffffffff`fffffffe 000007ff`e1261560 : mscorwks!MethodTable::CheckRunClassInitThrowing+0x68 00000000`02ef8148 00000000`00f0c040 00000000`00000000 000007ff`883f3b90 : mscorwks!MethodDesc::DoPrestub+0x162 00000000`02ef4e40 00000000`00000000 00000000`02ef5fe8 00000000`00f58320 : mscorwks!PreStubWorker+0x1fa 00000000`1b81ec50 00000000`00000000 00000000`1b81ea50 00000000`00000000 : mscorwks!ThePreStubAMD64+0x87 00000000`02ef6028 000007ff`e16334c0 00000000`1b81ed00 00000000`00000000 : StaticConstruction!StaticConstruction.Singleton.LockIt()+0x10c 00000000`02ef4c20 000007ff`e16334c0 00000000`1b81ed00 00000000`00000000 : StaticConstruction!StaticConstruction.Program.Thread1Proc()+0x37
...
00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!RtlUserThreadStart+0x1d
將該CriticalSection輸出,查看一下擁有的線程是1994,即4號線程。
0:003> !cs 00f535d0 ----------------------------------------- Critical section = 0x0000000000f535d0 (+0xF535D0) DebugInfo = 0x0000000000f402e0 LOCKED LockCount = 0x1 WaiterWoken = No OwningThread = 0x0000000000001994 RecursionCount = 0x1 LockSemaphore = 0x1F8 SpinCount = 0x00000000020007d0 0:003> ~ 0 Id: 88c.1f7c Suspend: 1 Teb: 000007f5`ff39e000 Unfrozen 1 Id: 88c.18a4 Suspend: 1 Teb: 000007f5`ff39c000 Unfrozen 2 Id: 88c.1438 Suspend: 1 Teb: 000007f5`ff39a000 Unfrozen . 3 Id: 88c.1814 Suspend: 1 Teb: 000007f5`ff396000 Unfrozen 4 Id: 88c.1994 Suspend: 1 Teb: 000007f5`ff394000 Unfrozen # 5 Id: 88c.174c Suspend: 1 Teb: 000007f5`ff1be000 Unfrozen
切換到4號線程查看調用棧,可以看到該調用棧正在等待StaticConstruction.Singleton.LockIt中的Monitor.Enter,即三號線程擁有的syncblk。
0:003> ~4s ntdll!ZwWaitForMultipleObjects+0xa: 000007f8`04ba319b c3 ret 0:004> kL Child-SP RetAddr Call Site 00000000`1b91d348 000007f8`01e812d2 ntdll!ZwWaitForMultipleObjects+0xa 00000000`1b91d350 000007ff`e7c3e809 KERNELBASE!WaitForMultipleObjectsEx+0xe5 00000000`1b91d630 000007ff`e7c431f1 mscorwks!WaitForMultipleObjectsEx_SO_TOLERANT+0xc1 00000000`1b91d6d0 000007ff`e7d403e5 mscorwks!Thread::DoAppropriateAptStateWait+0x41 00000000`1b91d730 000007ff`e7c5e95c mscorwks!Thread::DoAppropriateWaitWorker+0x191 00000000`1b91d830 000007ff`e7c9d17a mscorwks!Thread::DoAppropriateWait+0x5c 00000000`1b91d8a0 000007ff`e7c20fe1 mscorwks!CLREvent::WaitEx+0xbe 00000000`1b91d950 000007ff`e7d6e012 mscorwks!AwareLock::EnterEpilog+0xc9 00000000`1b91da20 000007ff`e817a825 mscorwks!AwareLock::Enter+0x72 00000000`1b91da50 000007ff`88540657 mscorwks!JIT_MonEnterWorker_Portable+0xf5 00000000`1b91dc20 000007ff`885404c3 StaticConstruction!StaticConstruction.Singleton.LockIt()+0xa7 00000000`1b91dcc0 000007ff`e7ddd562 StaticConstruction!StaticConstruction.Static..cctor()+0x93 00000000`1b91dd30 000007ff`e7d1a293 mscorwks!CallDescrWorker+0x82 00000000`1b91dd70 000007ff`e7d1a3da mscorwks!CallDescrWorkerWithHandler+0xd3 00000000`1b91de10 000007ff`e7cfd437 mscorwks!DispatchCallDebuggerWrapper+0x3e 00000000`1b91de70 000007ff`e7cf22bd mscorwks!MethodTable::RunClassInitEx+0x207 00000000`1b91dfc0 000007ff`e8165f98 mscorwks!MethodTable::DoRunClassInitThrowing+0x74d 00000000`1b91ea40 000007ff`e814f162 mscorwks!MethodTable::CheckRunClassInitThrowing+0x68 00000000`1b91ea80 000007ff`e7cf72aa mscorwks!MethodDesc::DoPrestub+0x162 00000000`1b91eb70 000007ff`e7ddd447 mscorwks!PreStubWorker+0x1fa 00000000`1b91ec30 000007ff`88540300 mscorwks!ThePreStubAMD64+0x87 00000000`1b91ed00 000007ff`e14e2bdb StaticConstruction!StaticConstruction.Program.Thread2Proc()+0x20 ... 00000000`1b91f8c0 00000000`00000000 ntdll!RtlUserThreadStart+0x1d
因此我們可以看到是3號線程和4號線程互相得到了彼此需要請求的鎖,因而造成了死鎖。
其中4號線程很容易理解,他在等待monitor.enter,但是如何解釋3號線程的調用棧,3號線程同樣是調用StaticConstruction.Singleton.LockIt,其等待的CritcalSection來自何處又是做什么用的呢?
回頭查看程序代碼,
using System; using System.Collections.Generic; using System.Text; using System.Threading; namespace StaticConstruction { class Program { static void Main(string[] args) { Thread t1 = new Thread(Thread1Proc); Thread t2 = new Thread(Thread2Proc); t1.Start(); // lockA -> wait for static construction t2.Start(); // static consctruction -> lockA Console.Read(); } static void Thread1Proc() { Singleton.Instance.LockIt(); } static void Thread2Proc() { Static.Foo(); } } class Singleton { private object lockA = new object(); private Singleton() { } private static Singleton _instance = new Singleton(); public static Singleton Instance { get { return _instance; } } public void LockIt() { Console.WriteLine("Thread {0} waiting lock A", Thread.CurrentThread.ManagedThreadId); lock (lockA) { Console.WriteLine("Thread {0} got lock A", Thread.CurrentThread.ManagedThreadId); Thread.Sleep(10); Static.Foo(); } Console.WriteLine("Thread {0} released lock A", Thread.CurrentThread.ManagedThreadId); } } class Static { private Static() { } static Static() { Console.WriteLine("Static constructor begin by thread {0}", Thread.CurrentThread.ManagedThreadId); Singleton.Instance.LockIt(); Console.WriteLine("Static constructor end by thread {0}", Thread.CurrentThread.ManagedThreadId); } public static void Foo() { Console.WriteLine("Static method Foo begin"); Console.WriteLine("Static method Foo end"); } } }
原來3號線程在StaticConstruction.Singleton.LockIt得到鎖之后緊接着要調用一個靜態類Static的方法,而回到4號線程的調用棧可以看到該靜態類Static構造函數還在執行過程中,在這個靜態類被構造成功之前,3號線程當然無法調用該類上的方法,原來CLR通過在methodtable初始化過程中通過一個CriticalSection來確保線程安全。因此也使得在這種特殊情況下會有死鎖情況發生。
明白了這一點,解決方法就很簡單了,將鎖的請求順序統一,只要大家都按照同樣的請求順序來請求鎖,死鎖問題就不會發生了。