linux死鎖檢測


參考

https://www.ibm.com/developerworks/cn/linux/l-cn-deadlock/index.html
https://blog.csdn.net/peng314899581/article/details/79064616
https://www.cnblogs.com/youxin/p/8837771.html
https://www.jianshu.com/p/d451793cab4c?utm_source=oschina-app
http://blog.sina.com.cn/s/blog_a2a6dd380102xtec.html
https://blog.csdn.net/wanxuexiang/article/details/88382808
https://ethanhao.github.io/c++11,/gdb,/multithread,/2017/03/03/Deadlock-detecting-using-GDB-Copy.html

前沿

Windows下死鎖的解決方法已經很熟悉了。首先,Windows via C/C++中,提供了一個工程-LockCop,附加到一個進程,判斷是否有死鎖。死鎖的現象行為有顯著的特點,程序表面上看上去一切正常,但是某些信息或是消息發送過去后,無法處理。一般我們用LockCop判斷是否有死鎖,發現有死鎖之后,用Visual Studio遠程附加到進程調試,看看對應線程卡在哪個位置。一般都會卡在加鎖的位置,然后看看兩個死鎖的線程代碼上幾步,是不是相互鎖定了對方現在正在請求的鎖。這樣就可以很快的查到死鎖的問題。

Linux下調查死鎖的方法與Windows類似,也是先確認是否死鎖,然后找到哪兩個線程死鎖,然后調試具體看線程卡在哪一步。

代碼

這個代碼創建了4個線程,兩個死鎖,兩個不斷操作數組

#include <unistd.h> 
#include <pthread.h> 
#include <string.h> 
 
pthread_mutex_t mutex1 = PTHREAD_MUTEX_INITIALIZER; 
pthread_mutex_t mutex2 = PTHREAD_MUTEX_INITIALIZER; 
pthread_mutex_t mutex3 = PTHREAD_MUTEX_INITIALIZER; 
pthread_mutex_t mutex4 = PTHREAD_MUTEX_INITIALIZER; 
 
static int sequence1 = 0; 
static int sequence2 = 0; 
 
int func1() 
{ 
   pthread_mutex_lock(&mutex1); 
   ++sequence1; 
   sleep(1); 
   pthread_mutex_lock(&mutex2); 
   ++sequence2; 
   pthread_mutex_unlock(&mutex2); 
   pthread_mutex_unlock(&mutex1); 
 
   return sequence1; 
} 
 
int func2() 
{ 
   pthread_mutex_lock(&mutex2); 
   ++sequence2; 
   sleep(1); 
   pthread_mutex_lock(&mutex1); 
   ++sequence1; 
   pthread_mutex_unlock(&mutex1); 
   pthread_mutex_unlock(&mutex2); 
 
   return sequence2; 
} 
 
void* thread1(void* arg) 
{ 
   while (1) 
   { 
       int iRetValue = func1(); 
 
       if (iRetValue == 100000) 
       { 
           pthread_exit(NULL); 
       } 
   } 
} 
 
void* thread2(void* arg) 
{ 
   while (1) 
   { 
       int iRetValue = func2(); 
 
       if (iRetValue == 100000) 
       { 
           pthread_exit(NULL); 
       } 
   } 
} 
 
void* thread3(void* arg) 
{ 
   while (1) 
   { 
       sleep(1); 
       char szBuf[128]; 
       memset(szBuf, 0, sizeof(szBuf)); 
       strcpy(szBuf, "thread3"); 
   } 
} 
 
void* thread4(void* arg) 
{ 
   while (1) 
   { 
       sleep(1); 
       char szBuf[128]; 
       memset(szBuf, 0, sizeof(szBuf)); 
       strcpy(szBuf, "thread3"); 
   } 
} 
 
int main() 
{ 
   pthread_t tid[4]; 
   if (pthread_create(&tid[0], NULL, &thread1, NULL) != 0) 
   { 
       _exit(1); 
   } 
   if (pthread_create(&tid[1], NULL, &thread2, NULL) != 0) 
   { 
       _exit(1); 
   } 
   if (pthread_create(&tid[2], NULL, &thread3, NULL) != 0) 
   { 
       _exit(1); 
   } 
   if (pthread_create(&tid[3], NULL, &thread4, NULL) != 0) 
   { 
       _exit(1); 
   } 
 
   sleep(5); 
   //pthread_cancel(tid[0]); 
 
   pthread_join(tid[0], NULL); 
   pthread_join(tid[1], NULL); 
   pthread_join(tid[2], NULL); 
   pthread_join(tid[3], NULL); 
 
   pthread_mutex_destroy(&mutex1); 
   pthread_mutex_destroy(&mutex2); 
   pthread_mutex_destroy(&mutex3); 
   pthread_mutex_destroy(&mutex4); 
 
   return 0; 
}

編譯運行

第一種方式 strace

找到我們的進程

$ ps aux -T |grep a.out
root      6794  6794  0.0  0.0  38416  1664 pts/0    Sl+  14:23   0:00 ./a.out
root      6794  6795  0.0  0.0  38416  1664 pts/0    Sl+  14:23   0:00 ./a.out
root      6794  6796  0.0  0.0  38416  1664 pts/0    Sl+  14:23   0:00 ./a.out
root      6794  6797  0.0  0.0  38416  1664 pts/0    Sl+  14:23   0:00 ./a.out
root      6794  6798  0.0  0.0  38416  1664 pts/0    Sl+  14:23   0:00 ./a.out
root      6800  6800  0.0  0.0   3216   892 pts/1    R+   14:23   0:00 grep --color=auto --exclude-dir=.bzr --exclude-dir=CVS --exclude-dir=.git --exclude-dir=.hg --exclude-dir=.svn --exclude-dir=.idea --exclude-dir=.tox a.out

我們看到6794這個進程,也就是我們跑的程序,有5個線程,因為一個程序起來的主線程,然后又申請了4個子線程。

用strace查看每個線程的狀態

# root @ debian in ~ [14:27:42] C:130
$ strace -p 6794        
strace: Process 6794 attached
futex(0x7f1d36d1a9d0, FUTEX_WAIT, 6795, NULL^Cstrace: Process 6794 detached
 <detached ...>


# root @ debian in ~ [14:27:46] C:130
$ strace -p 6795
strace: Process 6795 attached
futex(0x5608207030e0, FUTEX_WAIT_PRIVATE, 2, NULL^Cstrace: Process 6795 detached
 <detached ...>


# root @ debian in ~ [14:27:51] C:130
$ strace -p 6796
strace: Process 6796 attached
futex(0x5608207030a0, FUTEX_WAIT_PRIVATE, 2, NULL^Cstrace: Process 6796 detached
 <detached ...>


# root @ debian in ~ [14:27:55] C:130
$ strace -p 6797
strace: Process 6797 attached
restart_syscall(<... resuming interrupted nanosleep ...>) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7f1d35d17e20) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7f1d35d17e20) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7f1d35d17e20) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7f1d35d17e20) = 0
nanosleep({tv_sec=1, tv_nsec=0}, ^Cstrace: Process 6797 detached
 <detached ...>


# root @ debian in ~ [14:28:02] C:130
$ strace -p 6798
strace: Process 6798 attached
restart_syscall(<... resuming interrupted nanosleep ...>) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7f1d35516e20) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7f1d35516e20) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7f1d35516e20) = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7f1d35516e20) = 0
nanosleep({tv_sec=1, tv_nsec=0}, ^Cstrace: Process 6798 detached
 <detached ...>

我們知道主線程肯定是阻塞的或是循環的,不然程序就執行完退出了,所以6794是一個等待狀態,多次調用strace可以看到6795和6796也一直是等待狀態,按照正常的程序執行,很難在抓取信息的時候看到是加鎖等待狀態,更不用說多次執行都是同一個等待狀態,這基本上就表示是死鎖了。后面6797和6798符合代碼的執行流程,就是sleep,然后做一些操作,strace可以記錄到每次調用系統nanosleep的日志。有關futex的更多信息請參考futex

gdb調試

$ gdb
GNU gdb (Debian 8.2.1-2+b3) 8.2.1
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word".

(gdb) attach 7291
Attaching to process 7291
[New LWP 7292]
[New LWP 7293]
[New LWP 7294]
[New LWP 7295]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007fea5c5c6495 in __GI___pthread_timedjoin_ex (threadid=140644543452928, thread_return=0x0, abstime=0x0, block=<optimized out>) at pthread_join_common.c:89
89	pthread_join_common.c: No such file or directory.

(gdb) info threads 
  Id   Target Id                                Frame 
* 1    Thread 0x7fea5c0d6740 (LWP 7291) "a.out" 0x00007fea5c5c6495 in __GI___pthread_timedjoin_ex (threadid=140644543452928, thread_return=0x0, abstime=0x0, 
    block=<optimized out>) at pthread_join_common.c:89
  2    Thread 0x7fea5c0d5700 (LWP 7292) "a.out" __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103
  3    Thread 0x7fea5b8d4700 (LWP 7293) "a.out" __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103
  4    Thread 0x7fea5b0d3700 (LWP 7294) "a.out" 0x00007fea5c1a1720 in __GI___nanosleep (requested_time=requested_time@entry=0x7fea5b0d2e20, 
    remaining=remaining@entry=0x7fea5b0d2e20) at ../sysdeps/unix/sysv/linux/nanosleep.c:28
  5    Thread 0x7fea5a8d2700 (LWP 7295) "a.out" 0x00007fea5c1a1720 in __GI___nanosleep (requested_time=requested_time@entry=0x7fea5a8d1e20, 
    remaining=remaining@entry=0x7fea5a8d1e20) at ../sysdeps/unix/sysv/linux/nanosleep.c:28

(gdb) thread apply all bt

Thread 5 (Thread 0x7fea5a8d2700 (LWP 7295)):
#0  0x00007fea5c1a1720 in __GI___nanosleep (requested_time=requested_time@entry=0x7fea5a8d1e20, remaining=remaining@entry=0x7fea5a8d1e20)
    at ../sysdeps/unix/sysv/linux/nanosleep.c:28
#1  0x00007fea5c1a162a in __sleep (seconds=0) at ../sysdeps/posix/sleep.c:55
#2  0x0000558bb85f732c in thread4 (arg=0x0) at test.cpp:80
#3  0x00007fea5c5c4fa3 in start_thread (arg=<optimized out>) at pthread_create.c:486
#4  0x00007fea5c1d44cf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 4 (Thread 0x7fea5b0d3700 (LWP 7294)):
#0  0x00007fea5c1a1720 in __GI___nanosleep (requested_time=requested_time@entry=0x7fea5b0d2e20, remaining=remaining@entry=0x7fea5b0d2e20)
    at ../sysdeps/unix/sysv/linux/nanosleep.c:28
#1  0x00007fea5c1a162a in __sleep (seconds=0) at ../sysdeps/posix/sleep.c:55
#2  0x0000558bb85f72e7 in thread3 (arg=0x0) at test.cpp:69
#3  0x00007fea5c5c4fa3 in start_thread (arg=<optimized out>) at pthread_create.c:486
#4  0x00007fea5c1d44cf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 3 (Thread 0x7fea5b8d4700 (LWP 7293)):
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103
#1  0x00007fea5c5c7714 in __GI___pthread_mutex_lock (mutex=0x558bb85fa0a0 <mutex1>) at ../nptl/pthread_mutex_lock.c:80
#2  0x0000558bb85f724e in func2 () at test.cpp:31
#3  0x0000558bb85f72b5 in thread2 (arg=0x0) at test.cpp:56
#4  0x00007fea5c5c4fa3 in start_thread (arg=<optimized out>) at pthread_create.c:486
#5  0x00007fea5c1d44cf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 2 (Thread 0x7fea5c0d5700 (LWP 7292)):
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103
#1  0x00007fea5c5c7714 in __GI___pthread_mutex_lock (mutex=0x558bb85fa0e0 <mutex2>) at ../nptl/pthread_mutex_lock.c:80
#2  0x0000558bb85f71ea in func1 () at test.cpp:18
#3  0x0000558bb85f728e in thread1 (arg=0x0) at test.cpp:43
#4  0x00007fea5c5c4fa3 in start_thread (arg=<optimized out>) at pthread_create.c:486
#5  0x00007fea5c1d44cf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 1 (Thread 0x7fea5c0d6740 (LWP 7291)):
#0  0x00007fea5c5c6495 in __GI___pthread_timedjoin_ex (threadid=140644543452928, thread_return=0x0, abstime=0x0, block=<optimized out>) at pthread_join_common.c:89
#1  0x0000558bb85f7444 in main () at test.cpp:110
(gdb) p mutex1
$1 = {__data = {__lock = 2, __count = 0, __owner = 7292, __nusers = 1, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, 
  __size = "\002\000\000\000\000\000\000\000|\034\000\000\001", '\000' <repeats 26 times>, __align = 2}
(gdb) p mutex2
$2 = {__data = {__lock = 2, __count = 0, __owner = 7293, __nusers = 1, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, 
  __size = "\002\000\000\000\000\000\000\000}\034\000\000\001", '\000' <repeats 26 times>, __align = 2}

(gdb) detach 
Detaching from program: /home/arthas/code/a.out, process 7291
[Inferior 1 (process 7291) detached]
(gdb) q

上面的線程id變了,因為一開始沒有加g進行編譯,所以在打印mutex1的時候,打印不出來具體信息,又重新增加了調試信息進行編譯,然后執行。

attach 7291

附加到進程

info threads

查看線程概要信息

thread apply all bt

打印線程詳細信息,從這里找到thread2和thread3兩個死鎖線程的詳細信息,執行到哪個函數哪一行,等待哪一個鎖。

p mutex1

通過打印對應鎖的信息,可以看到thread2等待的鎖被thread3占有,thread3等待的鎖被thread2占有,所以發生了死鎖。出問題的地方找到了,調用detach解除附加進程,q退出gdb

第二種方式 valgrind

valgrind是一個非常有用的工具,可以分析很多運行時錯誤,比如使用了未初始化的內存,使用了釋放的內存,內存泄漏等,同樣也包括死鎖。

valgrind的方法很簡單,這是一個工具包,需要指定使用哪個工具,helgrind是一個分析死鎖的工具,可以指出出問題的地方,通過valgrind運行程序,等出現問題后,用Ctrl+C,結束掉,就可以看到打印的信息。

從下面的信息可以看出,valgrind指出了有兩處錯誤,給定了線程名,thread2和thread3。並且列出了沒出錯誤的調用堆棧,我們只需要到代碼中查看解決就可以了。

$ valgrind --tool=helgrind ./a.out 
==7608== Helgrind, a thread error detector
==7608== Copyright (C) 2007-2017, and GNU GPL'd, by OpenWorks LLP et al.
==7608== Using Valgrind-3.14.0 and LibVEX; rerun with -h for copyright info
==7608== Command: ./a.out
==7608== 
^C==7608== 
==7608== Process terminating with default action of signal 2 (SIGINT)
==7608==    at 0x4866495: __pthread_timedjoin_ex (pthread_join_common.c:89)
==7608==    by 0x48398F5: pthread_join_WRK (hg_intercepts.c:553)
==7608==    by 0x109443: main (test.cpp:110)
==7608== ---Thread-Announcement------------------------------------------
==7608== 
==7608== Thread #2 was created
==7608==    at 0x4C984BE: clone (clone.S:71)
==7608==    by 0x4863DDE: create_thread (createthread.c:101)
==7608==    by 0x486580D: pthread_create@@GLIBC_2.2.5 (pthread_create.c:826)
==7608==    by 0x483C6B7: pthread_create_WRK (hg_intercepts.c:427)
==7608==    by 0x109379: main (test.cpp:90)
==7608== 
==7608== ----------------------------------------------------------------
==7608== 
==7608== Thread #2: Exiting thread still holds 1 lock
==7608==    at 0x486E29C: __lll_lock_wait (lowlevellock.S:103)
==7608==    by 0x4867713: pthread_mutex_lock (pthread_mutex_lock.c:80)
==7608==    by 0x4839C66: mutex_lock_WRK (hg_intercepts.c:902)
==7608==    by 0x1091E9: func1() (test.cpp:18)
==7608==    by 0x10928D: thread1(void*) (test.cpp:43)
==7608==    by 0x483C8B6: mythread_wrapper (hg_intercepts.c:389)
==7608==    by 0x4864FA2: start_thread (pthread_create.c:486)
==7608==    by 0x4C984CE: clone (clone.S:95)
==7608== 
==7608== ---Thread-Announcement------------------------------------------
==7608== 
==7608== Thread #3 was created
==7608==    at 0x4C984BE: clone (clone.S:71)
==7608==    by 0x4863DDE: create_thread (createthread.c:101)
==7608==    by 0x486580D: pthread_create@@GLIBC_2.2.5 (pthread_create.c:826)
==7608==    by 0x483C6B7: pthread_create_WRK (hg_intercepts.c:427)
==7608==    by 0x1093AD: main (test.cpp:94)
==7608== 
==7608== ----------------------------------------------------------------
==7608== 
==7608== Thread #3: Exiting thread still holds 1 lock
==7608==    at 0x486E29C: __lll_lock_wait (lowlevellock.S:103)
==7608==    by 0x4867713: pthread_mutex_lock (pthread_mutex_lock.c:80)
==7608==    by 0x4839C66: mutex_lock_WRK (hg_intercepts.c:902)
==7608==    by 0x10924D: func2() (test.cpp:31)
==7608==    by 0x1092B4: thread2(void*) (test.cpp:56)
==7608==    by 0x483C8B6: mythread_wrapper (hg_intercepts.c:389)
==7608==    by 0x4864FA2: start_thread (pthread_create.c:486)
==7608==    by 0x4C984CE: clone (clone.S:95)
==7608== 
==7608== 
==7608== For counts of detected and suppressed errors, rerun with: -v
==7608== Use --history-level=approx or =none to gain increased speed, at
==7608== the cost of reduced accuracy of conflicting-access information
==7608== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 21 from 4)

第三種方法 pstack

pstack是Solaris、Red Hat系列(Fedora,Centos)和Debian系列(Ubuntu)等下提供的一個打印堆棧的調試工具。

但是在debian10下安裝好了,運行報錯

$ sudo pstack 1798 

1798: ./a.out
pstack: Input/output error
failed to read target.


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM