Linux高级调试与优化——内存泄漏实战分析


最近在整理Linux调试方面的文档,正好碰到了一个内存泄漏踩栈的问题,借此机会记录一下分析过程。

首先,发现问题之后,赶紧看一下产生coredump文件没有,果不其然,产生了coredump,果断上gdb调试。

 

$ arm-buildroot-linux-gnueabi-gdb ./linecard ~/core_tMscRcv_165

GNU gdb (GDB) 7.10.1 Copyright (C) 2015 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law.  Type "show copying" and "show warranty" for details. This GDB was configured as "--host=x86_64-unknown-linux-gnu --target=arm-buildroot-linux-gnueabi". Type "show configuration" for configuration details. For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>. Find the GDB manual and other documentation resources online at: <http://www.gnu.org/software/gdb/documentation/>. For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from ./linecard...done.

warning: exec file is newer than core file.

[New LWP 276]

... ...

 [New LWP 303]

warning: .dynamic section for "./libethernet_oam.so" is not at the expected address (wrong library or version mismatch?)

warning: .dynamic section for "./libigmp_adapter.so" is not at the expected address (wrong library or version mismatch?)

warning: .dynamic section for "./libmsc.so" is not at the expected address (wrong library or version mismatch?)

warning: .dynamic section for "./libolt_config.so" is not at the expected address (wrong library or version mismatch?)

warning: .dynamic section for "./libonu_config.so" is not at the expected address (wrong library or version mismatch?)

warning: .dynamic section for "./libonu_vlan.so" is not at the expected address (wrong library or version mismatch?)

warning: .dynamic section for "./libpolicy.so" is not at the expected address (wrong library or version mismatch?)

warning: .dynamic section for "./libppt.so" is not at the expected address (wrong library or version mismatch?)

warning: .dynamic section for "./librms.so" is not at the expected address (wrong library or version mismatch?)

warning: .dynamic section for "./libtime_sync.so" is not at the expected address (wrong library or version mismatch?)

warning: .dynamic section for "./libvoice.so" is not at the expected address (wrong library or version mismatch?)

warning: .dynamic section for "./libprivate_com.so" is not at the expected address (wrong library or version mismatch?)

warning: .dynamic section for "./libonu_ability.so" is not at the expected address (wrong library or version mismatch?)

warning: .dynamic section for "./libomci.so" is not at the expected address (wrong library or version mismatch?)

warning: .dynamic section for "./libconfig_data_gpon.so" is not at the expected address (wrong library or version mismatch?)

warning: .dynamic section for "./liblineid_adapter.so" is not at the expected address (wrong library or version mismatch?)

warning: .dynamic section for "./libbep.so" is not at the expected address (wrong library or version mismatch?)

warning: Could not load shared library symbols for 11 libraries, e.g. /usr/local/libfhdrv_kdrv_board_impl.so.

Use the "info sharedlibrary" command to see the complete listing.

Do you need "set solib-search-path" or "set sysroot"?

Core was generated by `./linecard'.

Program terminated with signal SIGABRT, Aborted.                                /* 最开始要确定是什么信号量导致程序异常退出,这样很容易缩小范围,猜测大概可能发生了什么问题 */

#0  0xe9d2f630 in ?? ()

[Current thread is 1 (LWP 276)]

貌似找不到符号表,库文件匹配不上,怎么办?用set solib-search-path设置一下库文件路径就好了!

(gdb) set solib-search-path /media/new/linyao/2_xPON/xPON/src/config_data_gpon/squashfs-root/mnt/work/linecard_app:/media/new/linyao/2_xPON/xPON/src/config_data_gpon/squashfs-root/usr/local:/media/new/linyao/2_xPON/xPON/src/config_data_gpon/squashfs-root/lib:/media/new/linyao/2_xPON/xPON/src/config_data_gpon/squashfs-root/usr/lib
Reading symbols from /media/new/linyao/2_xPON/xPON/src/config_data_gpon/squashfs-root/usr/local/libfhdrv_kdrv_board_impl.so...done.
Reading symbols from /media/new/linyao/2_xPON/xPON/src/config_data_gpon/squashfs-root/usr/local/libvirtual_netdev_drv.so...done.
Reading symbols from /media/new/linyao/2_xPON/xPON/src/config_data_gpon/squashfs-root/lib/libpthread.so.0...(no debugging symbols found)...done.
Reading symbols from /media/new/linyao/2_xPON/xPON/src/config_data_gpon/squashfs-root/lib/librt.so.1...(no debugging symbols found)...done.
Reading symbols from /media/new/linyao/2_xPON/xPON/src/config_data_gpon/squashfs-root/lib/libdl.so.2...(no debugging symbols found)...done.
Reading symbols from /media/new/linyao/2_xPON/xPON/src/config_data_gpon/squashfs-root/lib/libc.so.6...(no debugging symbols found)...done.
Reading symbols from /media/new/linyao/2_xPON/xPON/src/config_data_gpon/squashfs-root/lib/ld-linux.so.3...(no debugging symbols found)...done.
Reading symbols from /media/new/linyao/2_xPON/xPON/src/config_data_gpon/squashfs-root/usr/lib/libstdc++.so.6...(no debugging symbols found)...done.
Reading symbols from /media/new/linyao/2_xPON/xPON/src/config_data_gpon/squashfs-root/lib/libm.so.6...(no debugging symbols found)...done.
Reading symbols from /media/new/linyao/2_xPON/xPON/src/config_data_gpon/squashfs-root/lib/libgcc_s.so.1...(no debugging symbols found)...done.
Reading symbols from /media/new/linyao/2_xPON/xPON/src/config_data_gpon/squashfs-root/lib/libnss_files.so.2...(no debugging symbols found)...done.
/media/new/linyao/2_xPON/xPON/src/config_data_gpon/squashfs-root/mnt/work/linecard_app

用bt命令看一下调用栈:

(gdb) bt
#0  0xe9d2f630 in raise () from /media/new/linyao/2_xPON/xPON/src/config_data_gpon/squashfs-root/lib/libc.so.6
#1  0xe9d309c8 in abort () from /media/new/linyao/2_xPON/xPON/src/config_data_gpon/squashfs-root/lib/libc.so.6
#2  0xf5a52158 in diag_displayTermPack (ucTermType=<optimized out>, usTermCmd=<optimized out>, pucData=0xc6e61440 "\001", usDataLen=<optimized out>)
    at ../../code/diag/diag/diag_display.c:147
#3  0xf5a62288 in wos_taskDiagMsgProc (pulPara=0xf5b20da8 <g_astWosPerfMsgDispatchStat+44>) at ../../code/diag/wos/wos_task.c:1973
#4  <signal handler called>         /* 触发信号量异常处理函数 */
#5  0xe9c71ca0 in __dynamic_cast () from /media/new/linyao/2_xPON/xPON/src/config_data_gpon/squashfs-root/usr/lib/libstdc++.so.6
#6  0xf6416f70 in cfg_mod_set_onu_local_mng_interface_config_flag (pon_no=0, onu_no=0, cfg_flag=0 '\000')
    at /media/new/linyao/2_xPON/xPON/src/config_data_gpon/config_module_cinterface.cpp:35769
#7  0xf6cfa60c in str_ipaddr_to_general_ip (pgeneral_ip=0x1 <error: Cannot access memory at address 0x1>,
    str_ipaddr=0x80075 <error: Cannot access memory at address 0x80075>) at /media/new/linyao/2_xPON/xPON/src/voice/voice_service_global.cpp:945
#8  0xf5d382bc in CMscCommandObject::Parse (this=0x25a7a38, buf=0xde83a22c "\016\002\237\001\001\004\025", buf_len=676, session_id=0, volt_id=0 '\000', is_head=1 '\001')
    at /media/new/jenkins/workspace/workspace/201716-OLT-CBB-PONSYSTEM_LINUX-fsl61293-coverity/src/service_module/msc_command_object.cpp:201
#9  0xf5d39d48 in CServiceObject::DispatchMscMessage (this=0x2598818, cmd_id=1045, pbuf=0xde83a22c "\016\002\237\001\001\004\025", buf_len=676, session_id=0,
    volt_id=0 '\000', is_head=1 '\001') at /media/new/jenkins/workspace/workspace/201716-OLT-CBB-PONSYSTEM_LINUX-fsl61293-coverity/src/service_module/service_object.cpp:46
#10 0xf5d29f28 in CPonSystem::DispatchMscMessage (this=0x849a80, cmd_id=1045, pbuf=0xde83a22c "\016\002\237\001\001\004\025", buf_len=676, session_id=0, is_head=1 '\001',
    volt_id=0 '\000') at /media/new/jenkins/workspace/workspace/201716-OLT-CBB-PONSYSTEM_LINUX-fsl61293-coverity/src/service_module/pon_system.hpp:566
#11 0xf5d25494 in pon_system_dispatch_msc (cmd_id=1045, pbuf=0xde83a22c "\016\002\237\001\001\004\025", buf_len=676, session_id=0, is_head=1 '\001')
    at /media/new/jenkins/workspace/workspace/201716-OLT-CBB-PONSYSTEM_LINUX-fsl61293-coverity/src/service_module/pon_system_cinterface.cpp:385
#12 0xf728c46c in MSC_RegProcGswCmdFunToManager (cmdId=1045, cmdType=0 '\000', pFun=0xde83a22c, pPrmt=0x2a4 <error: Cannot access memory at address 0x2a4>)
    at /media/new/linyao/2_xPON/xPON/src/msc/msc_main.c:1477
#13 0xf72897dc in __gnu_cxx::__normal_iterator<CServiceObject**, std::vector<CServiceObject*, std::allocator<CServiceObject*> > >::__normal_iterator (this=0x0,
    __i=@0x1: <error reading variable>)
    at /opt/toolchains/crosstools-arm-gcc-5.3-linux-4.1-glibc-2.24-binutils-2.25/usr/arm-buildroot-linux-gnueabi/include/c++/5.3.0/bits/stl_iterator.h:740
#14 0xf72897dc in __gnu_cxx::__normal_iterator<CServiceObject**, std::vector<CServiceObject*, std::allocator<CServiceObject*> > >::__normal_iterator (this=0xe3400002,
    __i=@0xe300c406: <error reading variable>)
    at /opt/toolchains/crosstools-arm-gcc-5.3-linux-4.1-glibc-2.24-binutils-2.25/usr/arm-buildroot-linux-gnueabi/include/c++/5.3.0/bits/stl_iterator.h:740
#15 0xf72897dc in __gnu_cxx::__normal_iterator<CServiceObject**, std::vector<CServiceObject*, std::allocator<CServiceObject*> > >::__normal_iterator (this=0x0,
    __i=@0x0: <error reading variable>)
    at /opt/toolchains/crosstools-arm-gcc-5.3-linux-4.1-glibc-2.24-binutils-2.25/usr/arm-buildroot-linux-gnueabi/include/c++/5.3.0/bits/stl_iterator.h:740
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

frame 4触发信号量异常处理函数,那么说明问题肯定出在frame 5。

(gdb) frame 5
#5  0xe9c71ca0 in __dynamic_cast () from /media/new/justin/2_xPON/xPON/src/config_data_gpon/squashfs-root/usr/lib/libstdc++.so.6
(gdb) info registers
r0             0xb8d70860 3101100128
r1             0xf5d6c6c0 4124493504
r2             0xf5d6c9e0 4124494304
r3             0x0 0
r4             0xb8d70860 3101100128
r5             0xf65f1808 4133427208
r6             0x79e0 31200
r7             0xf5af8ed4 4121923284
r8             0xf5af903c 4121923644
r9             0x64 100
r10            0xf5b1a118 4122059032
r11            0xc6e581ac 3336929708
r12            0x2564 9572
sp             0xc6e58120 0xc6e58120
lr             0xf6416f70 -163483792
pc             0xe9c71ca0 0xe9c71ca0 <__dynamic_cast+16>
cpsr           0xa0070010 -1610153968
(gdb) disassemble
Dump of assembler code for function __dynamic_cast:
   0xe9c71c90 <+0>: ldr r12, [r0]
   0xe9c71c94 <+4>: push {r4, r5, r6, r7, r8, lr}
   0xe9c71c98 <+8>: mov r4, r0
   0xe9c71c9c <+12>: sub sp, sp, #40 ; 0x28
=> 0xe9c71ca0 <+16>: ldr lr, [r12, #-8]                                            /* 出错指令,将r12-8指向内存地址的数据加载到lr,通过前面打印的寄存器值发现r12此时为0x2564,在0号page,明显的非法地址访问 */
(gdb) x/40wx $sp
0xc6e58120: 0xf5d60101 0xb8d6d50c 0xf5d6d1c4 0xf65f1808
0xc6e58130: 0xc6e5815c 0xf5d09c3c 0xc6e5815c 0x00000002
0xc6e58140: 0x0013849e 0x00080075 0xf6e05bb4 0xf65f1808
0xc6e58150: 0x000079e0 0xf5af8ed4 0xf5af903c 0xf6416f70
0xc6e58160: 0x00000000 0xf5ce3f5c 0x000079e0 0x00000415
0xc6e58170: 0x001381ac 0x00080075 0x023f3460 0x00000101
0xc6e58180: 0x023f3460 0x00080101 0x00083460 0x00000000
0xc6e58190: 0xb8d703b8 0xb8d70338 0xb8d4d4b0 0x8000007a
0xc6e581a0: 0xf6e05bb4 0x00000008 0xc6e60944 0xf6cfa60c
0xc6e581b0: 0xc6e60870 0x00000001 0x00000001 0x00000001

出错指令为ldr lr, [r12, #-8] 此时r12的值为0x2564,指向内存0页,为非法地址,因此该错误为非法地址访问。

上一行指令sub sp, sp, #40说明该函数的栈大小为40个字节,打印所有栈数据。

此时pc地址在libstdc++.so.6库的代码段__dynamic_cast函数 lr地址(函数返回地址)在libconfig_xxx.so的代码段,即调用ONU_CONFIGOBJECT()宏处。

 (gdb) list *0xf6416f70
0xf6416f70 is in cfg_mod_set_onu_local_mng_interface_config_flag(PON_NO, ONU_NO, unsigned char) (/media/new/linyao/2_xPON/xPON/src/config_data_gpon/config_module_cinterface.cpp:35769).
35764         MAPLE_LOG(ULLOG_PRI_ERROR, "onuno %d\r\n", onu_no);
35765         return ERR_INVALID_ONU_NO;
35766     }
35767 
35768     /* »򈢅嗃¶Տ񞨯
35769     COnuConfigObject * onu_config_object = ONU_CONFIGOBJECT(pon_no, onu_no);
35770     if (NULL != onu_config_object)
35771     {
35772         /* ¼ԋ
35773         synchronized(onu_config_object)

再回溯一条指令,是push {r4, r5, r6, r7, r8, lr}压栈指令,lr压栈,在栈数据中找到lr值,在0xc6e5815c处,由此反推出r4~r8的值:

r8 0xf5af903c
r7 0xf5af8ed4
r6 0x000079e0
r5 0xf65f1808
r4 0xf6e05bb4

再往上一条指令时ldr r12, [r0],将r0指向内存的值加载到r12寄存器中。
这几条指令中,r0的值没有发生改变,打印r0指向内存区域的值。

(gdb) x/64wx $r0-64 (打印以r0指向地址为中心的128字节数据)
0xb8d70820: 0x00000000 0x00000000 0x00000000 0x00000000
0xb8d70830: 0xfe0f0000 0xffffffff 0x0a00ffff 0x0e888888
0xb8d70840: 0xf90f0081 0x01000608 0x04060008 0x0a000100
0xb8d70850: 0x0e888888 0x0e64190a 0x00000000 0x190a0000
0xb8d70860: 0x00002564 0x00000000 0x00000000 0x00000000
0xb8d70870: 0x00000000 0x00080001 0x00750000 0x00000013
0xb8d70880: 0x00000000 0x00000002 0x00000000 0x31312d38
0xb8d70890: 0x39312d37 0x0000322f 0x00000000 0x00000000
0xb8d708a0: 0x00000000 0x00000000 0x00000000 0xb8d708e0
0xb8d708b0: 0xb8d4d4b0 0x00000000 0x00000000 0x00000000
0xb8d708c0: 0xb8d708b8 0xb8d708b8 0x00000000 0x00000000
0xb8d708d0: 0x00000000 0x00000013 0x00000002 0x000004ad
0xb8d708e0: 0xf65f1448 0x00000000 0x00000001 0xb8d70860
0xb8d708f0: 0x00000000 0x00000000 0x00000000 0x00000000
0xb8d70900: 0x00000000 0x00000000 0x00000000 0x00000000
0xb8d70910: 0x00000000 0x00000000 0x00000000 0x00000000
r0的值为0x2564,即前面出问题的地址。那么,r0的值从哪里来的呢?带着疑问进入上一层函数

(gdb) frame 6
#6  0xf6416f70 in cfg_mod_set_onu_local_mng_interface_config_flag (pon_no=0, onu_no=0, cfg_flag=0 '\000')
    at /media/new/justin/2_xPON/xPON/src/config_data_gpon/config_module_cinterface.cpp:35769
35769     COnuConfigObject * onu_config_object = ONU_CONFIGOBJECT(pon_no, onu_no);
(gdb) info registers
r0             0xb8d70860 3101100128
r1             0xf5d6c6c0 4124493504
r2             0xf5d6c9e0 4124494304
r3             0x0 0
r4             0xf6e05bb4 4141898676
r5             0xf65f1808 4133427208
r6             0x79e0 31200
r7             0xf5af8ed4 4121923284
r8             0xf5af903c 4121923644
r9             0x64 100
r10            0xf5b1a118 4122059032
r11            0xc6e581ac 3336929708
r12            0x2564 9572
sp             0xc6e58160 0xc6e58160
lr             0xf6416f70 -163483792
pc             0xf6416f70 0xf6416f70 <cfg_mod_set_onu_local_mng_interface_config_flag(PON_NO, ONU_NO, unsigned char)+252>
cpsr           0xa0070010 -1610153968
(gdb) info frame
Stack level 6, frame at 0xc6e581b0:
 pc = 0xf6416f70 in cfg_mod_set_onu_local_mng_interface_config_flag (/media/new/justin/2_xPON/xPON/src/config_data_gpon/config_module_cinterface.cpp:35769); saved pc = 0xf6cfa60c
 called by frame at 0xc6e60948, caller of frame at 0xc6e58160
 source language c++.
 
 Arglist at 0xc6e581ac, args: pon_no=0, onu_no=0, cfg_flag=0 '\000'
 Locals at 0xc6e581ac, Previous frame's sp is 0xc6e581b0
 Saved registers:
  r4 at 0xc6e581a0, r5 at 0xc6e581a4, r11 at 0xc6e581a8, lr at 0xc6e581ac

反汇编确定frame 6返回地址行

(gdb) disassemble (根据寄存器值和栈数据,返回代码执行流程,灰色为不执行代码,蓝色为跳转地址)

Dump of assembler code for function cfg_mod_set_onu_local_mng_interface_config_flag(PON_NO, ONU_NO, unsigned char):
   0xf6416e74 <+0>: push {r4, r5, r11, lr}
   0xf6416e78 <+4>: add r11, sp, #12
   0xf6416e7c <+8>: sub sp, sp, #32
   0xf6416e80 <+12>: mov r3, r1
   0xf6416e84 <+16>: strh r0, [r11, #-30] ; 0xffffffe2
   0xf6416e88 <+20>: strh r3, [r11, #-32] ; 0xffffffe0
   0xf6416e8c <+24>: mov r3, r2
   0xf6416e90 <+28>: strb r3, [r11, #-33] ; 0xffffffdf
   0xf6416e94 <+32>: ldr r5, [pc, #556] ; 0xf64170c8 <cfg_mod_set_onu_local_mng_interface_config_flag(PON_NO, ONU_NO, unsigned char)+596>
   0xf6416e98 <+36>: add r5, pc, r5
   0xf6416e9c <+40>: ldrb r3, [r11, #-33] ; 0xffffffdf
   0xf6416ea0 <+44>: cmp r3, #1
   0xf6416ea4 <+48>: bls 0xf6416ed4 <cfg_mod_set_onu_local_mng_interface_config_flag(PON_NO, ONU_NO, unsigned char)+96>
   0xf6416ea8 <+52>: ldrb r3, [r11, #-33] ; 0xffffffdf
   0xf6416eac <+56>: str r3, [sp]
   0xf6416eb0 <+60>: ldr r3, [pc, #532] ; 0xf64170cc <cfg_mod_set_onu_local_mng_interface_config_flag(PON_NO, ONU_NO, unsigned char)+600>
   0xf6416eb4 <+64>: add r3, pc, r3
   0xf6416eb8 <+68>: ldr r2, [pc, #528] ; 0xf64170d0 <cfg_mod_set_onu_local_mng_interface_config_flag(PON_NO, ONU_NO, unsigned char)+604>
   0xf6416ebc <+72>: add r2, pc, r2
   0xf6416ec0 <+76>: mov r1, #13
   0xf6416ec4 <+80>: mov r0, #10
   0xf6416ec8 <+84>: bl 0xf63afc70 <ulogPrintf@plt>
   0xf6416ecc <+88>: mov r4, #-2147483645 ; 0x80000003
   0xf6416ed0 <+92>: b 0xf64170a4 <cfg_mod_set_onu_local_mng_interface_config_flag(PON_NO, ONU_NO, unsigned char)+560>
   0xf6416ed4 <+96>: ldrh r3, [r11, #-30] ; 0xffffffe2
   0xf6416ed8 <+100>: mov r0, r3
   0xf6416edc <+104>: bl 0xf63aa6cc <valid_port_no@plt>
                                    (gdb) disassemble 0xf63aa6cc
                                    Dump of assembler code for function valid_port_no@plt:
                                    0xf63aa6cc <+0>: add r12, pc, #2097152 ; 0x200000
                                    0xf63aa6d0 <+4>: add r12, r12, #290816 ; 0x47000
                                    0xf63aa6d4 <+8>: ldr pc, [r12, #3688]! ; 0xe68
                                    End of assembler dump.
   0xf6416ee0 <+108>: mov r3, r0
   0xf6416ee4 <+112>: cmp r3, #0
   0xf6416ee8 <+116>: movne r3, #1
   0xf6416eec <+120>: moveq r3, #0
   0xf6416ef0 <+124>: uxtb r3, r3
   0xf6416ef4 <+128>: cmp r3, #0
   0xf6416ef8 <+132>: beq 0xf6416f20 <cfg_mod_set_onu_local_mng_interface_config_flag(PON_NO, ONU_NO, unsigned char)+172>
   0xf6416efc <+136>: ldrh r3, [r11, #-30] ; 0xffffffe2
   0xf6416f00 <+140>: ldr r2, [pc, #460] ; 0xf64170d4 <cfg_mod_set_onu_local_mng_interface_config_flag(PON_NO, ONU_NO, unsigned char)+608>
   0xf6416f04 <+144>: add r2, pc, r2
   0xf6416f08 <+148>: mov r1, #13
   0xf6416f0c <+152>: mov r0, #10
   0xf6416f10 <+156>: bl 0xf63afc70 <ulogPrintf@plt>
   0xf6416f14 <+160>: mov r4, #134 ; 0x86
   0xf6416f18 <+164>: movt r4, #32768 ; 0x8000
   0xf6416f1c <+168>: b 0xf64170a4 <cfg_mod_set_onu_local_mng_interface_config_flag(PON_NO, ONU_NO, unsigned char)+560>
   0xf6416f20 <+172>: ldrh r3, [r11, #-32] ; 0xffffffe0
   0xf6416f24 <+176>: mov r0, r3  >>>>>>>>>Look Here>>>>>>>>>  r0 = r3 = [r11, #-32],
   0xf6416f28 <+180>: bl 0xf63ae9bc <valid_onu_no@plt>
                                    (gdb) disassemble 0xf63ae9bc
                                    Dump of assembler code for function valid_onu_no@plt:
                                    0xf63ae9bc <+0>: add r12, pc, #2097152 ; 0x200000
                                    0xf63ae9c0 <+4>: add r12, r12, #282624 ; 0x45000
                                    0xf63ae9c4 <+8>: ldr pc, [r12, #456]! ; 0x1c8
                                    End of assembler dump.
   0xf6416f2c <+184>: mov r3, r0
   0xf6416f30 <+188>: cmp r3, #0
   0xf6416f34 <+192>: movne r3, #1
   0xf6416f38 <+196>: moveq r3, #0
   0xf6416f3c <+200>: uxtb r3, r3
   0xf6416f40 <+204>: cmp r3, #0
   0xf6416f44 <+208>: beq 0xf6416f6c <cfg_mod_set_onu_local_mng_interface_config_flag(PON_NO, ONU_NO, unsigned char)+248>
   0xf6416f48 <+212>: ldrh r3, [r11, #-32] ; 0xffffffe0
   0xf6416f4c <+216>: ldr r2, [pc, #388] ; 0xf64170d8 <cfg_mod_set_onu_local_mng_interface_config_flag(PON_NO, ONU_NO, unsigned char)+612>
   0xf6416f50 <+220>: add r2, pc, r2
   0xf6416f54 <+224>: mov r1, #13
   0xf6416f58 <+228>: mov r0, #10
   0xf6416f5c <+232>: bl 0xf63afc70 <ulogPrintf@plt>
   0xf6416f60 <+236>: mov r4, #135 ; 0x87
   0xf6416f64 <+240>: movt r4, #32768 ; 0x8000
   0xf6416f68 <+244>: b 0xf64170a4 <cfg_mod_set_onu_local_mng_interface_config_flag(PON_NO, ONU_NO, unsigned char)+560>
   0xf6416f6c <+248>: ldrh r1, [r11, #-32] ; 0xffffffe0        >>>>>>>>>>[r11 - 32]物理内存地址(0xc6e5818c)的值为0x00000000,但是此时r1寄存器值为0xf5d6c6c0,说明栈的数据被踩了
=> 0xf6416f70 <+252>: ldrh r3, [r11, #-30] ; 0xffffffe2
   0xf6416f74 <+256>: mov r2, #0
   0xf6416f78 <+260>: mov r0, r3
   0xf6416f7c <+264>: bl 0xf63a8aa0 <_ZN17CCardConfigObject15GetConfigObjectEtth@plt>
   0xf6416f80 <+268>: cmp r0, #0
   0xf6416f84 <+272>: bne 0xf6416f90 <cfg_mod_set_onu_local_mng_interface_config_flag(PON_NO, ONU_NO, unsigned char)+284>

分析r1 = [r11 - 32],此时r1寄存器值为0xf5d6c6c0,但是栈数据中保存的值为0x0000 0000,说明栈数据被踩了。

(gdb) p/x $r11-32

$1 = 0xc6e5818c

(gdb) x/wx 0xc6e5818c

0xc6e5818c: 0x00000000

分析r0 = r3 = [r11 - 30], r11是栈基址,sp是栈顶指针。此时r0寄存器值为0xb8d70860,但是栈数据中保存的值为0x03b8 0000,证明栈数据果然被踩了。

(gdb) p/x $r11-30
$2 = 0xc6e5818e
(gdb) x/wx 0xc6e5818e
0xc6e5818e: 0x03b80000

那么栈数据被谁踩了呢?茫茫进程中,万千线程,怎么破?

(gdb) info registers
r0             0xb8d70860 3101100128
r1             0xf5d6c6c0 4124493504
r2             0xf5d6c9e0 4124494304
r3             0x0 0
r4             0xf6e05bb4 4141898676
r5             0xf65f1808 4133427208
r6             0x79e0 31200
r7             0xf5af8ed4 4121923284
r8             0xf5af903c 4121923644
r9             0x64 100
r10            0xf5b1a118 4122059032
r11            0xc6e581ac 3336929708
r12            0x2564 9572
sp             0xc6e58160 0xc6e58160
lr             0xf6416f70 -163483792
pc             0xf6416f70 0xf6416f70 <cfg_mod_set_onu_local_mng_interface_config_flag(PON_NO, ONU_NO, unsigned char)+252>
cpsr           0xa0070010 -1610153968

查看一下栈数据,貌似没有发现什么异常。

(gdb) x /128wx $sp-256
0xc6e58060: 0x008499d0 0xf5b1a118 0x000000a9 0xf5d482d8
0xc6e58070: 0x000000a9 0xf5ce41f4 0xf5d6ca30 0xb8d4d4b0
0xc6e58080: 0x00000000 0xf5b1a118 0x000000a9 0xf5d482d8
0xc6e58090: 0xc6e5809c 0xf5ce41f4 0xf65f1808 0x000079e0
0xc6e580a0: 0xf5af8ed4 0xf5af903c 0x00000000 0xb8d70860
0xc6e580b0: 0xc6e580bc 0xf5a6da18 0xc6e580ec 0xf5ce4310
0xc6e580c0: 0x0000006b 0xf5d4c410 0x000003e8 0x023f3460
0xc6e580d0: 0xf5d6ca30 0x00000000 0x00000000 0x023f3460
0xc6e580e0: 0x0000006b 0x00000000 0xc6e580fc 0xf5ce3f5c
0xc6e580f0: 0xb8d6d50c 0xc6e5811c 0xc6e58134 0xf5d0ee0c
0xc6e58100: 0x000003e8 0xc6e5810c 0xf5d6d1c4 0x00000013
0xc6e58110: 0x00000002 0xb8d4d4b0 0xb8d6d4f4 0x023f3460
0xc6e58120: 0xf5d60101 0xb8d6d50c 0xf5d6d1c4 0xf65f1808
0xc6e58130: 0xc6e5815c 0xf5d09c3c 0xc6e5815c 0x00000002
0xc6e58140: 0x0013849e 0x00080075 0xf6e05bb4 0xf65f1808
0xc6e58150: 0x000079e0 0xf5af8ed4 0xf5af903c 0xf6416f70
0xc6e58160: 0x00000000 0xf5ce3f5c 0x000079e0 0x00000415
0xc6e58170: 0x001381ac 0x00080075 0x023f3460 0x00000101
0xc6e58180: 0x023f3460 0x00080101 0x00083460 0x00000000
0xc6e58190: 0xb8d703b8 0xb8d70338 0xb8d4d4b0 0x8000007a
0xc6e581a0: 0xf6e05bb4 0x00000008 0xc6e60944 0xf6cfa60c
0xc6e581b0: 0xc6e60870 0x00000001 0x00000001 0x00000001
0xc6e581c0: 0x00000000 0xeeeeeeee 0xeeeeeeee 0x0000024f
0xc6e581d0: 0xde83a281 0x025a7a38 0x00000000 0x00000000
0xc6e581e0: 0x00000000 0x00000000 0x00000000 0x00000000
0xc6e581f0: 0x00000000 0x00000000 0x00000000 0x00000000
0xc6e58200: 0x00000000 0x00000000 0x00000000 0x00000000
0xc6e58210: 0x00000000 0x00000000 0x00000000 0x00000000
0xc6e58220: 0x00000000 0x00000000 0x00000000 0x00000000
0xc6e58230: 0x00000000 0x00000000 0x00000000 0x00000000
0xc6e58240: 0x00000000 0x00000000 0x00000000 0x00000000
0xc6e58250: 0x00000000 0x00000000 0x00000000 0x00000000

既然出问题的r0指向的内存地址,而且0x2564地址出现的很诡异,明显也是被踩了,那就看一下r0指向的地址附近的数据

查看以r0指向地址为中心的128字节数据

(gdb) x /64wx $r0-64
0xb8d70820: 0x00000000 0x00000000 0x00000000 0x00000000
0xb8d70830: 0xfe0f0000 0xffffffff 0x0a00ffff 0x0e888888
0xb8d70840: 0xf90f0081 0x01000608 0x04060008 0x0a000100
0xb8d70850: 0x0e888888 0x0e64190a 0x00000000 0x190a0000
0xb8d70860: 0x00002564 0x00000000 0x00000000 0x00000000                           ------->0x2564赫然在列!
0xb8d70870: 0x00000000 0x00080001 0x00750000 0x00000013
0xb8d70880: 0x00000000 0x00000002 0x00000000 0x31312d38
0xb8d70890: 0x39312d37 0x0000322f 0x00000000 0x00000000
0xb8d708a0: 0x00000000 0x00000000 0x00000000 0xb8d708e0
0xb8d708b0: 0xb8d4d4b0 0x00000000 0x00000000 0x00000000
0xb8d708c0: 0xb8d708b8 0xb8d708b8 0x00000000 0x00000000
0xb8d708d0: 0x00000000 0x00000013 0x00000002 0x000004ad
0xb8d708e0: 0xf65f1448 0x00000000 0x00000001 0xb8d70860
0xb8d708f0: 0x00000000 0x00000000 0x00000000 0x00000000
0xb8d70900: 0x00000000 0x00000000 0x00000000 0x00000000
0xb8d70910: 0x00000000 0x00000000 0x00000000 0x00000000

以字节形式显示:

(gdb) x/128bx $r0-64
0xb8d70820: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0xb8d70828: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0xb8d70830: 0x00 0x00 0x0f 0xfe 0xff 0xff 0xff 0xff                                   ---->ff:ff:ff:ff:ff:ff 目的MAC地址,广播地址
0xb8d70838: 0xff 0xff 0x00 0x0a 0x88 0x88 0x88 0x0e                             ----->源MAC地址
0xb8d70840: 0x81 0x00 0x0f 0xf9 0x08 0x06 0x00 0x01                           -----> 0x8100 帧类型 ARP数据包
0xb8d70848: 0x08 0x00 0x06 0x04 0x00 0x01 0x00 0x0a
0xb8d70850: 0x88 0x88 0x88 0x0e 0x0a 0x19 0x64 0x0e
0xb8d70858: 0x00 0x00 0x00 0x00 0x00 0x00 0x0a 0x19
0xb8d70860: 0x64 0x25 0x00 0x00 0x00 0x00 0x00 0x00
0xb8d70868: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0xb8d70870: 0x00 0x00 0x00 0x00 0x01 0x00 0x08 0x00
0xb8d70878: 0x00 0x00 0x75 0x00 0x13 0x00 0x00 0x00
0xb8d70880: 0x00 0x00 0x00 0x00 0x02 0x00 0x00 0x00
0xb8d70888: 0x00 0x00 0x00 0x00 0x38 0x2d 0x31 0x31
0xb8d70890: 0x37 0x2d 0x31 0x39 0x2f 0x32 0x00 0x00
0xb8d70898: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00

这个时候,大神出现,一眼就看出了踩内存的这一段数据是一个ARP网络数据包。赶紧ifconfig看一下:

源MAC地址和eth0的MAC地址完全一致,终于抓到了罪魁祸首!!!

 

最终定位是网卡驱动使用了一段保留内存段,但是Linux内核初始化DDR时,保留内存段设置没有对齐,导致该内存段参与了动态内存分配,用户态进程和内核网卡驱动同时往这段内存写数据,ARP报文踩了进程的子线程的栈,导致该异常发生。


免责声明!

本站转载的文章为个人学习借鉴使用,本站对版权不负任何法律责任。如果侵犯了您的隐私权益,请联系本站邮箱yoyou2525@163.com删除。



 
粤ICP备18138465号  © 2018-2025 CODEPRJ.COM