一例智能網卡(mellanox)的網卡故障分析
背景:這個是在centos 7.6.1810的環境上復現的,智能網卡是目前很多
雲服務器上的網卡標配,在oppo主要用於vpc等場景,智能網卡的代碼隨着
功能的增強導致復雜度一直在上升,驅動的bug一直是內核bug
中的大頭,在遇到類似問題時,內核開發者由於對驅動代碼不熟悉,排查
會比較費勁,本身涉及的背景知識有:dma_pool,dma_page,net_device,
mlx5_core_dev設備,設備卸載,uaf問題等,另外,這個bug目測在最新的
紅帽linux基線也沒有解決,本文單獨拿出來列舉是因為uaf問題相對比較獨特。
下面列一下我們是怎么排查並解決這個問題的。
一、故障現象
oppo雲內核團隊接到連通性告警報障,發現機器復位:
UPTIME: 00:04:16-------------運行的時間很短
LOAD AVERAGE: 0.25, 0.23, 0.11
TASKS: 2027
RELEASE: 3.10.0-1062.18.1.el7.x86_64
MEMORY: 127.6 GB
PANIC: "BUG: unable to handle kernel NULL pointer dereference at (null)"
PID: 23283
COMMAND: "spider-agent"
TASK: ffff9d1fbb090000 [THREAD_INFO: ffff9d1f9a0d8000]
CPU: 0
STATE: TASK_RUNNING (PANIC)
crash> bt
PID: 23283 TASK: ffff9d1fbb090000 CPU: 0 COMMAND: "spider-agent"
#0 [ffff9d1f9a0db650] machine_kexec at ffffffffb6665b34
#1 [ffff9d1f9a0db6b0] __crash_kexec at ffffffffb6722592
#2 [ffff9d1f9a0db780] crash_kexec at ffffffffb6722680
#3 [ffff9d1f9a0db798] oops_end at ffffffffb6d85798
#4 [ffff9d1f9a0db7c0] no_context at ffffffffb6675bb4
#5 [ffff9d1f9a0db810] __bad_area_nosemaphore at ffffffffb6675e82
#6 [ffff9d1f9a0db860] bad_area_nosemaphore at ffffffffb6675fa4
#7 [ffff9d1f9a0db870] __do_page_fault at ffffffffb6d88750
#8 [ffff9d1f9a0db8e0] do_page_fault at ffffffffb6d88975
#9 [ffff9d1f9a0db910] page_fault at ffffffffb6d84778
[exception RIP: dma_pool_alloc+427]//caq:異常地址
RIP: ffffffffb680efab RSP: ffff9d1f9a0db9c8 RFLAGS: 00010046
RAX: 0000000000000246 RBX: ffff9d0fa45f4c80 RCX: 0000000000001000
RDX: 0000000000000000 RSI: 0000000000000246 RDI: ffff9d0fa45f4c10
RBP: ffff9d1f9a0dba20 R8: 000000000001f080 R9: ffff9d00ffc07c00
R10: ffffffffc03e10c4 R11: ffffffffb67dd6fd R12: 00000000000080d0
R13: ffff9d0fa45f4c10 R14: ffff9d0fa45f4c00 R15: 0000000000000000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#10 [ffff9d1f9a0dba28] mlx5_alloc_cmd_msg at ffffffffc03e10e3 [mlx5_core]//涉及的模塊
#11 [ffff9d1f9a0dba78] cmd_exec at ffffffffc03e3c92 [mlx5_core]
#12 [ffff9d1f9a0dbb18] mlx5_cmd_exec at ffffffffc03e442b [mlx5_core]
#13 [ffff9d1f9a0dbb48] mlx5_core_access_reg at ffffffffc03ee354 [mlx5_core]
#14 [ffff9d1f9a0dbba0] mlx5_query_port_ptys at ffffffffc03ee411 [mlx5_core]
#15 [ffff9d1f9a0dbc10] mlx5e_get_link_ksettings at ffffffffc0413035 [mlx5_core]
#16 [ffff9d1f9a0dbce8] __ethtool_get_link_ksettings at ffffffffb6c56d06
#17 [ffff9d1f9a0dbd48] speed_show at ffffffffb6c705b8
#18 [ffff9d1f9a0dbdd8] dev_attr_show at ffffffffb6ab1643
#19 [ffff9d1f9a0dbdf8] sysfs_kf_seq_show at ffffffffb68d709f
#20 [ffff9d1f9a0dbe18] kernfs_seq_show at ffffffffb68d57d6
#21 [ffff9d1f9a0dbe28] seq_read at ffffffffb6872a30
#22 [ffff9d1f9a0dbe98] kernfs_fop_read at ffffffffb68d6125
#23 [ffff9d1f9a0dbed8] vfs_read at ffffffffb684a8ff
#24 [ffff9d1f9a0dbf08] sys_read at ffffffffb684b7bf
#25 [ffff9d1f9a0dbf50] system_call_fastpath at ffffffffb6d8dede
RIP: 00000000004a5030 RSP: 000000c001099378 RFLAGS: 00000212
RAX: 0000000000000000 RBX: 000000c000040000 RCX: ffffffffffffffff
RDX: 000000000000000a RSI: 000000c00109976e RDI: 000000000000000d---read的文件fd編號
RBP: 000000c001099640 R8: 0000000000000000 R9: 0000000000000000
R10: 0000000000000000 R11: 0000000000000206 R12: 000000000000000c
R13: 0000000000000032 R14: 0000000000f710c4 R15: 0000000000000000
ORIG_RAX: 0000000000000000 CS: 0033 SS: 002b
從堆棧看,是某進程讀取文件觸發了一個內核態的空指針引用。
二、故障現象分析
從堆棧信息看:
1、當時進程打開fd編號為13的文件,這個從rdi的值可以看出。
2、speed_show 和 __ethtool_get_link_ksettings 表示在讀取網卡的速率值
下面看下打開的文件是哪個,
crash> files 23283
PID: 23283 TASK: ffff9d1fbb090000 CPU: 0 COMMAND: "spider-agent"
ROOT: /rootfs CWD: /rootfs/home/service/app/spider
FD FILE DENTRY INODE TYPE PATH
....
9 ffff9d0f5709b200 ffff9d1facc80a80 ffff9d1069a194d0 REG /rootfs/sys/devices/pci0000:3a/0000:3a:00.0/0000:3b:00.0/net/p1p1/speed---這個還在
10 ffff9d0f4a45a400 ffff9d0f9982e240 ffff9d0fb7b873a0 REG /rootfs/sys/devices/pci0000:5d/0000:5d:00.0/0000:5e:00.0/net/p3p1/speed---注意對應關系 0000:5e:00.0 對應p3p1
11 ffff9d0f57098f00 ffff9d1facc80240 ffff9d1069a1b530 REG /rootfs/sys/devices/pci0000:3a/0000:3a:00.0/0000:3b:00.1/net/p1p2/speed---這個還在
13 ffff9d0f4a458a00 ffff9d0f9982e0c0 ffff9d0fb7b875f0 REG /rootfs/sys/devices/pci0000:5d/0000:5d:00.0/0000:5e:00.1/net/p3p2/speed---注意對應關系 0000:5e:00.1 對應p3p2
....
注意上面 pci編號與 網卡名稱的對應關系,后面會用到。
打開文件讀取speed本身應該是一個很常見的流程,
下面從 exception RIP: dma_pool_alloc+427 進一步分析為什么觸發了NULL pointer dereference
展開具體的堆棧如下:
#9 [ffff9d1f9a0db910] page_fault at ffffffffb6d84778
[exception RIP: dma_pool_alloc+427]
RIP: ffffffffb680efab RSP: ffff9d1f9a0db9c8 RFLAGS: 00010046
RAX: 0000000000000246 RBX: ffff9d0fa45f4c80 RCX: 0000000000001000
RDX: 0000000000000000 RSI: 0000000000000246 RDI: ffff9d0fa45f4c10
RBP: ffff9d1f9a0dba20 R8: 000000000001f080 R9: ffff9d00ffc07c00
R10: ffffffffc03e10c4 R11: ffffffffb67dd6fd R12: 00000000000080d0
R13: ffff9d0fa45f4c10 R14: ffff9d0fa45f4c00 R15: 0000000000000000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
ffff9d1f9a0db918: 0000000000000000 ffff9d0fa45f4c00
ffff9d1f9a0db928: ffff9d0fa45f4c10 00000000000080d0
ffff9d1f9a0db938: ffff9d1f9a0dba20 ffff9d0fa45f4c80
ffff9d1f9a0db948: ffffffffb67dd6fd ffffffffc03e10c4
ffff9d1f9a0db958: ffff9d00ffc07c00 000000000001f080
ffff9d1f9a0db968: 0000000000000246 0000000000001000
ffff9d1f9a0db978: 0000000000000000 0000000000000246
ffff9d1f9a0db988: ffff9d0fa45f4c10 ffffffffffffffff
ffff9d1f9a0db998: ffffffffb680efab 0000000000000010
ffff9d1f9a0db9a8: 0000000000010046 ffff9d1f9a0db9c8
ffff9d1f9a0db9b8: 0000000000000018 ffffffffb680ee45
ffff9d1f9a0db9c8: ffff9d0faf9fec40 0000000000000000
ffff9d1f9a0db9d8: ffff9d0faf9fec48 ffffffffb682669c
ffff9d1f9a0db9e8: ffff9d00ffc07c00 00000000618746c1
ffff9d1f9a0db9f8: 0000000000000000 0000000000000000
ffff9d1f9a0dba08: ffff9d0faf9fec40 0000000000000000
ffff9d1f9a0dba18: ffff9d0fa3c800c0 ffff9d1f9a0dba70
ffff9d1f9a0dba28: ffffffffc03e10e3
#10 [ffff9d1f9a0dba28] mlx5_alloc_cmd_msg at ffffffffc03e10e3 [mlx5_core]
ffff9d1f9a0dba30: ffff9d0f4eebee00 0000000000000001
ffff9d1f9a0dba40: 000000d0000080d0 0000000000000050
ffff9d1f9a0dba50: ffff9d0fa3c800c0 0000000000000005 --r12是rdi ,ffff9d0fa3c800c0
ffff9d1f9a0dba60: ffff9d0fa3c803e0 ffff9d1f9d87ccc0
ffff9d1f9a0dba70: ffff9d1f9a0dbb10 ffffffffc03e3c92
#11 [ffff9d1f9a0dba78] cmd_exec at ffffffffc03e3c92 [mlx5_core]
從堆棧中取出對應的 mlx5_core_dev 為 ffff9d0fa3c800c0
crash> mlx5_core_dev.cmd ffff9d0fa3c800c0 -xo
struct mlx5_core_dev {
[ffff9d0fa3c80138] struct mlx5_cmd cmd;
}
crash> mlx5_cmd.pool ffff9d0fa3c80138
pool = 0xffff9d0fa45f4c00------這個就是dma_pool,寫驅動代碼的同學會經常遇到
出問題的代碼行號為:
crash> dis -l dma_pool_alloc+427 -B 5
/usr/src/debug/kernel-3.10.0-1062.18.1.el7/linux-3.10.0-1062.18.1.el7.x86_64/mm/dmapool.c: 334
0xffffffffb680efab <dma_pool_alloc+427>: mov (%r15),%ecx
而對應的r15,從上面的堆棧看,確實是null。
305 void *dma_pool_alloc(struct dma_pool *pool, gfp_t mem_flags,
306 dma_addr_t *handle)
307 {
...
315 spin_lock_irqsave(&pool->lock, flags);
316 list_for_each_entry(page, &pool->page_list, page_list) {
317 if (page->offset < pool->allocation)---//caq:當前滿足條件
318 goto ready;//caq:跳轉到ready
319 }
320
321 /* pool_alloc_page() might sleep, so temporarily drop &pool->lock */
322 spin_unlock_irqrestore(&pool->lock, flags);
323
324 page = pool_alloc_page(pool, mem_flags & (~__GFP_ZERO));
325 if (!page)
326 return NULL;
327
328 spin_lock_irqsave(&pool->lock, flags);
329
330 list_add(&page->page_list, &pool->page_list);
331 ready:
332 page->in_use++;//caq:表示正在引用
333 offset = page->offset;//從上次用完的地方開始使用
334 page->offset = *(int *)(page->vaddr + offset);//caq:出問題的行號
...
}
從上面的代碼看,page->vaddr為NULL,offset也為0,才會引用NULL,page有兩個來源,
第一種是從pool中的page_list中取,
第二種是從pool_alloc_page臨時申請,當然申請之后會掛入到pool中的page_list,
下面查看一下這個page_list.
crash> dma_pool ffff9d0fa45f4c00 -x
struct dma_pool {
page_list = {
next = 0xffff9d0fa45f4c80,
prev = 0xffff9d0fa45f4c00
},
lock = {
{
rlock = {
raw_lock = {
val = {
counter = 0x1
}
}
}
}
},
size = 0x400,
dev = 0xffff9d1fbddec098,
allocation = 0x1000,
boundary = 0x1000,
name = "mlx5_cmd\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000",
pools = {
next = 0xdead000000000100,
prev = 0xdead000000000200
}
}
crash> list dma_pool.page_list -H 0xffff9d0fa45f4c00 -s dma_page.offset,vaddr
ffff9d0fa45f4c80
offset = 0
vaddr = 0x0
ffff9d0fa45f4d00
offset = 0
vaddr = 0x0
從 dma_pool_alloc 函數的代碼邏輯看,pool->page_list確實不為空,而且滿足
if (page->offset < pool->allocation) 的條件,所以第一個page應該是 ffff9d0fa45f4c80
也就是從第一種情況取出的:
crash> dma_page ffff9d0fa45f4c80
struct dma_page {
page_list = {
next = 0xffff9d0fa45f4d00,
prev = 0xffff9d0fa45f4c80
},
vaddr = 0x0, //caq:這個異常,引用這個將導致crash
dma = 0,
in_use = 1, //caq:這個標記為在使用,符合page->in_use++;
offset = 0
}
問題分析到這里,因為dma_pool中的page,申請之后,vaddr都會初始化,
一般在pool_alloc_page 中進行初始化,怎么可能會NULL呢?
然后查看一下這個地址:
crash> kmem ffff9d0fa45f4c80-------這個是dma_pool中的page
CACHE NAME OBJSIZE ALLOCATED TOTAL SLABS SSIZE
ffff9d00ffc07900 kmalloc-128//caq:注意這個長度 128 8963 14976 234 8k
SLAB MEMORY NODE TOTAL ALLOCATED FREE
ffffe299c0917d00 ffff9d0fa45f4000 0 64 29 35
FREE / [ALLOCATED]
ffff9d0fa45f4c80
PAGE PHYSICAL MAPPING INDEX CNT FLAGS
ffffe299c0917d00 10245f4000 0 ffff9d0fa45f4c00 1 2fffff00004080 slab,head
由於以前用過類似的dma函數,印象中dma_page沒有這么大,再看看第二個dma_page如下:
crash> kmem ffff9d0fa45f4d00
CACHE NAME OBJSIZE ALLOCATED TOTAL SLABS SSIZE
ffff9d00ffc07900 kmalloc-128 128 8963 14976 234 8k
SLAB MEMORY NODE TOTAL ALLOCATED FREE
ffffe299c0917d00 ffff9d0fa45f4000 0 64 29 35
FREE / [ALLOCATED]
ffff9d0fa45f4d00
PAGE PHYSICAL MAPPING INDEX CNT FLAGS
ffffe299c0917d00 10245f4000 0 ffff9d0fa45f4c00 1 2fffff00004080 slab,head
crash> dma_page ffff9d0fa45f4d00
struct dma_page {
page_list = {
next = 0xffff9d0fa45f5000,
prev = 0xffff9d0fa45f4d00
},
vaddr = 0x0, -----------caq:也是null
dma = 0,
in_use = 0,
offset = 0
}
crash> list dma_pool.page_list -H 0xffff9d0fa45f4c00 -s dma_page.offset,vaddr
ffff9d0fa45f4c80
offset = 0
vaddr = 0x0
ffff9d0fa45f4d00
offset = 0
vaddr = 0x0
ffff9d0fa45f5000
offset = 0
vaddr = 0x0
.........
看來不僅是第一個dma_page有問題,所有在pool中的dma_page單元都一樣,
那直接查看一下dma_page的正常大小:
crash> p sizeof(struct dma_page)
$3 = 40
按道理長度才40字節,就算申請slab的話,也應該擴展為64字節才對,怎么可能像上面那個dma_page
一樣是128字節呢?為了解開這個疑惑,找一個正常的其他節點對比一下:
crash> net
NET_DEVICE NAME IP ADDRESS(ES)
ffff8f9e800be000 lo 127.0.0.1
ffff8f9e62640000 p1p1
ffff8f9e626c0000 p1p2
ffff8f9e627c0000 p3p1 -----//caq:以這個為例
ffff8f9e62100000 p3p2
然后根據代碼:通過net_device查看mlx5e_priv:
static int mlx5e_get_link_ksettings(struct net_device *netdev,
struct ethtool_link_ksettings *link_ksettings)
{
...
struct mlx5e_priv *priv = netdev_priv(netdev);
...
}
static inline void *netdev_priv(const struct net_device *dev)
{
return (char *)dev + ALIGN(sizeof(struct net_device), NETDEV_ALIGN);
}
crash> px sizeof(struct net_device)
$2 = 0x8c0
crash> mlx5e_priv.mdev ffff8f9e627c08c0---根據偏移計算
mdev = 0xffff8f9e67c400c0
crash> mlx5_core_dev.cmd 0xffff8f9e67c400c0 -xo
struct mlx5_core_dev {
[ffff8f9e67c40138] struct mlx5_cmd cmd;
}
crash> mlx5_cmd.pool ffff8f9e67c40138
pool = 0xffff8f9e7bf48f80
crash> dma_pool 0xffff8f9e7bf48f80
struct dma_pool {
page_list = {
next = 0xffff8f9e79c60880, //caq:其中的一個dma_page
prev = 0xffff8fae6e4db800
},
.......
size = 1024,
dev = 0xffff8f9e800b3098,
allocation = 4096,
boundary = 4096,
name = "mlx5_cmd\000\217\364{\236\217\377\377\300\217\364{\236\217\377\377\200\234>\250\217\217\377\377",
pools = {
next = 0xffff8f9e800b3290,
prev = 0xffff8f9e800b3290
}
}
crash> dma_page 0xffff8f9e79c60880 //caq:查看這個dma_page
struct dma_page {
page_list = {
next = 0xffff8f9e79c60840, -------其中的一個dma_page
prev = 0xffff8f9e7bf48f80
},
vaddr = 0xffff8f9e6fc9b000, //caq:正常vaddr不可能會NULL的
dma = 69521223680,
in_use = 0,
offset = 0
}
crash> kmem 0xffff8f9e79c60880
CACHE NAME OBJSIZE ALLOCATED TOTAL SLABS SSIZE
ffff8f8fbfc07b00 kmalloc-64--正常長度 64 667921 745024 11641 4k
SLAB MEMORY NODE TOTAL ALLOCATED FREE
ffffde5140e71800 ffff8f9e79c60000 0 64 64 0
FREE / [ALLOCATED]
[ffff8f9e79c60880]
PAGE PHYSICAL MAPPING INDEX CNT FLAGS
ffffde5140e71800 1039c60000 0 0 1 2fffff00000080 slab
以上操作要求對net_device和mlx5相關驅動代碼比較熟悉。
相比於異常的dma_page,正常的dma_page是一個64字節的slab,所以很明顯,
要么這個是一個踩內存問題,要么是一個uaf(used after free )問題。
一般問題查到這,怎么快速判斷是哪一種類型呢?因為這兩種問題,涉及到內存紊亂,
一般都比較難查,這時候需要跳出來,
我們先看一下其他運行進程的情況,找到了一個進程如下:
crash> bt 48263
PID: 48263 TASK: ffff9d0f4ee0a0e0 CPU: 56 COMMAND: "reboot"
#0 [ffff9d0f95d7f958] __schedule at ffffffffb6d80d4a
#1 [ffff9d0f95d7f9e8] schedule at ffffffffb6d811f9
#2 [ffff9d0f95d7f9f8] schedule_timeout at ffffffffb6d7ec48
#3 [ffff9d0f95d7faa8] wait_for_completion_timeout at ffffffffb6d81ae5
#4 [ffff9d0f95d7fb08] cmd_exec at ffffffffc03e41c9 [mlx5_core]
#5 [ffff9d0f95d7fba8] mlx5_cmd_exec at ffffffffc03e442b [mlx5_core]
#6 [ffff9d0f95d7fbd8] mlx5_core_destroy_mkey at ffffffffc03f085d [mlx5_core]
#7 [ffff9d0f95d7fc40] mlx5_mr_cache_cleanup at ffffffffc0c60aab [mlx5_ib]
#8 [ffff9d0f95d7fca8] mlx5_ib_stage_pre_ib_reg_umr_cleanup at ffffffffc0c45d32 [mlx5_ib]
#9 [ffff9d0f95d7fcc0] __mlx5_ib_remove at ffffffffc0c4f450 [mlx5_ib]
#10 [ffff9d0f95d7fce8] mlx5_ib_remove at ffffffffc0c4f4aa [mlx5_ib]
#11 [ffff9d0f95d7fd00] mlx5_detach_device at ffffffffc03fe231 [mlx5_core]
#12 [ffff9d0f95d7fd30] mlx5_unload_one at ffffffffc03dee90 [mlx5_core]
#13 [ffff9d0f95d7fd60] shutdown at ffffffffc03def80 [mlx5_core]
#14 [ffff9d0f95d7fd80] pci_device_shutdown at ffffffffb69d1cda
#15 [ffff9d0f95d7fda8] device_shutdown at ffffffffb6ab3beb
#16 [ffff9d0f95d7fdd8] kernel_restart_prepare at ffffffffb66b7916
#17 [ffff9d0f95d7fde8] kernel_restart at ffffffffb66b7932
#18 [ffff9d0f95d7fe00] SYSC_reboot at ffffffffb66b7ba9
#19 [ffff9d0f95d7ff40] sys_reboot at ffffffffb66b7c4e
#20 [ffff9d0f95d7ff50] system_call_fastpath at ffffffffb6d8dede
RIP: 00007fc9be7a5226 RSP: 00007ffd9a19e448 RFLAGS: 00010246
RAX: 00000000000000a9 RBX: 0000000000000004 RCX: 0000000000000000
RDX: 0000000001234567 RSI: 0000000028121969 RDI: fffffffffee1dead
RBP: 0000000000000002 R8: 00005575d529558c R9: 0000000000000000
R10: 00007fc9bea767b8 R11: 0000000000000206 R12: 0000000000000000
R13: 00007ffd9a19e690 R14: 0000000000000000 R15: 0000000000000000
ORIG_RAX: 00000000000000a9 CS: 0033 SS: 002b
為什么會關注這個進程,因為這么多年以來,因為卸載模塊引發的uaf問題排查不低於20次了,
有時候是reboot,有時候是unload,有時候是在work中釋放資源,
所以直覺上,覺得和這個卸載有很大關系。
下面分析一下,reboot流程里面操作到哪了。
2141 void device_shutdown(void)
2142 {
2143 struct device *dev, *parent;
2144
2145 spin_lock(&devices_kset->list_lock);
2146 /*
2147 * Walk the devices list backward, shutting down each in turn.
2148 * Beware that device unplug events may also start pulling
2149 * devices offline, even as the system is shutting down.
2150 */
2151 while (!list_empty(&devices_kset->list)) {
2152 dev = list_entry(devices_kset->list.prev, struct device,
2153 kobj.entry);
........
2178 if (dev->device_rh && dev->device_rh->class_shutdown_pre) {
2179 if (initcall_debug)
2180 dev_info(dev, "shutdown_pre\n");
2181 dev->device_rh->class_shutdown_pre(dev);
2182 }
2183 if (dev->bus && dev->bus->shutdown) {
2184 if (initcall_debug)
2185 dev_info(dev, "shutdown\n");
2186 dev->bus->shutdown(dev);
2187 } else if (dev->driver && dev->driver->shutdown) {
2188 if (initcall_debug)
2189 dev_info(dev, "shutdown\n");
2190 dev->driver->shutdown(dev);
2191 }
}
從上面代碼看出以下兩點:
1、每個device 的 kobj.entry 成員串接在 devices_kset->list 中。
2、每個設備的shutdown流程從 device_shutdown 看是串行的。
從reboot 的堆棧看,卸載一個 mlx設備的流程包含如下:
pci_device_shutdown-->shutdown-->mlx5_unload_one-->mlx5_detach_device
-->mlx5_cmd_cleanup-->dma_pool_destroy
mlx5_detach_device的流程分支為:
void dma_pool_destroy(struct dma_pool *pool)
{
.......
while (!list_empty(&pool->page_list)) {//caq:將pool中的dma_page一一刪除
struct dma_page *page;
page = list_entry(pool->page_list.next,
struct dma_page, page_list);
if (is_page_busy(page)) {
.......
list_del(&page->page_list);
kfree(page);
} else
pool_free_page(pool, page);//每個dma_page去釋放
}
kfree(pool);//caq:釋放pool
.......
}
static void pool_free_page(struct dma_pool *pool, struct dma_page *page)
{
dma_addr_t dma = page->dma;
#ifdef DMAPOOL_DEBUG
memset(page->vaddr, POOL_POISON_FREED, pool->allocation);
#endif
dma_free_coherent(pool->dev, pool->allocation, page->vaddr, dma);
list_del(&page->page_list);//caq:釋放后會將page_list成員毒化
kfree(page);
}
從reboot的堆棧中,查看對應的 信息
#4 [ffff9d0f95d7fb08] cmd_exec at ffffffffc03e41c9 [mlx5_core]
ffff9d0f95d7fb10: ffffffffb735b580 ffff9d0f904caf18
ffff9d0f95d7fb20: ffff9d00ff801da8 ffff9d0f23121200
ffff9d0f95d7fb30: ffff9d0f23121740 ffff9d0fa7480138
ffff9d0f95d7fb40: 0000000000000000 0000001002020000
ffff9d0f95d7fb50: 0000000000000000 ffff9d0f95d7fbe8
ffff9d0f95d7fb60: ffff9d0f00000000 0000000000000000
ffff9d0f95d7fb70: 00000000756415e3 ffff9d0fa74800c0 ----mlx5_core_dev設備,對應的是 p3p1,
ffff9d0f95d7fb80: ffff9d0f95d7fbf8 ffff9d0f95d7fbe8
ffff9d0f95d7fb90: 0000000000000246 ffff9d0f8f3a20b8
ffff9d0f95d7fba0: ffff9d0f95d7fbd0 ffffffffc03e442b
#5 [ffff9d0f95d7fba8] mlx5_cmd_exec at ffffffffc03e442b [mlx5_core]
ffff9d0f95d7fbb0: 0000000000000000 ffff9d0fa74800c0
ffff9d0f95d7fbc0: ffff9d0f8f3a20b8 ffff9d0fa74bea00
ffff9d0f95d7fbd0: ffff9d0f95d7fc38 ffffffffc03f085d
#6 [ffff9d0f95d7fbd8] mlx5_core_destroy_mkey at ffffffffc03f085d [mlx5_core]
要注意,reboot正在釋放的 mlx5_core_dev 是 ffff9d0fa74800c0,這個設備對應的net_device是:
p3p1,而 23283 進程正在訪問的 mlx5_core_dev 是 ffff9d0fa3c800c0 ,對應的是 p3p2。
crash> net
NET_DEVICE NAME IP ADDRESS(ES)
ffff9d0fc003e000 lo 127.0.0.1
ffff9d1fad200000 p1p1
ffff9d0fa0700000 p1p2
ffff9d0fa00c0000 p3p1 對應的 mlx5_core_dev 是 ffff9d0fa74800c0
ffff9d0fa0200000 p3p2 對應的 mlx5_core_dev 是 ffff9d0fa3c800c0
我們看下目前還殘留在 devices_kset 中的device:
crash> p devices_kset
devices_kset = $4 = (struct kset *) 0xffff9d1fbf4e70c0
crash> p devices_kset.list
$5 = {
next = 0xffffffffb72f2a38,
prev = 0xffff9d0fbe0ea130
}
crash> list -H -o 0x18 0xffffffffb72f2a38 -s device.kobj.name >device.list
我們發現p3p1 與 p3p2均不在 device.list中,
[root@it202-seg-k8s-prod001-node-10-27-96-220 127.0.0.1-2020-12-07-10:58:06]# grep 0000:5e:00.0 device.list //caq:未找到 這個是 p3p1,當前reboot流程正在卸載。
[root@it202-seg-k8s-prod001-node-10-27-96-220 127.0.0.1-2020-12-07-10:58:06]# grep 0000:5e:00.1 device.list //caq:未找到,這個是 p3p2,已經卸載完
[root@it202-seg-k8s-prod001-node-10-27-96-220 127.0.0.1-2020-12-07-10:58:06]# grep 0000:3b:00.0 device.list //caq:這個mlx5設備還沒unload
kobj.name = 0xffff9d1fbe82aa70 "0000:3b:00.0",
[root@it202-seg-k8s-prod001-node-10-27-96-220 127.0.0.1-2020-12-07-10:58:06]# grep 0000:3b:00.1 device.list //caq:這個mlx5設備還沒unload
kobj.name = 0xffff9d1fbe82aae0 "0000:3b:00.1",
由於 p3p2 與 p3p1均不在 device.list中,而根據 pci_device_shutdown的串行卸載流程,
當前正在卸載的是 p3p1,所以很確定的是 23283 進程訪問的是卸載后的cmd_pool,
根據前面描述的卸載流程 :
pci_device_shutdown-->shutdown-->mlx5_unload_one-->mlx5_cmd_cleanup-->dma_pool_destroy
此時的pool已經被釋放了,pool中的dma_page均無效的。
然后嘗試google對應的bug,查看到一個跟當前現象極為相似,
redhat遇到了類似的問題:https://access.redhat.com/solutions/5132931
但是,紅帽在這個鏈接中認為解決了uaf的問題,合入的補丁卻是:
commit 4cca96a8d9da0ed8217cfdf2aec0c3c8b88e8911
Author: Parav Pandit <parav@mellanox.com>
Date: Thu Dec 12 13:30:21 2019 +0200
diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
index 997cbfe..05b557d 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -6725,6 +6725,8 @@ void __mlx5_ib_remove(struct mlx5_ib_dev *dev,
const struct mlx5_ib_profile *profile,
int stage)
{
+ dev->ib_active = false;
+
/* Number of stages to cleanup */
while (stage) {
stage--;
敲黑板,三遍:
這個合入是不能解決對應的bug的,比如如下的並發:
我們用一個簡單的圖來表示一下並發處理:
CPU1 CPU2
dev_attr_show
pci_device_shutdown speed_show
shutdown
mlx5_unload_one
mlx5_detach_device
mlx5_detach_interface
mlx5e_detach
mlx5e_detach_netdev
mlx5e_nic_disable
rtnl_lock
mlx5e_close_locked
clear_bit(MLX5E_STATE_OPENED, &priv->state);---只清理了這個bit
rtnl_unlock
rtnl_trylock---持鎖成功后
netif_running 只是判斷net_device.state的最低位
__ethtool_get_link_ksettings
mlx5e_get_link_ksettings
mlx5_query_port_ptys()
mlx5_core_access_reg()
mlx5_cmd_exec
cmd_exec
mlx5_alloc_cmd_msg
mlx5_cmd_cleanup---清理dma_pool
dma_pool_alloc---訪問cmd.pool,觸發crash
所以如果要真正解決這個問題,還需要 netif_device_detach 中清理 __LINK_STATE_START的bit位,
或者在 speed_show 中判斷一下 __LINK_STATE_PRESENT 位?如果考慮影響范圍,不想動公共流程,則應該
在 mlx5e_get_link_ksettings 中 判斷一下 __LINK_STATE_PRESENT。
這個就留給喜歡跟社區打交道的同學去完善吧。
static void mlx5e_nic_disable(struct mlx5e_priv *priv)
{
.......
rtnl_lock();
if (netif_running(priv->netdev))
mlx5e_close(priv->netdev);
netif_device_detach(priv->netdev);
//caq:增加一下清理 __LINK_STATE_PRESENT位
rtnl_unlock();
.......
三、故障復現
1、競態問題,可以制造類似上圖cpu1 與cpu2 的競爭場景。
四、故障規避或解決
可能的解決方案是:
1、不要按照紅帽https://access.redhat.com/solutions/5132931那樣升級。
2、單獨打補丁。
五、作者簡介
陳安慶,目前在oppo混合雲負責linux內核及容器,虛擬機等虛擬化方面的工作,
聯系方式:微信與手機同號:18752035557。
