實驗要求
- 找一個系統調用,系統調用號為學號最后2位相同的系統調用
- 通過匯編指令觸發該系統調用
- 通過gdb跟蹤該系統調用的內核處理過程
- 重點閱讀分析系統調用入口的保存現場、恢復現場和系統調用返回,以及重點關注系統調用過程中內核堆棧狀態的變化
1 目標系統調用
本人學號尾號為61,選取的系統調用號是61號,即wait4.
進程終止后,我們希望父進程能夠得到關於該進程的終止信息,父進程就可以通過調用wait函數來獲取子進程的終止狀態,之后就可以再進行最后的操作,徹底刪除進程所占用的內存資源。與wait相關的函數包括wait、wait3、waitpid等,這些函數都是通過系統調用wait4來實現的。
相關原理可以從下面的資料中查看:系統調用exit與wait4
2 環境准備
2.1 本機環境
VirtualBox 6.1.6 + Manjaro 20.0.1
Manjaro是一個基於Arch的面向新手的容易上手的發行版,本次實驗使用Manjaro主要是因為我已經有了一個預先配置好的Manjaro虛擬機。
2.2 搭建編譯環境
sudo pacman -Syu # Manjaro是滾動更新的發行版,使用時需要更新到最新版本
sudo pacman -S axel
axel -n 20 https://mirrors.edge.kernel.org/pub/linux/kernel/v5.x/linux-5.4.40.tar.xz
xz -d linux-5.4.40.tar.xz
tar -xvf linux-5.4.40.tar
cd linux-5.4.40
sudo pacman -S base-devel libncurses-dev bison flex libssl-dev libelf-dev #在paman中build-essential包被稱為base-devel,libncurses-dev被稱為ncurses,libssl-dev被稱為openssl,libelf-dev被成為libelf
2.3 內核配置與初步測試
make defconfig
make menuconfig
#打開debug相關選項
Kernel hacking --->
Compile-time checks and compiler options --->
[*] Compile the kernel with debug info
[*] Provide GDB scripts for kernel debugging
[*] Kernel debugging
#關閉KASLR,否則會導致打斷點失敗
Processor type and features ---->
[] Randomize the address of the kernel image (KASLR)
# 測試一下是否可以編譯並且在qemu上運行
make -j$(nproc)
sudo pacman -S qemu
qemu-system-x86_64 -kernel arch/x86/boot/bzImage
經測試發現可以正常運行,因為沒有文件系統,所以會出現Kernel panic的錯誤信息。
2.4 制作根文件系統鏡像
# 安裝Busybox工具箱
axel -n 20 https://busybox.net/downloads/busybox-1.31.1.tar.bz2
tar -jxvf busybox-1.31.1.tar.bz2
cd busybox-1.31.1
make menuconfig
# 選擇靜態鏈接
Settings --->
[*] Build static binary (no shared libs)
make -j$(nproc) && make install
編譯時很可能出現這樣的錯誤信息:
/usr/bin/ld: libbb/lib.a(xconnect.o): in function `str2sockaddr':
xconnect.c:(.text.str2sockaddr+0x116): 警告:Using 'getaddrinfo' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
/usr/bin/ld: coreutils/lib.a(mktemp.o): in function `mktemp_main':
mktemp.c:(.text.mktemp_main+0x94): 警告:the use of `mktemp' is dangerous, better use `mkstemp' or `mkdtemp'
/usr/bin/ld: libbb/lib.a(xgethostbyname.o): in function `xgethostbyname':
xgethostbyname.c:(.text.xgethostbyname+0x5): 警告:Using 'gethostbyname' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
/usr/bin/ld: libbb/lib.a(xconnect.o): in function `bb_lookup_port':
xconnect.c:(.text.bb_lookup_port+0x44): 警告:Using 'getservbyname' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
/usr/bin/ld: util-linux/lib.a(rdate.o): in function `rdate_main':
rdate.c:(.text.rdate_main+0xf8): undefined reference to `stime'
/usr/bin/ld: coreutils/lib.a(date.o): in function `date_main':
date.c:(.text.date_main+0x22d): undefined reference to `stime'
collect2: 錯誤:ld 返回 1
Note: if build needs additional libraries, put them in CONFIG_EXTRA_LDLIBS.
Example: CONFIG_EXTRA_LDLIBS="pthread dl tirpc audit pam"
make: *** [Makefile:718:busybox_unstripped] 錯誤 1
我的解決方案是在menuconfig中相關的項目全部取消勾選,就能完成編譯了。
下面繼續制作根目錄鏡像文件
mkdir rootfs
cd rootfs
cp ../busybox-1.31.1/_install/* ./ -rf
mkdir dev proc sys home
sudo cp -a /dev/{null,console,tty,tty1,tty2,tty3,tty4} dev/
# 准備init腳本文件放在根文件系統跟目錄下(rootfs/init),添加如下內容到init文件。
#!/bin/sh
mount -t proc none /proc mount -t sysfs none /sys
echo "Wellcome MengningOS!" echo "--------------------"
cd home
/bin/sh
#給init腳本添加可執行權限
chmod +x init
#打包成內存根文件系統鏡像
find . -print0 | cpio --null -ov --format=newc | gzip -9 > ../rootfs.cpio.gz
#測試掛載根文件系統,看內核啟動完成后是否執行init腳本
qemu-system-x86_64 -kernel ~/linux-5.4.40/arch/x86/boot/bzImage -initrd ../rootfs.cpio.gz
執行成功:

3 追蹤系統調用
3.1 編寫觸發系統調用的匯編代碼
在在rootfs/home目錄下編寫文件callwait4.c:
#include <stdio.h>
#include <unistd.h>
int main() {
sleep(1);
asm volatile(
"movl $0x3D,%eax\n\t"
"syscall\n\t"
);
printf("Hello wait4\n");
return 0;
}
重新制作根文件系統鏡像:
# 然后使用gcc進行靜態編譯
gcc -o callwait4 callwait4.c -static
# 重新制作根文件系統鏡像
cd ..
find . -print0 | cpio --null -ov --format=newc | gzip -9 > ../rootfs.cpio.gz
# 測試一下編寫好的程序能否運行
qemu-system-x86_64 -kernel ~/linux-5.4.40/arch/x86/boot/bzImage -initrd ../rootfs.cpio.gz
./callwait4
3.2 使用gdb調試
# 使用以下命令啟動qemu,-s是設置gdbserver的監聽端口,-S則是讓cpu加電后被掛起。
qemu-system-x86_64 -kernel ~/linux-5.4.40/arch/x86/boot/bzImage -initrd ../rootfs.cpio.gz -S -s
# 另開一個終端,啟動gdb,並設置好監聽端口和斷點
cd linux-5.4.40/
sudo pacman -S gdb
gdb vmlinux
(gdb) target remote:1234
(gdb) b __x64_sys_wait4
(gdb) c
# 在qemu中運行callwait4
./callwait4
# gdb捕獲到之前打好的斷點,可以使用bt查看堆棧信息
(gdb) bt
下面是從開始設置端口到查看到斷點堆棧信息的gdb截圖:

下面開始進行單步調試:
Breakpoint 1, __x64_sys_wait4 (regs=0xffffc900001aff58) at kernel/exit.c:1625
1625 SYSCALL_DEFINE4(wait4, pid_t, upid, int __user *, stat_addr,
(gdb) bt
#0 __x64_sys_wait4 (regs=0xffffc900001aff58) at kernel/exit.c:1625
#1 0xffffffff810025e3 in do_syscall_64 (nr=<optimized out>, regs=0xffffc900001aff58)
at arch/x86/entry/common.c:290
#2 0xffffffff81c0007c in entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:175
#3 0x00007ffd653be83c in ?? ()
#4 0x0000000000622050 in ?? ()
#5 0x0000000000aab990 in ?? ()
#6 0x0000000000aab990 in ?? ()
#7 0x0000000000000001 in fixed_percpu_data ()
#8 0x0000000000aa9a90 in ?? ()
#9 0x0000000000000246 in ?? ()
#10 0x0000000000000000 in ?? ()
(gdb) n
__do_sys_wait4 (upid=-1, stat_addr=0x7ffd653be83c,
options=0, ru=0x0 <fixed_percpu_data>)
at kernel/exit.c:1627
1627 {
(gdb)
1629 long err = kernel_wait4(upid, stat_addr, options, ru ? &r : NULL);
(gdb)
Breakpoint 1, __x64_sys_wait4 (
regs=0xffffc900001b7f58) at kernel/exit.c:1625
1625 SYSCALL_DEFINE4(wait4, pid_t, upid, int __user *, stat_addr,
(gdb)
__do_sys_wait4 (upid=0,
stat_addr=0x0 <fixed_percpu_data>, options=0,
ru=0x7fffb9283030) at kernel/exit.c:1627
1627 {
(gdb)
1629 long err = kernel_wait4(upid, stat_addr, options, ru ? &r : NULL);
(gdb)
1631 if (err > 0) {
(gdb)
do_syscall_64 (nr=18446744071600093856,
regs=0xffffc900001b7f58)
at arch/x86/entry/common.c:300
300 syscall_return_slowpath(regs);
(gdb)
entry_SYSCALL_64 ()
at arch/x86/entry/entry_64.S:184
184 movq RCX(%rsp), %rcx
(gdb)
185 movq RIP(%rsp), %r11
(gdb)
187 cmpq %rcx, %r11 /* SYSRET requires RCX == RIP */
(gdb)
188 jne swapgs_restore_regs_and_return_to_usermode
(gdb)
205 shl $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
(gdb)
206 sar $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
(gdb)
210 cmpq %rcx, %r11
(gdb)
211 jne swapgs_restore_regs_and_return_to_usermode
(gdb)
213 cmpq $__USER_CS, CS(%rsp) /* CS must match SYSRET */
(gdb)
214 jne swapgs_restore_regs_and_return_to_usermode
(gdb)
216 movq R11(%rsp), %r11
(gdb)
217 cmpq %r11, EFLAGS(%rsp) /* R11 == RFLAGS */
(gdb)
218 jne swapgs_restore_regs_and_return_to_usermode
(gdb)
238 testq $(X86_EFLAGS_RF|X86_EFLAGS_TF), %r11
(gdb)
239 jnz swapgs_restore_regs_and_return_to_usermode
(gdb)
243 cmpq $__USER_DS, SS(%rsp) /* SS must match SYSRET */
(gdb)
244 jne swapgs_restore_regs_and_return_to_usermode
(gdb)
253 POP_REGS pop_rdi=0 skip_r11rcx=1
(gdb)
entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:259
259 movq %rsp, %rdi
(gdb)
260 movq PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %rsp
(gdb)
entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:262
262 pushq RSP-RDI(%rdi) /* RSP */
(gdb)
entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:263
263 pushq (%rdi) /* RDI */
(gdb)
entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:271
271 SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi
(gdb)
273 popq %rdi
(gdb)
entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:274
274 popq %rsp
(gdb)
entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:275
275 USERGS_SYSRET64
(gdb)
0x0000000000401c6a in ?? ()
3.3 結果分析
查看到達斷點時的堆棧信息:
Breakpoint 1, __x64_sys_wait4 (regs=0xffffc900001aff58) at kernel/exit.c:1625
1625 SYSCALL_DEFINE4(wait4, pid_t, upid, int __user *, stat_addr,
(gdb) bt
#0 __x64_sys_wait4 (regs=0xffffc900001aff58) at kernel/exit.c:1625
#1 0xffffffff810025e3 in do_syscall_64 (nr=<optimized out>, regs=0xffffc900001aff58)
at arch/x86/entry/common.c:290
#2 0xffffffff81c0007c in entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:175
可以看到,當程序在函數__x64_sys_wait4處暫停時,系統已經通過entry_SYSCALL_64和do_syscall_64完成了保存現場的工作。下面來看entry_SYSCALL_64相關代碼:
ENTRY(entry_SYSCALL_64)
UNWIND_HINT_EMPTY
/*
* Interrupts are off on entry.
* We do not frame this tiny irq-off block with TRACE_IRQS_OFF/ON,
* it is too small to ever cause noticeable irq latency.
*/
// 使用swapgs保存現場
swapgs
/* tss.sp2 is scratch space. */
movq %rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)
SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp
// 將現場信息入棧
/* Construct struct pt_regs on stack */
pushq $__USER_DS /* pt_regs->ss */
pushq PER_CPU_VAR(cpu_tss_rw + TSS_sp2) /* pt_regs->sp */
pushq %r11 /* pt_regs->flags */
pushq $__USER_CS /* pt_regs->cs */
pushq %rcx /* pt_regs->ip */
GLOBAL(entry_SYSCALL_64_after_hwframe)
pushq %rax /* pt_regs->orig_ax */
PUSH_AND_CLEAR_REGS rax=$-ENOSYS
TRACE_IRQS_OFF
// 通過rax保存系統調用號
/* IRQs are off. */
movq %rax, %rdi
movq %rsp, %rsi
call do_syscall_64 /* returns with IRQs disabled */
然后查看do_syscall_64相關代碼:
__visible void do_syscall_64(unsigned long nr, struct pt_regs *regs)
{
struct thread_info *ti;
enter_from_user_mode();
local_irq_enable();
ti = current_thread_info();
if (READ_ONCE(ti->flags) & _TIF_WORK_SYSCALL_ENTRY)
nr = syscall_trace_enter(regs);
//從系統調用表中查找系統調用號
if (likely(nr < NR_syscalls)) {
nr = array_index_nospec(nr, NR_syscalls);
regs->ax = sys_call_table[nr](regs);
#ifdef CONFIG_X86_X32_ABI
} else if (likely((nr & __X32_SYSCALL_BIT) &&
(nr & ~__X32_SYSCALL_BIT) < X32_NR_syscalls)) {
nr = array_index_nospec(nr & ~__X32_SYSCALL_BIT,
X32_NR_syscalls);
regs->ax = x32_sys_call_table[nr](regs);
#endif
}
//調用查找到的系統調用,本次作業中調用的是__x64_sys_wait4
syscall_return_slowpath(regs);
}
完成調用之后,回到entry_SYSCALL_64,完成回復現場的工作:
TRACE_IRQS_IRETQ /* we're about to change IF */
/*
* Try to use SYSRET instead of IRET if we're returning to
* a completely clean 64-bit userspace context. If we're not,
* go to the slow exit path.
*/
movq RCX(%rsp), %rcx
movq RIP(%rsp), %r11
cmpq %rcx, %r11 /* SYSRET requires RCX == RIP */
jne swapgs_restore_regs_and_return_to_usermode
//跳躍到swapgs_restore_regs_and_return_to_usermode
shl $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
sar $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
#endif
/* If this changed %rcx, it was not canonical */
cmpq %rcx, %r11
jne swapgs_restore_regs_and_return_to_usermode
cmpq $__USER_CS, CS(%rsp) /* CS must match SYSRET */
jne swapgs_restore_regs_and_return_to_usermode
movq R11(%rsp), %r11
cmpq %r11, EFLAGS(%rsp) /* R11 == RFLAGS */
jne swapgs_restore_regs_and_return_to_usermode
/*
* SYSCALL clears RF when it saves RFLAGS in R11 and SYSRET cannot
* restore RF properly. If the slowpath sets it for whatever reason, we
* need to restore it correctly.
*
* SYSRET can restore TF, but unlike IRET, restoring TF results in a
* trap from userspace immediately after SYSRET. This would cause an
* infinite loop whenever #DB happens with register state that satisfies
* the opportunistic SYSRET conditions. For example, single-stepping
* this user code:
*
* movq $stuck_here, %rcx
* pushfq
* popq %r11
* stuck_here:
*
* would never get past 'stuck_here'.
*/
testq $(X86_EFLAGS_RF|X86_EFLAGS_TF), %r11
jnz swapgs_restore_regs_and_return_to_usermode
/* nothing to check for RSP */
cmpq $__USER_DS, SS(%rsp) /* SS must match SYSRET */
jne swapgs_restore_regs_and_return_to_usermode
/*
* We win! This label is here just for ease of understanding
* perf profiles. Nothing jumps here.
*/
syscall_return_via_sysret:
/* rcx and r11 are already restored (see code above) */
UNWIND_HINT_EMPTY
POP_REGS pop_rdi=0 skip_r11rcx=1
/*
* Now all regs are restored except RSP and RDI.
* Save old stack pointer and switch to trampoline stack.
*/
movq %rsp, %rdi
movq PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %rsp
pushq RSP-RDI(%rdi) /* RSP */
pushq (%rdi) /* RDI */
/*
* We are on the trampoline stack. All regs except RDI are live.
* We can do future final exit work right here.
*/
STACKLEAK_ERASE_NOCLOBBER
SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi
popq %rdi
popq %rsp
USERGS_SYSRET64
END(entry_SYSCALL_64)
這樣,一次系統調用就完成了。
