实验要求
- 找一个系统调用,系统调用号为学号最后2位相同的系统调用
- 通过汇编指令触发该系统调用
- 通过gdb跟踪该系统调用的内核处理过程
- 重点阅读分析系统调用入口的保存现场、恢复现场和系统调用返回,以及重点关注系统调用过程中内核堆栈状态的变化
1 目标系统调用
本人学号尾号为61,选取的系统调用号是61号,即wait4.
进程终止后,我们希望父进程能够得到关于该进程的终止信息,父进程就可以通过调用wait函数来获取子进程的终止状态,之后就可以再进行最后的操作,彻底删除进程所占用的内存资源。与wait相关的函数包括wait、wait3、waitpid等,这些函数都是通过系统调用wait4来实现的。
相关原理可以从下面的资料中查看:系统调用exit与wait4
2 环境准备
2.1 本机环境
VirtualBox 6.1.6 + Manjaro 20.0.1
Manjaro是一个基于Arch的面向新手的容易上手的发行版,本次实验使用Manjaro主要是因为我已经有了一个预先配置好的Manjaro虚拟机。
2.2 搭建编译环境
sudo pacman -Syu # Manjaro是滚动更新的发行版,使用时需要更新到最新版本
sudo pacman -S axel
axel -n 20 https://mirrors.edge.kernel.org/pub/linux/kernel/v5.x/linux-5.4.40.tar.xz
xz -d linux-5.4.40.tar.xz
tar -xvf linux-5.4.40.tar
cd linux-5.4.40
sudo pacman -S base-devel libncurses-dev bison flex libssl-dev libelf-dev #在paman中build-essential包被称为base-devel,libncurses-dev被称为ncurses,libssl-dev被称为openssl,libelf-dev被成为libelf
2.3 内核配置与初步测试
make defconfig
make menuconfig
#打开debug相关选项
Kernel hacking --->
Compile-time checks and compiler options --->
[*] Compile the kernel with debug info
[*] Provide GDB scripts for kernel debugging
[*] Kernel debugging
#关闭KASLR,否则会导致打断点失败
Processor type and features ---->
[] Randomize the address of the kernel image (KASLR)
# 测试一下是否可以编译并且在qemu上运行
make -j$(nproc)
sudo pacman -S qemu
qemu-system-x86_64 -kernel arch/x86/boot/bzImage
经测试发现可以正常运行,因为没有文件系统,所以会出现Kernel panic的错误信息。
2.4 制作根文件系统镜像
# 安装Busybox工具箱
axel -n 20 https://busybox.net/downloads/busybox-1.31.1.tar.bz2
tar -jxvf busybox-1.31.1.tar.bz2
cd busybox-1.31.1
make menuconfig
# 选择静态链接
Settings --->
[*] Build static binary (no shared libs)
make -j$(nproc) && make install
编译时很可能出现这样的错误信息:
/usr/bin/ld: libbb/lib.a(xconnect.o): in function `str2sockaddr':
xconnect.c:(.text.str2sockaddr+0x116): 警告:Using 'getaddrinfo' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
/usr/bin/ld: coreutils/lib.a(mktemp.o): in function `mktemp_main':
mktemp.c:(.text.mktemp_main+0x94): 警告:the use of `mktemp' is dangerous, better use `mkstemp' or `mkdtemp'
/usr/bin/ld: libbb/lib.a(xgethostbyname.o): in function `xgethostbyname':
xgethostbyname.c:(.text.xgethostbyname+0x5): 警告:Using 'gethostbyname' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
/usr/bin/ld: libbb/lib.a(xconnect.o): in function `bb_lookup_port':
xconnect.c:(.text.bb_lookup_port+0x44): 警告:Using 'getservbyname' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
/usr/bin/ld: util-linux/lib.a(rdate.o): in function `rdate_main':
rdate.c:(.text.rdate_main+0xf8): undefined reference to `stime'
/usr/bin/ld: coreutils/lib.a(date.o): in function `date_main':
date.c:(.text.date_main+0x22d): undefined reference to `stime'
collect2: 错误:ld 返回 1
Note: if build needs additional libraries, put them in CONFIG_EXTRA_LDLIBS.
Example: CONFIG_EXTRA_LDLIBS="pthread dl tirpc audit pam"
make: *** [Makefile:718:busybox_unstripped] 错误 1
我的解决方案是在menuconfig中相关的项目全部取消勾选,就能完成编译了。
下面继续制作根目录镜像文件
mkdir rootfs
cd rootfs
cp ../busybox-1.31.1/_install/* ./ -rf
mkdir dev proc sys home
sudo cp -a /dev/{null,console,tty,tty1,tty2,tty3,tty4} dev/
# 准备init脚本文件放在根文件系统跟目录下(rootfs/init),添加如下内容到init文件。
#!/bin/sh
mount -t proc none /proc mount -t sysfs none /sys
echo "Wellcome MengningOS!" echo "--------------------"
cd home
/bin/sh
#给init脚本添加可执行权限
chmod +x init
#打包成内存根文件系统镜像
find . -print0 | cpio --null -ov --format=newc | gzip -9 > ../rootfs.cpio.gz
#测试挂载根文件系统,看内核启动完成后是否执行init脚本
qemu-system-x86_64 -kernel ~/linux-5.4.40/arch/x86/boot/bzImage -initrd ../rootfs.cpio.gz
执行成功:
3 追踪系统调用
3.1 编写触发系统调用的汇编代码
在在rootfs/home目录下编写文件callwait4.c:
#include <stdio.h>
#include <unistd.h>
int main() {
sleep(1);
asm volatile(
"movl $0x3D,%eax\n\t"
"syscall\n\t"
);
printf("Hello wait4\n");
return 0;
}
重新制作根文件系统镜像:
# 然后使用gcc进行静态编译
gcc -o callwait4 callwait4.c -static
# 重新制作根文件系统镜像
cd ..
find . -print0 | cpio --null -ov --format=newc | gzip -9 > ../rootfs.cpio.gz
# 测试一下编写好的程序能否运行
qemu-system-x86_64 -kernel ~/linux-5.4.40/arch/x86/boot/bzImage -initrd ../rootfs.cpio.gz
./callwait4
3.2 使用gdb调试
# 使用以下命令启动qemu,-s是设置gdbserver的监听端口,-S则是让cpu加电后被挂起。
qemu-system-x86_64 -kernel ~/linux-5.4.40/arch/x86/boot/bzImage -initrd ../rootfs.cpio.gz -S -s
# 另开一个终端,启动gdb,并设置好监听端口和断点
cd linux-5.4.40/
sudo pacman -S gdb
gdb vmlinux
(gdb) target remote:1234
(gdb) b __x64_sys_wait4
(gdb) c
# 在qemu中运行callwait4
./callwait4
# gdb捕获到之前打好的断点,可以使用bt查看堆栈信息
(gdb) bt
下面是从开始设置端口到查看到断点堆栈信息的gdb截图:
下面开始进行单步调试:
Breakpoint 1, __x64_sys_wait4 (regs=0xffffc900001aff58) at kernel/exit.c:1625
1625 SYSCALL_DEFINE4(wait4, pid_t, upid, int __user *, stat_addr,
(gdb) bt
#0 __x64_sys_wait4 (regs=0xffffc900001aff58) at kernel/exit.c:1625
#1 0xffffffff810025e3 in do_syscall_64 (nr=<optimized out>, regs=0xffffc900001aff58)
at arch/x86/entry/common.c:290
#2 0xffffffff81c0007c in entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:175
#3 0x00007ffd653be83c in ?? ()
#4 0x0000000000622050 in ?? ()
#5 0x0000000000aab990 in ?? ()
#6 0x0000000000aab990 in ?? ()
#7 0x0000000000000001 in fixed_percpu_data ()
#8 0x0000000000aa9a90 in ?? ()
#9 0x0000000000000246 in ?? ()
#10 0x0000000000000000 in ?? ()
(gdb) n
__do_sys_wait4 (upid=-1, stat_addr=0x7ffd653be83c,
options=0, ru=0x0 <fixed_percpu_data>)
at kernel/exit.c:1627
1627 {
(gdb)
1629 long err = kernel_wait4(upid, stat_addr, options, ru ? &r : NULL);
(gdb)
Breakpoint 1, __x64_sys_wait4 (
regs=0xffffc900001b7f58) at kernel/exit.c:1625
1625 SYSCALL_DEFINE4(wait4, pid_t, upid, int __user *, stat_addr,
(gdb)
__do_sys_wait4 (upid=0,
stat_addr=0x0 <fixed_percpu_data>, options=0,
ru=0x7fffb9283030) at kernel/exit.c:1627
1627 {
(gdb)
1629 long err = kernel_wait4(upid, stat_addr, options, ru ? &r : NULL);
(gdb)
1631 if (err > 0) {
(gdb)
do_syscall_64 (nr=18446744071600093856,
regs=0xffffc900001b7f58)
at arch/x86/entry/common.c:300
300 syscall_return_slowpath(regs);
(gdb)
entry_SYSCALL_64 ()
at arch/x86/entry/entry_64.S:184
184 movq RCX(%rsp), %rcx
(gdb)
185 movq RIP(%rsp), %r11
(gdb)
187 cmpq %rcx, %r11 /* SYSRET requires RCX == RIP */
(gdb)
188 jne swapgs_restore_regs_and_return_to_usermode
(gdb)
205 shl $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
(gdb)
206 sar $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
(gdb)
210 cmpq %rcx, %r11
(gdb)
211 jne swapgs_restore_regs_and_return_to_usermode
(gdb)
213 cmpq $__USER_CS, CS(%rsp) /* CS must match SYSRET */
(gdb)
214 jne swapgs_restore_regs_and_return_to_usermode
(gdb)
216 movq R11(%rsp), %r11
(gdb)
217 cmpq %r11, EFLAGS(%rsp) /* R11 == RFLAGS */
(gdb)
218 jne swapgs_restore_regs_and_return_to_usermode
(gdb)
238 testq $(X86_EFLAGS_RF|X86_EFLAGS_TF), %r11
(gdb)
239 jnz swapgs_restore_regs_and_return_to_usermode
(gdb)
243 cmpq $__USER_DS, SS(%rsp) /* SS must match SYSRET */
(gdb)
244 jne swapgs_restore_regs_and_return_to_usermode
(gdb)
253 POP_REGS pop_rdi=0 skip_r11rcx=1
(gdb)
entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:259
259 movq %rsp, %rdi
(gdb)
260 movq PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %rsp
(gdb)
entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:262
262 pushq RSP-RDI(%rdi) /* RSP */
(gdb)
entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:263
263 pushq (%rdi) /* RDI */
(gdb)
entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:271
271 SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi
(gdb)
273 popq %rdi
(gdb)
entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:274
274 popq %rsp
(gdb)
entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:275
275 USERGS_SYSRET64
(gdb)
0x0000000000401c6a in ?? ()
3.3 结果分析
查看到达断点时的堆栈信息:
Breakpoint 1, __x64_sys_wait4 (regs=0xffffc900001aff58) at kernel/exit.c:1625
1625 SYSCALL_DEFINE4(wait4, pid_t, upid, int __user *, stat_addr,
(gdb) bt
#0 __x64_sys_wait4 (regs=0xffffc900001aff58) at kernel/exit.c:1625
#1 0xffffffff810025e3 in do_syscall_64 (nr=<optimized out>, regs=0xffffc900001aff58)
at arch/x86/entry/common.c:290
#2 0xffffffff81c0007c in entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:175
可以看到,当程序在函数__x64_sys_wait4处暂停时,系统已经通过entry_SYSCALL_64和do_syscall_64完成了保存现场的工作。下面来看entry_SYSCALL_64相关代码:
ENTRY(entry_SYSCALL_64)
UNWIND_HINT_EMPTY
/*
* Interrupts are off on entry.
* We do not frame this tiny irq-off block with TRACE_IRQS_OFF/ON,
* it is too small to ever cause noticeable irq latency.
*/
// 使用swapgs保存现场
swapgs
/* tss.sp2 is scratch space. */
movq %rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)
SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp
// 将现场信息入栈
/* Construct struct pt_regs on stack */
pushq $__USER_DS /* pt_regs->ss */
pushq PER_CPU_VAR(cpu_tss_rw + TSS_sp2) /* pt_regs->sp */
pushq %r11 /* pt_regs->flags */
pushq $__USER_CS /* pt_regs->cs */
pushq %rcx /* pt_regs->ip */
GLOBAL(entry_SYSCALL_64_after_hwframe)
pushq %rax /* pt_regs->orig_ax */
PUSH_AND_CLEAR_REGS rax=$-ENOSYS
TRACE_IRQS_OFF
// 通过rax保存系统调用号
/* IRQs are off. */
movq %rax, %rdi
movq %rsp, %rsi
call do_syscall_64 /* returns with IRQs disabled */
然后查看do_syscall_64相关代码:
__visible void do_syscall_64(unsigned long nr, struct pt_regs *regs)
{
struct thread_info *ti;
enter_from_user_mode();
local_irq_enable();
ti = current_thread_info();
if (READ_ONCE(ti->flags) & _TIF_WORK_SYSCALL_ENTRY)
nr = syscall_trace_enter(regs);
//从系统调用表中查找系统调用号
if (likely(nr < NR_syscalls)) {
nr = array_index_nospec(nr, NR_syscalls);
regs->ax = sys_call_table[nr](regs);
#ifdef CONFIG_X86_X32_ABI
} else if (likely((nr & __X32_SYSCALL_BIT) &&
(nr & ~__X32_SYSCALL_BIT) < X32_NR_syscalls)) {
nr = array_index_nospec(nr & ~__X32_SYSCALL_BIT,
X32_NR_syscalls);
regs->ax = x32_sys_call_table[nr](regs);
#endif
}
//调用查找到的系统调用,本次作业中调用的是__x64_sys_wait4
syscall_return_slowpath(regs);
}
完成调用之后,回到entry_SYSCALL_64,完成回复现场的工作:
TRACE_IRQS_IRETQ /* we're about to change IF */
/*
* Try to use SYSRET instead of IRET if we're returning to
* a completely clean 64-bit userspace context. If we're not,
* go to the slow exit path.
*/
movq RCX(%rsp), %rcx
movq RIP(%rsp), %r11
cmpq %rcx, %r11 /* SYSRET requires RCX == RIP */
jne swapgs_restore_regs_and_return_to_usermode
//跳跃到swapgs_restore_regs_and_return_to_usermode
shl $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
sar $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
#endif
/* If this changed %rcx, it was not canonical */
cmpq %rcx, %r11
jne swapgs_restore_regs_and_return_to_usermode
cmpq $__USER_CS, CS(%rsp) /* CS must match SYSRET */
jne swapgs_restore_regs_and_return_to_usermode
movq R11(%rsp), %r11
cmpq %r11, EFLAGS(%rsp) /* R11 == RFLAGS */
jne swapgs_restore_regs_and_return_to_usermode
/*
* SYSCALL clears RF when it saves RFLAGS in R11 and SYSRET cannot
* restore RF properly. If the slowpath sets it for whatever reason, we
* need to restore it correctly.
*
* SYSRET can restore TF, but unlike IRET, restoring TF results in a
* trap from userspace immediately after SYSRET. This would cause an
* infinite loop whenever #DB happens with register state that satisfies
* the opportunistic SYSRET conditions. For example, single-stepping
* this user code:
*
* movq $stuck_here, %rcx
* pushfq
* popq %r11
* stuck_here:
*
* would never get past 'stuck_here'.
*/
testq $(X86_EFLAGS_RF|X86_EFLAGS_TF), %r11
jnz swapgs_restore_regs_and_return_to_usermode
/* nothing to check for RSP */
cmpq $__USER_DS, SS(%rsp) /* SS must match SYSRET */
jne swapgs_restore_regs_and_return_to_usermode
/*
* We win! This label is here just for ease of understanding
* perf profiles. Nothing jumps here.
*/
syscall_return_via_sysret:
/* rcx and r11 are already restored (see code above) */
UNWIND_HINT_EMPTY
POP_REGS pop_rdi=0 skip_r11rcx=1
/*
* Now all regs are restored except RSP and RDI.
* Save old stack pointer and switch to trampoline stack.
*/
movq %rsp, %rdi
movq PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %rsp
pushq RSP-RDI(%rdi) /* RSP */
pushq (%rdi) /* RDI */
/*
* We are on the trampoline stack. All regs except RDI are live.
* We can do future final exit work right here.
*/
STACKLEAK_ERASE_NOCLOBBER
SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi
popq %rdi
popq %rsp
USERGS_SYSRET64
END(entry_SYSCALL_64)
这样,一次系统调用就完成了。