深入理解系统调用


实验要求

  • 找一个系统调用,系统调用号为学号最后2位相同的系统调用
  • 通过汇编指令触发该系统调用
  • 通过gdb跟踪该系统调用的内核处理过程
  • 重点阅读分析系统调用入口的保存现场、恢复现场和系统调用返回,以及重点关注系统调用过程中内核堆栈状态的变化

1 目标系统调用

本人学号尾号为61,选取的系统调用号是61号,即wait4.

进程终止后,我们希望父进程能够得到关于该进程的终止信息,父进程就可以通过调用wait函数来获取子进程的终止状态,之后就可以再进行最后的操作,彻底删除进程所占用的内存资源。与wait相关的函数包括wait、wait3、waitpid等,这些函数都是通过系统调用wait4来实现的。

相关原理可以从下面的资料中查看:系统调用exit与wait4

2 环境准备

2.1 本机环境

VirtualBox 6.1.6 + Manjaro 20.0.1

Manjaro是一个基于Arch的面向新手的容易上手的发行版,本次实验使用Manjaro主要是因为我已经有了一个预先配置好的Manjaro虚拟机。

2.2 搭建编译环境

sudo pacman -Syu # Manjaro是滚动更新的发行版,使用时需要更新到最新版本
sudo pacman -S axel
axel -n 20 https://mirrors.edge.kernel.org/pub/linux/kernel/v5.x/linux-5.4.40.tar.xz
xz -d linux-5.4.40.tar.xz
tar -xvf linux-5.4.40.tar
cd linux-5.4.40
sudo pacman -S base-devel libncurses-dev bison flex libssl-dev libelf-dev #在paman中build-essential包被称为base-devel,libncurses-dev被称为ncurses,libssl-dev被称为openssl,libelf-dev被成为libelf

2.3 内核配置与初步测试

make defconfig
make menuconfig
#打开debug相关选项
Kernel hacking --->
	Compile-time checks and compiler options --->
		[*] Compile the kernel with debug info
		[*] Provide GDB scripts for kernel debugging 
	[*] Kernel debugging
#关闭KASLR,否则会导致打断点失败
Processor type and features ---->
	[] Randomize the address of the kernel image (KASLR)
# 测试一下是否可以编译并且在qemu上运行
make -j$(nproc) 
sudo pacman -S qemu 
qemu-system-x86_64 -kernel arch/x86/boot/bzImage

经测试发现可以正常运行,因为没有文件系统,所以会出现Kernel panic的错误信息。

2.4 制作根文件系统镜像

# 安装Busybox工具箱
axel -n 20 https://busybox.net/downloads/busybox-1.31.1.tar.bz2
tar -jxvf busybox-1.31.1.tar.bz2
cd busybox-1.31.1
make menuconfig
	# 选择静态链接
	Settings --->
		[*] Build static binary (no shared libs)
make -j$(nproc) && make install

编译时很可能出现这样的错误信息:

/usr/bin/ld: libbb/lib.a(xconnect.o): in function `str2sockaddr':
xconnect.c:(.text.str2sockaddr+0x116): 警告:Using 'getaddrinfo' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
/usr/bin/ld: coreutils/lib.a(mktemp.o): in function `mktemp_main':
mktemp.c:(.text.mktemp_main+0x94): 警告:the use of `mktemp' is dangerous, better use `mkstemp' or `mkdtemp'
/usr/bin/ld: libbb/lib.a(xgethostbyname.o): in function `xgethostbyname':
xgethostbyname.c:(.text.xgethostbyname+0x5): 警告:Using 'gethostbyname' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
/usr/bin/ld: libbb/lib.a(xconnect.o): in function `bb_lookup_port':
xconnect.c:(.text.bb_lookup_port+0x44): 警告:Using 'getservbyname' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
/usr/bin/ld: util-linux/lib.a(rdate.o): in function `rdate_main':
rdate.c:(.text.rdate_main+0xf8): undefined reference to `stime'
/usr/bin/ld: coreutils/lib.a(date.o): in function `date_main':
date.c:(.text.date_main+0x22d): undefined reference to `stime'
collect2: 错误:ld 返回 1
Note: if build needs additional libraries, put them in CONFIG_EXTRA_LDLIBS.
Example: CONFIG_EXTRA_LDLIBS="pthread dl tirpc audit pam"
make: *** [Makefile:718:busybox_unstripped] 错误 1

我的解决方案是在menuconfig中相关的项目全部取消勾选,就能完成编译了。

下面继续制作根目录镜像文件

mkdir rootfs
cd rootfs
cp ../busybox-1.31.1/_install/* ./ -rf
mkdir dev proc sys home
sudo cp -a /dev/{null,console,tty,tty1,tty2,tty3,tty4} dev/


# 准备init脚本文件放在根文件系统跟目录下(rootfs/init),添加如下内容到init文件。
#!/bin/sh
mount -t proc none /proc mount -t sysfs none /sys
echo "Wellcome MengningOS!" echo "--------------------"
cd home
/bin/sh


#给init脚本添加可执行权限
chmod +x init 
#打包成内存根文件系统镜像
 find . -print0 | cpio --null -ov --format=newc | gzip -9 > ../rootfs.cpio.gz 
#测试挂载根文件系统,看内核启动完成后是否执行init脚本
qemu-system-x86_64 -kernel ~/linux-5.4.40/arch/x86/boot/bzImage -initrd ../rootfs.cpio.gz

执行成功:

3 追踪系统调用

3.1 编写触发系统调用的汇编代码

在在rootfs/home目录下编写文件callwait4.c:

#include <stdio.h>
#include <unistd.h>
int main() {
		sleep(1);
        asm volatile(
                "movl $0x3D,%eax\n\t"
                "syscall\n\t"
        );
        printf("Hello wait4\n");
        return 0;
}

重新制作根文件系统镜像:

# 然后使用gcc进行静态编译
gcc  -o callwait4 callwait4.c -static
# 重新制作根文件系统镜像
cd ..
find . -print0 | cpio --null -ov --format=newc | gzip -9 > ../rootfs.cpio.gz
# 测试一下编写好的程序能否运行
qemu-system-x86_64 -kernel ~/linux-5.4.40/arch/x86/boot/bzImage -initrd ../rootfs.cpio.gz
./callwait4

3.2 使用gdb调试

# 使用以下命令启动qemu,-s是设置gdbserver的监听端口,-S则是让cpu加电后被挂起。
qemu-system-x86_64 -kernel ~/linux-5.4.40/arch/x86/boot/bzImage -initrd ../rootfs.cpio.gz -S -s
# 另开一个终端,启动gdb,并设置好监听端口和断点
cd linux-5.4.40/
sudo pacman -S gdb
gdb vmlinux
(gdb) target remote:1234
(gdb) b __x64_sys_wait4
(gdb) c
# 在qemu中运行callwait4
./callwait4
# gdb捕获到之前打好的断点,可以使用bt查看堆栈信息
(gdb) bt

下面是从开始设置端口到查看到断点堆栈信息的gdb截图:

下面开始进行单步调试:

Breakpoint 1, __x64_sys_wait4 (regs=0xffffc900001aff58) at kernel/exit.c:1625
1625	SYSCALL_DEFINE4(wait4, pid_t, upid, int __user *, stat_addr,
(gdb) bt
#0  __x64_sys_wait4 (regs=0xffffc900001aff58) at kernel/exit.c:1625
#1  0xffffffff810025e3 in do_syscall_64 (nr=<optimized out>, regs=0xffffc900001aff58)
    at arch/x86/entry/common.c:290
#2  0xffffffff81c0007c in entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:175
#3  0x00007ffd653be83c in ?? ()
#4  0x0000000000622050 in ?? ()
#5  0x0000000000aab990 in ?? ()
#6  0x0000000000aab990 in ?? ()
#7  0x0000000000000001 in fixed_percpu_data ()
#8  0x0000000000aa9a90 in ?? ()
#9  0x0000000000000246 in ?? ()
#10 0x0000000000000000 in ?? ()

(gdb) n
__do_sys_wait4 (upid=-1, stat_addr=0x7ffd653be83c, 
    options=0, ru=0x0 <fixed_percpu_data>)
    at kernel/exit.c:1627
1627	{
(gdb) 
1629		long err = kernel_wait4(upid, stat_addr, options, ru ? &r : NULL);
(gdb) 


Breakpoint 1, __x64_sys_wait4 (
    regs=0xffffc900001b7f58) at kernel/exit.c:1625
1625	SYSCALL_DEFINE4(wait4, pid_t, upid, int __user *, stat_addr,
(gdb) 
__do_sys_wait4 (upid=0, 
    stat_addr=0x0 <fixed_percpu_data>, options=0, 
    ru=0x7fffb9283030) at kernel/exit.c:1627
1627	{
(gdb) 
1629		long err = kernel_wait4(upid, stat_addr, options, ru ? &r : NULL);
(gdb) 
1631		if (err > 0) {
(gdb) 
do_syscall_64 (nr=18446744071600093856, 
    regs=0xffffc900001b7f58)
    at arch/x86/entry/common.c:300
300		syscall_return_slowpath(regs);
(gdb) 
entry_SYSCALL_64 ()
    at arch/x86/entry/entry_64.S:184
184		movq	RCX(%rsp), %rcx
(gdb) 
185		movq	RIP(%rsp), %r11
(gdb) 
187		cmpq	%rcx, %r11	/* SYSRET requires RCX == RIP */
(gdb) 
188		jne	swapgs_restore_regs_and_return_to_usermode
(gdb) 
205		shl	$(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
(gdb) 
206		sar	$(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
(gdb) 
210		cmpq	%rcx, %r11
(gdb) 
211		jne	swapgs_restore_regs_and_return_to_usermode
(gdb) 
213		cmpq	$__USER_CS, CS(%rsp)		/* CS must match SYSRET */
(gdb) 
214		jne	swapgs_restore_regs_and_return_to_usermode
(gdb) 
216		movq	R11(%rsp), %r11
(gdb) 
217		cmpq	%r11, EFLAGS(%rsp)		/* R11 == RFLAGS */
(gdb) 
218		jne	swapgs_restore_regs_and_return_to_usermode
(gdb) 
238		testq	$(X86_EFLAGS_RF|X86_EFLAGS_TF), %r11
(gdb) 
239		jnz	swapgs_restore_regs_and_return_to_usermode
(gdb) 
243		cmpq	$__USER_DS, SS(%rsp)		/* SS must match SYSRET */
(gdb) 
244		jne	swapgs_restore_regs_and_return_to_usermode
(gdb) 
253		POP_REGS pop_rdi=0 skip_r11rcx=1
(gdb) 
entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:259
259		movq	%rsp, %rdi
(gdb) 
260		movq	PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %rsp
(gdb) 
entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:262
262		pushq	RSP-RDI(%rdi)	/* RSP */
(gdb) 
entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:263
263		pushq	(%rdi)		/* RDI */
(gdb) 
entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:271
271		SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi
(gdb) 
273		popq	%rdi
(gdb) 
entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:274
274		popq	%rsp
(gdb) 
entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:275
275		USERGS_SYSRET64
(gdb) 
0x0000000000401c6a in ?? ()

3.3 结果分析

查看到达断点时的堆栈信息:

Breakpoint 1, __x64_sys_wait4 (regs=0xffffc900001aff58) at kernel/exit.c:1625
1625	SYSCALL_DEFINE4(wait4, pid_t, upid, int __user *, stat_addr,
(gdb) bt
#0  __x64_sys_wait4 (regs=0xffffc900001aff58) at kernel/exit.c:1625
#1  0xffffffff810025e3 in do_syscall_64 (nr=<optimized out>, regs=0xffffc900001aff58)
    at arch/x86/entry/common.c:290
#2  0xffffffff81c0007c in entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:175

可以看到,当程序在函数__x64_sys_wait4处暂停时,系统已经通过entry_SYSCALL_64和do_syscall_64完成了保存现场的工作。下面来看entry_SYSCALL_64相关代码:

ENTRY(entry_SYSCALL_64)
	UNWIND_HINT_EMPTY
	/*
	 * Interrupts are off on entry.
	 * We do not frame this tiny irq-off block with TRACE_IRQS_OFF/ON,
	 * it is too small to ever cause noticeable irq latency.
	 */

// 使用swapgs保存现场
	swapgs
	/* tss.sp2 is scratch space. */
	movq	%rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)
	SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
	// 将现场信息入栈
	/* Construct struct pt_regs on stack */
	pushq	$__USER_DS				/* pt_regs->ss */
	pushq	PER_CPU_VAR(cpu_tss_rw + TSS_sp2)	/* pt_regs->sp */
	pushq	%r11					/* pt_regs->flags */
	pushq	$__USER_CS				/* pt_regs->cs */
	pushq	%rcx					/* pt_regs->ip */
GLOBAL(entry_SYSCALL_64_after_hwframe)
	pushq	%rax					/* pt_regs->orig_ax */

	PUSH_AND_CLEAR_REGS rax=$-ENOSYS

	TRACE_IRQS_OFF
// 通过rax保存系统调用号
	/* IRQs are off. */
	movq	%rax, %rdi
	movq	%rsp, %rsi
	call	do_syscall_64		/* returns with IRQs disabled */

然后查看do_syscall_64相关代码:

__visible void do_syscall_64(unsigned long nr, struct pt_regs *regs)
{
	struct thread_info *ti;

	enter_from_user_mode();
	local_irq_enable();
	ti = current_thread_info();
	if (READ_ONCE(ti->flags) & _TIF_WORK_SYSCALL_ENTRY)
		nr = syscall_trace_enter(regs);
	
    //从系统调用表中查找系统调用号
	if (likely(nr < NR_syscalls)) {
		nr = array_index_nospec(nr, NR_syscalls);
		regs->ax = sys_call_table[nr](regs);
#ifdef CONFIG_X86_X32_ABI
	} else if (likely((nr & __X32_SYSCALL_BIT) &&
			  (nr & ~__X32_SYSCALL_BIT) < X32_NR_syscalls)) {
		nr = array_index_nospec(nr & ~__X32_SYSCALL_BIT,
					X32_NR_syscalls);
		regs->ax = x32_sys_call_table[nr](regs);
#endif
	}
	//调用查找到的系统调用,本次作业中调用的是__x64_sys_wait4
	syscall_return_slowpath(regs);
}

完成调用之后,回到entry_SYSCALL_64,完成回复现场的工作:

	TRACE_IRQS_IRETQ		/* we're about to change IF */

	/*
	 * Try to use SYSRET instead of IRET if we're returning to
	 * a completely clean 64-bit userspace context.  If we're not,
	 * go to the slow exit path.
	 */
	movq	RCX(%rsp), %rcx
	movq	RIP(%rsp), %r11

	cmpq	%rcx, %r11	/* SYSRET requires RCX == RIP */
	jne	swapgs_restore_regs_and_return_to_usermode
//跳跃到swapgs_restore_regs_and_return_to_usermode
	shl	$(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
	sar	$(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
#endif

	/* If this changed %rcx, it was not canonical */
	cmpq	%rcx, %r11
	jne	swapgs_restore_regs_and_return_to_usermode

	cmpq	$__USER_CS, CS(%rsp)		/* CS must match SYSRET */
	jne	swapgs_restore_regs_and_return_to_usermode

	movq	R11(%rsp), %r11
	cmpq	%r11, EFLAGS(%rsp)		/* R11 == RFLAGS */
	jne	swapgs_restore_regs_and_return_to_usermode

	/*
	 * SYSCALL clears RF when it saves RFLAGS in R11 and SYSRET cannot
	 * restore RF properly. If the slowpath sets it for whatever reason, we
	 * need to restore it correctly.
	 *
	 * SYSRET can restore TF, but unlike IRET, restoring TF results in a
	 * trap from userspace immediately after SYSRET.  This would cause an
	 * infinite loop whenever #DB happens with register state that satisfies
	 * the opportunistic SYSRET conditions.  For example, single-stepping
	 * this user code:
	 *
	 *           movq	$stuck_here, %rcx
	 *           pushfq
	 *           popq %r11
	 *   stuck_here:
	 *
	 * would never get past 'stuck_here'.
	 */
	testq	$(X86_EFLAGS_RF|X86_EFLAGS_TF), %r11
	jnz	swapgs_restore_regs_and_return_to_usermode

	/* nothing to check for RSP */

	cmpq	$__USER_DS, SS(%rsp)		/* SS must match SYSRET */
	jne	swapgs_restore_regs_and_return_to_usermode

	/*
	 * We win! This label is here just for ease of understanding
	 * perf profiles. Nothing jumps here.
	 */
syscall_return_via_sysret:
	/* rcx and r11 are already restored (see code above) */
	UNWIND_HINT_EMPTY
	POP_REGS pop_rdi=0 skip_r11rcx=1

	/*
	 * Now all regs are restored except RSP and RDI.
	 * Save old stack pointer and switch to trampoline stack.
	 */
	movq	%rsp, %rdi
	movq	PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %rsp

	pushq	RSP-RDI(%rdi)	/* RSP */
	pushq	(%rdi)		/* RDI */

	/*
	 * We are on the trampoline stack.  All regs except RDI are live.
	 * We can do future final exit work right here.
	 */
	STACKLEAK_ERASE_NOCLOBBER

	SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi

	popq	%rdi
	popq	%rsp
	USERGS_SYSRET64
END(entry_SYSCALL_64)

这样,一次系统调用就完成了。

参考资料

[1] https://my.oschina.net/u/3857782/blog/1857551


免责声明!

本站转载的文章为个人学习借鉴使用,本站对版权不负任何法律责任。如果侵犯了您的隐私权益,请联系本站邮箱yoyou2525@163.com删除。



 
粤ICP备18138465号  © 2018-2025 CODEPRJ.COM