深入理解系統調用


實驗要求

  • 找一個系統調用,系統調用號為學號最后2位相同的系統調用
  • 通過匯編指令觸發該系統調用
  • 通過gdb跟蹤該系統調用的內核處理過程
  • 重點閱讀分析系統調用入口的保存現場、恢復現場和系統調用返回,以及重點關注系統調用過程中內核堆棧狀態的變化

1 目標系統調用

本人學號尾號為61,選取的系統調用號是61號,即wait4.

進程終止后,我們希望父進程能夠得到關於該進程的終止信息,父進程就可以通過調用wait函數來獲取子進程的終止狀態,之后就可以再進行最后的操作,徹底刪除進程所占用的內存資源。與wait相關的函數包括wait、wait3、waitpid等,這些函數都是通過系統調用wait4來實現的。

相關原理可以從下面的資料中查看:系統調用exit與wait4

2 環境准備

2.1 本機環境

VirtualBox 6.1.6 + Manjaro 20.0.1

Manjaro是一個基於Arch的面向新手的容易上手的發行版,本次實驗使用Manjaro主要是因為我已經有了一個預先配置好的Manjaro虛擬機。

2.2 搭建編譯環境

sudo pacman -Syu # Manjaro是滾動更新的發行版,使用時需要更新到最新版本
sudo pacman -S axel
axel -n 20 https://mirrors.edge.kernel.org/pub/linux/kernel/v5.x/linux-5.4.40.tar.xz
xz -d linux-5.4.40.tar.xz
tar -xvf linux-5.4.40.tar
cd linux-5.4.40
sudo pacman -S base-devel libncurses-dev bison flex libssl-dev libelf-dev #在paman中build-essential包被稱為base-devel,libncurses-dev被稱為ncurses,libssl-dev被稱為openssl,libelf-dev被成為libelf

2.3 內核配置與初步測試

make defconfig
make menuconfig
#打開debug相關選項
Kernel hacking --->
	Compile-time checks and compiler options --->
		[*] Compile the kernel with debug info
		[*] Provide GDB scripts for kernel debugging 
	[*] Kernel debugging
#關閉KASLR,否則會導致打斷點失敗
Processor type and features ---->
	[] Randomize the address of the kernel image (KASLR)
# 測試一下是否可以編譯並且在qemu上運行
make -j$(nproc) 
sudo pacman -S qemu 
qemu-system-x86_64 -kernel arch/x86/boot/bzImage

經測試發現可以正常運行,因為沒有文件系統,所以會出現Kernel panic的錯誤信息。

2.4 制作根文件系統鏡像

# 安裝Busybox工具箱
axel -n 20 https://busybox.net/downloads/busybox-1.31.1.tar.bz2
tar -jxvf busybox-1.31.1.tar.bz2
cd busybox-1.31.1
make menuconfig
	# 選擇靜態鏈接
	Settings --->
		[*] Build static binary (no shared libs)
make -j$(nproc) && make install

編譯時很可能出現這樣的錯誤信息:

/usr/bin/ld: libbb/lib.a(xconnect.o): in function `str2sockaddr':
xconnect.c:(.text.str2sockaddr+0x116): 警告:Using 'getaddrinfo' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
/usr/bin/ld: coreutils/lib.a(mktemp.o): in function `mktemp_main':
mktemp.c:(.text.mktemp_main+0x94): 警告:the use of `mktemp' is dangerous, better use `mkstemp' or `mkdtemp'
/usr/bin/ld: libbb/lib.a(xgethostbyname.o): in function `xgethostbyname':
xgethostbyname.c:(.text.xgethostbyname+0x5): 警告:Using 'gethostbyname' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
/usr/bin/ld: libbb/lib.a(xconnect.o): in function `bb_lookup_port':
xconnect.c:(.text.bb_lookup_port+0x44): 警告:Using 'getservbyname' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
/usr/bin/ld: util-linux/lib.a(rdate.o): in function `rdate_main':
rdate.c:(.text.rdate_main+0xf8): undefined reference to `stime'
/usr/bin/ld: coreutils/lib.a(date.o): in function `date_main':
date.c:(.text.date_main+0x22d): undefined reference to `stime'
collect2: 錯誤:ld 返回 1
Note: if build needs additional libraries, put them in CONFIG_EXTRA_LDLIBS.
Example: CONFIG_EXTRA_LDLIBS="pthread dl tirpc audit pam"
make: *** [Makefile:718:busybox_unstripped] 錯誤 1

我的解決方案是在menuconfig中相關的項目全部取消勾選,就能完成編譯了。

下面繼續制作根目錄鏡像文件

mkdir rootfs
cd rootfs
cp ../busybox-1.31.1/_install/* ./ -rf
mkdir dev proc sys home
sudo cp -a /dev/{null,console,tty,tty1,tty2,tty3,tty4} dev/


# 准備init腳本文件放在根文件系統跟目錄下(rootfs/init),添加如下內容到init文件。
#!/bin/sh
mount -t proc none /proc mount -t sysfs none /sys
echo "Wellcome MengningOS!" echo "--------------------"
cd home
/bin/sh


#給init腳本添加可執行權限
chmod +x init 
#打包成內存根文件系統鏡像
 find . -print0 | cpio --null -ov --format=newc | gzip -9 > ../rootfs.cpio.gz 
#測試掛載根文件系統,看內核啟動完成后是否執行init腳本
qemu-system-x86_64 -kernel ~/linux-5.4.40/arch/x86/boot/bzImage -initrd ../rootfs.cpio.gz

執行成功:

3 追蹤系統調用

3.1 編寫觸發系統調用的匯編代碼

在在rootfs/home目錄下編寫文件callwait4.c:

#include <stdio.h>
#include <unistd.h>
int main() {
		sleep(1);
        asm volatile(
                "movl $0x3D,%eax\n\t"
                "syscall\n\t"
        );
        printf("Hello wait4\n");
        return 0;
}

重新制作根文件系統鏡像:

# 然后使用gcc進行靜態編譯
gcc  -o callwait4 callwait4.c -static
# 重新制作根文件系統鏡像
cd ..
find . -print0 | cpio --null -ov --format=newc | gzip -9 > ../rootfs.cpio.gz
# 測試一下編寫好的程序能否運行
qemu-system-x86_64 -kernel ~/linux-5.4.40/arch/x86/boot/bzImage -initrd ../rootfs.cpio.gz
./callwait4

3.2 使用gdb調試

# 使用以下命令啟動qemu,-s是設置gdbserver的監聽端口,-S則是讓cpu加電后被掛起。
qemu-system-x86_64 -kernel ~/linux-5.4.40/arch/x86/boot/bzImage -initrd ../rootfs.cpio.gz -S -s
# 另開一個終端,啟動gdb,並設置好監聽端口和斷點
cd linux-5.4.40/
sudo pacman -S gdb
gdb vmlinux
(gdb) target remote:1234
(gdb) b __x64_sys_wait4
(gdb) c
# 在qemu中運行callwait4
./callwait4
# gdb捕獲到之前打好的斷點,可以使用bt查看堆棧信息
(gdb) bt

下面是從開始設置端口到查看到斷點堆棧信息的gdb截圖:

下面開始進行單步調試:

Breakpoint 1, __x64_sys_wait4 (regs=0xffffc900001aff58) at kernel/exit.c:1625
1625	SYSCALL_DEFINE4(wait4, pid_t, upid, int __user *, stat_addr,
(gdb) bt
#0  __x64_sys_wait4 (regs=0xffffc900001aff58) at kernel/exit.c:1625
#1  0xffffffff810025e3 in do_syscall_64 (nr=<optimized out>, regs=0xffffc900001aff58)
    at arch/x86/entry/common.c:290
#2  0xffffffff81c0007c in entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:175
#3  0x00007ffd653be83c in ?? ()
#4  0x0000000000622050 in ?? ()
#5  0x0000000000aab990 in ?? ()
#6  0x0000000000aab990 in ?? ()
#7  0x0000000000000001 in fixed_percpu_data ()
#8  0x0000000000aa9a90 in ?? ()
#9  0x0000000000000246 in ?? ()
#10 0x0000000000000000 in ?? ()

(gdb) n
__do_sys_wait4 (upid=-1, stat_addr=0x7ffd653be83c, 
    options=0, ru=0x0 <fixed_percpu_data>)
    at kernel/exit.c:1627
1627	{
(gdb) 
1629		long err = kernel_wait4(upid, stat_addr, options, ru ? &r : NULL);
(gdb) 


Breakpoint 1, __x64_sys_wait4 (
    regs=0xffffc900001b7f58) at kernel/exit.c:1625
1625	SYSCALL_DEFINE4(wait4, pid_t, upid, int __user *, stat_addr,
(gdb) 
__do_sys_wait4 (upid=0, 
    stat_addr=0x0 <fixed_percpu_data>, options=0, 
    ru=0x7fffb9283030) at kernel/exit.c:1627
1627	{
(gdb) 
1629		long err = kernel_wait4(upid, stat_addr, options, ru ? &r : NULL);
(gdb) 
1631		if (err > 0) {
(gdb) 
do_syscall_64 (nr=18446744071600093856, 
    regs=0xffffc900001b7f58)
    at arch/x86/entry/common.c:300
300		syscall_return_slowpath(regs);
(gdb) 
entry_SYSCALL_64 ()
    at arch/x86/entry/entry_64.S:184
184		movq	RCX(%rsp), %rcx
(gdb) 
185		movq	RIP(%rsp), %r11
(gdb) 
187		cmpq	%rcx, %r11	/* SYSRET requires RCX == RIP */
(gdb) 
188		jne	swapgs_restore_regs_and_return_to_usermode
(gdb) 
205		shl	$(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
(gdb) 
206		sar	$(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
(gdb) 
210		cmpq	%rcx, %r11
(gdb) 
211		jne	swapgs_restore_regs_and_return_to_usermode
(gdb) 
213		cmpq	$__USER_CS, CS(%rsp)		/* CS must match SYSRET */
(gdb) 
214		jne	swapgs_restore_regs_and_return_to_usermode
(gdb) 
216		movq	R11(%rsp), %r11
(gdb) 
217		cmpq	%r11, EFLAGS(%rsp)		/* R11 == RFLAGS */
(gdb) 
218		jne	swapgs_restore_regs_and_return_to_usermode
(gdb) 
238		testq	$(X86_EFLAGS_RF|X86_EFLAGS_TF), %r11
(gdb) 
239		jnz	swapgs_restore_regs_and_return_to_usermode
(gdb) 
243		cmpq	$__USER_DS, SS(%rsp)		/* SS must match SYSRET */
(gdb) 
244		jne	swapgs_restore_regs_and_return_to_usermode
(gdb) 
253		POP_REGS pop_rdi=0 skip_r11rcx=1
(gdb) 
entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:259
259		movq	%rsp, %rdi
(gdb) 
260		movq	PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %rsp
(gdb) 
entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:262
262		pushq	RSP-RDI(%rdi)	/* RSP */
(gdb) 
entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:263
263		pushq	(%rdi)		/* RDI */
(gdb) 
entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:271
271		SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi
(gdb) 
273		popq	%rdi
(gdb) 
entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:274
274		popq	%rsp
(gdb) 
entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:275
275		USERGS_SYSRET64
(gdb) 
0x0000000000401c6a in ?? ()

3.3 結果分析

查看到達斷點時的堆棧信息:

Breakpoint 1, __x64_sys_wait4 (regs=0xffffc900001aff58) at kernel/exit.c:1625
1625	SYSCALL_DEFINE4(wait4, pid_t, upid, int __user *, stat_addr,
(gdb) bt
#0  __x64_sys_wait4 (regs=0xffffc900001aff58) at kernel/exit.c:1625
#1  0xffffffff810025e3 in do_syscall_64 (nr=<optimized out>, regs=0xffffc900001aff58)
    at arch/x86/entry/common.c:290
#2  0xffffffff81c0007c in entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:175

可以看到,當程序在函數__x64_sys_wait4處暫停時,系統已經通過entry_SYSCALL_64和do_syscall_64完成了保存現場的工作。下面來看entry_SYSCALL_64相關代碼:

ENTRY(entry_SYSCALL_64)
	UNWIND_HINT_EMPTY
	/*
	 * Interrupts are off on entry.
	 * We do not frame this tiny irq-off block with TRACE_IRQS_OFF/ON,
	 * it is too small to ever cause noticeable irq latency.
	 */

// 使用swapgs保存現場
	swapgs
	/* tss.sp2 is scratch space. */
	movq	%rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)
	SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
	// 將現場信息入棧
	/* Construct struct pt_regs on stack */
	pushq	$__USER_DS				/* pt_regs->ss */
	pushq	PER_CPU_VAR(cpu_tss_rw + TSS_sp2)	/* pt_regs->sp */
	pushq	%r11					/* pt_regs->flags */
	pushq	$__USER_CS				/* pt_regs->cs */
	pushq	%rcx					/* pt_regs->ip */
GLOBAL(entry_SYSCALL_64_after_hwframe)
	pushq	%rax					/* pt_regs->orig_ax */

	PUSH_AND_CLEAR_REGS rax=$-ENOSYS

	TRACE_IRQS_OFF
// 通過rax保存系統調用號
	/* IRQs are off. */
	movq	%rax, %rdi
	movq	%rsp, %rsi
	call	do_syscall_64		/* returns with IRQs disabled */

然后查看do_syscall_64相關代碼:

__visible void do_syscall_64(unsigned long nr, struct pt_regs *regs)
{
	struct thread_info *ti;

	enter_from_user_mode();
	local_irq_enable();
	ti = current_thread_info();
	if (READ_ONCE(ti->flags) & _TIF_WORK_SYSCALL_ENTRY)
		nr = syscall_trace_enter(regs);
	
    //從系統調用表中查找系統調用號
	if (likely(nr < NR_syscalls)) {
		nr = array_index_nospec(nr, NR_syscalls);
		regs->ax = sys_call_table[nr](regs);
#ifdef CONFIG_X86_X32_ABI
	} else if (likely((nr & __X32_SYSCALL_BIT) &&
			  (nr & ~__X32_SYSCALL_BIT) < X32_NR_syscalls)) {
		nr = array_index_nospec(nr & ~__X32_SYSCALL_BIT,
					X32_NR_syscalls);
		regs->ax = x32_sys_call_table[nr](regs);
#endif
	}
	//調用查找到的系統調用,本次作業中調用的是__x64_sys_wait4
	syscall_return_slowpath(regs);
}

完成調用之后,回到entry_SYSCALL_64,完成回復現場的工作:

	TRACE_IRQS_IRETQ		/* we're about to change IF */

	/*
	 * Try to use SYSRET instead of IRET if we're returning to
	 * a completely clean 64-bit userspace context.  If we're not,
	 * go to the slow exit path.
	 */
	movq	RCX(%rsp), %rcx
	movq	RIP(%rsp), %r11

	cmpq	%rcx, %r11	/* SYSRET requires RCX == RIP */
	jne	swapgs_restore_regs_and_return_to_usermode
//跳躍到swapgs_restore_regs_and_return_to_usermode
	shl	$(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
	sar	$(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
#endif

	/* If this changed %rcx, it was not canonical */
	cmpq	%rcx, %r11
	jne	swapgs_restore_regs_and_return_to_usermode

	cmpq	$__USER_CS, CS(%rsp)		/* CS must match SYSRET */
	jne	swapgs_restore_regs_and_return_to_usermode

	movq	R11(%rsp), %r11
	cmpq	%r11, EFLAGS(%rsp)		/* R11 == RFLAGS */
	jne	swapgs_restore_regs_and_return_to_usermode

	/*
	 * SYSCALL clears RF when it saves RFLAGS in R11 and SYSRET cannot
	 * restore RF properly. If the slowpath sets it for whatever reason, we
	 * need to restore it correctly.
	 *
	 * SYSRET can restore TF, but unlike IRET, restoring TF results in a
	 * trap from userspace immediately after SYSRET.  This would cause an
	 * infinite loop whenever #DB happens with register state that satisfies
	 * the opportunistic SYSRET conditions.  For example, single-stepping
	 * this user code:
	 *
	 *           movq	$stuck_here, %rcx
	 *           pushfq
	 *           popq %r11
	 *   stuck_here:
	 *
	 * would never get past 'stuck_here'.
	 */
	testq	$(X86_EFLAGS_RF|X86_EFLAGS_TF), %r11
	jnz	swapgs_restore_regs_and_return_to_usermode

	/* nothing to check for RSP */

	cmpq	$__USER_DS, SS(%rsp)		/* SS must match SYSRET */
	jne	swapgs_restore_regs_and_return_to_usermode

	/*
	 * We win! This label is here just for ease of understanding
	 * perf profiles. Nothing jumps here.
	 */
syscall_return_via_sysret:
	/* rcx and r11 are already restored (see code above) */
	UNWIND_HINT_EMPTY
	POP_REGS pop_rdi=0 skip_r11rcx=1

	/*
	 * Now all regs are restored except RSP and RDI.
	 * Save old stack pointer and switch to trampoline stack.
	 */
	movq	%rsp, %rdi
	movq	PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %rsp

	pushq	RSP-RDI(%rdi)	/* RSP */
	pushq	(%rdi)		/* RDI */

	/*
	 * We are on the trampoline stack.  All regs except RDI are live.
	 * We can do future final exit work right here.
	 */
	STACKLEAK_ERASE_NOCLOBBER

	SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi

	popq	%rdi
	popq	%rsp
	USERGS_SYSRET64
END(entry_SYSCALL_64)

這樣,一次系統調用就完成了。

參考資料

[1] https://my.oschina.net/u/3857782/blog/1857551


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM