一、實驗要求
- 找一個系統調用,系統調用號為學號最后2位相同的系統調用
- 通過匯編指令觸發該系統調用
- 通過gdb跟蹤該系統調用的內核處理過程
- 重點閱讀分析系統調用入口的保存現場、恢復現場和系統調用返回,以及重點關注系統調用過程中內核堆棧狀態的變化
二、理論知識
2.1 Linux整體架構圖
2.2 系統調用
系統調用時操作系統的最小功能單位。根據不同的應用場景,不同的Linux
發行版本提供的系統調用數量也不盡相同,大致在240-350之間。這些系統調用組成了用戶態跟內核態交互的基本接口,例如:用戶態想要申請一塊20K大小的動態內存,就需要brk系統調用,將數據段指針向下偏移,如果用戶態多處申請20K動態內存,同時又釋放呢?這個內存的管理就變得非常的復雜。
2.3 庫函數
open()
,
write()
,
read()
等等。庫函數根據不同的標准也有不同的版本,例如:
glibc
庫,
posix
庫等。
2.4 用戶態與內核態
宏觀上 Linux 操作系統的體系架構分為用戶態和內核態。計算機的硬件資源是有限的,為了減少有限資源的訪問和使用沖突,CPU和操作系統必須提供一些機制對用戶程序進行權限划分。現代的 CPU一般都有幾種不同的指令執行級別,就是什么樣的程 序執行什么樣的指令是有權限的。在高的執行級別下,代碼可以執行特權指令,訪問任意內存,這時 CPU 的執行級別對應的就是內核態,所有的指令包括特權指令都可以執行。相應的,在用戶態(低級別指令),代碼能夠掌控的范圍會受到限制。為什么 會出現這種情況呢?其實很容易理解,如果沒有權限級別的划分,系統中程序員編寫的所有代碼都可以使用特權指令,系統就很容易出現崩潰的情況。因為不是每個程序員寫的代碼都那么健壯,或者說會非法訪問其他進程甚至內核的資源,就會產生信息安全問題,這也是操作系統發展的過程中保證系統穩定性和安全性的一種機制。讓普通程序員寫的用戶態的代碼很難導致整個系統的崩潰,而操作系統內核的代碼是由更專業的程序員寫的,有規范的測試,相對就會更穩定、健壯。
2.5 中斷
中斷分外部中斷(硬件中斷)和內部中斷(軟件中斷),內部中斷又稱為異常(Exception),異常又分為故障(fault)和陷阱(trap)。 系統調用就是利用陷阱(trap)這種軟件中斷方式主動從用戶態進入內核態的。
一般來說,從用戶態進入內核態是由中斷觸發的,可能是硬件中斷, 在用戶態進程執行時,硬件中斷信號到來,進入內核態,就會執行這 個中斷對應的中斷服務例程。也可能是用戶態程序執行過程中,調用 了一個系統調用,陷入了內核態,叫作陷阱(trap)(系統調用是特 殊的中斷)。
2.6 系統編程接口 API 和系統調用的關系
系統調用的庫函數就是我們使用的操作系統提供的 API(應用程序編程接口),API 只是函數定義。系統調用是通過特定的軟件中斷(陷阱 trap)向內核發出服務請求,int $0x80 和syscall指令的執行就會觸發一個系統調用。C庫函數內部使用了系統調用的封裝例程, 其主要目的是發布系統調用,使程序員在寫代碼時不需要用匯編指令和寄存器傳遞參數來 觸發系統調用。一般每個系統調用對應一個系統調用的封裝例程,函數庫再用這些封裝例 程定義出給程序員調用的 API,這樣把系統調用最終封裝成方便程序員使用的C庫函數。
C庫函數API可能直接提供一些用戶態的服務,並不需要通過系統調用 與內核打交道,比如一些數學函數,但涉及與內核空間進行交互的C 庫函數API 內部會封裝系統調用。一個 API 可能只對應一個系統調用,也可能內部由多個系統調用實現,一個系統調用也可能被多個 API 調用。不涉及與內核進行交互的 API 內部不會封裝系統調用,比 如用於求絕對值的數學函數 abs()。對於返回值,大部分系統調用的封裝例程返回一個整數,其值的含義依賴於對應的系統調用,返回值-1 在多數情況下表示內核不能滿足進程的請求,C庫函數中進一步定義的 errno 變量包含特定的出錯碼。
2.7 Linux系統調用
當用戶態進程調用一個系統調用時,CPU切換到內核態並開始執行 system_call(entry_INT80_32或entry_SYSCALL_64)匯編代碼,其中根據系統調用號調用對應的內核處理函數。具體來說,在Linux中通 過執行int $0x80或syscall指令來觸發系統調用的執行,其中這條int $0x80匯編指令是產生中斷向量為128的編程異常(trap)。另外Intel 處理器中還引入了sysenter指令(快速系統調用),因為Intel專用 AMD並不支持,在此不再詳述。我們只關注int指令和syscall指令觸發 的系統調用,進入內核后,開始執行對應的中斷服務程序 entry_INT80_32或entry_SYSCALL_64。
三、搭建實驗環境
3.1 安裝開發工具
1 sudo apt install build-essential 2 sudo apt install qemu # install QEMU 3 sudo apt install libncurses5-dev bison flex libssl-dev libelf-dev
3.2 下載內核源代碼
1 sudo apt install axel 2 axel -n 20 https://mirrors.edge.kernel.org/pub/linux/kernel/v5.x/linux-5.4.34.tar.xz 3 xz -d linux-5.4.34.tar.xz 4 tar -xvf linux-5.4.34.tar 5 cd linux-5.4.34
3.3 配置內核選項
1 make defconfig # Default configuration is based on 'x86_64_defconfig' 2 make menuconfig 3 # 打開debug相關選項 4 Kernel hacking ---> 5 Compile-time checks and compiler options ---> 6 [*] Compile the kernel with debug info 7 [*] Provide GDB scripts for kernel debugging 8 [*] Kernel debugging 9 # 關閉KASLR,否則會導致打斷點失敗 10 Processor type and features ----> 11 [] Randomize the address of the kernel image (KASLR)
3.4 編譯和運行內核
1 make -j$(nproc) # nproc gives the number of CPU cores/threads available 2 # 測試⼀下內核能不能正常加載運⾏,因為沒有⽂件系統終會kernel panic 3 qemu-system-x86_64 -kernel arch/x86/boot/bzImage # 此時應該不能正常運行
由於沒有⽂件系統最終會kernel panic,這屬於正常現象
3.5 制作根文件系統
1 首先從https://www.busybox.net下載 busybox源代碼解壓,解壓完成后,跟內核一樣先配置編譯,並安裝。 2 axel -n 20 https://busybox.net/downloads/busybox-1.31.1.tar.bz2 3 tar -jxvf busybox-1.31.1.tar.bz2 4 cd busybox-1.31.1
1 make menuconfig #若出現打不開menuconfig,重啟虛擬機即可 2 #記得要編譯成靜態鏈接,不⽤動態鏈接庫。 3 Settings ---> 4 [*] Build static binary (no shared libs) 5 #然后編譯安裝,默認會安裝到源碼⽬錄下的 _install ⽬錄中。 6 make -j$(nproc) && make install
制作內存根文件系統鏡像
1 mkdir rootfs 2 cd rootfs 3 cp ../busybox-1.31.1/_install/* ./ -rf 4 mkdir dev proc sys home 5 sudo cp -a /dev/{null,console,tty,tty1,tty2,tty3,tty4} dev/
准備init腳本文件放在根文件系統跟目錄下(rootfs/init),添加如下內容到init文件。
1 #!/bin/sh 2 mount -t proc none /proc 3 mount -t sysfs none /sys 4 echo "Wellcome MengningOS!" 5 echo "--------------------" 6 cd home 7 /bin/sh
給init腳本添加可執⾏權限
1 chmod +x init
1 #打包成內存根⽂件系統鏡像 2 find . -print0 | cpio --null -ov --format=newc | gzip -9 > ../ rootfs.cpio.gz 3 #測試掛載根⽂件系統,看內核啟動完成后是否執⾏init腳本 4 qemu-system-x86_64 -kernel linux-5.4.34/arch/x86/boot/bzImage -initrd rootfs.cpio.gz
四、系統調用
我的學號后兩位為03,查找linux-5.4.34/arch/x86/entry/syscalls/syscall_32.tb,03號為read系統調用。
通過man read命令查看read()函數
編寫read.c函數
1 #include <unistd.h> 2 #include <stdlib.h> 3 #include <stdio.h> 4 5 int main(void) 6 { 7 char buf[10]; 8 int n = read(STDIN_FILENO, buf, 10); 9 printf("%d: %s\n", n, buf); 10 return 0; 11 }
編譯運行
使用匯編指令觸發系統調用,vi read_asm.c
1 #include <unistd.h> 2 #include <stdlib.h> 3 #include <stdio.h> 4 5 #define SYS_READ 3 6 7 int main() 8 { 9 char buff[10]; 10 ssize_t charsread; 11 asm volatile("int $0x80" 12 : "=a" (charsread) 13 : "0" (SYS_READ), "b" (STDIN_FILENO), "c" (buff), "d" (sizeof(buff)) 14 : "memory", "cc"); 15 16 printf("%d: %s", (int)charsread, buff); 17 18 return 0; 19 }
編譯運行
分析:read系統調用主要作用是,讀取輸入(文件內容)到buf,返回內容結束位置
五、gdb跟蹤read系統調用的內核處理過程
1 find . -print0 | cpio --null -ov --format=newc | gzip -9 > ../rootfs.cpio.gz
1 qemu-system-x86_64 -kernel linux-5.4.34/arch/x86/boot/bzImage -initrd rootfs.cpio.gz -S -s -nographic -append "console=ttyS0"
1 #開啟新的terminal 2 3 cd linux-5.4.34 4 gdb vmlinux 5 target remote:1234
1 #設置斷點 2 b __ia32_sys_read
(gdb) target remote:1234 Remote debugging using :1234 default_idle () at arch/x86/kernel/process.c:581 581 trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id()); (gdb) b __ia32_sys_read Breakpoint 1 at 0xffffffff811d0b50: file fs/read_write.c, line 597. (gdb) bt #0 default_idle () at arch/x86/kernel/process.c:581 #1 0xffffffff81095ebe in cpuidle_idle_call () at kernel/sched/idle.c:154 #2 do_idle () at kernel/sched/idle.c:263 #3 0xffffffff810960f4 in cpu_startup_entry (state=CPUHP_ONLINE) at kernel/sched/idle.c:355 #4 0xffffffff81a7eec5 in rest_init () at init/main.c:451 #5 0xffffffff829aeab7 in arch_call_rest_init () at init/main.c:573 #6 0xffffffff829aef74 in start_kernel () at init/main.c:785 #7 0xffffffff810000d4 in secondary_startup_64 () at arch/x86/kernel/head_64.S:241 #8 0x0000000000000000 in ?? () (gdb) l 576 */ 577 void __cpuidle default_idle(void) 578 { 579 trace_cpu_idle_rcuidle(1, smp_processor_id()); 580 safe_halt(); 581 trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id()); 582 } 583 #if defined(CONFIG_APM_MODULE) || defined(CONFIG_HALTPOLL_CPUIDLE_MODULE) 584 EXPORT_SYMBOL(default_idle); 585 #endif (gdb) l 586 587 #ifdef CONFIG_XEN 588 bool xen_set_default_idle(void) 589 { 590 bool ret = !!x86_idle; 591 592 x86_idle = default_idle; 593 594 return ret; 595 } (gdb) l 596 #endif 597 598 void stop_this_cpu(void *dummy) 599 { 600 local_irq_disable(); 601 /* 602 * Remove this CPU: 603 */ 604 set_cpu_online(smp_processor_id(), false); 605 disable_local_APIC(); (gdb) l 606 mcheck_cpu_clear(this_cpu_ptr(&cpu_info)); 607 608 /* 609 * Use wbinvd on processors that support SME. This provides support 610 * for performing a successful kexec when going from SME inactive 611 * to SME active (or vice-versa). The cache must be cleared so that 612 * if there are entries with the same physical address, both with and 613 * without the encryption bit, they don't race each other when flushed 614 * and potentially end up with the wrong entry being committed to 615 * memory. (gdb) l 616 */ 617 if (boot_cpu_has(X86_FEATURE_SME)) 618 native_wbinvd(); 619 for (;;) { 620 /* 621 * Use native_halt() so that memory contents don't change 622 * (stack usage and variables) after possibly issuing the 623 * native_wbinvd() above. 624 */ 625 native_halt(); (gdb) l 626 } 627 } 628 629 /* 630 * AMD Erratum 400 aware idle routine. We handle it the same way as C3 power 631 * states (local apic timer and TSC stop). 632 */ 633 static void amd_e400_idle(void) 634 { 635 /* (gdb) l 636 * We cannot use static_cpu_has_bug() here because X86_BUG_AMD_APIC_C1E 637 * gets set after static_cpu_has() places have been converted via 638 * alternatives. 639 */ 640 if (!boot_cpu_has_bug(X86_BUG_AMD_APIC_C1E)) { 641 default_idle(); 642 return; 643 } 644 645 tick_broadcast_enter(); (gdb) l 646 647 default_idle(); 648 649 /* 650 * The switch back from broadcast mode needs to be called with 651 * interrupts disabled. 652 */ 653 local_irq_disable(); 654 tick_broadcast_exit(); 655 local_irq_enable(); (gdb) l 656 } 657 658 /* 659 * Intel Core2 and older machines prefer MWAIT over HALT for C1. 660 * We can't rely on cpuidle installing MWAIT, because it will not load 661 * on systems that support only C1 -- so the boot default must be MWAIT. 662 * 663 * Some AMD machines are the opposite, they depend on using HALT. 664 * 665 * So for default C1, which is used during boot until cpuidle loads, (gdb) l 666 * use MWAIT-C1 on Intel HW that has it, else use HALT. 667 */ 668 static int prefer_mwait_c1_over_halt(const struct cpuinfo_x86 *c) 669 { 670 if (c->x86_vendor != X86_VENDOR_INTEL) 671 return 0; 672 673 if (!cpu_has(c, X86_FEATURE_MWAIT) || boot_cpu_has_bug(X86_BUG_MONITOR)) 674 return 0; 675 (gdb) l 676 return 1; 677 } 678 679 /* 680 * MONITOR/MWAIT with no hints, used for default C1 state. This invokes MWAIT 681 * with interrupts enabled and no flags, which is backwards compatible with the 682 * original MWAIT implementation. 683 */ 684 static __cpuidle void mwait_idle(void) 685 { (gdb) l 686 if (!current_set_polling_and_test()) { 687 trace_cpu_idle_rcuidle(1, smp_processor_id()); 688 if (this_cpu_has(X86_BUG_CLFLUSH_MONITOR)) { 689 mb(); /* quirk */ 690 clflush((void *)¤t_thread_info()->flags); 691 mb(); /* quirk */ 692 } 693 694 __monitor((void *)¤t_thread_info()->flags, 0, 0); 695 if (!need_resched()) (gdb) l 696 __sti_mwait(0, 0); 697 else 698 local_irq_enable(); 699 trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id()); 700 } else { 701 local_irq_enable(); 702 } 703 __current_clr_polling(); 704 } 705 (gdb) l 706 void select_idle_routine(const struct cpuinfo_x86 *c) 707 { 708 #ifdef CONFIG_SMP 709 if (boot_option_idle_override == IDLE_POLL && smp_num_siblings > 1) 710 pr_warn_once("WARNING: polling idle and HT enabled, performance may degrade\n"); 711 #endif 712 if (x86_idle || boot_option_idle_override == IDLE_POLL) 713 return; 714 715 if (boot_cpu_has_bug(X86_BUG_AMD_E400)) { (gdb) l 716 pr_info("using AMD E400 aware idle routine\n"); 717 x86_idle = amd_e400_idle; 718 } else if (prefer_mwait_c1_over_halt(c)) { 719 pr_info("using mwait in idle threads\n"); 720 x86_idle = mwait_idle; 721 } else 722 x86_idle = default_idle; 723 } 724 725 void amd_e400_c1e_apic_setup(void) (gdb) l 726 { 727 if (boot_cpu_has_bug(X86_BUG_AMD_APIC_C1E)) { 728 pr_info("Switch to broadcast mode on CPU%d\n", smp_processor_id()); 729 local_irq_disable(); 730 tick_broadcast_force(); 731 local_irq_enable(); 732 } 733 } 734 735 void __init arch_post_acpi_subsys_init(void) (gdb) l 736 { 737 u32 lo, hi; 738 739 if (!boot_cpu_has_bug(X86_BUG_AMD_E400)) 740 return; 741 742 /* 743 * AMD E400 detection needs to happen after ACPI has been enabled. If 744 * the machine is affected K8_INTP_C1E_ACTIVE_MASK bits are set in 745 * MSR_K8_INT_PENDING_MSG. (gdb) l 746 */ 747 rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi); 748 if (!(lo & K8_INTP_C1E_ACTIVE_MASK)) 749 return; 750 751 boot_cpu_set_bug(X86_BUG_AMD_APIC_C1E); 752 753 if (!boot_cpu_has(X86_FEATURE_NONSTOP_TSC)) 754 mark_tsc_unstable("TSC halt in AMD C1E"); 755 pr_info("System has AMD C1E enabled\n"); (gdb) l 756 } 757 758 static int __init idle_setup(char *str) 759 { 760 if (!str) 761 return -EINVAL; 762 763 if (!strcmp(str, "poll")) { 764 pr_info("using polling idle threads\n"); 765 boot_option_idle_override = IDLE_POLL; (gdb) l 766 cpu_idle_poll_ctrl(true); 767 } else if (!strcmp(str, "halt")) { 768 /* 769 * When the boot option of idle=halt is added, halt is 770 * forced to be used for CPU idle. In such case CPU C2/C3 771 * won't be used again. 772 * To continue to load the CPU idle driver, don't touch 773 * the boot_option_idle_override. 774 */ 775 x86_idle = default_idle; (gdb) l 776 boot_option_idle_override = IDLE_HALT; 777 } else if (!strcmp(str, "nomwait")) { 778 /* 779 * If the boot option of "idle=nomwait" is added, 780 * it means that mwait will be disabled for CPU C2/C3 781 * states. In such case it won't touch the variable 782 * of boot_option_idle_override. 783 */ 784 boot_option_idle_override = IDLE_NOMWAIT; 785 } else (gdb) l 786 return -1; 787 788 return 0; 789 } 790 early_param("idle", idle_setup); 791 792 unsigned long arch_align_stack(unsigned long sp) 793 { 794 if (!(current->personality & ADDR_NO_RANDOMIZE) && randomize_va_space) 795 sp -= get_random_int() % 8192; (gdb) l 796 return sp & ~0xf; 797 } 798 799 unsigned long arch_randomize_brk(struct mm_struct *mm) 800 { 801 return randomize_page(mm->brk, 0x02000000); 802 } 803 804 /* 805 * Called from fs/proc with a reference on @p to find the function (gdb) l 806 * which called into schedule(). This needs to be done carefully 807 * because the task might wake up and we might look at a stack 808 * changing under us. 809 */ 810 unsigned long get_wchan(struct task_struct *p) 811 { 812 unsigned long start, bottom, top, sp, fp, ip, ret = 0; 813 int count = 0; 814 815 if (p == current || p->state == TASK_RUNNING) (gdb) l 816 return 0; 817 818 if (!try_get_task_stack(p)) 819 return 0; 820 821 start = (unsigned long)task_stack_page(p); 822 if (!start) 823 goto out; 824 825 /* (gdb) l 826 * Layout of the stack page: 827 * 828 * ----------- topmax = start + THREAD_SIZE - sizeof(unsigned long) 829 * PADDING 830 * ----------- top = topmax - TOP_OF_KERNEL_STACK_PADDING 831 * stack 832 * ----------- bottom = start 833 * 834 * The tasks stack pointer points at the location where the 835 * framepointer is stored. The data on the stack is: (gdb) l 836 * ... IP FP ... IP FP 837 * 838 * We need to read FP and IP, so we need to adjust the upper 839 * bound by another unsigned long. 840 */ 841 top = start + THREAD_SIZE - TOP_OF_KERNEL_STACK_PADDING; 842 top -= 2 * sizeof(unsigned long); 843 bottom = start; 844 845 sp = READ_ONCE(p->thread.sp); (gdb) l 846 if (sp < bottom || sp > top) 847 goto out; 848 849 fp = READ_ONCE_NOCHECK(((struct inactive_task_frame *)sp)->bp); 850 do { 851 if (fp < bottom || fp > top) 852 goto out; 853 ip = READ_ONCE_NOCHECK(*(unsigned long *)(fp + sizeof(unsigned long))); 854 if (!in_sched_functions(ip)) { 855 ret = ip; (gdb) l 856 goto out; 857 } 858 fp = READ_ONCE_NOCHECK(*(unsigned long *)fp); 859 } while (count++ < 16 && p->state != TASK_RUNNING); 860 861 out: 862 put_task_stack(p); 863 return ret; 864 } 865 (gdb) l 866 long do_arch_prctl_common(struct task_struct *task, int option, 867 unsigned long cpuid_enabled) 868 { 869 switch (option) { 870 case ARCH_GET_CPUID: 871 return get_cpuid_mode(); 872 case ARCH_SET_CPUID: 873 return set_cpuid_mode(task, cpuid_enabled); 874 } 875 (gdb) l 876 return -EINVAL; 877 }