深入理解系統調用

本文轉載自查看原文 2020-05-26 00:51 668

一、實驗要求

找一個系統調用，系統調用號為學號最后2位相同的系統調用
通過匯編指令觸發該系統調用
通過gdb跟蹤該系統調用的內核處理過程
重點閱讀分析系統調用入口的保存現場、恢復現場和系統調用返回，以及重點關注系統調用過程中內核堆棧狀態的變化

二、理論知識

　　2.1 Linux整體架構圖

　　2.2 系統調用

　　系統調用時操作系統的最小功能單位。根據不同的應用場景，不同的Linux發行版本提供的系統調用數量也不盡相同，大致在240-350之間。這些系統調用組成了用戶態跟內核態交互的基本接口，例如：用戶態想要申請一塊20K大小的動態內存，就需要brk系統調用，將數據段指針向下偏移，如果用戶態多處申請20K動態內存，同時又釋放呢？這個內存的管理就變得非常的復雜。

　　2.3 庫函數

　　庫函數就是屏蔽這些復雜的底層實現細節，減輕程序員的負擔，從而更加關注上層的邏輯實現。它對系統調用進行封裝，提供簡單的基本接口給用戶，這樣增強了程序的靈活性，當然對於簡單的接口，也可以直接使用系統調用訪問資源，例如： open()， write()， read()等等。庫函數根據不同的標准也有不同的版本，例如： glibc庫， posix庫等。

　　2.4 用戶態與內核態

　　宏觀上 Linux 操作系統的體系架構分為用戶態和內核態。計算機的硬件資源是有限的，為了減少有限資源的訪問和使用沖突，CPU和操作系統必須提供一些機制對用戶程序進行權限划分。現代的 CPU一般都有幾種不同的指令執行級別，就是什么樣的程序執行什么樣的指令是有權限的。在高的執行級別下，代碼可以執行特權指令，訪問任意內存，這時 CPU 的執行級別對應的就是內核態，所有的指令包括特權指令都可以執行。相應的，在用戶態(低級別指令)，代碼能夠掌控的范圍會受到限制。為什么會出現這種情況呢?其實很容易理解，如果沒有權限級別的划分，系統中程序員編寫的所有代碼都可以使用特權指令，系統就很容易出現崩潰的情況。因為不是每個程序員寫的代碼都那么健壯，或者說會非法訪問其他進程甚至內核的資源，就會產生信息安全問題，這也是操作系統發展的過程中保證系統穩定性和安全性的一種機制。讓普通程序員寫的用戶態的代碼很難導致整個系統的崩潰，而操作系統內核的代碼是由更專業的程序員寫的，有規范的測試，相對就會更穩定、健壯。

　　2.5 中斷

　　中斷分外部中斷(硬件中斷)和內部中斷(軟件中斷)，內部中斷又稱為異常(Exception)，異常又分為故障(fault)和陷阱(trap)。系統調用就是利用陷阱(trap)這種軟件中斷方式主動從用戶態進入內核態的。

　　一般來說，從用戶態進入內核態是由中斷觸發的，可能是硬件中斷，在用戶態進程執行時，硬件中斷信號到來，進入內核態，就會執行這個中斷對應的中斷服務例程。也可能是用戶態程序執行過程中，調用了一個系統調用，陷入了內核態，叫作陷阱(trap)(系統調用是特殊的中斷)。

　　2.6 系統編程接口 API 和系統調用的關系

　　系統調用的庫函數就是我們使用的操作系統提供的 API(應用程序編程接口)，API 只是函數定義。系統調用是通過特定的軟件中斷(陷阱 trap)向內核發出服務請求，int $0x80 和syscall指令的執行就會觸發一個系統調用。C庫函數內部使用了系統調用的封裝例程，其主要目的是發布系統調用，使程序員在寫代碼時不需要用匯編指令和寄存器傳遞參數來觸發系統調用。一般每個系統調用對應一個系統調用的封裝例程，函數庫再用這些封裝例程定義出給程序員調用的 API，這樣把系統調用最終封裝成方便程序員使用的C庫函數。

　　C庫函數API可能直接提供一些用戶態的服務，並不需要通過系統調用與內核打交道，比如一些數學函數，但涉及與內核空間進行交互的C 庫函數API 內部會封裝系統調用。一個 API 可能只對應一個系統調用，也可能內部由多個系統調用實現，一個系統調用也可能被多個 API 調用。不涉及與內核進行交互的 API 內部不會封裝系統調用，比如用於求絕對值的數學函數 abs()。對於返回值，大部分系統調用的封裝例程返回一個整數，其值的含義依賴於對應的系統調用，返回值-1 在多數情況下表示內核不能滿足進程的請求，C庫函數中進一步定義的 errno 變量包含特定的出錯碼。

　　2.7 Linux系統調用

　　當用戶態進程調用一個系統調用時，CPU切換到內核態並開始執行 system_call(entry_INT80_32或entry_SYSCALL_64)匯編代碼，其中根據系統調用號調用對應的內核處理函數。具體來說，在Linux中通過執行int $0x80或syscall指令來觸發系統調用的執行，其中這條int $0x80匯編指令是產生中斷向量為128的編程異常(trap)。另外Intel 處理器中還引入了sysenter指令(快速系統調用)，因為Intel專用 AMD並不支持，在此不再詳述。我們只關注int指令和syscall指令觸發的系統調用，進入內核后，開始執行對應的中斷服務程序 entry_INT80_32或entry_SYSCALL_64。

三、搭建實驗環境

　　3.1 安裝開發工具

1 sudo apt install build-essential
2 sudo apt install qemu # install QEMU
3 sudo apt install libncurses5-dev bison flex libssl-dev libelf-dev

　　3.2 下載內核源代碼　　

1 sudo apt install axel
2 axel -n 20 https://mirrors.edge.kernel.org/pub/linux/kernel/v5.x/linux-5.4.34.tar.xz 
3 xz -d linux-5.4.34.tar.xz 
4 tar -xvf linux-5.4.34.tar 
5 cd linux-5.4.34

　　3.3 配置內核選項

 1 make defconﬁg # Default conﬁguration is based on 'x86_64_defconﬁg'
 2 make menuconﬁg  
 3 # 打開debug相關選項
 4 Kernel hacking  ---> 
 5     Compile-time checks and compiler options  ---> 
 6        [*] Compile the kernel with debug info 
 7        [*]   Provide GDB scripts for kernel debugging
 8  [*] Kernel debugging 
 9 # 關閉KASLR，否則會導致打斷點失敗
10 Processor type and features ----> 
11    [] Randomize the address of the kernel image (KASLR)

　　3.4 編譯和運行內核

1 make -j$(nproc) # nproc gives the number of CPU cores/threads available
2 # 測試⼀下內核能不能正常加載運⾏，因為沒有⽂件系統終會kernel panic 
3 qemu-system-x86_64 -kernel arch/x86/boot/bzImage  #  此時應該不能正常運行

　　由於沒有⽂件系統最終會kernel panic，這屬於正常現象

　　3.5 制作根文件系統

1 首先從https://www.busybox.net下載 busybox源代碼解壓，解壓完成后，跟內核一樣先配置編譯，並安裝。
2 axel -n 20 https://busybox.net/downloads/busybox-1.31.1.tar.bz2
3 tar -jxvf busybox-1.31.1.tar.bz2
4 cd busybox-1.31.1

1 make menuconﬁg  #若出現打不開menuconfig，重啟虛擬機即可
2 #記得要編譯成靜態鏈接，不⽤動態鏈接庫。
3 Settings  --->
4     [*] Build static binary (no shared libs) 
5 #然后編譯安裝，默認會安裝到源碼⽬錄下的 _install ⽬錄中。 
6 make -j$(nproc) && make install

　　制作內存根文件系統鏡像

1 mkdir rootfs
2 cd rootfs
3 cp ../busybox-1.31.1/_install/* ./ -rf
4 mkdir dev proc sys home
5 sudo cp -a /dev/{null,console,tty,tty1,tty2,tty3,tty4} dev/

　　准備init腳本文件放在根文件系統跟目錄下(rootfs/init)，添加如下內容到init文件。

1 #!/bin/sh
2 mount -t proc none /proc 
3 mount -t sysfs none /sys
4 echo "Wellcome MengningOS!" 
5 echo "--------------------"
6 cd home
7 /bin/sh

　　給init腳本添加可執⾏權限

1 chmod +x init

1 #打包成內存根⽂件系統鏡像 
2 ﬁnd . -print0 | cpio --null -ov --format=newc | gzip -9 > ../ rootfs.cpio.gz 
3 #測試掛載根⽂件系統，看內核啟動完成后是否執⾏init腳本 
4 qemu-system-x86_64 -kernel linux-5.4.34/arch/x86/boot/bzImage -initrd rootfs.cpio.gz

四、系統調用

　　我的學號后兩位為03，查找linux-5.4.34/arch/x86/entry/syscalls/syscall_32.tb，03號為read系統調用。

　　通過man read命令查看read()函數

　　編寫read.c函數

 1 #include <unistd.h>
 2 #include <stdlib.h>
 3 #include <stdio.h>
 4 
 5 int main(void) 
 6 {
 7     char buf[10];
 8     int n = read(STDIN_FILENO, buf, 10);
 9     printf("%d: %s\n", n, buf);
10     return 0;
11 }

　　編譯運行

　　使用匯編指令觸發系統調用，vi read_asm.c

 1 #include <unistd.h>
 2 #include <stdlib.h>
 3 #include <stdio.h>
 4 
 5 #define SYS_READ 3
 6 
 7 int main()
 8 {
 9    char buff[10]; 
10    ssize_t charsread; 
11    asm volatile("int $0x80" 
12       : "=a" (charsread) 
13       : "0" (SYS_READ), "b" (STDIN_FILENO), "c" (buff), "d" (sizeof(buff))
14       : "memory", "cc");
15 
16     printf("%d: %s", (int)charsread, buff);
17 
18     return 0;
19 }

　　編譯運行

　　分析：read系統調用主要作用是，讀取輸入（文件內容）到buf，返回內容結束位置

五、gdb跟蹤read系統調用的內核處理過程

1 find . -print0 | cpio --null -ov --format=newc | gzip -9 > ../rootfs.cpio.gz

1 qemu-system-x86_64 -kernel linux-5.4.34/arch/x86/boot/bzImage -initrd rootfs.cpio.gz -S -s -nographic -append "console=ttyS0"

1 #開啟新的terminal
2 
3 cd linux-5.4.34
4 gdb vmlinux
5 target remote:1234

1 #設置斷點
2 b __ia32_sys_read

(gdb) target remote:1234
Remote debugging using :1234
default_idle () at arch/x86/kernel/process.c:581
581        trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id());
(gdb) b __ia32_sys_read
Breakpoint 1 at 0xffffffff811d0b50: file fs/read_write.c, line 597.
(gdb) bt
#0  default_idle () at arch/x86/kernel/process.c:581
#1  0xffffffff81095ebe in cpuidle_idle_call () at kernel/sched/idle.c:154
#2  do_idle () at kernel/sched/idle.c:263
#3  0xffffffff810960f4 in cpu_startup_entry (state=CPUHP_ONLINE)
    at kernel/sched/idle.c:355
#4  0xffffffff81a7eec5 in rest_init () at init/main.c:451
#5  0xffffffff829aeab7 in arch_call_rest_init () at init/main.c:573
#6  0xffffffff829aef74 in start_kernel () at init/main.c:785
#7  0xffffffff810000d4 in secondary_startup_64 ()
    at arch/x86/kernel/head_64.S:241
#8  0x0000000000000000 in ?? ()
(gdb) l
576     */
577    void __cpuidle default_idle(void)
578    {
579        trace_cpu_idle_rcuidle(1, smp_processor_id());
580        safe_halt();
581        trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id());
582    }
583    #if defined(CONFIG_APM_MODULE) || defined(CONFIG_HALTPOLL_CPUIDLE_MODULE)
584    EXPORT_SYMBOL(default_idle);
585    #endif
(gdb) l
586    
587    #ifdef CONFIG_XEN
588    bool xen_set_default_idle(void)
589    {
590        bool ret = !!x86_idle;
591    
592        x86_idle = default_idle;
593    
594        return ret;
595    }
(gdb) l
596    #endif
597    
598    void stop_this_cpu(void *dummy)
599    {
600        local_irq_disable();
601        /*
602         * Remove this CPU:
603         */
604        set_cpu_online(smp_processor_id(), false);
605        disable_local_APIC();
(gdb) l
606        mcheck_cpu_clear(this_cpu_ptr(&cpu_info));
607    
608        /*
609         * Use wbinvd on processors that support SME. This provides support
610         * for performing a successful kexec when going from SME inactive
611         * to SME active (or vice-versa). The cache must be cleared so that
612         * if there are entries with the same physical address, both with and
613         * without the encryption bit, they don't race each other when flushed
614         * and potentially end up with the wrong entry being committed to
615         * memory.
(gdb) l
616         */
617        if (boot_cpu_has(X86_FEATURE_SME))
618            native_wbinvd();
619        for (;;) {
620            /*
621             * Use native_halt() so that memory contents don't change
622             * (stack usage and variables) after possibly issuing the
623             * native_wbinvd() above.
624             */
625            native_halt();
(gdb) l
626        }
627    }
628    
629    /*
630     * AMD Erratum 400 aware idle routine. We handle it the same way as C3 power
631     * states (local apic timer and TSC stop).
632     */
633    static void amd_e400_idle(void)
634    {
635        /*
(gdb) l
636         * We cannot use static_cpu_has_bug() here because X86_BUG_AMD_APIC_C1E
637         * gets set after static_cpu_has() places have been converted via
638         * alternatives.
639         */
640        if (!boot_cpu_has_bug(X86_BUG_AMD_APIC_C1E)) {
641            default_idle();
642            return;
643        }
644    
645        tick_broadcast_enter();
(gdb) l
646    
647        default_idle();
648    
649        /*
650         * The switch back from broadcast mode needs to be called with
651         * interrupts disabled.
652         */
653        local_irq_disable();
654        tick_broadcast_exit();
655        local_irq_enable();
(gdb) l
656    }
657    
658    /*
659     * Intel Core2 and older machines prefer MWAIT over HALT for C1.
660     * We can't rely on cpuidle installing MWAIT, because it will not load
661     * on systems that support only C1 -- so the boot default must be MWAIT.
662     *
663     * Some AMD machines are the opposite, they depend on using HALT.
664     *
665     * So for default C1, which is used during boot until cpuidle loads,
(gdb) l
666     * use MWAIT-C1 on Intel HW that has it, else use HALT.
667     */
668    static int prefer_mwait_c1_over_halt(const struct cpuinfo_x86 *c)
669    {
670        if (c->x86_vendor != X86_VENDOR_INTEL)
671            return 0;
672    
673        if (!cpu_has(c, X86_FEATURE_MWAIT) || boot_cpu_has_bug(X86_BUG_MONITOR))
674            return 0;
675    
(gdb) l
676        return 1;
677    }
678    
679    /*
680     * MONITOR/MWAIT with no hints, used for default C1 state. This invokes MWAIT
681     * with interrupts enabled and no flags, which is backwards compatible with the
682     * original MWAIT implementation.
683     */
684    static __cpuidle void mwait_idle(void)
685    {
(gdb) l
686        if (!current_set_polling_and_test()) {
687            trace_cpu_idle_rcuidle(1, smp_processor_id());
688            if (this_cpu_has(X86_BUG_CLFLUSH_MONITOR)) {
689                mb(); /* quirk */
690                clflush((void *)&current_thread_info()->flags);
691                mb(); /* quirk */
692            }
693    
694            __monitor((void *)&current_thread_info()->flags, 0, 0);
695            if (!need_resched())
(gdb) l
696                __sti_mwait(0, 0);
697            else
698                local_irq_enable();
699            trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id());
700        } else {
701            local_irq_enable();
702        }
703        __current_clr_polling();
704    }
705    
(gdb) l
706    void select_idle_routine(const struct cpuinfo_x86 *c)
707    {
708    #ifdef CONFIG_SMP
709        if (boot_option_idle_override == IDLE_POLL && smp_num_siblings > 1)
710            pr_warn_once("WARNING: polling idle and HT enabled, performance may degrade\n");
711    #endif
712        if (x86_idle || boot_option_idle_override == IDLE_POLL)
713            return;
714    
715        if (boot_cpu_has_bug(X86_BUG_AMD_E400)) {
(gdb) l
716            pr_info("using AMD E400 aware idle routine\n");
717            x86_idle = amd_e400_idle;
718        } else if (prefer_mwait_c1_over_halt(c)) {
719            pr_info("using mwait in idle threads\n");
720            x86_idle = mwait_idle;
721        } else
722            x86_idle = default_idle;
723    }
724    
725    void amd_e400_c1e_apic_setup(void)
(gdb) l
726    {
727        if (boot_cpu_has_bug(X86_BUG_AMD_APIC_C1E)) {
728            pr_info("Switch to broadcast mode on CPU%d\n", smp_processor_id());
729            local_irq_disable();
730            tick_broadcast_force();
731            local_irq_enable();
732        }
733    }
734    
735    void __init arch_post_acpi_subsys_init(void)
(gdb) l
736    {
737        u32 lo, hi;
738    
739        if (!boot_cpu_has_bug(X86_BUG_AMD_E400))
740            return;
741    
742        /*
743         * AMD E400 detection needs to happen after ACPI has been enabled. If
744         * the machine is affected K8_INTP_C1E_ACTIVE_MASK bits are set in
745         * MSR_K8_INT_PENDING_MSG.
(gdb) l
746         */
747        rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi);
748        if (!(lo & K8_INTP_C1E_ACTIVE_MASK))
749            return;
750    
751        boot_cpu_set_bug(X86_BUG_AMD_APIC_C1E);
752    
753        if (!boot_cpu_has(X86_FEATURE_NONSTOP_TSC))
754            mark_tsc_unstable("TSC halt in AMD C1E");
755        pr_info("System has AMD C1E enabled\n");
(gdb) l
756    }
757    
758    static int __init idle_setup(char *str)
759    {
760        if (!str)
761            return -EINVAL;
762    
763        if (!strcmp(str, "poll")) {
764            pr_info("using polling idle threads\n");
765            boot_option_idle_override = IDLE_POLL;
(gdb) l
766            cpu_idle_poll_ctrl(true);
767        } else if (!strcmp(str, "halt")) {
768            /*
769             * When the boot option of idle=halt is added, halt is
770             * forced to be used for CPU idle. In such case CPU C2/C3
771             * won't be used again.
772             * To continue to load the CPU idle driver, don't touch
773             * the boot_option_idle_override.
774             */
775            x86_idle = default_idle;
(gdb) l
776            boot_option_idle_override = IDLE_HALT;
777        } else if (!strcmp(str, "nomwait")) {
778            /*
779             * If the boot option of "idle=nomwait" is added,
780             * it means that mwait will be disabled for CPU C2/C3
781             * states. In such case it won't touch the variable
782             * of boot_option_idle_override.
783             */
784            boot_option_idle_override = IDLE_NOMWAIT;
785        } else
(gdb) l
786            return -1;
787    
788        return 0;
789    }
790    early_param("idle", idle_setup);
791    
792    unsigned long arch_align_stack(unsigned long sp)
793    {
794        if (!(current->personality & ADDR_NO_RANDOMIZE) && randomize_va_space)
795            sp -= get_random_int() % 8192;
(gdb) l
796        return sp & ~0xf;
797    }
798    
799    unsigned long arch_randomize_brk(struct mm_struct *mm)
800    {
801        return randomize_page(mm->brk, 0x02000000);
802    }
803    
804    /*
805     * Called from fs/proc with a reference on @p to find the function
(gdb) l
806     * which called into schedule(). This needs to be done carefully
807     * because the task might wake up and we might look at a stack
808     * changing under us.
809     */
810    unsigned long get_wchan(struct task_struct *p)
811    {
812        unsigned long start, bottom, top, sp, fp, ip, ret = 0;
813        int count = 0;
814    
815        if (p == current || p->state == TASK_RUNNING)
(gdb) l
816            return 0;
817    
818        if (!try_get_task_stack(p))
819            return 0;
820    
821        start = (unsigned long)task_stack_page(p);
822        if (!start)
823            goto out;
824    
825        /*
(gdb) l
826         * Layout of the stack page:
827         *
828         * ----------- topmax = start + THREAD_SIZE - sizeof(unsigned long)
829         * PADDING
830         * ----------- top = topmax - TOP_OF_KERNEL_STACK_PADDING
831         * stack
832         * ----------- bottom = start
833         *
834         * The tasks stack pointer points at the location where the
835         * framepointer is stored. The data on the stack is:
(gdb) l
836         * ... IP FP ... IP FP
837         *
838         * We need to read FP and IP, so we need to adjust the upper
839         * bound by another unsigned long.
840         */
841        top = start + THREAD_SIZE - TOP_OF_KERNEL_STACK_PADDING;
842        top -= 2 * sizeof(unsigned long);
843        bottom = start;
844    
845        sp = READ_ONCE(p->thread.sp);
(gdb) l
846        if (sp < bottom || sp > top)
847            goto out;
848    
849        fp = READ_ONCE_NOCHECK(((struct inactive_task_frame *)sp)->bp);
850        do {
851            if (fp < bottom || fp > top)
852                goto out;
853            ip = READ_ONCE_NOCHECK(*(unsigned long *)(fp + sizeof(unsigned long)));
854            if (!in_sched_functions(ip)) {
855                ret = ip;
(gdb) l
856                goto out;
857            }
858            fp = READ_ONCE_NOCHECK(*(unsigned long *)fp);
859        } while (count++ < 16 && p->state != TASK_RUNNING);
860    
861    out:
862        put_task_stack(p);
863        return ret;
864    }
865    
(gdb) l
866    long do_arch_prctl_common(struct task_struct *task, int option,
867                  unsigned long cpuid_enabled)
868    {
869        switch (option) {
870        case ARCH_GET_CPUID:
871            return get_cpuid_mode();
872        case ARCH_SET_CPUID:
873            return set_cpuid_mode(task, cpuid_enabled);
874        }
875    
(gdb) l
876        return -EINVAL;
877    }

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 深入理解系統調用深入理解系統調用深入理解系統調用深入理解Linux系統調用深入理解系統調用深入理解TCP協議close的系統調用深入理解Linux系統調用過程深入理解Linux系統調用：write/writev Linux操作系統分析 | 深入理解系統調用深入理解TCP協議運行過程和系統調用過程