linux啟動內核源碼分析


內核的啟動時從main.c這個文件里面的start_kernel函數開始的,這個文件在linux源碼里面的init文件夾下面

下面我們來看看這個函數 這個函數很長,可以看個大概過去

asmlinkage __visible void __init start_kernel(void)
{
    char *command_line;
    char *after_dashes;

    set_task_stack_end_magic(&init_task);
    smp_setup_processor_id();
    debug_objects_early_init();

    cgroup_init_early();

    local_irq_disable();
    early_boot_irqs_disabled = true;

    /*
     * Interrupts are still disabled. Do necessary setups, then
     * enable them.
     */
    boot_cpu_init();
    page_address_init();
    pr_notice("%s", linux_banner);
    setup_arch(&command_line);
    /*
     * Set up the the initial canary and entropy after arch
     * and after adding latent and command line entropy.
     */
    add_latent_entropy();
    add_device_randomness(command_line, strlen(command_line));
    boot_init_stack_canary();
    mm_init_cpumask(&init_mm);
    setup_command_line(command_line);
    setup_nr_cpu_ids();
    setup_per_cpu_areas();
    smp_prepare_boot_cpu();    /* arch-specific boot-cpu hooks */
    boot_cpu_hotplug_init();

    build_all_zonelists(NULL);
    page_alloc_init();

    pr_notice("Kernel command line: %s\n", boot_command_line);
    parse_early_param();
    after_dashes = parse_args("Booting kernel",
                  static_command_line, __start___param,
                  __stop___param - __start___param,
                  -1, -1, NULL, &unknown_bootoption);
    if (!IS_ERR_OR_NULL(after_dashes))
        parse_args("Setting init args", after_dashes, NULL, 0, -1, -1,
               NULL, set_init_arg);

    jump_label_init();

    /*
     * These use large bootmem allocations and must precede
     * kmem_cache_init()
     */
    setup_log_buf(0);
    vfs_caches_init_early();
    sort_main_extable();
    trap_init();
    mm_init();

    ftrace_init();

    /* trace_printk can be enabled here */
    early_trace_init();

    /*
     * Set up the scheduler prior starting any interrupts (such as the
     * timer interrupt). Full topology setup happens at smp_init()
     * time - but meanwhile we still have a functioning scheduler.
     */
    sched_init();
    /*
     * Disable preemption - early bootup scheduling is extremely
     * fragile until we cpu_idle() for the first time.
     */
    preempt_disable();
    if (WARN(!irqs_disabled(),
         "Interrupts were enabled *very* early, fixing it\n"))
        local_irq_disable();
    radix_tree_init();

    /*
     * Set up housekeeping before setting up workqueues to allow the unbound
     * workqueue to take non-housekeeping into account.
     */
    housekeeping_init();

    /*
     * Allow workqueue creation and work item queueing/cancelling
     * early.  Work item execution depends on kthreads and starts after
     * workqueue_init().
     */
    workqueue_init_early();

    rcu_init();

    /* Trace events are available after this */
    trace_init();

    if (initcall_debug)
        initcall_debug_enable();

    context_tracking_init();
    /* init some links before init_ISA_irqs() */
    early_irq_init();
    init_IRQ();
    tick_init();
    rcu_init_nohz();
    init_timers();
    hrtimers_init();
    softirq_init();
    timekeeping_init();
    time_init();
    printk_safe_init();
    perf_event_init();
    profile_init();
    call_function_init();
    WARN(!irqs_disabled(), "Interrupts were enabled early\n");

    early_boot_irqs_disabled = false;
    local_irq_enable();

    kmem_cache_init_late();

    /*
     * HACK ALERT! This is early. We're enabling the console before
     * we've done PCI setups etc, and console_init() must be aware of
     * this. But we do want output early, in case something goes wrong.
     */
    console_init();
    if (panic_later)
        panic("Too many boot %s vars at `%s'", panic_later,
              panic_param);

    lockdep_init();

    /*
     * Need to run this when irqs are enabled, because it wants
     * to self-test [hard/soft]-irqs on/off lock inversion bugs
     * too:
     */
    locking_selftest();

    /*
     * This needs to be called before any devices perform DMA
     * operations that might use the SWIOTLB bounce buffers. It will
     * mark the bounce buffers as decrypted so that their usage will
     * not cause "plain-text" data to be decrypted when accessed.
     */
    mem_encrypt_init();

#ifdef CONFIG_BLK_DEV_INITRD
    if (initrd_start && !initrd_below_start_ok &&
        page_to_pfn(virt_to_page((void *)initrd_start)) < min_low_pfn) {
        pr_crit("initrd overwritten (0x%08lx < 0x%08lx) - disabling it.\n",
            page_to_pfn(virt_to_page((void *)initrd_start)),
            min_low_pfn);
        initrd_start = 0;
    }
#endif
    kmemleak_init();
    setup_per_cpu_pageset();
    numa_policy_init();
    acpi_early_init();
    if (late_time_init)
        late_time_init();
    sched_clock_init();
    calibrate_delay();
    pid_idr_init();
    anon_vma_init();
#ifdef CONFIG_X86
    if (efi_enabled(EFI_RUNTIME_SERVICES))
        efi_enter_virtual_mode();
#endif
    thread_stack_cache_init();
    cred_init();
    fork_init();
    proc_caches_init();
    uts_ns_init();
    buffer_init();
    key_init();
    security_init();
    dbg_late_init();
    vfs_caches_init();
    pagecache_init();
    signals_init();
    seq_file_init();
    proc_root_init();
    nsfs_init();
    cpuset_init();
    cgroup_init();
    taskstats_init_early();
    delayacct_init();

    check_bugs();

    acpi_subsystem_init();
    arch_post_acpi_subsys_init();
    sfi_init_late();

    /* Do the rest non-__init'ed, we're now alive */
    arch_call_rest_init();
}

這個函數里面我們會看到有很多的各種init,也就是初始化,我們只說幾個重點操作

 

 

首先來看下這個函數set_task_stack_end_magic(&init_task);

在linux里面所有的進程都是由父進程創建而來,所以說在啟動內核的時候需要有個祖先進程,這個進程是系統創建的

第一個進程,我們稱為0號進程,它是唯一一個沒有通過fork或者kernel_thread的進程

 

然后就是初始化系統調用,對應的函數就是trap_init();這里面設置了很多中斷門,用於處理各種中斷

系統調用也是通過發送中斷的方式進行的。

接下來就是內存管理模塊的初始化,對應的函數是mm_init();

 

 

然后就是初始化任務調度,對應的函數就是sched_init();

這個任務調度是干嘛用的呢?就是操作系統協調進程和cpu,比如說分配哪個進程在cpu上運行呀,

在比如說你這個進程在cpu上運行時間過長了,然后操作系統就會把你踢下去,換另一個進程在cpu上運行。

 

到了這個preempt_disable();函數,這個函數的意思就是在這個函數運行以后就禁止被中斷

也就是說在這個函數運行后面,如果沒有主動讓出cpu,那么其他進程是無法搶占他的。

 

然后看下這個tick_init();這個函數是時鍾初始化,這個時鍾的概念是什么意思呢?

計算機會每隔一段時間周期通知操作系統,就像時鍾一樣,滴答滴答,每滴答一下就是一個時間周期過去了,

通知操作系統后,操作系統會看下當前在cpu上運行的進程運行時間是否過長,如果過長就標識該進程為可搶占

然后在某些時機下會切掉該進程,換下一個進程。

 

最后start_kernel()調用的是rest_init()用來初始化其他方面,這里面做了好多事情

noinline void __ref rest_init(void)
{
    struct task_struct *tsk;
    int pid;

    rcu_scheduler_starting();
    /*
     * We need to spawn init first so that it obtains pid 1, however
     * the init task will end up wanting to create kthreads, which, if
     * we schedule it before we create kthreadd, will OOPS.
     */
    pid = kernel_thread(kernel_init, NULL, CLONE_FS);
    /*
     * Pin init on the boot CPU. Task migration is not properly working
     * until sched_init_smp() has been run. It will set the allowed
     * CPUs for init to the non isolated CPUs.
     */
    rcu_read_lock();
    tsk = find_task_by_pid_ns(pid, &init_pid_ns);
    set_cpus_allowed_ptr(tsk, cpumask_of(smp_processor_id()));
    rcu_read_unlock();

    numa_default_policy();
    pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES);
    rcu_read_lock();
    kthreadd_task = find_task_by_pid_ns(pid, &init_pid_ns);
    rcu_read_unlock();

    /*
     * Enable might_sleep() and smp_processor_id() checks.
     * They cannot be enabled earlier because with CONFIG_PREEMPT=y
     * kernel_thread() would trigger might_sleep() splats. With
     * CONFIG_PREEMPT_VOLUNTARY=y the init task might have scheduled
     * already, but it's stuck on the kthreadd_done completion.
     */
    system_state = SYSTEM_SCHEDULING;

    complete(&kthreadd_done);

    /*
     * The boot idle thread must execute schedule()
     * at least once to get things moving:
     */
    schedule_preempt_disabled();
    /* Call into cpu_idle with preempt disabled */
    cpu_startup_entry(CPUHP_ONLINE);
}

 

首先調用kernel_thread()函數,用來創建用戶態的第一個進程,這個進程是所有用戶態進程的祖先進程,我們稱為1號進程

這個一號進程進入用戶態以后,開枝散葉,創建了很多子進程,子進程又創建子進程,就形成了一顆進程樹。

一旦有了用戶進程,就需要划分資源了,比如說用戶態的進程要想使用網卡發送數據,這個時候不能直接讓用戶態進程調用網卡

而是通過操作系統提供的系統調用函數,給進程發送數據,發送成功以后在返回到用戶態進程,通知進程處理結果,也就是封裝了

底層實現,用戶態進程想要實現什么功能,直接調用系統調用就可以了,在用戶態進程進行系統調用時,操作系統會把當前該進程的

參數都保存到寄存器里面,如果有對寄存器不懂的,就把寄存器想象成變量,變量是編程語言存放數據的,那么寄存器就是cpu用來存放數據的東西,

等到系統調用從內核態返回到用戶態的時候,會恢復當時保存的寄存器里面的數據,繼續運行。

這個過程就是這樣的,用戶態-》系統調用-》保存寄存器-》內核態執行系統調用-》恢復寄存器-》返回用戶態 接着運行

 

 

 

 然后接着說這個一號進程啟動過程,現在這個進程還是在內核態的,那么要怎么把它搞到用戶態里面的,

一般都是從用戶態到內核態在返回到用戶態,很少見過直接從內核態開始然后到用戶態的

看下下面這個代碼

void
start_thread(struct pt_regs *regs, unsigned long new_ip, unsigned long new_sp)
{
set_user_gs(regs, 0);
regs->fs    = 0;
regs->ds    = __USER_DS;
regs->es    = __USER_DS;
regs->ss    = __USER_DS;
regs->cs    = __USER_CS;
regs->ip    = new_ip;
regs->sp    = new_sp;
regs->flags    = X86_EFLAGS_IF;
force_iret();
}
EXPORT_SYMBOL_GPL(start_thread);

 

 創建進程的這函數最后會有這么一個函數也就是start_thread(),這里面把各個寄存器都設置為了_USER,啥意思呢,里面將用戶態的代碼段CS設置為_USER_CS,將用戶態的數據段DS設置為_USER_DS,

以及指令指針寄存器IP,棧頂指針SP,最后的force_iret();是用來恢復寄存器的,按理來說應該恢復在系統調用的時候保存的寄存器,這里面恢復的其實就是上面設置的寄存器。CS和指令指針寄存器IP恢復了,

指向用戶態下一個要執行的語句,DS和函數棧指針SP也被恢復了,指向用戶態函數棧的棧頂,所以,下一條指令就從用戶態開始了。

 

用戶態的祖先進程創建完了,那么內核態有沒有一個祖先進程呢?

有的,rest_init第二大事情就是第三個進程,也就是2號進程。

 了解更多:https://www.toutiao.com/c/user/83293539887/#mid=1633933053814798


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM