本文將深入學習kdump相關代碼，梳理kdump整個流程。

1. Overview

1.1 什么是kdump

kernel崩潰時，創建核心存儲(core dump)
kdump導出/proc/vmcore，便於離線分析crash原因

1.2 kernel的分類

first kernel(production kernel)
second kernel(dump-capture kernel)

1.3 kdump流程

1.4 版本信息

kernel版本是4.9.38
kexec-tools版本是v2.0.16
architecture是arm64

2. kexec

2.1 kexec overview

2.1.1 kexec功能

kexec主要有兩個功能

快速切換kernel
kdump

2.1.2 代碼下載/編譯

kexec可通過github下載

 
               $ git clone https://github.com/horms/kexec-tools.git

下載倉庫后，可交叉編譯出arm64可執行文件

 
               $ ./bootstrap
$ mkdir timzhang
$ ./configure --prefix=/home/timzhang/work/project/github/kexec-tools/timzhang --build=x86_64-linux-gnu --host=aarch64-linux-gnu --target=aarch64-linux-gnu
$ make
$ make install

2.1.3 kexec運行

快速切換kernel

 
               $ sudo ./kexec -l ./Image --initrd=kdump.cpio --dtb=./chip_asic.dtb --append="1 maxcpus=1 reset_devices console=ttyS0,115200 earlycon debug user_debug=31 loglevel=10"

$ sudo ./kexec -e

kdump

 
               $ sudo ./kexec -p ./Image --initrd=./kdump.cpio --dtb=./chip_asic.dtb --append="1 maxcpus=1 reset_devices console=ttyS0,115200 earlycon debug user_debug=31 loglevel=10"

注: Image和dtb文件最好和當前kernel版本保持一致！

2.2 kexec用戶態

2.2.1 配置解析

判斷是否加載crashkernel，相關函數實現在is_crashkernel_mem_reserved()

該函數主要操作如下

讀取/proc/iomem
調用回調函數，處理Crash kernel/System RAM/Kernel code/Kernel data等感興趣的內存部分.

 
               # cat /proc/iomem 
... ...
105100000-2ffffffff : System RAM
300020000-3ffffffff : System RAM
 300080000-300aaffff : Kernel code
 300b40000-300c85fff : Kernel data
 310000000-313ffffff : Crash kernel
408000000-4ffffffff : System RAM

其中，除了crash kernel部分， System RAM其余部分都是需要被dump的。

2.2.2 收集segment

kexec用戶態的信息，存儲在struct kexec_info里面。

struct kexec_info結構體的定義如下:

 
            
             
               
               struct kexec_info {
 struct kexec_segment *segment;
 int nr_segments;
 struct memory_range *memory_range;
 int memory_ranges;
 struct memory_range *crash_range;
 int nr_crash_ranges;
 void *entry;
 struct mem_ehdr rhdr;
 unsigned long backup_start;
 unsigned long kexec_flags;
 unsigned long backup_src_start;
 unsigned long backup_src_size;
 /* Set to 1 if we are using kexec file syscall */
 unsigned long file_mode :1; 

 /* Filled by kernel image processing code */
 int initrd_fd;
 char *command_line;
 int command_line_len;
};
  
             
 
            
          

              
          

kexec在用戶態，主要是集齊5個segment, 它們分別是

kernel segment
ELF core header segment
initrd segment
dtb segment
purgatory segment

kernel segment

功能: 讀取kexec運行時指定的kernel image

file type	probe	load	usage
vmlinux	elf_arm64_probe	elf_arm64_load	elf_arm64_usage
Image	image_arm64_probe	image_arm64_load	image_arm64_usage
uImage	uImage_arm64_probe	uImage_arm64_load	uImage_arm64_usage

ELF core header segment

功能: 為kdump生成vmcore准備ELF core header

構造ELF core header segment時，主要構造了ELF header和program header

其中， program header包含了PT_NOTE和PT_LOAD兩類

program header 類型	內容
PT_NOTE	cpu, vmcoreinfo
PT_LOAD	kernel text, system ram chunk

構造結束后，將ELF core header的起始地址保存到elfcorehdr_mem

initrd segment

功能: 讀取kexec運行時指定的initrd

dtb segment

功能: 讀取kexec運行時指定的dtb

設置屬性

“linux,elfcorehdr”: 將ELF core header地址(elfcorehdr_mem)設置到此屬性
“linux,usable-memory-range”: 將crash_reserved_mem設置到此屬性(指定capture kernel的總內存大小)

purgatory segment

功能: 用於完成crash kernel完整性校驗和kernel跳轉

主要流程如下

purgatory相關源文件生成purgatory.ro
bin-to-hex將purgatory.ro轉換成purgatory.c(ELF內容格式到purgatory[])
purgatory.c和其它源文件編譯生成kexec可執行程序
kexec在運行過程中，構造purgatory可重定位對象，放到purgatory buffer

purgatory.ro中有重要的三個符號

符號	說明
purgatory_start	purgatory的啟動地址, 保存到info->entry字段
arm64_kernel_entry	capture kernel代碼段首地址
arm64_dtb_addr	capture kernel依賴的dtb地址

 
               $ aarch64-linux-gnu-readelf -s purgatory.ro.sym |grep -e purgatory_start -e arm64_
0000000000000120 8 NOTYPE GLOBAL DEFAULT 4 arm64_kernel_entry
0000000000000660 36 NOTYPE GLOBAL DEFAULT 1 purgatory_start
0000000000000128 8 NOTYPE GLOBAL DEFAULT 4 arm64_dtb_addr

小結

5個segment收集完畢后， info.segment[]的布局如下

2.2.3 更新哈希值

調用update_purgatory()，計算除了purgatory之外其它4個segment的sha256值, 並存儲到sha256_digest符號

 
               $ aarch64-linux-gnu-readelf -s purgatory.ro.sym |grep -e " sha256_regions" -e " sha256_digest"
 56: 0000000000000000 256 OBJECT GLOBAL DEFAULT 4 sha256_regions
 57: 0000000000000100 32 OBJECT GLOBAL DEFAULT 4 sha256_digest

2.2.4 開始加載

kexex通過kexec_load函數中的系統調用來完成最后的加載。

 
               static inline long kexec_load(void *entry, unsigned long nr_segments,
 struct kexec_segment *segments, unsigned long flags)
{
 return (long) syscall(__NR_kexec_load, entry, nr_segments, segments, flags);
}

2.3 kexec內核態

2.3.1 crashkernel

crashkernel表示給capture kernel預留的內存.

在打開kdump的kernel中，會有類似如下的啟動log

 
               ... ...
crashkernel reserved: 0x0000000310000000 - 0x0000000314000000 (64 MB)
... ...
Kernel command line: console=ttyS0,115200 earlycon user_debug=31 crashkernel=64M@0x310000000
... ...

這里給capture kernel配置的起始物理地址0x310000000, 大小為64MB.

具體在kernel代碼中，是通過reserve_crashkernel()來完成的

 
            
             
               
               static void __init reserve_crashkernel(void)
{
 unsigned long long crash_base, crash_size;
 int ret;

 ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
 &crash_size, &crash_base);
 /* no crashkernel= or invalid value specified */
 if (ret || !crash_size)
 return;

 crash_size = PAGE_ALIGN(crash_size);

 if (crash_base == 0) {
 /* Current arm64 boot protocol requires 2MB alignment */
 crash_base = memblock_find_in_range(0, ARCH_LOW_ADDRESS_LIMIT,
 crash_size, SZ_2M);
 if (crash_base == 0) {
 pr_warn("cannot allocate crashkernel (size:0x%llx)\n",
 crash_size);
 return;
 }
 } else {
 /* User specifies base address explicitly. */
 if (!memblock_is_region_memory(crash_base, crash_size)) {
 pr_warn("cannot reserve crashkernel: region is not memory\n");
 return;
 }

 if (memblock_is_region_reserved(crash_base, crash_size)) {
 pr_warn("cannot reserve crashkernel: region overlaps reserved memory\n");
 return;
 }

 if (!IS_ALIGNED(crash_base, SZ_2M)) {
 pr_warn("cannot reserve crashkernel: base address is not 2MB aligned\n");
 return;
 }
 }
 memblock_reserve(crash_base, crash_size);

 pr_info("crashkernel reserved: 0x%016llx - 0x%016llx (%lld MB)\n",
 crash_base, crash_base + crash_size, crash_size >> 20);

 crashk_res.start = crash_base;
 crashk_res.end = crash_base + crash_size - 1;
}
  
             
 
            
          

              
          

內核態是通過解析boot_command_line來獲取crashkernel的起始地址和大小，並保存到crashk_res結構體中。

2.3.2 sys_kexec_load

kexec在用戶態調用NR_kexec_load(104)的系統調用后，最終會執行到kernel態的sys_kexec_load

 
            
             
               
               SYSCALL_DEFINE4(kexec_load, unsigned long, entry, unsigned long, nr_segments,
 struct kexec_segment __user *, segments, unsigned long, flags)
{
 int result;

 /* We only trust the superuser with rebooting the system. */
 if (!capable(CAP_SYS_BOOT) || kexec_load_disabled)
 return -EPERM;

 /* 
 * Verify we have a legal set of flags
 * This leaves us room for future extensions.
 */
 if ((flags & KEXEC_FLAGS) != (flags & ~KEXEC_ARCH_MASK))
 return -EINVAL;

 /* Verify we are on the appropriate architecture */
 if (((flags & KEXEC_ARCH_MASK) != KEXEC_ARCH) &&
 ((flags & KEXEC_ARCH_MASK) != KEXEC_ARCH_DEFAULT))
 return -EINVAL;

 /* Put an artificial cap on the number
 * of segments passed to kexec_load.
 */
 if (nr_segments > KEXEC_SEGMENT_MAX)
 return -EINVAL;

 /* Because we write directly to the reserved memory
 * region when loading crash kernels we need a mutex here to
 * prevent multiple crash kernels from attempting to load
 * simultaneously, and to prevent a crash kernel from loading
 * over the top of a in use crash kernel.
 *
 * KISS: always take the mutex.
 */
 if (!mutex_trylock(&kexec_mutex))
 return -EBUSY;

 result = do_kexec_load(entry, nr_segments, segments, flags);

 mutex_unlock(&kexec_mutex);

 return result;
}
  
             
 
            
          

              
          

sys_kexec_load()中，主要干活的是do_kexec_load()

 
            
             
               
               static int do_kexec_load(unsigned long entry, unsigned long nr_segments,
 struct kexec_segment __user *segments, unsigned long flags)
{
 struct kimage **dest_image, *image;
 unsigned long i;
 int ret;

 if (flags & KEXEC_ON_CRASH) {
 dest_image = &kexec_crash_image;
 if (kexec_crash_image)
 arch_kexec_unprotect_crashkres();
 } else {
 dest_image = &kexec_image;
 }

 if (nr_segments == 0) {
 /* Uninstall image */
 kimage_free(xchg(dest_image, NULL));
 return 0;
 }
 if (flags & KEXEC_ON_CRASH) {
 /*
 * Loading another kernel to switch to if this one
 * crashes. Free any current crash dump kernel before
 * we corrupt it.
 */
 kimage_free(xchg(&kexec_crash_image, NULL));
 }

 ret = kimage_alloc_init(&image, entry, nr_segments, segments, flags);
 if (ret)
 return ret;

 if (flags & KEXEC_PRESERVE_CONTEXT)
 image->preserve_context = 1;

 ret = machine_kexec_prepare(image);
 if (ret)
 goto out;

 for (i = 0; i < nr_segments; i++) {
 ret = kimage_load_segment(image, &image->segment[i]);
 if (ret)
 goto out;
 }

 kimage_terminate(image);

 /* Install the new kernel and uninstall the old */
 image = xchg(dest_image, image);

out:
 if ((flags & KEXEC_ON_CRASH) && kexec_crash_image)
 arch_kexec_protect_crashkres();
 
 kimage_free(image);
 return ret;
}
  
             
 
            
          

              
          

kexec在內核態的信息存儲在kexec_crash_image, 它是struct kimage *類型。

struct kimage結構體定義如下

 
            
             
               
               struct kimage {
 kimage_entry_t head;
 kimage_entry_t *entry;
 kimage_entry_t *last_entry;

 unsigned long start;
 struct page *control_code_page;
 struct page *swap_page;

 unsigned long nr_segments;
 struct kexec_segment segment[KEXEC_SEGMENT_MAX];

 struct list_head control_pages;
 struct list_head dest_pages;
 struct list_head unusable_pages;

 /* Address of next control page to allocate for crash kernels. */
 unsigned long control_page;

 /* Flags to indicate special processing */
 unsigned int type : 1; 
#define KEXEC_TYPE_DEFAULT 0
#define KEXEC_TYPE_CRASH 1
 unsigned int preserve_context : 1;
 /* If set, we are using file mode kexec syscall */
 unsigned int file_mode:1;
 
#ifdef ARCH_HAS_KIMAGE_ARCH
 struct kimage_arch arch;
#endif

#ifdef CONFIG_KEXEC_FILE
 /* Additional fields for file based kexec syscall */
 void *kernel_buf;
 unsigned long kernel_buf_len;

 void *initrd_buf;
 unsigned long initrd_buf_len;

 char *cmdline_buf;
 unsigned long cmdline_buf_len;

 /* File operations provided by image loader */
 struct kexec_file_ops *fops;

 /* Image loader handling the kernel can store a pointer here */
 void *image_loader_data;

 /* Information for loading purgatory */
 struct purgatory_info purgatory_info;
#endif
};
  
             
 
            
          

              
          

kimage_alloc_init()會分配並初始化struct kimage結構體, 然后在kimage_load_segment()里將用戶態准備好的5個segment加載到crash kernel

kexec_crash_image->start的值為kexec在用戶態傳入的entry, 即purgatory_start

 
               image->start = entry

2.3.3 kexec相關節點

/sys/kernel/kexec_loaded: 快速切換kernel是否打開

 
               static ssize_t kexec_loaded_show(struct kobject *kobj,
 struct kobj_attribute *attr, char *buf)
{
 return sprintf(buf, "%d\n", !!kexec_image);
}
KERNEL_ATTR_RO(kexec_loaded);

/sys/kernel/kexec_crash_loaded: kexec crash是否打開

 
               static ssize_t kexec_crash_loaded_show(struct kobject *kobj,
 struct kobj_attribute *attr, char *buf)
{ 
 return sprintf(buf, "%d\n", kexec_crash_loaded());
}
KERNEL_ATTR_RO(kexec_crash_loaded);

int kexec_crash_loaded(void)
{
 return !!kexec_crash_image;
}
EXPORT_SYMBOL_GPL(kexec_crash_loaded);

/sys/kernel/kexec_crash_size: 返回crash kernel大小

 
               static ssize_t kexec_crash_size_show(struct kobject *kobj,
 struct kobj_attribute *attr, char *buf)
{ 
 return sprintf(buf, "%zu\n", crash_get_memory_size());
} 

size_t crash_get_memory_size(void)
{ 
 size_t size = 0;

 mutex_lock(&kexec_mutex);
 if (crashk_res.end != crashk_res.start)
 size = resource_size(&crashk_res);
 mutex_unlock(&kexec_mutex);
 return size;
}

/sys/kernel/vmcoreinfo: 返回vmcoreinfo_note相關信息

 
               static ssize_t vmcoreinfo_show(struct kobject *kobj,
 struct kobj_attribute *attr, char *buf)
{
 phys_addr_t vmcore_base = paddr_vmcoreinfo_note();
 return sprintf(buf, "%pa %x\n", &vmcore_base,
 (unsigned int)sizeof(vmcoreinfo_note));
} 
KERNEL_ATTR_RO(vmcoreinfo);

3. kdump

3.1 觸發kdump

當kernel panic后，最終會走到 __crash_kexec

 
            
             
               
               /*
 * No panic_cpu check version of crash_kexec(). This function is called
 * only when panic_cpu holds the current CPU number; this is the only CPU
 * which processes crash_kexec routines.
 */
void __crash_kexec(struct pt_regs *regs)
{
 /* Take the kexec_mutex here to prevent sys_kexec_load
 * running on one cpu from replacing the crash kernel
 * we are using after a panic on a different cpu.
 *
 * If the crash kernel was not located in a fixed area
 * of memory the xchg(&kexec_crash_image) would be
 * sufficient. But since I reuse the memory...
 */
 if (mutex_trylock(&kexec_mutex)) {
 if (kexec_crash_image) {
 struct pt_regs fixed_regs;

 crash_setup_regs(&fixed_regs, regs);
 crash_save_vmcoreinfo(); 
 machine_crash_shutdown(&fixed_regs);
 machine_kexec(kexec_crash_image);
 }
 mutex_unlock(&kexec_mutex);
 }
}
  
             
 
            
          

              
          

crash_setup_regs和machine_crash_shutdown用於保存當前的register信息到vmcore

crash_save_vmcoreinfo用於保存vmcore信息，如crash time等

最后會進入machine_kexec

 
            
             
               
               /**
 * machine_kexec - Do the kexec reboot.
 *
 * Called from the core kexec code for a sys_reboot with LINUX_REBOOT_CMD_KEXEC.
 */
void machine_kexec(struct kimage *kimage)
{
 phys_addr_t reboot_code_buffer_phys;
 void *reboot_code_buffer;
 bool in_kexec_crash = (kimage == kexec_crash_image);
 bool stuck_cpus = cpus_are_stuck_in_kernel();

 clear_abnormal_magic();

 /*
 * New cpus may have become stuck_in_kernel after we loaded the image.
 */
 BUG_ON(!in_kexec_crash && (stuck_cpus || (num_online_cpus() > 1)));
 WARN(in_kexec_crash && (stuck_cpus || smp_crash_stop_failed()),
 "Some CPUs may be stale, kdump will be unreliable.\n");

 reboot_code_buffer_phys = page_to_phys(kimage->control_code_page);
 reboot_code_buffer = phys_to_virt(reboot_code_buffer_phys);

 kexec_image_info(kimage);

 pr_debug("%s:%d: control_code_page: %p\n", __func__, __LINE__,
 kimage->control_code_page);
 pr_debug("%s:%d: reboot_code_buffer_phys: %pa\n", __func__, __LINE__,
 &reboot_code_buffer_phys);
 pr_debug("%s:%d: reboot_code_buffer: %p\n", __func__, __LINE__,
 reboot_code_buffer);
 pr_debug("%s:%d: relocate_new_kernel: %p\n", __func__, __LINE__,
 arm64_relocate_new_kernel);
 pr_debug("%s:%d: relocate_new_kernel_size: 0x%lx(%lu) bytes\n",
 __func__, __LINE__, arm64_relocate_new_kernel_size,
 arm64_relocate_new_kernel_size);

 /*
 * Copy arm64_relocate_new_kernel to the reboot_code_buffer for use
 * after the kernel is shut down.
 */
 memcpy(reboot_code_buffer, arm64_relocate_new_kernel,
 arm64_relocate_new_kernel_size);

 /* Flush the reboot_code_buffer in preparation for its execution. */
 __flush_dcache_area(reboot_code_buffer, arm64_relocate_new_kernel_size);
 flush_icache_range((uintptr_t)reboot_code_buffer,
 arm64_relocate_new_kernel_size);

 /* Flush the kimage list and its buffers. */
 kexec_list_flush(kimage);

 /* Flush the new image if already in place. */
 if ((kimage != kexec_crash_image) && (kimage->head & IND_DONE))
 kexec_segment_flush(kimage);

 pr_info("Bye!\n");

 /* Disable all DAIF exceptions. */
 asm volatile ("msr daifset, #0xf" : : : "memory");

 /*
 * cpu_soft_restart will shutdown the MMU, disable data caches, then
 * transfer control to the reboot_code_buffer which contains a copy of
 * the arm64_relocate_new_kernel routine. arm64_relocate_new_kernel
 * uses physical addressing to relocate the new image to its final
 * position and transfers control to the image entry point when the
 * relocation is complete.
 */

 cpu_soft_restart(kimage != kexec_crash_image,
 reboot_code_buffer_phys, kimage->head, kimage->start, 0);

 BUG(); /* Should never get here. */
}
  
             
 
            
          

              
          

machine_kexec()函數完成的主要功能

將arm64_relocate_new_kernel拷貝到kimage的控制代碼頁中
調用cpu_soft_restart, 傳入kimage的重要參數

 
               static inline void __noreturn cpu_soft_restart(unsigned long el2_switch,
 unsigned long entry, unsigned long arg0, unsigned long arg1,
 unsigned long arg2)
{
 typeof(__cpu_soft_restart) *restart;

 el2_switch = el2_switch && !is_kernel_in_hyp_mode() &&
 is_hyp_mode_available();
 restart = (void *)virt_to_phys(__cpu_soft_restart);

 cpu_install_idmap();
 restart(el2_switch, entry, arg0, arg1, arg2);
 unreachable();
}

繼續調用__cpu_soft_restart, 位於arch/arm64/kernel/relocate_kernel.S

 
            
             
               
               /*
 * __cpu_soft_restart(el2_switch, entry, arg0, arg1, arg2) - Helper for
 * cpu_soft_restart.
 *
 * @el2_switch: Flag to indicate a swich to EL2 is needed.
 * @entry: Location to jump to for soft reset.
 * arg0: First argument passed to @entry.
 * arg1: Second argument passed to @entry.
 * arg2: Third argument passed to @entry.
 *
 * Put the CPU into the same state as it would be if it had been reset, and
 * branch to what would be the reset vector. It must be executed with the
 * flat identity mapping.
 */
ENTRY(__cpu_soft_restart)
 /* Clear sctlr_el1 flags. */
 mrs x12, sctlr_el1
 ldr x13, =SCTLR_ELx_FLAGS
 bic x12, x12, x13 
 msr sctlr_el1, x12 
 isb 

 cbz x0, 1f // el2_switch?
 mov x0, #HVC_SOFT_RESTART
 hvc #0 // no return

1: mov x18, x1 // entry
 mov x0, x2 // arg0
 mov x1, x3 // arg1
 mov x2, x4 // arg2
 br x18 
ENDPROC(__cpu_soft_restart)
  
             
 
            
          

              
          

x18存儲是的arm64_relocate_new_kernel, 位於arch/arm64/kernel/relocate_kernel.S

 
            
             
               
               /*
 * arm64_relocate_new_kernel - Put a 2nd stage image in place and boot it.
 *
 * The memory that the old kernel occupies may be overwritten when coping the
 * new image to its final location. To assure that the
 * arm64_relocate_new_kernel routine which does that copy is not overwritten,
 * all code and data needed by arm64_relocate_new_kernel must be between the
 * symbols arm64_relocate_new_kernel and arm64_relocate_new_kernel_end. The
 * machine_kexec() routine will copy arm64_relocate_new_kernel to the kexec
 * control_code_page, a special page which has been set up to be preserved
 * during the copy operation.
 */
ENTRY(arm64_relocate_new_kernel)

 /* Setup the list loop variables. */
 mov x17, x1 /* x17 = kimage_start */
 mov x16, x0 /* x16 = kimage_head */
 raw_dcache_line_size x15, x0 /* x15 = dcache line size */
 mov x14, xzr /* x14 = entry ptr */
 mov x13, xzr /* x13 = copy dest */

 /* Clear the sctlr_el2 flags. */
 mrs x0, CurrentEL
 cmp x0, #CurrentEL_EL2
 b.ne 1f 
 mrs x0, sctlr_el2
 ldr x1, =SCTLR_ELx_FLAGS
 bic x0, x0, x1
 msr sctlr_el2, x0
 isb 
1:

 /* Check if the new image needs relocation. */
 tbnz x16, IND_DONE_BIT, .Ldone

.Lloop:
 and x12, x16, PAGE_MASK /* x12 = addr */

 /* Test the entry flags. */
.Ltest_source:
 tbz x16, IND_SOURCE_BIT, .Ltest_indirection

 /* Invalidate dest page to PoC. */
 mov x0, x13 
 add x20, x0, #PAGE_SIZE
 sub x1, x15, #1
 bic x0, x0, x1
2: dc ivac, x0
 add x0, x0, x15 
 cmp x0, x20 
 b.lo 2b 
 dsb sy 

 mov x20, x13 
 mov x21, x12
 copy_page x20, x21, x0, x1, x2, x3, x4, x5, x6, x7

 /* dest += PAGE_SIZE */
 add x13, x13, PAGE_SIZE
 b .Lnext

.Ltest_indirection:
 tbz x16, IND_INDIRECTION_BIT, .Ltest_destination

 /* ptr = addr */
 mov x14, x12
 b .Lnext

.Ltest_destination:
 tbz x16, IND_DESTINATION_BIT, .Lnext

 /* dest = addr */
 mov x13, x12

.Lnext:
 /* entry = *ptr++ */
 ldr x16, [x14], #8

 /* while (!(entry & DONE)) */
 tbz x16, IND_DONE_BIT, .Lloop

.Ldone:
 /* wait for writes from copy_page to finish */
 dsb nsh
 ic iallu
 dsb nsh
 isb

 /* Start new image. */
 mov x0, xzr
 mov x1, xzr
 mov x2, xzr
 mov x3, xzr
 br x17

ENDPROC(arm64_relocate_new_kernel)
  
             
 
            
          

              
          

最后，會跳轉到kimage->start, 也就是kexec用戶態的info.entry, 即purgatory_start

purgatory_start位於kexec-tools倉庫的purgatory/arch/arm64/entry.S

 
               /*
 * ARM64 purgatory.
 */

.macro size, sym:req
 .size \sym, . - \sym
.endm

.text

.globl purgatory_start
purgatory_start:

 adr x19, .Lstack
 mov sp, x19 

 bl purgatory

 /* Start new image. */
 ldr x17, arm64_kernel_entry
 ldr x0, arm64_dtb_addr
 mov x1, xzr 
 mov x2, xzr 
 mov x3, xzr 
 br x17 

size purgatory_start

.ltorg

.align 4
 .rept 256 
 .quad 0 
 .endr
.Lstack:

.data

.align 3

.globl arm64_kernel_entry
arm64_kernel_entry:
 .quad 0 
size arm64_kernel_entry

.globl arm64_dtb_addr
arm64_dtb_addr:
 .quad 0 
size arm64_dtb_addr

.end

purgatory_start主要完成兩個功能

執行函數purgatory, 完成指定sha256_region的校驗
跳轉到capture kernel entry, 啟動新的kernel

3.2 dump vmcore

3.2.1 elfcorehdr

kernel每次啟動時，都會去檢查elfcorehdr是否存在。elfcorehdr主要是為ELF core header預留內存

kernel相關的實現在reserve_elfcorehdr()

 
               static void __init reserve_elfcorehdr(void)
{
 of_scan_flat_dt(early_init_dt_scan_elfcorehdr, NULL);

 if (!elfcorehdr_size)
 return;

 if (memblock_is_region_reserved(elfcorehdr_addr, elfcorehdr_size)) {
 pr_warn("elfcorehdr is overlapped\n");
 return;
 }

 memblock_reserve(elfcorehdr_addr, elfcorehdr_size);

 pr_info("Reserving %lldKB of memory at 0x%llx for elfcorehdr\n",
 elfcorehdr_size >> 10, elfcorehdr_addr);
}

其中， early_init_dt_scan_elfcorehdr()是要查找“linux,elfcorehdr”的屬性. 這個屬性是在執行kexec后，設置到capture kernel使用的dtb中。

 
               static int __init early_init_dt_scan_elfcorehdr(unsigned long node,
 const char *uname, int depth, void *data)
{
 const __be32 *reg;
 int len;

 if (depth != 1 || strcmp(uname, "chosen") != 0)
 return 0;

 reg = of_get_flat_dt_prop(node, "linux,elfcorehdr", &len);
 if (!reg || (len < (dt_root_addr_cells + dt_root_size_cells)))
 return 1;

 elfcorehdr_addr = dt_mem_next_cell(dt_root_addr_cells, &reg);
 elfcorehdr_size = dt_mem_next_cell(dt_root_size_cells, &reg);

 return 1;
}

ELF core header信息會存儲到elfcorehdr_addr/elfcorehdr_size

3.2.2 vmcore_init

vmcore_init實現如下

 
            
             
               
               /* Init function for vmcore module. */
static int __init vmcore_init(void)
{
 int rc = 0; 

 /* Allow architectures to allocate ELF header in 2nd kernel */
 rc = elfcorehdr_alloc(&elfcorehdr_addr, &elfcorehdr_size);
 if (rc) 
 return rc;
 /*
 * If elfcorehdr= has been passed in cmdline or created in 2nd kernel,
 * then capture the dump.
 */
 if (!(is_vmcore_usable()))
 return rc;
 rc = parse_crash_elf_headers();
 if (rc) {
 pr_warn("Kdump: vmcore not initialized\n");
 return rc;
 }
 elfcorehdr_free(elfcorehdr_addr);
 elfcorehdr_addr = ELFCORE_ADDR_ERR;

 proc_vmcore = proc_create("vmcore", S_IRUSR, NULL, &proc_vmcore_operations);
 if (proc_vmcore)
 proc_vmcore->size = vmcore_size;
 return 0;
}
fs_initcall(vmcore_init);
  
             
 
            
          

              
          

vmcore_init主要實現了以下功能

is_vmcore_usable() 檢查ELF core header是否存在，來決定vmcore是否可用
parse_crash_elf_headers()用於解析ELF core header
proc_create用於創建/proc/vmcore結點

is_vmcore_usable

 
            
             
               
               /*
 * is_kdump_kernel() checks whether this kernel is booting after a panic of
 * previous kernel or not. This is determined by checking if previous kernel
 * has passed the elf core header address on command line.
 *
 * This is not just a test if CONFIG_CRASH_DUMP is enabled or not. It will
 * return 1 if CONFIG_CRASH_DUMP=y and if kernel is booting after a panic of
 * previous kernel.
 */

static inline int is_kdump_kernel(void)
{
 return (elfcorehdr_addr != ELFCORE_ADDR_MAX) ? 1 : 0;
}

/* is_vmcore_usable() checks if the kernel is booting after a panic and
 * the vmcore region is usable.
 *
 * This makes use of the fact that due to alignment -2ULL is not
 * a valid pointer, much in the vain of IS_ERR(), except
 * dealing directly with an unsigned long long rather than a pointer.
 */

static inline int is_vmcore_usable(void)
{
 return is_kdump_kernel() && elfcorehdr_addr != ELFCORE_ADDR_ERR ? 1 : 0;
}
  
             
 
            
          

              
          

parse_crash_elf_headers

將PT_NOTE的program header合並成一個，數據存放到elfnotes_buf
將PT_LOAD的promgram header信息存放到vmcore_list

vmcore_list是個雙向鏈表，每個結點是struct vmcore，定義如下

 
               struct vmcore {
 struct list_head list;
 unsigned long long paddr;
 unsigned long long size;
 loff_t offset;
};

vmcore header結構如下

proc_create

 
            
             
               
               static inline struct proc_dir_entry *proc_create(
 const char *name, umode_t mode, struct proc_dir_entry *parent,
 const struct file_operations *proc_fops)
{
 return proc_create_data(name, mode, parent, proc_fops, NULL);
}

struct proc_dir_entry *proc_create_data(const char *name, umode_t mode,
 struct proc_dir_entry *parent,
 const struct file_operations *proc_fops,
 void *data)
{
 struct proc_dir_entry *pde;
 if ((mode & S_IFMT) == 0)
 mode |= S_IFREG;

 if (!S_ISREG(mode)) {
 WARN_ON(1); /* use proc_mkdir() */
 return NULL;
 }

 BUG_ON(proc_fops == NULL);

 if ((mode & S_IALLUGO) == 0)
 mode |= S_IRUGO;
 pde = __proc_create(&parent, name, mode, 1);
 if (!pde)
 goto out;
 pde->proc_fops = proc_fops;
 pde->data = data;
 pde->proc_iops = &proc_file_inode_operations;
 if (proc_register(parent, pde) < 0)
 goto out_free;
 return pde;
out_free:
 kfree(pde);
out:
 return NULL;
}
  
             
 
            
          

              
          

/proc/vmcore結點被創建后，

 
               static const struct file_operations proc_vmcore_operations = {
 .read = read_vmcore,
 .llseek = default_llseek,
 .mmap = mmap_vmcore,
};

當讀取該結點時，會調用到read_vmcore()->__read_vmcore()

 
            
             
               
               /* Read from the ELF header and then the crash dump. On error, negative value is
 * returned otherwise number of bytes read are returned.
 */
static ssize_t __read_vmcore(char *buffer, size_t buflen, loff_t *fpos,
 int userbuf)
{
 ssize_t acc = 0, tmp;
 size_t tsz;
 u64 start;
 struct vmcore *m = NULL;

 if (buflen == 0 || *fpos >= vmcore_size)
 return 0;

 /* trim buflen to not go beyond EOF */
 if (buflen > vmcore_size - *fpos)
 buflen = vmcore_size - *fpos;

 /* Read ELF core header */
 if (*fpos < elfcorebuf_sz) {
 tsz = min(elfcorebuf_sz - (size_t)*fpos, buflen);
 if (copy_to(buffer, elfcorebuf + *fpos, tsz, userbuf))
 return -EFAULT;
 buflen -= tsz;
 *fpos += tsz;
 buffer += tsz;
 acc += tsz;

 /* leave now if filled buffer already */
 if (buflen == 0)
 return acc;
 }

 /* Read Elf note segment */
 if (*fpos < elfcorebuf_sz + elfnotes_sz) {
 void *kaddr;

 tsz = min(elfcorebuf_sz + elfnotes_sz - (size_t)*fpos, buflen);
 kaddr = elfnotes_buf + *fpos - elfcorebuf_sz;
 if (copy_to(buffer, kaddr, tsz, userbuf))
 return -EFAULT;
 buflen -= tsz;
 *fpos += tsz;
 buffer += tsz;
 acc += tsz;

 /* leave now if filled buffer already */
 if (buflen == 0)
 return acc;
 }

 list_for_each_entry(m, &vmcore_list, list) {
 if (*fpos < m->offset + m->size) {
 tsz = (size_t)min_t(unsigned long long,
 m->offset + m->size - *fpos,
 buflen);
 start = m->paddr + *fpos - m->offset;
 tmp = read_from_oldmem(buffer, tsz, &start, userbuf);
 if (tmp < 0)
 return tmp;
 buflen -= tsz;
 *fpos += tsz;
 buffer += tsz;
 acc += tsz;

 /* leave now if filled buffer already */
 if (buflen == 0)
 return acc;
 }
 }

 return acc;
}
  
             
 
            
          

              
          

該函數會依次讀取ELF header, elfnotes_buf和vmcore_list, 從而生成完整的vmcore文件。

3.2.3 導出vmcore

新的kernel啟動后，我們可以直接導出/proc/vmcore, 如在掛載文件系統后，將/proc/vmcore壓縮到本地硬盤。

 
               $ tar -czf /mnt/vmcore.tar.gz /proc/vmcore

3.3 分析vmcore

/proc/vmcore導出后，通常借助crash工具進行離線分析。

 
            
             
               
               $ crash vmlinux proc/vmcore 

crash 7.2.5
Copyright (C) 2002-2019 Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation
Copyright (C) 1999-2006 Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited
Copyright (C) 2006, 2007 VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011 NEC Corporation
Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions. Enter "help copying" to see the conditions.
This program has absolutely no warranty. Enter "help warranty" for details.
 
GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "--host=x86_64-unknown-linux-gnu --target=aarch64-elf-linux"...
Redefine command "pstring"? (y or n) [answered Y; input not from terminal]

 KERNEL: vmlinux 
 DUMPFILE: proc/vmcore
 CPUS: 8
 DATE: Mon Oct 28 21:44:54 2019
 UPTIME: 00:29:23
LOAD AVERAGE: 0.00, 0.00, 0.00
 TASKS: 176
 NODENAME: chiptim
 RELEASE: 4.9.38-chip_v1.0.0-00253-gece0e28-dirty
 VERSION: #1 SMP Thu Aug 22 21:27:26 HKT 2019
 MACHINE: aarch64 (unknown Mhz)
 MEMORY: 15.8 GB
 PANIC: "sysrq: SysRq : Trigger a crash"
 PID: 978
 COMMAND: "bash"
 TASK: ffffffc2ea91bc00 [THREAD_INFO: ffffffc2ea91bc00]
 CPU: 0
 STATE: TASK_RUNNING (SYSRQ)

crash> bt
PID: 978 TASK: ffffffc2ea91bc00 CPU: 0 COMMAND: "bash"
 #0 [ffffffc2e5253880] machine_kexec at ffffff80080940e0
 #1 [ffffffc2e52538e0] __crash_kexec at ffffff800811d430
 #2 [ffffffc2e5253a30] __crash_kexec at ffffff800811d4e8
 #3 [ffffffc2e5253a50] crash_kexec at ffffff800811d558
 #4 [ffffffc2e5253a70] die at ffffff8008088db4
 #5 [ffffffc2e5253ab0] __do_kernel_fault at ffffff8008099c64
 #6 [ffffffc2e5253ae0] do_page_fault at ffffff8008097560
 #7 [ffffffc2e5253b50] do_translation_fault at ffffff8008097668
 #8 [ffffffc2e5253b60] do_mem_abort at ffffff8008081294
 #9 [ffffffc2e5253d40] el1_ia at ffffff800808260c
 PC: ffffff80084915ac [sysrq_handle_crash+20]
 LR: ffffff800849202c [__handle_sysrq+284]
 SP: ffffffc2e5253d40 PSTATE: 00000145
 X29: ffffffc2e5253d40 X28: ffffffc2ea91bc00 X27: ffffff8008852000
 X26: 0000000000000040 X25: 0000000000000123 X24: 0000000000000015
 X23: 0000000000000000 X22: 0000000000000007 X21: ffffff8008bb0da8
 X20: 0000000000000063 X19: ffffff8008b6c000 X18: 0000000000000000
 X17: 0000007f9d45b120 X16: ffffff80081d5700 X15: ffffffffffffffff
 X14: 0000000000000000 X13: 0000000000000007 X12: 0000000000000161
 X11: 0000000000000006 X10: 0000000000000161 X9: 0000000000000001
 X8: ffffff800839a988 X7: 0000000000000008 X6: ffffff8008db3c08
 X5: 0000000000000000 X4: 0000000000000000 X3: 0000000000000000
 X2: ffffffc2ffb0e700 X1: 0000000000000000 X0: 0000000000000001
#10 [ffffffc2e5253d80] write_sysrq_trigger at ffffff8008492494
#11 [ffffffc2e5253da0] proc_reg_write at ffffff8008234090
#12 [ffffffc2e5253dc0] __vfs_write at ffffff80081d3758
#13 [ffffffc2e5253e40] vfs_write at ffffff80081d4518
#14 [ffffffc2e5253e80] sys_write at ffffff80081d5740
#15 [ffffffc2e5253ec0] el0_svc_naked at ffffff8008082f2c
 PC: 0000007f9d4af078 LR: 0000007f9d45e2f8 SP: 0000007fdabec300
 X29: 0000007fdabec300 X28: 0000007fdabec484 X27: 0000000000000000
 X26: 0000000000000000 X25: 0000000000000000 X24: 0000000000000002
 X23: 0000007f9d534638 X22: 0000000000000002 X21: 0000007f9d538480
 X20: 0000000000516808 X19: 0000000000000002 X18: 0000000000000000
 X17: 0000007f9d45b120 X16: 0000000000000000 X15: 0000000000000000
 X14: 0000000000000000 X13: 0000000000000000 X12: 0000000000000000
 X11: 0000000000000020 X10: 0000000000000000 X9: 0000000000000000
 X8: 0000000000000040 X7: 0000000000000001 X6: 0000000000000063
 X5: 5551000454000000 X4: 0000000000000888 X3: 0000000000000000
 X2: 0000000000000002 X1: 0000000000516808 X0: 0000000000000001
 ORIG_X0: 0000000000000001 SYSCALLNO: 40 PSTATE: 20000000
  
             
 
            
          

              
          

程序員自我修養

程序員自我修養(ID: dumphex)

深入學習kdump原理【轉】