kernel exception時打印出的ESR相關信息
<1>[ 7766.006249] Unhandled fault at 0xffffff800188d408 <1>[ 7766.006256] Mem abort info: <1>[ 7766.006259] ESR = 0x86000003 <1>[ 7766.006264] Exception class = IABT (current EL), IL = 32 bits <1>[ 7766.006268] SET = 0, FnV = 0 <1>[ 7766.006271] EA = 0, S1PTW = 0 <1>[ 7766.006277] swapper pgtable: 4k pages, 39-bit VAs, pgdp = 00000000352033d5 <1>[ 7766.006281] [ffffff800188d408] pgd=000000009d7fe003, pud=000000009d7fe003, pmd=00000000625c6003, pte=0040080063544793 <0>[ 7766.006294] Internal error: level 3 address size fault: 86000003 [#1] PREEMPT SMP
ESR相關信息說明
上述kernel exception時打印出的ESR(Exception Syndrome Register (EL1))值為0x86000003,看下ESR_EL1 register bit assignment:
ESR_EL1是一個64bit register,先要看EC(exception class) field,這個field是在這個register的bit[31:26],占6bit。
ISS依EC不同而有不同的含義。
此實例中EC值是0x21(0b100001),查看EC值解釋表,可以得知0b100001是instruction abort,然后查看instruction abort對應的ISS
EC | Meaning | ISS | Applies when |
---|---|---|---|
0b000000 | Unknown reason. |
ISS encoding for exceptions with an unknown reason | |
0b000001 | Trapped WF* instruction execution. Conditional WF* instructions that fail their condition code check do not cause an exception. |
ISS encoding for an exception from a WF* instruction |
0b100001 | Instruction Abort taken without a change in Exception level. Used for MMU faults generated by instruction accesses and synchronous External aborts, including synchronous parity or ECC errors. Not used for debug-related exceptions. |
ISS encoding for an exception from an Instruction Abort |
主要看IFSC bit field,這個bit field值的含義說明在如下的table里,在本實例中,IFSC bit field的值是3,所以是“Address size fault, level 3”
ISS encoding for an exception from an Instruction Abort
24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
RES0 | SET | FnV | EA | RES0 | S1PTW | RES0 | IFSC |
IFSC, bits [5:0]
Instruction Fault Status Code.
IFSC | Meaning | Applies when |
---|---|---|
0b000000 | Address size fault, level 0 of translation or translation table base register. |
|
0b000001 | Address size fault, level 1. |
|
0b000010 | Address size fault, level 2. |
|
0b000011 | Address size fault, level 3. |
|
0b000100 | Translation fault, level 0. |
|
0b000101 | Translation fault, level 1. |
其打印出來的IL = 32bits表示的是instruction length是32bit,即一條指令長度是4 byte
ESR_EL1 register具體說明見如下鏈接:
https://developer.arm.com/documentation/ddi0595/2021-06/AArch64-Registers/ESR-EL1--Exception-Syndrome-Register--EL1-?lang=en#fieldset_0-24_0_14-5_0
kernel exception是會打印出當前fault address對應的PGD/PUD/PMD/PTE
<1>[ 7766.006281] [ffffff800188d408] pgd=000000009d7fe003, pud=000000009d7fe003, pmd=00000000625c6003, pte=0040080063544793
pgd= 000000009d7fe003,
pud= 000000009d7fe003,
pmd=00000000625c6003,
pte= 0040080063544793
此kernel exception(KE)是發生在一台2G DRAM的ARM64機器上,所以看起來PGD/PUD/PMD page table descriptor的值是正常的。而PTE page table descriptor的值有問題,它所表示的物理地址是0x80063544000,對於2G DRAM的機器,物理地址應該要小於0xFFFFFFFF。
kernel oops log里的Code行log
[ 794.274311] Code: f946a2c9 12001eea 0b350157 9b1b2789 (39402529)
kernel里發生oops,比如data abort、instruction abort,此時會將哪一條指令觸發的data abort、instruction abort以及其前面的幾條打印出來,根據這條指令,可以定位出對應source code位置。
比如是在某個ko里某一個函數里發生的oops,則根據這個函數的反匯編代碼,在里面搜索39402529,這條指令以及其前面幾條如下,所以直接用39402529指令前的地址來執行llvm-symbolizer即可定位出對應source code位置:
llvm-symbolizer -e xxx.ko 0x39402529
227c7c: 12001eea and w10, w23, #0xff 227c80: 0b350157 add w23, w10, w21, uxtb 227c84: 9b1b2789 madd x9, x28, x27, x9 227c88: 39402529 ldrb w9, [x9,#9]
在這之前,可以根據PC所指向的函數的大小,和你反匯編出來的這個函數的匯編代碼大小相比較,如果相等,可以確認這個ko或者vmlinux和發生此問題的image是相匹配的,比如如下PC所指向的函數的大小是0xb10:
[ 794.235944] XXX_OSD_WindowDestroy+0xb0/0xb10 [xxx.ko]
在反匯編出來的函數里搜索導致問題的instruction時,有可能搜到的不止一條,此時可能需要分析對應的匯編指令來確定是哪一條,或者在確認PC所指向的函數所說明的size和反匯編出來的這個函數的大小是一樣的情況下,用這個函數的基地址加上offset,根據相加結果來定位對應的source code位置,比如上述PC所指向的位置在XXX_OSD_WindowDestroy()里的offset是0xb1