Arm Linux系統調用流程詳細解析


 

Linux系統通過向內核發出系統調用(system call)實現了用戶態進程和硬件設備之間的大部分接口。

系統調用是操作系統提供的服務,用戶程序通過各種系統調用,來引用內核提供的各種服務,系統調用的執行讓用戶程序陷入內核,該陷入動作由swi軟中斷完成。

1、用戶可以通過兩種方式使用系統調用:

第一種方式是通過C庫函數,包括系統調用在C庫中的封裝函數和其他普通函數。

第二種方式是使用_syscall宏。2.6.18版本之前的內核,在include/asm-i386/unistd.h文件中定義有7個_syscall宏,分別是:

_syscall0(type,name)  
_syscall1(type,name,type1,arg1)  
_syscall2(type,name,type1,arg1,type2,arg2)  
_syscall3(type,name,type1,arg1,type2,arg2,type3,arg3)  
_syscall4(type,name,type1,arg1,type2,arg2,type3,arg3,type4,arg4)  
_syscall5(type,name,type1,arg1,type2,arg2,type3,arg3,type4,arg4,type5,arg5)  
_syscall6(type,name,type1,arg1,type2,arg2,type3,arg3,type4,arg4,type5,arg5,type6,arg6) 

其中,type表示所生成系統調用的返回值類型,name表示該系統調用的名稱,typeN、argN分別表示第N個參數的類型和名稱,它們的數目和_syscall后面的數字一樣大。

這些宏的作用是創建名為name的函數,_syscall后面跟的數字指明了該函數的參數的個數。

比如sysinfo系統調用用於獲取系統總體統計信息,使用_syscall宏定義為:

_syscall1(int, sysinfo, struct sysinfo *, info); 

展開后的形式為:

int sysinfo(struct sysinfo * info)  
{
  long __res;
  __asm__ volatile("int $0x80" : "=a" (__res) : "0" (116),"b" ((long)(info)));

  do {
    if ((unsigned long)(__res) >= (unsigned long)(-(128 + 1)))
    {
      errno = -(__res);
      __res = -1;
    }

    return (int) (__res);
  } while (0);
}

可以看出,_syscall1(int, sysinfo, struct sysinfo *, info)展開成一個名為sysinfo的函數,原參數int就是函數的返回類型,原參數struct sysinfo *和info分別構成新函數的參數。

在程序文件里使用_syscall宏定義需要的系統調用,就可以在接下來的代碼中通過系統調用名稱直接調用該系統調用。下面是一個使用sysinfo系統調用的實例。

代碼清單5.1  sysinfo系統調用使用實例

#include <stdlib.h> 
#include <errno.h> 
#include <linux/unistd.h>         
#include <linux/kernel.h>       

/* for struct sysinfo */  
_syscall1(int, sysinfo, struct sysinfo *, info);       

int main(void)  
{  
  struct sysinfo s_info;  
  int error;
  error
= sysinfo(&s_info);   printf("code error = %d/n", error);   printf("Uptime = %lds/nLoad:       1 min %lu / 5 min %lu / 15 min %lu/n"       "RAM: total %lu / free %lu / shared %lu/n"       "Memory in buffers = %lu/nSwap: total %lu / free %lu/n"   "Number of processes = %d/n",   s_info.uptime,
      s_info.loads[
0], s_info.loads[1], s_info.loads[2],       s_info.totalram, s_info.freeram, s_info.sharedram,
s_info.bufferram, s_info.totalswap, s_info.freeswap,       s_info.procs);   exit(EXIT_SUCCESS); }

但是自2.6.19版本開始,_syscall宏被廢除,我們需要使用syscall函數,通過指定系統調用號和一組參數來調用系統調用。

syscall函數原型為:

int syscall(int number, ...); 

其中number是系統調用號,number后面應順序接上該系統調用的所有參數。下面是gettid系統調用的調用實例。

代碼清單5.2  gettid系統調用使用實例

#include <unistd.h> 
#include <sys/syscall.h> 
#include <sys/types.h> 

#define __NR_gettid      224  

int main(int argc, char *argv[])  
{       
    pid_t tid;  
  
    tid = syscall(__NR_gettid);  
}

大部分系統調用都包括了一個SYS_符號常量來指定自己到系統調用號的映射,因此上面第10行可重寫為:

tid = syscall(SYS_gettid);  

2 系統調用與應用編程接口(API)區別

應用編程接口(API)與系統調用的不同在於,前者只是一個函數定義,說明了如何獲得一個給定的服務,而后者是通過軟件中斷向內核發出的一個明確的請求。POSIX標准針對API,而不針對系統調用。Unix系統給程序員提供了很多API庫函數。libc的標准c庫所定義的一些API引用了封裝例程(wrapper routine)(其唯一目的就是發布系統調用)。通常情況下,每個系統調用對應一個封裝例程,而封裝例程定義了應用程序使用的API。反之則不然,一個API沒必要對應一個特定的系統調用。從編程者的觀點看,API和系統調用之間的差別是沒有關系的:唯一相關的事情就是函數名、參數類型及返回代碼的含義。然而,從內核設計者的觀點看,這種差別確實有關系,因為系統調用屬於內核,而用戶態的庫函數不屬於內核。

大部分封裝例程返回一個整數,其值的含義依賴於相應的系統調用。返回-1通常表示內核不能滿足進程的請求。系統調用處理程序的失敗可能是由無效參數引起的,也可能是因為缺乏可用資源,或硬件出了問題等等。在libd庫中定義的errno變量包含特定的出錯碼。每個出錯碼定義為一個常量宏。

當用戶態的進程調用一個系統調用時,CPU切換到內核態並開始執行一個內核函數。因為內核實現了很多不同的系統調用,因此進程必須傳遞一個名為系統調用號(system call number)的參數來識別所需的系統調用。所有的系統調用都返回一個整數值。這些返回值與封裝例程返回值的約定是不同的。在內核中,整數或0表示系統調用成功結束,而負數表示一個出錯條件。在后一種情況下,這個值就是存放在errno變量中必須返回給應用程序的負出錯碼。

3 系統調用執行過程

ARM Linux系統利用SWI指令來從用戶空間進入內核空間,還是先讓我們了解下這個SWI指令吧。SWI指令用於產生軟件中斷,從而實現從用戶模式變換到管理模式,CPSR保存到管理模式的SPSR,執行轉移到SWI向量。在其他模式下也可使用SWI指令,處理器同樣地切換到管理模式。指令格式如下:

SWI{cond} immed_24

其中:

immed_24  24位立即數,值為從0——16777215之間的整數。

使用SWI指令時,通常使用一下兩種方法進行參數傳遞,SWI異常處理程序可以提供相關的服務,這兩種方法均是用戶軟件協定。SWI異常中斷處理程序要通過讀取引起軟件中斷的SWI指令,以取得24為立即數。

1)、指令中24位的立即數指定了用戶請求的服務類型,參數通過通用寄存器傳遞。如:

MOV R0,#34
SWI 12

2)、指令中的24位立即數被忽略,用戶請求的服務類型有寄存器R0的只決定,參數通過其他的通用寄存器傳遞。如:

MOV R0, #12
MOV R1, #34
SWI 0

SWI異常處理程序中,去除SWI立即數的步驟為:首先確定一起軟中斷的SWI指令時ARM指令還是Thumb指令,這可通過對SPSR訪問得到;然后取得該SWI指令的地址,這可通過訪問LR寄存器得到;接着讀出指令,分解出立即數(低24位)。

由用戶空間進入系統調用

通常情況下,我們寫的代碼都是通過封裝的C lib來調用系統調用的。以0.9.30uClibc中的open為例,來追蹤一下這個封裝的函數是如何一步一步的調用系統調用的。在include/fcntl.h中有定義:

# define open open64

open實際上只是open64的一個別名而已。

libc/sysdeps/linux/common/open64.c中可以看到:

extern __typeof(open64) __libc_open64;
extern __typeof(open) __libc_open;

可見open64也只不過是__libc_open64的別名,而__libc_open64函數在同一個文件中定義:

libc_hidden_proto(__libc_open64)
int __libc_open64 (const char *file, int oflag, ...)
{
    mode_t mode = 0;

    if (oflag & O_CREAT)
    {
       va_list arg;
       va_start (arg, oflag);
       mode = va_arg (arg, mode_t);
       va_end (arg);
    }
 
    return __libc_open(file, oflag | O_LARGEFILE, mode);
}
libc_hidden_def(__libc_open64)

最終__libc_open64又調用了__libc_open函數,這個函數在文件libc/sysdeps/linux/common/open.c中定義:

libc_hidden_proto(__libc_open)
int __libc_open(const char *file, int oflag, ...)
{
   mode_t mode = 0;

   if (oflag & O_CREAT) {
      va_list arg;
      va_start (arg, oflag);
      mode = va_arg (arg, mode_t);
      va_end (arg);
   }

   return __syscall_open(file, oflag, mode);
}
libc_hidden_def(__libc_open)

__syscall_open在同一個文件中定義:

static __inline__ _syscall3(int, __syscall_open, const char *, file, int, flags, __kernel_mode_t, mode)

在文件libc/sysdeps/linux/arm/bits/syscalls.h文件中可以看到:

#undef _syscall3
#define _syscall3(type,name,type1,arg1,type2,arg2,type3,arg3) \

type name(type1 arg1,type2 arg2,type3 arg3) \
{ \
return (type) (INLINE_SYSCALL(name, 3, arg1, arg2, arg3)); \
}

這個宏實際上完成定義一個函數的工作,這個宏的第一個參數是函數的返回值類型,第二個參數是函數名,之后的參數就如同它的參數名所表明的那樣,分別是函數的參數類型及參數名。__syscall_open實際上為:

int __syscall_open (const char * file,int flags, __kernel_mode_t mode)
{
    return (int) (INLINE_SYSCALL(__syscall_open, 3, file, flags, mode));
}

INLINE_SYSCALL為同一個文件中定義的宏:

#undef INLINE_SYSCALL

#define INLINE_SYSCALL(name, nr, args...)            \

  ({ unsigned int _inline_sys_result = INTERNAL_SYSCALL (name, , nr, args);  \

     if (__builtin_expect (INTERNAL_SYSCALL_ERROR_P (_inline_sys_result, ), 0))  \

       {                        \

    __set_errno (INTERNAL_SYSCALL_ERRNO (_inline_sys_result, ));    \

    _inline_sys_result = (unsigned int) -1;          \

       }                        \

     (int) _inline_sys_result; })

 

#undef INTERNAL_SYSCALL

#if !defined(__thumb__)

#if defined(__ARM_EABI__)

#define INTERNAL_SYSCALL(name, err, nr, args...)        \

  ({unsigned int __sys_result;                 \

     {                          \

       register int _a1 __asm__ ("r0"), _nr __asm__ ("r7");    \

       LOAD_ARGS_##nr (args)                \

       _nr = SYS_ify(name);                 \

       __asm__ __volatile__ ("swi  0x0   @ syscall " #name  \

              : "=r" (_a1)            \

              : "r" (_nr) ASM_ARGS_##nr        \

              : "memory");            \

          __sys_result = _a1;               \

     }                          \

     (int) __sys_result; })

#else /* defined(__ARM_EABI__) */

 

#define INTERNAL_SYSCALL(name, err, nr, args...)        \

  ({ unsigned int __sys_result;                \

     {                          \

       register int _a1 __asm__ ("a1");               \

       LOAD_ARGS_##nr (args)                \

       __asm__ __volatile__ ("swi  %1 @ syscall " #name  \

           : "=r" (_a1)               \

           : "i" (SYS_ify(name)) ASM_ARGS_##nr    \

           : "memory");               \

       __sys_result = _a1;                  \

     }                          \

     (int) __sys_result; })

#endif

#else /* !defined(__thumb__) */

/* We can't use push/pop inside the asm because that breaks

   unwinding (ie. thread cancellation).

 */

#define INTERNAL_SYSCALL(name, err, nr, args...)        \

  ({ unsigned int __sys_result;                \

    {                           \

      int _sys_buf[2];                   \

      register int _a1 __asm__ ("a1");                \

      register int *_v3 __asm__ ("v3") = _sys_buf;       \

      *_v3 = (int) (SYS_ify(name));               \

      LOAD_ARGS_##nr (args)                 \

      __asm__ __volatile__ ("str   r7, [v3, #4]\n"       \

          "\tldr   r7, [v3]\n"           \

          "\tswi   0  @ syscall " #name "\n"      \

          "\tldr   r7, [v3, #4]"            \

          : "=r" (_a1)                \

          : "r" (_v3) ASM_ARGS_##nr            \

                    : "memory");              \

   __sys_result = _a1;                  \

    }                           \

    (int) __sys_result; })

#endif /*!defined(__thumb__)*/

這里也將同文件中的LOAD_ARGS宏的定義貼出來:

#define LOAD_ARGS_0()

#define
ASM_ARGS_0 #define LOAD_ARGS_1(a1) \ _a1 = (int) (a1); \ LOAD_ARGS_0 () #define ASM_ARGS_1 ASM_ARGS_0, "r" (_a1) #define LOAD_ARGS_2(a1, a2) \ register int _a2 __asm__ ("a2") = (int) (a2); \ LOAD_ARGS_1 (a1) #define ASM_ARGS_2 ASM_ARGS_1, "r" (_a2) #define LOAD_ARGS_3(a1, a2, a3) \ register int _a3 __asm__ ("a3") = (int) (a3); \ LOAD_ARGS_2 (a1, a2)

這項宏用來在相應的寄存器中加載相應的參數。SYS_ify宏獲得系統調用號

#define SYS_ify(syscall_name)  (__NR_##syscall_name)

也就是__NR___syscall_open,在libc/sysdeps/linux/common/open.c中可以看到這個宏的定義:

#define __NR___syscall_open __NR_open

__NR_open在內核代碼的頭文件中有定義。在r7寄存器中存放系統調用號,而參數傳遞似乎和普通的函數調用的參數傳遞也沒有什么區別。

在這個地方,得注意那個EABI, EABI是什么東西呢?ABIApplication Binary Interface,應用二進制接口。在較新的EABI規范中,是將系統調用號壓入寄存器r7中,而在老的OABI中則是執行的swi中斷號的方式,也就是說原來的調用方式(Old ABI)是通過跟隨在swi指令中的調用號來進行的。同時這兩種調用方式的系統調用號也是存在這區別的,在內核的文件arch/arm/inclue/asm/unistd.h中可以看到:

#define __NR_OABI_SYSCALL_BASE 0x900000

#if
defined(__thumb__) || defined(__ARM_EABI__) #define __NR_SYSCALL_BASE 0 #else #define __NR_SYSCALL_BASE __NR_OABI_SYSCALL_BASE #endif /* * This file contains the system call numbers. */ #define __NR_restart_syscall (__NR_SYSCALL_BASE+ 0) #define __NR_exit (__NR_SYSCALL_BASE+ 1) #define __NR_fork (__NR_SYSCALL_BASE+ 2) #define __NR_read (__NR_SYSCALL_BASE+ 3) #define __NR_write (__NR_SYSCALL_BASE+ 4) #define __NR_open (__NR_SYSCALL_BASE+ 5) ……

接下來來看操作系統對系統調用的處理。我們回到ARM Linux的異常向量表,因為當執行swi時,會從異常向量表中取例程的地址從而跳轉到相應的處理程序中。在文件arch/arm/kernel/entry-armv.S中:

/*
 * We group all the following data together to optimise
 * for CPUs with separate I & D caches.
 */
    .align    5

.LCvswi:
    .word    vector_swi

    .globl    __stubs_end
__stubs_end:

    .equ    stubs_offset, __vectors_start + 0x200 - __stubs_start

    .globl    __vectors_start
__vectors_start:
 ARM(    swi    SYS_ERROR0    )
 THUMB(    svc    #0        )
 THUMB(    nop            )
    W(b)    vector_und + stubs_offset
    W(ldr)    pc, .LCvswi + stubs_offset
    W(b)    vector_pabt + stubs_offset
    W(b)    vector_dabt + stubs_offset
    W(b)    vector_addrexcptn + stubs_offset
    W(b)    vector_irq + stubs_offset
    W(b)    vector_fiq + stubs_offset

    .globl    __vectors_end
__vectors_end:

.LCvswi在同一個文件中定義為:

.LCvswi:
   .word vector_swi

也就是最終會執行例程vector_swi來完成對系統調用的處理,接下來我們來看下在arch/arm/kernel/entry-common.S中定義的vector_swi例程:

/*=============================================================================
 * SWI handler
 *-----------------------------------------------------------------------------
 */

    /* If we're optimising for StrongARM the resulting code won't 
       run on an ARM7 and we can save a couple of instructions.  
                                --pb */
#ifdef CONFIG_CPU_ARM710
#define A710(code...) code
.Larm710bug:
    ldmia    sp, {r0 - lr}^            @ Get calling r0 - lr
    mov    r0, r0
    add    sp, sp, #S_FRAME_SIZE
    subs    pc, lr, #4
#else
#define A710(code...)
#endif

    .align    5
ENTRY(vector_swi)
    sub    sp, sp, #S_FRAME_SIZE
    stmia    sp, {r0 - r12}            @ Calling r0 - r12
 ARM(    add    r8, sp, #S_PC        )
 ARM(    stmdb    r8, {sp, lr}^        )    @ Calling sp, lr
 THUMB(    mov    r8, sp            )
 THUMB(    store_user_sp_lr r8, r10, S_SP    )    @ calling sp, lr
    mrs    r8, spsr            @ called from non-FIQ mode, so ok.
    str    lr, [sp, #S_PC]            @ Save calling PC
    str    r8, [sp, #S_PSR]        @ Save CPSR
    str    r0, [sp, #S_OLD_R0]        @ Save OLD_R0
    zero_fp

    /*
     * Get the system call number.
     */

#if defined(CONFIG_OABI_COMPAT)

    /*
     * If we have CONFIG_OABI_COMPAT then we need to look at the swi
     * value to determine if it is an EABI or an old ABI call.
     */
#ifdef CONFIG_ARM_THUMB
    tst    r8, #PSR_T_BIT
    movne    r10, #0                @ no thumb OABI emulation
    ldreq    r10, [lr, #-4]            @ get SWI instruction
#else
    ldr    r10, [lr, #-4]            @ get SWI instruction
  A710(    and    ip, r10, #0x0f000000        @ check for SWI        )
  A710(    teq    ip, #0x0f000000                        )
  A710(    bne    .Larm710bug                        )
#endif
#ifdef CONFIG_CPU_ENDIAN_BE8
    rev    r10, r10            @ little endian instruction
#endif

#elif defined(CONFIG_AEABI)

    /*
     * Pure EABI user space always put syscall number into scno (r7).
     */
  A710(    ldr    ip, [lr, #-4]            @ get SWI instruction    )
  A710(    and    ip, ip, #0x0f000000        @ check for SWI        )
  A710(    teq    ip, #0x0f000000                        )
  A710(    bne    .Larm710bug                        )

#elif defined(CONFIG_ARM_THUMB)

    /* Legacy ABI only, possibly thumb mode. */
    tst    r8, #PSR_T_BIT            @ this is SPSR from save_user_regs
    addne    scno, r7, #__NR_SYSCALL_BASE    @ put OS number in
    ldreq    scno, [lr, #-4]

#else

    /* Legacy ABI only. */
    ldr    scno, [lr, #-4]            @ get SWI instruction
  A710(    and    ip, scno, #0x0f000000        @ check for SWI        )
  A710(    teq    ip, #0x0f000000                        )
  A710(    bne    .Larm710bug                        )

#endif

#ifdef CONFIG_ALIGNMENT_TRAP
    ldr    ip, __cr_alignment
    ldr    ip, [ip]
    mcr    p15, 0, ip, c1, c0        @ update control register
#endif
    enable_irq

    //tsk 是寄存器r9的別名,在arch/arm/kernel/entry-header.S中定義:// tsk .req   r9     @current thread_info

      // 獲得線程對象的基地址。

    get_thread_info tsk

      // tbl是r8寄存器的別名,在arch/arm/kernel/entry-header.S中定義:

      // tbl  .req   r8     @syscall table pointer,

      // 用來存放系統調用表的指針,系統調用表在后面調用

    adr    tbl, sys_call_table        @ load syscall table pointer

#if defined(CONFIG_OABI_COMPAT)
    /*
     * If the swi argument is zero, this is an EABI call and we do nothing.
     *
     * If this is an old ABI call, get the syscall number into scno and
     * get the old ABI syscall table address.
     */
    bics    r10, r10, #0xff000000
    eorne    scno, r10, #__NR_OABI_SYSCALL_BASE
    ldrne    tbl, =sys_oabi_call_table
#elif !defined(CONFIG_AEABI)
   // scno是寄存器r7的別名
bic scno, scno, #0xff000000 @ mask off SWI op-code eor scno, scno, #__NR_SYSCALL_BASE @ check OS number #endif ldr r10, [tsk, #TI_FLAGS] @ check for syscall tracing stmdb sp!, {r4, r5} @ push fifth and sixth args #ifdef CONFIG_SECCOMP tst r10, #_TIF_SECCOMP beq 1f mov r0, scno bl __secure_computing add r0, sp, #S_R0 + S_OFF @ pointer to regs ldmia r0, {r0 - r3} @ have to reload r0 - r3 1: #endif tst r10, #_TIF_SYSCALL_TRACE @ are we tracing syscalls? bne __sys_trace cmp scno, #NR_syscalls @ check upper syscall limit adr lr, BSYM(ret_fast_syscall) @ return address ldrcc pc, [tbl, scno, lsl #2] @ call sys_* routine add r1, sp, #S_OFF

      // why也是r8寄存器的別名

2: mov why, #0 @ no longer a real syscall

    cmp    scno, #(__ARM_NR_BASE - __NR_SYSCALL_BASE)
    eor    r0, scno, #__NR_SYSCALL_BASE    @ put OS number back
    bcs    arm_syscall    
    b    sys_ni_syscall            @ not private func
ENDPROC(vector_swi)

    /*
     * This is the really slow path.  We're going to be doing
     * context switches, and waiting for our parent to respond.
     */
__sys_trace:
    mov    r2, scno
    add    r1, sp, #S_OFF
    mov    r0, #0                @ trace entry [IP = 0]
    bl    syscall_trace

    adr    lr, BSYM(__sys_trace_return)    @ return address
    mov    scno, r0            @ syscall number (possibly new)
    add    r1, sp, #S_R0 + S_OFF        @ pointer to regs
    cmp    scno, #NR_syscalls        @ check upper syscall limit
    ldmccia    r1, {r0 - r3}            @ have to reload r0 - r3
    ldrcc    pc, [tbl, scno, lsl #2]        @ call sys_* routine
    b    2b

__sys_trace_return:
    str    r0, [sp, #S_R0 + S_OFF]!    @ save returned r0
    mov    r2, scno
    mov    r1, sp
    mov    r0, #1                @ trace exit [IP = 1]
    bl    syscall_trace
    b    ret_slow_syscall

    .align    5
#ifdef CONFIG_ALIGNMENT_TRAP
    .type    __cr_alignment, #object
__cr_alignment:
    .word    cr_alignment
#endif
    .ltorg

/*
 * This is the syscall table declaration for native ABI syscalls.
 * With EABI a couple syscalls are obsolete and defined as sys_ni_syscall.
 */
#define ABI(native, compat) native
#ifdef CONFIG_AEABI
#define OBSOLETE(syscall) sys_ni_syscall
#else
#define OBSOLETE(syscall) syscall
#endif

    .type    sys_call_table, #object
ENTRY(sys_call_table)
#include "calls.S"
#undef ABI
#undef OBSOLETE

上面的zero_fp是一個宏,在arch/arm/kernel/entry-header.S中定義:

  .macro zero_fp

#ifdef CONFIG_FRAME_POINTER

   mov   fp, #0

#endif

   .endm

//而fp位寄存器r11。

    像每一個異常處理程序一樣,要做的第一件事當然就是保護現場了。緊接着是獲得系統調用的系統調用號。

    然后以系統調用號作為索引來查找系統調用表,如果系統調用號正常的話,就會調用相應的處理例程來處理,就是上面的那個ldrcc  pc, [tbl, scno, lsl #2]語句,然后通過例程ret_fast_syscall來返回。

    在這個地方我們接着來討論ABI的問題。現在,我們首先來看兩個宏,一個是CONFIG_OABI_COMPAT 意思是說與old ABI兼容,另一個是CONFIG_AEABI 意思是說指定現在的方式為EABI。這兩個宏可以同時配置,也可以都不配,也可以配置任何一種。我們來看一下內核是怎么處理這一問題的。我們知道,sys_call_table 在內核中是個跳轉表,這個表中存儲的是一系列的函數指針,這些指針就是系統調用函數的指針,如(sys_open)。系統調用是根據一個系統調用號(通常就是表的索引)找到實際該調用內核哪個函數,然后通過運行該函數完成的。 
    
首先,對於old ABI,內核給出的處理是為它建立一個單獨的system call table,sys_oabi_call_table,這樣,兼容方式下就會有兩個system call table, old ABI方式的系統調用會執行old_syscall_table表中的系統調用函數,EABI方式的系統調用會用sys_call_table中的函數指針。 
配置無外乎以下4中: 
第一、兩個宏都配置行為就是上面說的那樣。 
第二、只配置CONFIG_OABI_COMPAT,那么以old ABI方式調用的會用sys_oabi_call_table,以EABI方式調用的用sys_call_table,和1實質上是相同的。只是情況1更加明確。 
第三、只配置CONFIG_AEABI系統中不存在sys_oabi_call_table,對old ABI方式調用不兼容。只能 以EABI方式調用,用sys_call_table

第四、兩個都沒有配置,系統默認會只允許old ABI方式,但是不存在old_syscall_table,最終會通過sys_call_table 完成函數調用

系統會根據ABI的不同而將相應的系統調用表的基地址加載進tbl寄存器,也就是r8寄存器。接下來來看系統調用表,如前面所說的那樣,有兩個,同樣都在文件arch/arm/kernel/entry-common.S中:

/*
 * This is the syscall table declaration for native ABI syscalls.
 * With EABI a couple syscalls are obsolete and defined as sys_ni_syscall.
 */
#define ABI(native, compat) native
#ifdef CONFIG_AEABI
#define OBSOLETE(syscall) sys_ni_syscall
#else
#define OBSOLETE(syscall) syscall
#endif

    .type    sys_call_table, #object
ENTRY(sys_call_table)
#include "calls.S"
#undef ABI
#undef OBSOLETE

另外一個為:

/*
 * This is the syscall table declaration for native ABI syscalls.
 * With EABI a couple syscalls are obsolete and defined as sys_ni_syscall.
 */
#define ABI(native, compat) native
#ifdef CONFIG_AEABI
#define OBSOLETE(syscall) sys_ni_syscall
#else
#define OBSOLETE(syscall) syscall
#endif

    .type    sys_call_table, #object
ENTRY(sys_call_table)
#include "calls.S"
#undef ABI
#undef OBSOLETE

這樣看來貌似兩個系統調用表是完全一樣的。這里預處理指令include的獨特用法也挺有意思,在系統調用表的內容就是整個arch/arm/kernel/calls.S文件的內容這個文件的內容如下(由於太長,這里就不全部列出了):

/*
 *  linux/arch/arm/kernel/calls.S
 *
 *  Copyright (C) 1995-2005 Russell King
 *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License version 2 as
 * published by the Free Software Foundation.
 *
 *  This file is included thrice in entry-common.S
 */
/* 0 */        CALL(sys_restart_syscall)
        CALL(sys_exit)
        CALL(sys_fork_wrapper)
        CALL(sys_read)
        CALL(sys_write)
/* 5 */        CALL(sys_open)
        CALL(sys_close)
        CALL(sys_ni_syscall)        /* was sys_waitpid */
        CALL(sys_creat)
        CALL(sys_link)
                ...

這個是同樣在文件arch/arm/kernel/entry-common.S中的宏CALL()的定義:

    .equ NR_syscalls,0
#define CALL(x) .equ NR_syscalls,NR_syscalls+1
#include "calls.S"
#undef CALL
#define CALL(x) .long x

最后再羅嗦一點,如果用sys_open來搜的話,是搜不到系統調用open的定義的,系統調用函數都是用宏來定義的,比如對於open,在文件fs/open.c文件中這樣定義:

SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, int, mode)
{
    long ret;

    if (force_o_largefile())
        flags |= O_LARGEFILE;

    ret = do_sys_open(AT_FDCWD, filename, flags, mode);
    /* avoid REGPARM breakage on x86: */
    asmlinkage_protect(3, ret, filename, flags, mode);
    return ret;
}

繼續回到vector_swi,而如果系統調用號不正確,則會調用arm_syscall函數來進行處理,這個函數在文件arch/arm/kernel/traps.c中定義:

/*
 * Handle all unrecognised system calls.
 *  0x9f0000 - 0x9fffff are some more esoteric system calls
 */
#define NR(x) ((__ARM_NR_##x) - __ARM_NR_BASE)
asmlinkage int arm_syscall(int no, struct pt_regs *regs)
{
    struct thread_info *thread = current_thread_info();
    siginfo_t info;

    if ((no >> 16) != (__ARM_NR_BASE>> 16))
        return bad_syscall(no, regs);

    switch (no & 0xffff) {
    case 0: /* branch through 0 */
        info.si_signo = SIGSEGV;
        info.si_errno = 0;
        info.si_code  = SEGV_MAPERR;
        info.si_addr  = NULL;

        arm_notify_die("branch through zero", regs, &info, 0, 0);
        return 0;

    case NR(breakpoint): /* SWI BREAK_POINT */
        regs->ARM_pc -= thumb_mode(regs) ? 2 : 4;
        ptrace_break(current, regs);
        return regs->ARM_r0;

    /*
     * Flush a region from virtual address 'r0' to virtual address 'r1'
     * _exclusive_.  There is no alignment requirement on either address;
     * user space does not need to know the hardware cache layout.
     *
     * r2 contains flags.  It should ALWAYS be passed as ZERO until it
     * is defined to be something else.  For now we ignore it, but may
     * the fires of hell burn in your belly if you break this rule. ;)
     *
     * (at a later date, we may want to allow this call to not flush
     * various aspects of the cache.  Passing '0' will guarantee that
     * everything necessary gets flushed to maintain consistency in
     * the specified region).
     */
    case NR(cacheflush):
        do_cache_op(regs->ARM_r0, regs->ARM_r1, regs->ARM_r2);
        return 0;

    case NR(usr26):
        if (!(elf_hwcap & HWCAP_26BIT))
            break;
        regs->ARM_cpsr &= ~MODE32_BIT;
        return regs->ARM_r0;

    case NR(usr32):
        if (!(elf_hwcap & HWCAP_26BIT))
            break;
        regs->ARM_cpsr |= MODE32_BIT;
        return regs->ARM_r0;

    case NR(set_tls):
        thread->tp_value = regs->ARM_r0;
        if (tls_emu)
            return 0;
        if (has_tls_reg) {
            asm ("mcr p15, 0, %0, c13, c0, 3"
                : : "r" (regs->ARM_r0));
        } else {
            /*
             * User space must never try to access this directly.
             * Expect your app to break eventually if you do so.
             * The user helper at 0xffff0fe0 must be used instead.
             * (see entry-armv.S for details)
             */
            *((unsigned int *)0xffff0ff0) = regs->ARM_r0;
        }
        return 0;

#ifdef CONFIG_NEEDS_SYSCALL_FOR_CMPXCHG
    /*
     * Atomically store r1 in *r2 if *r2 is equal to r0 for user space.
     * Return zero in r0 if *MEM was changed or non-zero if no exchange
     * happened.  Also set the user C flag accordingly.
     * If access permissions have to be fixed up then non-zero is
     * returned and the operation has to be re-attempted.
     *
     * *NOTE*: This is a ghost syscall private to the kernel.  Only the
     * __kuser_cmpxchg code in entry-armv.S should be aware of its
     * existence.  Don't ever use this from user code.
     */
    case NR(cmpxchg):
    for (;;) {
        extern void do_DataAbort(unsigned long addr, unsigned int fsr,
                     struct pt_regs *regs);
        unsigned long val;
        unsigned long addr = regs->ARM_r2;
        struct mm_struct *mm = current->mm;
        pgd_t *pgd; pmd_t *pmd; pte_t *pte;
        spinlock_t *ptl;

        regs->ARM_cpsr &= ~PSR_C_BIT;
        down_read(&mm->mmap_sem);
        pgd = pgd_offset(mm, addr);
        if (!pgd_present(*pgd))
            goto bad_access;
        pmd = pmd_offset(pgd, addr);
        if (!pmd_present(*pmd))
            goto bad_access;
        pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
        if (!pte_present(*pte) || !pte_dirty(*pte)) {
            pte_unmap_unlock(pte, ptl);
            goto bad_access;
        }
        val = *(unsigned long *)addr;
        val -= regs->ARM_r0;
        if (val == 0) {
            *(unsigned long *)addr = regs->ARM_r1;
            regs->ARM_cpsr |= PSR_C_BIT;
        }
        pte_unmap_unlock(pte, ptl);
        up_read(&mm->mmap_sem);
        return val;

        bad_access:
        up_read(&mm->mmap_sem);
        /* simulate a write access fault */
        do_DataAbort(addr, 15 + (1 << 11), regs);
    }
#endif

    default:
        /* Calls 9f00xx..9f07ff are defined to return -ENOSYS
           if not implemented, rather than raising SIGILL.  This
           way the calling program can gracefully determine whether
           a feature is supported.  */
        if ((no & 0xffff) <= 0x7ff)
            return -ENOSYS;
        break;
    }
#ifdef CONFIG_DEBUG_USER
    /*
     * experience shows that these seem to indicate that
     * something catastrophic has happened
     */
    if (user_debug & UDBG_SYSCALL) {
        printk("[%d] %s: arm syscall %d\n",
               task_pid_nr(current), current->comm, no);
        dump_instr("", regs);
        if (user_mode(regs)) {
            __show_regs(regs);
            c_backtrace(regs->ARM_fp, processor_mode(regs));
        }
    }
#endif
    info.si_signo = SIGILL;
    info.si_errno = 0;
    info.si_code  = ILL_ILLTRP;
    info.si_addr  = (void __user *)instruction_pointer(regs) -
             (thumb_mode(regs) ? 2 : 4);

    arm_notify_die("Oops - bad syscall(2)", regs, &info, no, 0);
    return 0;
}

還有那個sys_ni_syscall,這個函數在kernel/sys_ni.c中定義,它的作用似乎也僅僅是要給用戶空間返回錯誤碼ENOSYS

/*  we can't #include <linux/syscalls.h> here,
    but tell gcc to not warn with -Wmissing-prototypes  */
asmlinkage long sys_ni_syscall(void);

/*
 * Non-implemented system calls get redirected here.
 */
asmlinkage long sys_ni_syscall(void)
{
    return -ENOSYS;
}

系統調用號正確也好不正確也好,最終都是通過ret_fast_syscall例程來返回,同樣在arch/arm/kernel/entry-common.S文件中:

    .align    5
/*
 * This is the fast syscall return path.  We do as little as
 * possible here, and this includes saving r0 back into the SVC
 * stack.
 */
ret_fast_syscall:
 UNWIND(.fnstart    )
 UNWIND(.cantunwind    )
    disable_irq                @ disable interrupts
    ldr    r1, [tsk, #TI_FLAGS]
    tst    r1, #_TIF_WORK_MASK
    bne    fast_work_pending
#if defined(CONFIG_IRQSOFF_TRACER)
    asm_trace_hardirqs_on
#endif

    /* perform architecture specific actions before user return */
    arch_ret_to_user r1, lr

    restore_user_regs fast = 1, offset = S_OFF
 UNWIND(.fnend        )

 

四.聲明系統調用的相關宏

linux下的系統調用函數定義接口:

1.SYSCALL_DEFINE1~6(include/linux/syscalls.h )

#define SYSCALL_DEFINE1(name, ...) SYSCALL_DEFINEx(1, _##name, __VA_ARGS__)
#define SYSCALL_DEFINE2(name, ...) SYSCALL_DEFINEx(2, _##name, __VA_ARGS__)
#define SYSCALL_DEFINE3(name, ...) SYSCALL_DEFINEx(3, _##name, __VA_ARGS__)
#define SYSCALL_DEFINE4(name, ...) SYSCALL_DEFINEx(4, _##name, __VA_ARGS__)
#define SYSCALL_DEFINE5(name, ...) SYSCALL_DEFINEx(5, _##name, __VA_ARGS__)
#define SYSCALL_DEFINE6(name, ...) SYSCALL_DEFINEx(6, _##name, __VA_ARGS__)

2.SYSCALL_DEFINEx

#ifdef CONFIG_FTRACE_SYSCALLS
#define SYSCALL_DEFINEx(x, sname, ...)                \
    static const char *types_##sname[] = {            \
        __SC_STR_TDECL##x(__VA_ARGS__)            \
    };                            \
    static const char *args_##sname[] = {            \
        __SC_STR_ADECL##x(__VA_ARGS__)            \
    };                            \
    SYSCALL_METADATA(sname, x);                \
    __SYSCALL_DEFINEx(x, sname, __VA_ARGS__)
#else
#define SYSCALL_DEFINEx(x, sname, ...)                \
    __SYSCALL_DEFINEx(x, sname, __VA_ARGS__)
#endif

3.__SYSCALL_DEFINEx

#ifdef CONFIG_HAVE_SYSCALL_WRAPPERS

#define SYSCALL_DEFINE(name) static inline long SYSC_##name

#define __SYSCALL_DEFINEx(x, name, ...)                    \
    asmlinkage long sys##name(__SC_DECL##x(__VA_ARGS__));        \
    static inline long SYSC##name(__SC_DECL##x(__VA_ARGS__));    \
    asmlinkage long SyS##name(__SC_LONG##x(__VA_ARGS__))        \
    {                                \
        __SC_TEST##x(__VA_ARGS__);                \
        return (long) SYSC##name(__SC_CAST##x(__VA_ARGS__));    \
    }                                \
    SYSCALL_ALIAS(sys##name, SyS##name);                \
    static inline long SYSC##name(__SC_DECL##x(__VA_ARGS__))

#else /* CONFIG_HAVE_SYSCALL_WRAPPERS */

#define SYSCALL_DEFINE(name) asmlinkage long sys_##name
#define __SYSCALL_DEFINEx(x, name, ...)                    \
    asmlinkage long sys##name(__SC_DECL##x(__VA_ARGS__))

#endif /* CONFIG_HAVE_SYSCALL_WRAPPERS */

4.__SC_開頭的宏

#define __SC_DECL1(t1, a1)    t1 a1
#define __SC_DECL2(t2, a2, ...) t2 a2, __SC_DECL1(__VA_ARGS__)
#define __SC_DECL3(t3, a3, ...) t3 a3, __SC_DECL2(__VA_ARGS__)
#define __SC_DECL4(t4, a4, ...) t4 a4, __SC_DECL3(__VA_ARGS__)
#define __SC_DECL5(t5, a5, ...) t5 a5, __SC_DECL4(__VA_ARGS__)
#define __SC_DECL6(t6, a6, ...) t6 a6, __SC_DECL5(__VA_ARGS__)

#define __SC_LONG1(t1, a1)     long a1
#define __SC_LONG2(t2, a2, ...) long a2, __SC_LONG1(__VA_ARGS__)
#define __SC_LONG3(t3, a3, ...) long a3, __SC_LONG2(__VA_ARGS__)
#define __SC_LONG4(t4, a4, ...) long a4, __SC_LONG3(__VA_ARGS__)
#define __SC_LONG5(t5, a5, ...) long a5, __SC_LONG4(__VA_ARGS__)
#define __SC_LONG6(t6, a6, ...) long a6, __SC_LONG5(__VA_ARGS__)

#define __SC_CAST1(t1, a1)    (t1) a1
#define __SC_CAST2(t2, a2, ...) (t2) a2, __SC_CAST1(__VA_ARGS__)
#define __SC_CAST3(t3, a3, ...) (t3) a3, __SC_CAST2(__VA_ARGS__)
#define __SC_CAST4(t4, a4, ...) (t4) a4, __SC_CAST3(__VA_ARGS__)
#define __SC_CAST5(t5, a5, ...) (t5) a5, __SC_CAST4(__VA_ARGS__)
#define __SC_CAST6(t6, a6, ...) (t6) a6, __SC_CAST5(__VA_ARGS__)
...

5.針對SYSCALL_DEFINE1(close, unsigned int, fd)來分析一下

SYSCALL_DEFINE1(close, unsigned int, fd)根據#define SYSCALL_DEFINE1(name, ...) SYSCALL_DEFINEx(1, _##name, __VA_ARGS__)

化簡SYSCALL_DEFINEx(1, _close, __VA_ARGS__)  【 ##是連接符的意思】,根據SYSCALL_DEFINEx的定義

化簡__SYSCALL_DEFINEx(1, _close, __VA_ARGS__) 根據__SYSCALL_DEFINEx的定義

#define __SYSCALL_DEFINEx(1, _close, ...)                \
    asmlinkage long sys_close(__SC_DECL1(__VA_ARGS__));        \
    static inline long SYSC_close(__SC_DECL1(__VA_ARGS__));    \
    asmlinkage long SyS_close(__SC_LONG1(__VA_ARGS__))        \
    {                            \
        __SC_TEST1(__VA_ARGS__);                \
        return (long) SYSC_close(__SC_CAST1(__VA_ARGS__));    \
    }                            \
    SYSCALL_ALIAS(sys_close, SyS_close);                \
    static inline long SYSC_close(__SC_DECL1(__VA_ARGS__))

這里__VA_ARGS__是可變參數宏,可以認為等於unsigned int, fd

根據__SC_宏化簡

#define __SYSCALL_DEFINEx(1, _close, ...)                \
    asmlinkage long sys_close(unsigned int fd);            \
    static inline long SYSC_close(unsigned int fd);        \
    asmlinkage long SyS_close(long fd))                \
    {                            \
        BUILD_BUG_ON(sizeof(unsigned int) > sizeof(long))    \
        return (long) SYSC_close((unsigned int)fd);        \
    }                            \
    SYSCALL_ALIAS(sys_close, SyS_close);                \
    static inline long SYSC_close(unsigned int fd)

聲明了sys_close函數

定義了SyS_close函數,函數體調用SYSC_close函數,並返回其返回值

SYSCALL_ALIAS宏

#define SYSCALL_ALIAS(alias, name)                    \
    asm ("\t.globl " #alias "\n\t.set " #alias ", " #name)

插入匯編代碼 讓執行sys_close等同於執行SYS_close

#define SYSCALL_ALIAS(alias, name)                    \
    asm ("\t.globl " #alias "\n\t.set " #alias ", " #name)

【#是預處理的意思】

BUILD_BUG_ON宏是個錯誤判斷檢測的功能

最后一句是SYSC_close的函數定義

所以在SYSCALL_DEFINE1宏定義后面緊跟的是{}包圍起來的函數體

6.根據5的解析可推斷出

SYSCALL_DEFINE1的'1'代表的是sys_close的參數個數為1

同理SYSCALL_DEFINE?的'/'代表的是sys_name的參數為'?'個

7.系統調用函數的定義用SYSCALL_DEFINE宏修飾

系統調用函數的外部聲明在include/linux/Syscalls.h頭文件中

 

5 添加新的系統調用

第一、打開arch/arm/kernel/calls.S,在最后添加系統調用的函數原型的指針,例如:

CALL(sys_set_senda)

補充說明一點關於NR_syscalls的東西,這個常量表示系統調用的總的個數,在較新版本的內核中,文件arch/arm/kernel/entry-common.S中可以找到:

   .equ NR_syscalls,0
#define CALL(x) .equ NR_syscalls,NR_syscalls+1
#include "calls.S"
#undef CALL
#define CALL(x) .long x

相當的巧妙,不是嗎?在系統調用表中每添加一個系統調用,NR_syscalls就自動增加一。在這個地方先求出NR_syscalls,然后重新定義CALL(x)宏,這樣也可以不影響文件后面系統調用表的建立。

第二、打開include/asm-arm/unistd.h,添加系統調用號的宏,感覺這步可以省略,因為這個地方定義的系統調用號主要是個C庫,比如uClibcGlibc用的。例如:

    #define __NR_plan_set_senda             (__NR_SYSCALL_BASE+365)

為了向后兼容,系統調用只能增加而不能減少,這里的編號添加時,也必須按順序來。否則會導致核心運行錯誤。

第三,實例化該系統調用,即編寫新添加系統調用的實現例如:

SYSCALL_DEFINE1(set_senda, int,iset)
{
       if(iset)
          UART_PUT_CR(&at91_port[2],AT91C_US_SENDA);
       else
          UART_PUT_CR(&at91_port[2],AT91C_US_RSTSTA);

       return 0;
}

第四、打開include/linux/syscalls.h添加函數聲明

asmlinkage long sys_set_senda(int iset);

第五、在應用程序中調用該系統調用,可以參考uClibc的實現。

第六、結束。


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM