linux內核數據結構學習總結


目錄

1. 進程相關數據結構
    1) struct task_struct
    2) struct cred 
    3) struct pid_link
    4) struct pid 
    5) struct signal_struct 
    6) struct rlimit
2. 內核中的隊列/鏈表對象
    1) singly-linked lists
    2) singly-linked tail queues
    3) doubly-linked lists
    4) doubly-linked tail queues
3. 內核模塊相關數據結構
    1) struct module 
4. 文件系統相關數據結構
    1) struct file
    2) struct inode 
    3) struct stat
    4) struct fs_struct 
    5) struct files_struct
    6) struct fdtable 
    7) struct dentry 
    8) struct vfsmount
    9) struct nameidata
    10) struct super_block
    11) struct file_system_type
5. 內核安全相關數據結構
    1) struct security_operations
    2) struct kprobe
    3) struct jprobe
    4) struct kretprobe
    5) struct kretprobe_instance 
    6) struct kretprobe_blackpoint 、struct kprobe_blacklist_entry 
    7) struct linux_binprm
    8) struct linux_binfmt 
6. 系統網絡狀態相關的數據結構
    1) struct ifconf
    2) struct ifreq 
    3) struct socket
    4) struct sock
    5) struct proto_ops
    6) struct inet_sock
    7) struct sockaddr     
7. 系統內存相關的數據結構
    1) struct mm_struct
    2) struct vm_area_struct
    3) struct pg_data_t
    4) struct zone
    5) struct page
8. 中斷相關的數據結構
    1) struct irq_desc
    2) struct irq_chip
    3) struct irqaction
9. 進程間通信(IPC)相關數據結構
    1) struct ipc_namespace
    2) struct ipc_ids
    3) struct kern_ipc_perm
    4) struct sysv_sem
    5) struct sem_queue
    6) struct msg_queue 
    7) struct msg_msg
    8) struct msg_sender
    9) struct msg_receiver
    10) struct msqid_ds
10. 命名空間(namespace)相關數據結構
    1) struct pid_namespace 
    2) struct pid、struct upid
    3) struct nsproxy
    4) struct mnt_namespace

 

1. 進程相關數據結構

0x0: CURRENT宏

我們知道,在windows中使用PCB(進程控制塊)來對進程的運行狀態進行描述,對應的,在linux中使用task_struct結構體存儲相關的進程信息

task_struct在linux/sched.h文件里定義(在使用current宏的時候一定要引入這個頭文件)

值得注意的是,在linux內核編程中常用的current宏可以非常簡單地獲取到指向task_struct的指針,這個宏和體系結構有關,大多數情況下,我們都是x86體系結構的,所以在arch/x86目錄下,其他體系結構的類推

目前主流的體系結構有x86、ARM、MIPS架構,在繼續學習之前,我們先來簡單了解一下什么是體系結構

在計算世界中,"體系結構"一詞被用來描述一個抽象的機器,而不是一個具體的機器實現。一般而言,一個CPU的體系結構有一個指令集加上一些寄存器而組成。"指令集""體系結構"這兩個術語是同義詞  

X86、MIPS、ARM三種cpu的體系結構和特點

1. X86 
X86采用了CISC指令集。在CISC指令集的各種指令中,大約有20%的指令會被反復使用,占整個程序代碼的80%。而余下的80%的指令卻不經常使用,在程序設計中只占20%。 
    1.1 總線接口部件BIU 
     總線接口部件由以下幾部分組成
        1) 4個16位段寄存器(DS、ES、SS、CS)
        2) 一個16位指令指針寄存器(IP)
        3) 20位物理地址加法器
        4) 6字節指令隊列(8088為4字節)
        5) 總線控制電路組成,負責與存儲器及I/O端口的數據傳送   
    1.2 執行部件EU 
    執行部件由以下幾部分組成,其任務就是從指令隊列流中取出指令,然后分析和執行指令,還負責計算操作數的16位偏移地址
        1) ALU
        2) 寄存器陣列(AX、BX、CX、DX、SI、DI、BP、SP)
        3) 標志寄存器(PSW)等幾個部分組成
    1.3 寄存器的結構 
        1) 數據寄存器AX、BX、CX、DX均為16位的寄存器,它們中的每一個又可分為高字節H和低字節L。即AH、BH、CH、DH及AL、BL、CL、DL可作為單獨的8位寄存器使用。不論16位寄存器還是8位寄存器,它們均可寄存操作數及
運算的中間結果。有少數指令指定某個寄存器專用,例如,串操作指令指定CX專門用作記錄串中元素個數的計數器。  
2) 段寄存器組:CS、DS、SS、ES。8086/8088的20位物理地址在CPU內部要由兩部分相加形成的 2.1) 指明其偏移地址 SP、BP、SI、DI標識20位物理地址的低16位,用於指明其偏移地址 2.2) 指明20位物理地址的高16位,故稱作段寄存器,4個存儲器使用專一,不能互換 2.2.1) CS: CS識別當前代碼段 2.2.2) DS: DS識別當前數據段 2.2.3) SS: SS識別當前堆棧段 2.2.4) ES: ES識別當前附加段 一般情況下,DS和ES都須用戶在程序中設置初值 3) 控制寄存器組 3.1) IP 指令指針IP用以指明當前要執行指令的偏移地址(段地址由CS提供) 3.2) FLAG 標志寄存器FLAG有16位,用了其中的九位,分兩組: 3.2.1) 狀態標志: 用以記錄狀態信息,由6位組成,包括CF、AF、OF、SF、PF和ZF,它反映前一次涉及ALU操作的結果,對用戶它"只讀不寫" 3.2.2) 控制標志: 用以記錄控制信息由3位組成,包括方向標志DF,中斷允許標志IF及陷阱標志TF,中斷允許標志IF及陷阱標志TF,可通過指令設置 2. MIPS:   1) 所有指令都是32位編碼;     2) 有些指令有26位供目標地址編碼;有些則只有16位。因此要想加載任何一個32位值,就得用兩個加載指令。16位的目標地址意味着,指令的跳轉或子函數的位置必須在64K以內(上下32K) 3) 所有的動作原理上要求必須在1個時鍾周期內完成,一個動作一個階段 4) 有32個通用寄存器,每個寄存器32位(對32位機)或64位(對64位機) 5) 對於MIPS體系結構來說,本身沒有任何幫助運算判斷的標志寄存器,要實現相應的功能時,是通過測試兩個寄存器是否相等來完成    6) 所有的運算都是基於32位的,沒有對字節和對半字的運算(MIPS里,字定義為32位,半字定義為16位)    7) 沒有單獨的棧指令,所有對棧的操作都是統一的內存訪問方式。因為push和pop指令實際上是一個復合操作,包含對內存的寫入和對棧指針的移動;    8) 由於MIPS固定指令長度,所以造成其編譯后的二進制文件和內存占用空間比x86的要大,(x86平均指令長度只有3個字節多一點,而MIPS是4個字節)   9) 尋址方式:只有一種內存尋址方式。就是基地址加一個16位的地址偏移   10) 內存中的數據訪問必須嚴格對齊(至少4字節對齊)    11) 跳轉指令只有26位目標地址,再加上2位的對齊位,可尋址28位的空間,即256M 12) 條件分支指令只有16位跳轉地址,加上2位的對齊位,共18位尋址空間,即256K  13) MIPS默認不把子函數的返回地址(就是調用函數的受害指令地址)存放到棧中,而是存放到$31寄存器中;這對那些葉子函數有利。如果遇到嵌套的函數的話,有另外的機制處理;     14) 高度的流水線: *MIPS指令的五級流水線:(每條指令都包含五個執行階段) 14.1) 第一階段:從指令緩沖區中取指令。占一個時鍾周期  14.2) 第二階段:從指令中的源寄存器域(可能有兩個)的值(為一個數字,指定$0~$31中的某一個)所代表的寄存器中讀出數據。占半個時鍾周期 14.3) 第三階段:在一個時鍾周期內做一次算術或邏輯運算。占一個時鍾周期  14.4) 第四階段:指令從數據緩沖中讀取內存變量的階段。從平均來講,大約有3/4的指令在這個階段沒做什么事情,但它是指令有序性的保證。占一個時鍾周期 14.5) 第五階段:存儲計算結果到緩沖或內存的階段。占半個時鍾周期  所以一條指令要占用四個時鍾周期 3. ARM  ARM處理器是一個32位元精簡指令集(RISC)處理器架構,其廣泛地使用在許多嵌入式系統設計    1) RISC(Reduced Instruction Set Computer,精簡指令集計算機)     RISC體系結構應具有如下特點:  1.1) 采用固定長度的指令格式,指令歸整、簡單、基本尋址方式有2~3種 1.2) 使用單周期指令,便於流水線操作執行。  1.3) 大量使用寄存器,數據處理指令只對寄存器進行操作,只有加載/ 存儲指令可以訪問存儲器,以提高指令的執行效率 2) ARM體系結構還采用了一些特別的技術,在保證高性能的前提下盡量縮小芯片的面積,並降低功耗  2.1) 所有的指令都可根據前面的執行結果決定是否被執行,從而提高指令的執行效率 2.2) 可用加載/存儲指令批量傳輸數據,以提高數據的傳輸效率。   3) 寄存器結構  ARM處理器共有37個寄存器,被分為若干個組(BANK),這些寄存器包括 3.1) 31個通用寄存器,包括程序計數器(PC指針),均為32位的寄存器 3.2) 6個狀態寄存器,用以標識CPU的工作狀態及程序的運行狀態,均為32位 4) 指令結構  ARM微處理器的在較新的體系結構中支持兩種指令集:ARM指令集和Thumb指令集。其中,ARM指令為32位的長度,Thumb指令為16位長度。Thumb指令集為ARM指令集的功能子集,但與等價的ARM代碼相比較,可節省30%~40%以上的
存儲空間,同時具備32位代碼的所有優點。

我們接下來來看看內核代碼中是如何實現current這個宏定義的

#ifndef __ASSEMBLY__
    struct task_struct;

    //用於在編譯時候聲明一個perCPU變量該變量被放在一個特殊的段中,原型為DECLARE_PER_CPU(type,name),主要作用是為處理器創建一個type類型,名為name的變量
    DECLARE_PER_CPU(struct task_struct *, current_task);

    static __always_inline struct task_struct *get_current(void)
    {
        return percpu_read_stable(current_task);
    }

    #define current get_current()
    #endif /* __ASSEMBLY__ */

#endif /* _ASM_X86_CURRENT_H */

繼續跟蹤percpu_read_stable()這個函數

\linux-2.6.32.63\arch\x86\include\asm\percpu.h

#define percpu_read_stable(var)    percpu_from_op("mov", per_cpu__##var, "p" (&per_cpu__##var))

繼續跟進percpu_from_op()這個函數

/*
percpu_from_op宏中根據不同的sizeof(var)選擇不同的分支,執行不同的流程,因為這里是x86體系,所以sizeof(current_task)的值為4
在每個分支中使用了一條的內聯匯編代碼,其中__percpu_arg(1)為%%fs:%P1(X86)或者%%gs:%P1(X86_64),將上述代碼整理后current獲取代碼如下:
1. x86: asm(movl "%%fs:%P1","%0" : "=r" (pfo_ret__) :"p" (&(var)) 
2. x86_64: asm(movl "%%gs:%P1","%0" : "=r" (pfo_ret__) :"p" (&(var)) 
*/
#define percpu_from_op(op, var, constraint)        \
({                            \
    typeof(var) ret__;                \
    switch (sizeof(var)) {                \
    case 1:                        \
        asm(op "b "__percpu_arg(1)",%0"        \
            : "=q" (ret__)            \
            : constraint);            \
        break;                    \
    case 2:                        \
        asm(op "w "__percpu_arg(1)",%0"        \
            : "=r" (ret__)            \
            : constraint);            \
        break;                    \
    case 4:                        \
        asm(op "l "__percpu_arg(1)",%0"        \
            : "=r" (ret__)            \
            : constraint);            \
        break;                    \
    case 8:                        \
        asm(op "q "__percpu_arg(1)",%0"        \
            : "=r" (ret__)            \
            : constraint);            \
        break;                    \
    default: __bad_percpu_size();            \
    }                        \
    ret__;                        \
})

將fs(或者gs)段中P1偏移處的值傳送給pfo_ret__變量

繼續跟進per_cpu__kernel_stack的定義

linux-2.6.32.63\arch\x86\kernel\cpu\common.c

/*
The following four percpu variables are hot.  Align current_task to
cacheline size such that all four fall in the same cacheline.
*/
DEFINE_PER_CPU(struct task_struct *, current_task) ____cacheline_aligned = &init_task;
EXPORT_PER_CPU_SYMBOL(current_task);

DEFINE_PER_CPU(unsigned long, kernel_stack) = (unsigned long)&init_thread_union - KERNEL_STACK_OFFSET + THREAD_SIZE;
EXPORT_PER_CPU_SYMBOL(kernel_stack);

DEFINE_PER_CPU(char *, irq_stack_ptr) = init_per_cpu_var(irq_stack_union.irq_stack) + IRQ_STACK_SIZE - 64;

DEFINE_PER_CPU(unsigned int, irq_count) = -1;

繼續進程內核棧初始化的關鍵代碼: DEFINE_PER_CPU(unsigned long, kernel_stack) = (unsigned long)&init_thread_union - KERNEL_STACK_OFFSET + THREAD_SIZE;

//linux-2.6.32.63\arch\x86\kernel\init_task.c
/*
 * Initial task structure.
 *
 * All other task structs will be allocated on slabs in fork.c
 */
struct task_struct init_task = INIT_TASK(init_task);
EXPORT_SYMBOL(init_task);

/*
 * Initial thread structure.
 *
 * We need to make sure that this is THREAD_SIZE aligned due to the
 * way process stacks are handled. This is done by having a special
 * "init_task" linker map entry..
 */
union thread_union init_thread_union __init_task_data =
{ 
    INIT_THREAD_INFO(init_task) 
};

\linux-2.6.32.63\include\linux\init_task.h

/*
 *  INIT_TASK is used to set up the first task table, touch at
 * your own risk!. Base=0, limit=0x1fffff (=2MB)
 */
#define INIT_TASK(tsk)    \
{                                    \
    .state        = 0,                        \
    .stack        = &init_thread_info,                \
    .usage        = ATOMIC_INIT(2),                \
    .flags        = PF_KTHREAD,                    \
    .lock_depth    = -1,                        \
    .prio        = MAX_PRIO-20,                    \
    .static_prio    = MAX_PRIO-20,                    \
    .normal_prio    = MAX_PRIO-20,                    \
    .policy        = SCHED_NORMAL,                    \
    .cpus_allowed    = CPU_MASK_ALL,                    \
    .mm        = NULL,                        \
    .active_mm    = &init_mm,                    \
    .se        = {                        \
        .group_node     = LIST_HEAD_INIT(tsk.se.group_node),    \
    },                                \
    .rt        = {                        \
        .run_list    = LIST_HEAD_INIT(tsk.rt.run_list),    \
        .time_slice    = HZ,                     \
        .nr_cpus_allowed = NR_CPUS,                \
    },                                \
    .tasks        = LIST_HEAD_INIT(tsk.tasks),            \
    .pushable_tasks = PLIST_NODE_INIT(tsk.pushable_tasks, MAX_PRIO), \
    .ptraced    = LIST_HEAD_INIT(tsk.ptraced),            \
    .ptrace_entry    = LIST_HEAD_INIT(tsk.ptrace_entry),        \
    .real_parent    = &tsk,                        \
    .parent        = &tsk,                        \
    .children    = LIST_HEAD_INIT(tsk.children),            \
    .sibling    = LIST_HEAD_INIT(tsk.sibling),            \
    .group_leader    = &tsk,                        \
    .real_cred    = &init_cred,                    \
    .cred        = &init_cred,                    \
    .cred_guard_mutex =                        \
         __MUTEX_INITIALIZER(tsk.cred_guard_mutex),        \
    .comm        = "swapper",                    \
    .thread        = INIT_THREAD,                    \
    .fs        = &init_fs,                    \
    .files        = &init_files,                    \
    .signal        = &init_signals,                \
    .sighand    = &init_sighand,                \
    .nsproxy    = &init_nsproxy,                \
    .pending    = {                        \
        .list = LIST_HEAD_INIT(tsk.pending.list),        \
        .signal = {{0}}},                    \
    .blocked    = {{0}},                    \
    .alloc_lock    = __SPIN_LOCK_UNLOCKED(tsk.alloc_lock),        \
    .journal_info    = NULL,                        \
    .cpu_timers    = INIT_CPU_TIMERS(tsk.cpu_timers),        \
    .fs_excl    = ATOMIC_INIT(0),                \
    .pi_lock    = __SPIN_LOCK_UNLOCKED(tsk.pi_lock),        \
    .timer_slack_ns = 50000, /* 50 usec default slack */        \
    .pids = {                            \
        [PIDTYPE_PID]  = INIT_PID_LINK(PIDTYPE_PID),        \
        [PIDTYPE_PGID] = INIT_PID_LINK(PIDTYPE_PGID),        \
        [PIDTYPE_SID]  = INIT_PID_LINK(PIDTYPE_SID),        \
    },                                \
    .dirties = INIT_PROP_LOCAL_SINGLE(dirties),            \
    INIT_IDS                            \
    INIT_PERF_EVENTS(tsk)                        \
    INIT_TRACE_IRQFLAGS                        \
    INIT_LOCKDEP                            \
    INIT_FTRACE_GRAPH                        \
    INIT_TRACE_RECURSION                        \
    INIT_TASK_RCU_PREEMPT(tsk)                    \
}

我們繼續跟進和進程信息密切相關的數據結構

\linux-2.6.32.63\include\linux\sched.h

/*
THREAD_SIZE在32位平台上一般定義為4K,所以stack的大小其實就是4KB,這就是初始任務在核心里所擁有的所有空間,除去thread_info和KERNEL_STACK_OFFSET占用的空間后,就是任務在核心里實際擁有堆棧的大小。
KERNEL_STACK_OFFSET定義為5*8,由於是unsigned long,所以堆棧底部以上還有5*8*4B=200B的空間用來存放程序運行時相關的環境參數
*/ union thread_union { struct thread_info thread_info; unsigned long stack[THREAD_SIZE/sizeof(long)]; };

學習到這里,我們需要總結一下

1. 在linux中,整個內核棧是所有進程公用的,每個進程會像切蛋糕一樣從中切去一份指定大小的內存區域
2. 每個進程都在當前內核棧中分配一段內存區域: thread_union,這段內核棧內存被分為兩個部分:
    1) 低地址部分保存的: thread_info 
    2) 剩余的高地址部分保存的: 當前進程的內核棧內核棧stack
3. struct thread_info thread_info;當中就保存着當前進程的信息,所以我們可以從本質上理解,current宏其實並不神秘,它就是在做一個內存棧上的取地址操作

Relevant Link:

http://www.pagefault.info/?p=36
http://www.cnblogs.com/justinzhang/archive/2011/07/18/2109923.html

僅僅只需檢查內核棧指針的值,而根本無需存取內存,內核就可以導出task_struct結構的地址,可以把它看作全局變量來用

0x1: struct task_struct

struct task_struct 
{
    /* 
    1. state: 進程執行時,它會根據具體情況改變狀態。進程狀態是進程調度和對換的依據。Linux中的進程主要有如下狀態:
        1) TASK_RUNNING: 可運行
        處於這種狀態的進程,只有兩種狀態:
            1.1) 正在運行
            正在運行的進程就是當前進程(由current所指向的進程)
            1.2) 正准備運行
            准備運行的進程只要得到CPU就可以立即投入運行,CPU是這些進程唯一等待的系統資源,系統中有一個運行隊列(run_queue),用來容納所有處於可運行狀態的進程,調度程序執行時,從中選擇一個進程投入運行 
        
        2) TASK_INTERRUPTIBLE: 可中斷的等待狀態,是針對等待某事件或其他資源的睡眠進程設置的,在內核發送信號給該進程表明事件已經發生時,進程狀態變為TASK_RUNNING,它只要調度器選中該進程即可恢復執行 
        
        3) TASK_UNINTERRUPTIBLE: 不可中斷的等待狀態
        處於該狀態的進程正在等待某個事件(event)或某個資源,它肯定位於系統中的某個等待隊列(wait_queue)中,處於不可中斷等待態的進程是因為硬件環境不能滿足而等待,例如等待特定的系統資源,它任何情況下都不能被打斷,只能用特定的方式來喚醒它,例如喚醒函數wake_up()等 
     它們不能由外部信號喚醒,只能由內核親自喚醒        
4) TASK_ZOMBIE: 僵死 進程雖然已經終止,但由於某種原因,父進程還沒有執行wait()系統調用,終止進程的信息也還沒有回收。顧名思義,處於該狀態的進程就是死進程,這種進程實際上是系統中的垃圾,必須進行相應處理以釋放其占用的資源。 5) TASK_STOPPED: 暫停 此時的進程暫時停止運行來接受某種特殊處理。通常當進程接收到SIGSTOP、SIGTSTP、SIGTTIN或 SIGTTOU信號后就處於這種狀態。例如,正接受調試的進程就處於這種狀態     
     6) TASK_TRACED
     從本質上來說,這屬於TASK_STOPPED狀態,用於從停止的進程中,將當前被調試的進程與常規的進程區分開來
      
     7) TASK_DEAD
     父進程wait系統調用發出后,當子進程退出時,父進程負責回收子進程的全部資源,子進程進入TASK_DEAD狀態
8) TASK_SWAPPING: 換入/換出
*/ volatile long state; /* 2. stack 進程內核棧,進程通過alloc_thread_info函數分配它的內核棧,通過free_thread_info函數釋放所分配的內核棧 */ void *stack; /* 3. usage 進程描述符使用計數,被置為2時,表示進程描述符正在被使用而且其相應的進程處於活動狀態 */ atomic_t usage; /* 4. flags flags是進程當前的狀態標志(注意和運行狀態區分) 1) #define PF_ALIGNWARN 0x00000001: 顯示內存地址未對齊警告 2) #define PF_PTRACED 0x00000010: 標識是否是否調用了ptrace 3) #define PF_TRACESYS 0x00000020: 跟蹤系統調用 4) #define PF_FORKNOEXEC 0x00000040: 已經完成fork,但還沒有調用exec 5) #define PF_SUPERPRIV 0x00000100: 使用超級用戶(root)權限 6) #define PF_DUMPCORE 0x00000200: dumped core 7) #define PF_SIGNALED 0x00000400: 此進程由於其他進程發送相關信號而被殺死 8) #define PF_STARTING 0x00000002: 當前進程正在被創建 9) #define PF_EXITING 0x00000004: 當前進程正在關閉 10) #define PF_USEDFPU 0x00100000: Process used the FPU this quantum(SMP only) #define PF_DTRACE 0x00200000: delayed trace (used on m68k) */ unsigned int flags; /* 5. ptrace ptrace系統調用,成員ptrace被設置為0時表示不需要被跟蹤,它的可能取值如下: linux-2.6.38.8/include/linux/ptrace.h 1) #define PT_PTRACED 0x00000001 2) #define PT_DTRACE 0x00000002: delayed trace (used on m68k, i386) 3) #define PT_TRACESYSGOOD 0x00000004 4) #define PT_PTRACE_CAP 0x00000008: ptracer can follow suid-exec 5) #define PT_TRACE_FORK 0x00000010 6) #define PT_TRACE_VFORK 0x00000020 7) #define PT_TRACE_CLONE 0x00000040 8) #define PT_TRACE_EXEC 0x00000080 9) #define PT_TRACE_VFORK_DONE 0x00000100 10) #define PT_TRACE_EXIT 0x00000200 */ unsigned int ptrace; unsigned long ptrace_message; siginfo_t *last_siginfo; /* 6. lock_depth 用於表示獲取大內核鎖的次數,如果進程未獲得過鎖,則置為-1 */ int lock_depth; /* 7. oncpu 在SMP上幫助實現無加鎖的進程切換(unlocked context switches) */ #ifdef CONFIG_SMP #ifdef __ARCH_WANT_UNLOCKED_CTXSW int oncpu; #endif #endif /* 8. 進程調度 1) prio: 調度器考慮的優先級保存在prio,由於在某些情況下內核需要暫時提高進程的優先級,因此需要第三個成員來表示(除了static_prio、normal_prio之外),由於這些改變不是持久的,因此靜態(static_prio)和普通(normal_prio)優先級不受影響 2) static_prio: 用於保存進程的"靜態優先級",靜態優先級是進程"啟動"時分配的優先級,它可以用nice、sched_setscheduler系統調用修改,否則在進程運行期間會一直保持恆定 3) normal_prio: 表示基於進程的"靜態優先級"和"調度策略"計算出的優先級,因此,即使普通進程和實時進程具有相同的靜態優先級(static_prio),其普通優先級(normal_prio)也是不同的。進程分支時(fork),新創建的子進程會集成普通優先級 */ int prio, static_prio, normal_prio; /* 4) rt_priority: 表示實時進程的優先級,需要明白的是,"實時進程優先級"和"普通進程優先級"有兩個獨立的范疇,實時進程即使是最低優先級也高於普通進程,最低的實時優先級為0,最高的優先級為99,值越大,表明優先級越高 */ unsigned int rt_priority; /* 5) sched_class: 該進程所屬的調度類,目前內核中有實現以下四種: 5.1) static const struct sched_class fair_sched_class; 5.2) static const struct sched_class rt_sched_class; 5.3) static const struct sched_class idle_sched_class; 5.4) static const struct sched_class stop_sched_class; */ const struct sched_class *sched_class; /* 6) se: 用於普通進程的調用實體
  調度器不限於調度進程,還可以處理更大的實體,這可以實現"組調度",可用的CPU時間可以首先在一般的進程組(例如所有進程可以按所有者分組)之間分配,接下來分配的時間在組內再次分配
  這種一般性要求調度器不直接操作進程,而是處理"可調度實體",一個實體有sched_entity的一個實例標識
  在最簡單的情況下,調度在各個進程上執行,由於調度器設計為處理可調度的實體,在調度器看來各個進程也必須也像這樣的實體,因此se在task_struct中內嵌了一個sched_entity實例,調度器可據此操作各個task_struct
*/ struct sched_entity se; /* 7) rt: 用於實時進程的調用實體 */ struct sched_rt_entity rt; #ifdef CONFIG_PREEMPT_NOTIFIERS /* 9. preempt_notifier preempt_notifiers結構體鏈表 */ struct hlist_head preempt_notifiers; #endif /* 10. fpu_counter FPU使用計數 */ unsigned char fpu_counter; #ifdef CONFIG_BLK_DEV_IO_TRACE /* 11. btrace_seq blktrace是一個針對Linux內核中塊設備I/O層的跟蹤工具 */ unsigned int btrace_seq; #endif /* 12. policy policy表示進程的調度策略,目前主要有以下五種: 1) #define SCHED_NORMAL 0: 用於普通進程,它們通過完全公平調度器來處理 2) #define SCHED_FIFO 1: 先來先服務調度,由實時調度類處理 3) #define SCHED_RR 2: 時間片輪轉調度,由實時調度類處理 4) #define SCHED_BATCH 3: 用於非交互、CPU使用密集的批處理進程,通過完全公平調度器來處理,調度決策對此類進程給與"冷處理",它們絕不會搶占CFS調度器處理的另一個進程,因此不會干擾交互式進程,如果不打算用nice降低進程的靜態優先級,同時又不希望該進程影響系統的交互性,最適合用該調度策略 5) #define SCHED_IDLE 5: 可用於次要的進程,其相對權重總是最小的,也通過完全公平調度器來處理。要注意的是,SCHED_IDLE不負責調度空閑進程,空閑進程由內核提供單獨的機制來處理 只有root用戶能通過sched_setscheduler()系統調用來改變調度策略 */ unsigned int policy; /* 13. cpus_allowed cpus_allowed是一個位域,在多處理器系統上使用,用於控制進程可以在哪里處理器上運行 */ cpumask_t cpus_allowed; /* 14. RCU同步原語 */ #ifdef CONFIG_TREE_PREEMPT_RCU int rcu_read_lock_nesting; char rcu_read_unlock_special; struct rcu_node *rcu_blocked_node; struct list_head rcu_node_entry; #endif /* #ifdef CONFIG_TREE_PREEMPT_RCU */ #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT) /* 15. sched_info 用於調度器統計進程的運行信息 */ struct sched_info sched_info; #endif /* 16. tasks 通過list_head將當前進程的task_struct串聯進內核的進程列表中,構建;linux進程鏈表 */ struct list_head tasks; /* 17. pushable_tasks limit pushing to one attempt */ struct plist_node pushable_tasks; /* 18. 進程地址空間 1) mm: 指向進程所擁有的內存描述符 2) active_mm: active_mm指向進程運行時所使用的內存描述符 對於普通進程而言,這兩個指針變量的值相同。但是,內核線程不擁有任何內存描述符,所以它們的mm成員總是為NULL。當內核線程得以運行時,它的active_mm成員被初始化為前一個運行進程的active_mm值 */ struct mm_struct *mm, *active_mm; /* 19. exit_state 進程退出狀態碼 */ int exit_state; /* 20. 判斷標志 1) exit_code exit_code用於設置進程的終止代號,這個值要么是_exit()或exit_group()系統調用參數(正常終止),要么是由內核提供的一個錯誤代號(異常終止) 2) exit_signal exit_signal被置為-1時表示是某個線程組中的一員。只有當線程組的最后一個成員終止時,才會產生一個信號,以通知線程組的領頭進程的父進程 */ int exit_code, exit_signal; /* 3) pdeath_signal pdeath_signal用於判斷父進程終止時發送信號 */ int pdeath_signal; /* 4) personality用於處理不同的ABI,它的可能取值如下: enum { PER_LINUX = 0x0000, PER_LINUX_32BIT = 0x0000 | ADDR_LIMIT_32BIT, PER_LINUX_FDPIC = 0x0000 | FDPIC_FUNCPTRS, PER_SVR4 = 0x0001 | STICKY_TIMEOUTS | MMAP_PAGE_ZERO, PER_SVR3 = 0x0002 | STICKY_TIMEOUTS | SHORT_INODE, PER_SCOSVR3 = 0x0003 | STICKY_TIMEOUTS | WHOLE_SECONDS | SHORT_INODE, PER_OSR5 = 0x0003 | STICKY_TIMEOUTS | WHOLE_SECONDS, PER_WYSEV386 = 0x0004 | STICKY_TIMEOUTS | SHORT_INODE, PER_ISCR4 = 0x0005 | STICKY_TIMEOUTS, PER_BSD = 0x0006, PER_SUNOS = 0x0006 | STICKY_TIMEOUTS, PER_XENIX = 0x0007 | STICKY_TIMEOUTS | SHORT_INODE, PER_LINUX32 = 0x0008, PER_LINUX32_3GB = 0x0008 | ADDR_LIMIT_3GB, PER_IRIX32 = 0x0009 | STICKY_TIMEOUTS, PER_IRIXN32 = 0x000a | STICKY_TIMEOUTS, PER_IRIX64 = 0x000b | STICKY_TIMEOUTS, PER_RISCOS = 0x000c, PER_SOLARIS = 0x000d | STICKY_TIMEOUTS, PER_UW7 = 0x000e | STICKY_TIMEOUTS | MMAP_PAGE_ZERO, PER_OSF4 = 0x000f, PER_HPUX = 0x0010, PER_MASK = 0x00ff, }; */ unsigned int personality; /* 5) did_exec did_exec用於記錄進程代碼是否被execve()函數所執行 */ unsigned did_exec:1; /* 6) in_execve in_execve用於通知LSM是否被do_execve()函數所調用 */ unsigned in_execve:1; /* 7) in_iowait in_iowait用於判斷是否進行iowait計數 */ unsigned in_iowait:1; /* 8) sched_reset_on_fork sched_reset_on_fork用於判斷是否恢復默認的優先級或調度策略 */ unsigned sched_reset_on_fork:1; /* 21. 進程標識符(PID) 在CONFIG_BASE_SMALL配置為0的情況下,PID的取值范圍是0到32767,即系統中的進程數最大為32768個 #define PID_MAX_DEFAULT (CONFIG_BASE_SMALL ? 0x1000 : 0x8000) 在Linux系統中,一個線程組中的所有線程使用和該線程組的領頭線程(該組中的第一個輕量級進程)相同的PID,並被存放在tgid成員中。只有線程組的領頭線程的pid成員才會被設置為與tgid相同的值。注意,getpid()系統調用
返回的是當前進程的tgid值而不是pid值。
*/ pid_t pid; pid_t tgid; #ifdef CONFIG_CC_STACKPROTECTOR /* 22. stack_canary 防止內核堆棧溢出,在GCC編譯內核時,需要加上-fstack-protector選項 */ unsigned long stack_canary; #endif /* 23. 表示進程親屬關系的成員 1) real_parent: 指向其父進程,如果創建它的父進程不再存在,則指向PID為1的init進程 2) parent: 指向其父進程,當它終止時,必須向它的父進程發送信號。它的值通常與real_parent相同 */ struct task_struct *real_parent; struct task_struct *parent; /* 3) children: 表示鏈表的頭部,鏈表中的所有元素都是它的子進程(子進程鏈表) 4) sibling: 用於把當前進程插入到兄弟鏈表中(連接到父進程的子進程鏈表(兄弟鏈表)) 5) group_leader: 指向其所在進程組的領頭進程 */ struct list_head children; struct list_head sibling; struct task_struct *group_leader; struct list_head ptraced; struct list_head ptrace_entry; struct bts_context *bts; /* 24. pids PID散列表和鏈表 */ struct pid_link pids[PIDTYPE_MAX]; /* 25. thread_group 線程組中所有進程的鏈表 */ struct list_head thread_group; /* 26. do_fork函數 1) vfork_done 在執行do_fork()時,如果給定特別標志,則vfork_done會指向一個特殊地址 2) set_child_tid、clear_child_tid 如果copy_process函數的clone_flags參數的值被置為CLONE_CHILD_SETTID或CLONE_CHILD_CLEARTID,則會把child_tidptr參數的值分別復制到set_child_tid和clear_child_tid成員。這些標志說明必須改變子
進程用戶態地址空間的child_tidptr所指向的變量的值。
*/ struct completion *vfork_done; int __user *set_child_tid; int __user *clear_child_tid; /* 27. 記錄進程的I/O計數(時間) 1) utime 用於記錄進程在"用戶態"下所經過的節拍數(定時器) 2) stime 用於記錄進程在"內核態"下所經過的節拍數(定時器) 3) utimescaled 用於記錄進程在"用戶態"的運行時間,但它們以處理器的頻率為刻度 4) stimescaled 用於記錄進程在"內核態"的運行時間,但它們以處理器的頻率為刻度 */ cputime_t utime, stime, utimescaled, stimescaled; /* 5) gtime 以節拍計數的虛擬機運行時間(guest time) */ cputime_t gtime; /* 6) prev_utime、prev_stime是先前的運行時間 */ cputime_t prev_utime, prev_stime; /* 7) nvcsw 自願(voluntary)上下文切換計數 8) nivcsw 非自願(involuntary)上下文切換計數 */ unsigned long nvcsw, nivcsw; /* 9) start_time 進程創建時間 10) real_start_time 進程睡眠時間,還包含了進程睡眠時間,常用於/proc/pid/stat, */ struct timespec start_time; struct timespec real_start_time; /* 11) cputime_expires 用來統計進程或進程組被跟蹤的處理器時間,其中的三個成員對應着cpu_timers[3]的三個鏈表 */ struct task_cputime cputime_expires; struct list_head cpu_timers[3]; #ifdef CONFIG_DETECT_HUNG_TASK /* 12) last_switch_count nvcsw和nivcsw的總和 */ unsigned long last_switch_count; #endif struct task_io_accounting ioac; #if defined(CONFIG_TASK_XACCT) u64 acct_rss_mem1; u64 acct_vm_mem1; cputime_t acct_timexpd; #endif /* 28. 缺頁統計 */ unsigned long min_flt, maj_flt; /* 29. 進程權能 */ const struct cred *real_cred; const struct cred *cred; struct mutex cred_guard_mutex; struct cred *replacement_session_keyring; /* 30. comm[TASK_COMM_LEN] 相應的程序名 */ char comm[TASK_COMM_LEN]; /* 31. 文件 1) fs 用來表示進程與文件系統的聯系,包括當前目錄和根目錄 2) files 表示進程當前打開的文件 */ int link_count, total_link_count; struct fs_struct *fs; struct files_struct *files; #ifdef CONFIG_SYSVIPC /* 32. sysvsem 進程通信(SYSVIPC) */ struct sysv_sem sysvsem; #endif /* 33. 處理器特有數據 */ struct thread_struct thread; /* 34. nsproxy 命名空間 */ struct nsproxy *nsproxy; /* 35. 信號處理 1) signal: 指向進程的信號描述符 2) sighand: 指向進程的信號處理程序描述符 */ struct signal_struct *signal; struct sighand_struct *sighand; /* 3) blocked: 表示被阻塞信號的掩碼 4) real_blocked: 表示臨時掩碼 */ sigset_t blocked, real_blocked; sigset_t saved_sigmask; /* 5) pending: 存放私有掛起信號的數據結構 */ struct sigpending pending; /* 6) sas_ss_sp: 信號處理程序備用堆棧的地址 7) sas_ss_size: 表示堆棧的大小 */ unsigned long sas_ss_sp; size_t sas_ss_size; /* 8) notifier 設備驅動程序常用notifier指向的函數來阻塞進程的某些信號 9) otifier_data 指的是notifier所指向的函數可能使用的數據。 10) otifier_mask 標識這些信號的位掩碼 */ int (*notifier)(void *priv); void *notifier_data; sigset_t *notifier_mask; /* 36. 進程審計 */ struct audit_context *audit_context; #ifdef CONFIG_AUDITSYSCALL uid_t loginuid; unsigned int sessionid; #endif /* 37. secure computing */ seccomp_t seccomp; /* 38. 用於copy_process函數使用CLONE_PARENT標記時 */ u32 parent_exec_id; u32 self_exec_id; /* 39. alloc_lock 用於保護資源分配或釋放的自旋鎖 */ spinlock_t alloc_lock; /* 40. 中斷 */ #ifdef CONFIG_GENERIC_HARDIRQS struct irqaction *irqaction; #endif #ifdef CONFIG_TRACE_IRQFLAGS unsigned int irq_events; int hardirqs_enabled; unsigned long hardirq_enable_ip; unsigned int hardirq_enable_event; unsigned long hardirq_disable_ip; unsigned int hardirq_disable_event; int softirqs_enabled; unsigned long softirq_disable_ip; unsigned int softirq_disable_event; unsigned long softirq_enable_ip; unsigned int softirq_enable_event; int hardirq_context; int softirq_context; #endif /* 41. pi_lock task_rq_lock函數所使用的鎖 */ spinlock_t pi_lock; #ifdef CONFIG_RT_MUTEXES /* 42. 基於PI協議的等待互斥鎖,其中PI指的是priority inheritance/9優先級繼承) */ struct plist_head pi_waiters; struct rt_mutex_waiter *pi_blocked_on; #endif #ifdef CONFIG_DEBUG_MUTEXES /* 43. blocked_on 死鎖檢測 */ struct mutex_waiter *blocked_on; #endif /* 44. lockdep, */ #ifdef CONFIG_LOCKDEP # define MAX_LOCK_DEPTH 48UL u64 curr_chain_key; int lockdep_depth; unsigned int lockdep_recursion; struct held_lock held_locks[MAX_LOCK_DEPTH]; gfp_t lockdep_reclaim_gfp; #endif /* 45. journal_info JFS文件系統 */ void *journal_info; /* 46. 塊設備鏈表 */ struct bio *bio_list, **bio_tail; /* 47. reclaim_state 內存回收 */ struct reclaim_state *reclaim_state; /* 48. backing_dev_info 存放塊設備I/O數據流量信息 */ struct backing_dev_info *backing_dev_info; /* 49. io_context I/O調度器所使用的信息 */ struct io_context *io_context; /* 50. CPUSET功能 */ #ifdef CONFIG_CPUSETS nodemask_t mems_allowed; int cpuset_mem_spread_rotor; #endif /* 51. Control Groups */ #ifdef CONFIG_CGROUPS struct css_set *cgroups; struct list_head cg_list; #endif /* 52. robust_list futex同步機制 */ #ifdef CONFIG_FUTEX struct robust_list_head __user *robust_list; #ifdef CONFIG_COMPAT struct compat_robust_list_head __user *compat_robust_list; #endif struct list_head pi_state_list; struct futex_pi_state *pi_state_cache; #endif #ifdef CONFIG_PERF_EVENTS struct perf_event_context *perf_event_ctxp; struct mutex perf_event_mutex; struct list_head perf_event_list; #endif /* 53. 非一致內存訪問(NUMA Non-Uniform Memory Access) */ #ifdef CONFIG_NUMA struct mempolicy *mempolicy; /* Protected by alloc_lock */ short il_next; #endif /* 54. fs_excl 文件系統互斥資源 */ atomic_t fs_excl; /* 55. rcu RCU鏈表 */ struct rcu_head rcu; /* 56. splice_pipe 管道 */ struct pipe_inode_info *splice_pipe; /* 57. delays 延遲計數 */ #ifdef CONFIG_TASK_DELAY_ACCT struct task_delay_info *delays; #endif /* 58. make_it_fail fault injection */ #ifdef CONFIG_FAULT_INJECTION int make_it_fail; #endif /* 59. dirties FLoating proportions */ struct prop_local_single dirties; /* 60. Infrastructure for displayinglatency */ #ifdef CONFIG_LATENCYTOP int latency_record_count; struct latency_record latency_record[LT_SAVECOUNT]; #endif /* 61. time slack values,常用於poll和select函數 */ unsigned long timer_slack_ns; unsigned long default_timer_slack_ns; /* 62. scm_work_list socket控制消息(control message) */ struct list_head *scm_work_list; /* 63. ftrace跟蹤器 */ #ifdef CONFIG_FUNCTION_GRAPH_TRACER int curr_ret_stack; struct ftrace_ret_stack *ret_stack; unsigned long long ftrace_timestamp; atomic_t trace_overrun; atomic_t tracing_graph_pause; #endif #ifdef CONFIG_TRACING unsigned long trace; unsigned long trace_recursion; #endif };
Relevant Link:
http://oss.org.cn/kernel-book/ch04/4.3.htm
http://www.eecs.harvard.edu/~margo/cs161/videos/sched.h.html
http://memorymyann.iteye.com/blog/235363
http://blog.csdn.net/hongchangfirst/article/details/7075026
http://oss.org.cn/kernel-book/ch04/4.4.2.htm
http://blog.csdn.net/npy_lp/article/details/7335187
http://blog.csdn.net/npy_lp/article/details/7292563

0x2: struct cred

\linux-2.6.32.63\include\linux\cred.h

//保存了當前進程的相關權限信息
struct cred 
{
    atomic_t    usage;
#ifdef CONFIG_DEBUG_CREDENTIALS
    atomic_t    subscribers;    /* number of processes subscribed */
    void        *put_addr;
    unsigned    magic;
#define CRED_MAGIC    0x43736564
#define CRED_MAGIC_DEAD    0x44656144
#endif
    uid_t        uid;        /* real UID of the task */
    gid_t        gid;        /* real GID of the task */
    uid_t        suid;        /* saved UID of the task */
    gid_t        sgid;        /* saved GID of the task */
    uid_t        euid;        /* effective UID of the task */
    gid_t        egid;        /* effective GID of the task */
    uid_t        fsuid;        /* UID for VFS ops */
    gid_t        fsgid;        /* GID for VFS ops */
    unsigned    securebits;    /* SUID-less security management */
    kernel_cap_t    cap_inheritable; /* caps our children can inherit */
    kernel_cap_t    cap_permitted;    /* caps we're permitted */
    kernel_cap_t    cap_effective;    /* caps we can actually use */
    kernel_cap_t    cap_bset;    /* capability bounding set */
#ifdef CONFIG_KEYS
    unsigned char    jit_keyring;    /* default keyring to attach requested
                     * keys to */
    struct key    *thread_keyring; /* keyring private to this thread */
    struct key    *request_key_auth; /* assumed request_key authority */
    struct thread_group_cred *tgcred; /* thread-group shared credentials */
#endif
#ifdef CONFIG_SECURITY
    void        *security;    /* subjective LSM security */
#endif
    struct user_struct *user;    /* real user ID subscription */
    struct group_info *group_info;    /* supplementary groups for euid/fsgid */
    struct rcu_head    rcu;        /* RCU deletion hook */
};

0x3: struct pid_link

/* PID/PID hash table linkage. */
struct pid_link pids[PIDTYPE_MAX];

/include/linux/pid.h

enum pid_type
{
    PIDTYPE_PID,
    PIDTYPE_PGID,
    PIDTYPE_SID,
    PIDTYPE_MAX
};

struct definition

struct pid_link
{
    struct hlist_node node;
    struct pid *pid;
};

/include/linux/types.h

struct hlist_node 
{
    struct hlist_node *next, **pprev;
};

0x4: struct pid

struct pid
{
    //1. 指向該數據結構的引用次數
    atomic_t count;

    /*
    2. 該pid在pid_namespace中處於第幾層
        1) 當level=0時
        表示是global namespace,即最高層 
    */
    unsigned int level;
    
    /* lists of tasks that use this pid */
    //3. tasks[i]指向的是一個哈希表。譬如說tasks[PIDTYPE_PID]指向的是PID的哈希表
    struct hlist_head tasks[PIDTYPE_MAX];

    //4. 
    struct rcu_head rcu;

    /*
    5. numbers[1]域指向的是upid結構體
    numbers數組的本意是想表示不同的pid_namespace,一個PID可以屬於不同的namespace
        1) umbers[0]表示global namespace
        2) numbers[i]表示第i層namespace
        3) i越大所在層級越低
    目前該數組只有一個元素, 即global namespace。所以namepace的概念雖然引入了pid,但是並未真正使用,在未來的版本可能會用到
    */
    struct upid numbers[1];
};

Relevant Link:

http://blog.csdn.net/zhanglei4214/article/details/6765913

0x5: struct signal_struct

/*
NOTE! "signal_struct" does not have it's own locking, because a shared signal_struct always implies a shared sighand_struct, so locking sighand_struct is always a proper superset of the locking of signal_struct.
*/
struct signal_struct 
{
    atomic_t        count;
    atomic_t        live;

    /* for wait4() */
    wait_queue_head_t    wait_chldexit;    

    /* current thread group signal load-balancing target: */
    struct task_struct    *curr_target;

    /* shared signal handling: */
    struct sigpending    shared_pending;

    /* thread group exit support */
    int            group_exit_code;

    /* 
    overloaded:
    notify group_exit_task when ->count is equal to notify_count,everyone except group_exit_task is stopped during signal delivery of fatal signals, group_exit_task processes the signal.
    */
    int            notify_count;
    struct task_struct    *group_exit_task;

    /* thread group stop support, overloads group_exit_code too */
    int            group_stop_count;
    unsigned int        flags; /* see SIGNAL_* flags below */

    /* POSIX.1b Interval Timers */
    struct list_head posix_timers;

    /* ITIMER_REAL timer for the process */
    struct hrtimer real_timer;
    struct pid *leader_pid;
    ktime_t it_real_incr;

    /*
    ITIMER_PROF and ITIMER_VIRTUAL timers for the process, we use CPUCLOCK_PROF and CPUCLOCK_VIRT for indexing array as these values are defined to 0 and 1 respectively
    */
    struct cpu_itimer it[2];

    /*
    Thread group totals for process CPU timers. See thread_group_cputimer(), et al, for details.
    */
    struct thread_group_cputimer cputimer;

    /* Earliest-expiration cache. */
    struct task_cputime cputime_expires;

    struct list_head cpu_timers[3];

    struct pid *tty_old_pgrp;

    /* boolean value for session group leader */
    int leader;

    struct tty_struct *tty; /* NULL if no tty */

    /*
    Cumulative resource counters for dead threads in the group, and for reaped dead child processes forked by this group.
    Live threads maintain their own counters and add to these in __exit_signal, except for the group leader.
    */
    cputime_t utime, stime, cutime, cstime;
    cputime_t gtime;
    cputime_t cgtime;
#ifndef CONFIG_VIRT_CPU_ACCOUNTING
    cputime_t prev_utime, prev_stime;
#endif
    unsigned long nvcsw, nivcsw, cnvcsw, cnivcsw;
    unsigned long min_flt, maj_flt, cmin_flt, cmaj_flt;
    unsigned long inblock, oublock, cinblock, coublock;
    unsigned long maxrss, cmaxrss;
    struct task_io_accounting ioac;

    /*
    Cumulative ns of schedule CPU time fo dead threads in the group, not including a zombie group leader, (This only differs from jiffies_to_ns(utime + stime) if sched_clock uses something other than jiffies.)
    */
    unsigned long long sum_sched_runtime;

    /*
    We don't bother to synchronize most readers of this at all, because there is no reader checking a limit that actually needs to get both rlim_cur and rlim_max atomically, 
    and either one alone is a single word that can safely be read normally.
    getrlimit/setrlimit use task_lock(current->group_leader) to protect this instead of the siglock, because they really have no need to disable irqs.
    struct rlimit 
    {
        rlim_t rlim_cur;     //Soft limit(軟限制): 進程當前的資源限制
        rlim_t rlim_max;    //Hard limit(硬限制): 該限制的最大容許值(ceiling for rlim_cur)    
    };
    rlim是一個數組,其中每一項保存了一種類型的資源限制,RLIM_NLIMITS表示資源限制類型的類型數量
    要說明的是,hard limit只針對非特權進程,也就是進程的有效用戶ID(effective user ID)不是0的進程
    */
    struct rlimit rlim[RLIM_NLIMITS];

#ifdef CONFIG_BSD_PROCESS_ACCT
    struct pacct_struct pacct;    /* per-process accounting information */
#endif
#ifdef CONFIG_TASKSTATS
    struct taskstats *stats;
#endif
#ifdef CONFIG_AUDIT
    unsigned audit_tty;
    struct tty_audit_buf *tty_audit_buf;
#endif

    int oom_adj;    /* OOM kill score adjustment (bit shift) */
};

Relevant Link:

http://blog.csdn.net/walkingman321/article/details/6167435

0x6: struct rlimit

\linux-2.6.32.63\include\linux\resource.h

struct rlimit 
{
    //Soft limit(軟限制): 進程當前的資源限制
    unsigned long    rlim_cur;

    //Hard limit(硬限制): 該限制的最大容許值(ceiling for rlim_cur)    
    unsigned long    rlim_max;
};

Linux提供資源限制(resources limit rlimit)機制,對進程使用系統資源施加限制,該機制利用了task_struct中的rlim數組
rlim數組中的位置標識了受限制資源的類型,這也是內核需要定義預處理器常數,將資源與位置關聯起來的原因,以下是所有的常數及其含義

1. RLIMIT_CPU: CPU time in ms
CPU時間的最大量值(秒),當超過此軟限制時向該進程發送SIGXCPU信號

2. RLIMIT_FSIZE: Maximum file size
可以創建的文件的最大字節長度,當超過此軟限制時向進程發送SIGXFSZ

3. RLIMIT_DATA: Maximum size of the data segment
數據段的最大字節長度

4. RLIMIT_STACK: Maximum stack size
棧的最大長度

5. RLIMIT_CORE: Maximum core file size
設定最大的core文件,當值為0時將禁止core文件非0時將設定產生的最大core文件大小為設定的值

6. RLIMIT_RSS: Maximum resident set size
最大駐內存集字節長度(RSS)如果物理存儲器供不應求則內核將從進程處取回超過RSS的部份

7. RLIMIT_NPROC: Maximum number of processes
每個實際用戶ID所擁有的最大子進程數,更改此限制將影響到sysconf函數在參數_SC_CHILD_MAX中返回的值

8. RLIMIT_NOFILE: aximum number of open files
每個進程能夠打開的最多文件數。更改此限制將影響到sysconf函數在參數_SC_CHILD_MAX中的返回值

9. RLIMIT_MEMLOCK: Maximum locked-in-memory address space
The maximum number of bytes of virtual memory that may be locked into RAM using mlock() and mlockall().
不可換出頁的最大數目

10. RLIMIT_AS: Maximum address space size in bytes
The maximum size of the process virtual memory (address space) in bytes. This limit affects calls to brk(2), mmap(2) and mremap(2), which fail with the error ENOMEM upon exceeding this limit. Also automatic stack expansion will fail (and generate a SIGSEGV that kills the process when no alternate stack has been made available). Since the value is a long, on machines with a 32-bit long either this limit is at most 2 GiB, or this resource is unlimited.
進程占用的虛擬地址空間的最大尺寸

11. RLIMIT_LOCKS: Maximum file locks held
文件鎖的最大數目

12. RLIMIT_SIGPENDING: Maximum number of pending signals
待決信號的最大數目

13. RLIMIT_MSGQUEUE: Maximum bytes in POSIX mqueues
消息隊列的最大數目

14. RLIMIT_NICE: Maximum nice prio allowed to raise to
非實時進程的優先級(nice level)

15. RLIMIT_RTPRIO: Maximum realtime priority
最大的實時優先級

因為涉及內核的各個不同部分,內核必須確認子系統遵守了相應限制。需要注意的是,如果某一類資源沒有使用限制(這是幾乎所有資源的默認設置),則將rlim_max設置為RLIM_INFINITY,例外情況包括下列

1. 打開文件的數目(RLIMIT_NOFILE): 默認限制在1024
2. 每用戶的最大進程數(RLIMIT_NPROC): 定義為max_threads / 2,max_threads是一個全局變量,指定了在把 1/8 可用內存用於管理線程信息的情況下,可以創建的進程數目。在計算時,提前給定了20個線程的最小可能內存用量

init進程在Linux中是一個特殊的進程,init的進程限制在系統啟動時就生效了
\linux-2.6.32.63\include\asm-generic\resource.h

/*
 * boot-time rlimit defaults for the init task:
 */
#define INIT_RLIMITS                            \
{                                    \
    [RLIMIT_CPU]        = {  RLIM_INFINITY,  RLIM_INFINITY },    \
    [RLIMIT_FSIZE]        = {  RLIM_INFINITY,  RLIM_INFINITY },    \
    [RLIMIT_DATA]        = {  RLIM_INFINITY,  RLIM_INFINITY },    \
    [RLIMIT_STACK]        = {       _STK_LIM,   _STK_LIM_MAX },    \
    [RLIMIT_CORE]        = {              0,  RLIM_INFINITY },    \
    [RLIMIT_RSS]        = {  RLIM_INFINITY,  RLIM_INFINITY },    \
    [RLIMIT_NPROC]        = {              0,              0 },    \
    [RLIMIT_NOFILE]        = {       INR_OPEN,       INR_OPEN },    \
    [RLIMIT_MEMLOCK]    = {    MLOCK_LIMIT,    MLOCK_LIMIT },    \
    [RLIMIT_AS]        = {  RLIM_INFINITY,  RLIM_INFINITY },    \
    [RLIMIT_LOCKS]        = {  RLIM_INFINITY,  RLIM_INFINITY },    \
    [RLIMIT_SIGPENDING]    = {         0,           0 },    \
    [RLIMIT_MSGQUEUE]    = {   MQ_BYTES_MAX,   MQ_BYTES_MAX },    \
    [RLIMIT_NICE]        = { 0, 0 },                \
    [RLIMIT_RTPRIO]        = { 0, 0 },                \
    [RLIMIT_RTTIME]        = {  RLIM_INFINITY,  RLIM_INFINITY },    \
}

在proc文件系統中,對每個進程都包含了對應的一個文件,可以查看當前的rlimit值
cat /proc/self/limits

 

2. 內核中的隊列/鏈表對象
在內核中存在4種不同類型的列表數據結構:
1. singly-linked lists
2. singly-linked tail queues
3. doubly-linked lists
4. doubly-linked tail queues
linux內核中的鏈表有如下特點
1. 盡可能的代碼重用,化大堆的鏈表設計為單個鏈表

2. 在后面的學習中我們會發現,內核中大部分都是"雙向循環鏈表",因為"雙向循環鏈表"的效率是最高的,找頭節點,尾節點,直接前驅,直接后繼時間復雜度都是O(1) ,而使用單鏈表,單向循環鏈表或其他形式的鏈表是不能完成的。

3. 如果需要構造某類對象的特定列表,則在其結構中定義一個類型為"list_head"指針的成員
linux-2.6.32.63\include\linux\list.h
struct list_head { struct list_head *next, *prev; }; 通過這個成員將這類對象連接起來,形成所需列表,並通過通用鏈表函數對其進行操作(list_head內嵌在原始結構中就像一個鈎子,將原始對象串起來) 在這種架構設計下,內核開發人員只需編寫通用鏈表函數,即可構造和操作不同對象的列表,而無需為每類對象的每種列表編寫專用函數,實現了代碼的重用。 4. 如果想對某種類型創建鏈表,就把一個list_head類型的變量嵌入到該類型中,用list_head中的成員和相對應的處理函數來對鏈表進行遍歷

現在我們知道內核中鏈表的基本元素數據結構、也知道它們的設計原則以及組成原理,接下來的問題是在內核是怎么初始化並使用這些數據結構的呢?那些我們熟知的一個個鏈表都是怎么形成的呢?

linux內核為這些鏈表數據結構配套了相應的"操作宏"、以及內嵌函數

linux-2.6.32.63\include\linux\list.h
1. 鏈表初始化
    1.1 LIST_HEAD_INIT
    #define LIST_HEAD_INIT(name) { &(name), &(name) }
    LIST_HEAD_INIT這個宏的作用是初始化當前鏈表節點,即將頭指針和尾指針都指向自己

    1.2 LIST_HEAD
    #define LIST_HEAD(name) struct list_head name = LIST_HEAD_INIT(name)
    從代碼可以看出,LIST_HEAD這個宏的作用是定義了一個雙向鏈表的頭,並調用LIST_HEAD_INIT進行"鏈表頭初始化",將頭指針和尾指針都指向自己,因此可以得知在Linux中用頭指針的next是否指向自己來判斷鏈表是否為空

    1.3 INIT_LIST_HEAD(struct list_head *list)
    除了LIST_HEAD宏在編譯時靜態初始化,還可以使用內嵌函數INIT_LIST_HEAD(struct list_head *list)在運行時進行初始化
    static inline void INIT_LIST_HEAD(struct list_head *list)
    {
        list->next = list;
        list->prev = list;
    }
    無論是采用哪種方式,新生成的鏈表頭的指針next,prev都初始化為指向自己
2. 判斷一個鏈表是不是為空鏈表
    2.1 list_empty(const struct list_head *head) 
    static inline int list_empty(const struct list_head *head)
    {
        return head->next == head;
    }

    2.2 list_empty_careful(const struct list_head *head)
    和list_empty()的差別在於:
    函數使用的檢測方法是判斷表頭的前一個結點和后一個結點是否為其本身,如果同時滿足則返回0,否則返回值為1。
    這主要是為了應付另一個cpu正在處理同一個鏈表而造成next、prev不一致的情況。但代碼注釋也承認,這一安全保障能力有限:除非其他cpu的鏈表操作只有list_del_init(),否則仍然不能保證安全,也就是說,還是需要加
鎖保護
static inline int list_empty_careful(const struct list_head *head) { struct list_head *next = head->next; return (next == head) && (next == head->prev); } 3. 鏈表的插入操作 3.1 list_add(struct list_head *new, struct list_head *head) 在head和head->next之間加入一個新的節點。即表頭插入法(即先插入的后輸出,可以用來實現一個棧) static inline void list_add(struct list_head *new, struct list_head *head) { __list_add(new, head, head->next); } 3.2 list_add_tail(struct list_head *new, struct list_head *head) 在head->prev(雙向循環鏈表的最后一個結點)和head之間添加一個新的結點。即表尾插入法(先插入的先輸出,可以用來實現一個隊列) static inline void list_add_tail(struct list_head *new, struct list_head *head) { __list_add(new, head->prev, head); } #ifndef CONFIG_DEBUG_LIST static inline void __list_add(struct list_head *new, struct list_head *prev, struct list_head *next) { next->prev = new; new->next = next; new->prev = prev; prev->next = new; } #else extern void __list_add(struct list_head *new, struct list_head *prev, struct list_head *next); #endif 4. 鏈表的刪除 4.1 list_del(struct list_head *entry) #ifndef CONFIG_DEBUG_LIST static inline void list_del(struct list_head *entry) { /* __list_del(entry->prev, entry->next)表示將entry的前一個和后一個之間建立關聯(即架空中間的元素) */ __list_del(entry->prev, entry->next); /* list_del()函數將刪除后的prev、next指針分別被設為LIST_POSITION2和LIST_POSITION1兩個特殊值,這樣設置是為了保證不在鏈表中的節點項不可訪問。對LIST_POSITION1和LIST_POSITION2的訪問都將引起
"頁故障"
*/ entry->next = LIST_POISON1; entry->prev = LIST_POISON2; } #else extern void list_del(struct list_head *entry); #endif 4.2 list_del_init(struct list_head *entry) /* list_del_init這個函數首先將entry從雙向鏈表中刪除之后,並且將entry初始化為一個空鏈表。 要注意區分和理解的是: list_del(entry)和list_del_init(entry)唯一不同的是對entry的處理,前者是將entry設置為不可用,后者是將其設置為一個空的鏈表的開始。 */ static inline void list_del_init(struct list_head *entry) { __list_del(entry->prev, entry->next); INIT_LIST_HEAD(entry); } 5. 鏈表節點的替換 結點的替換是將old的結點替換成new 5.1 list_replace(struct list_head *old, struct list_head *new) list_repleace()函數只是改變new和old的指針關系,然而old指針並沒有釋放 static inline void list_replace(struct list_head *old, struct list_head *new) { new->next = old->next; new->next->prev = new; new->prev = old->prev; new->prev->next = new; } 5.2 list_replace_init(struct list_head *old, struct list_head *new) static inline void list_replace_init(struct list_head *old, struct list_head *new) { list_replace(old, new); INIT_LIST_HEAD(old); } 6. 分割鏈表 6.1 list_cut_position(struct list_head *list, struct list_head *head, struct list_head *entry) 函數將head(不包括head結點)到entry結點之間的所有結點截取下來添加到list鏈表中。該函數完成后就產生了兩個鏈表head和list static inline void list_cut_position(struct list_head *list, struct list_head *head, struct list_head *entry) { if (list_empty(head)) return; if (list_is_singular(head) && (head->next != entry && head != entry)) return; if (entry == head) INIT_LIST_HEAD(list); else __list_cut_position(list, head, entry); } static inline void __list_cut_position(struct list_head *list, struct list_head *head, struct list_head *entry) { struct list_head *new_first = entry->next; list->next = head->next; list->next->prev = list; list->prev = entry; entry->next = list; head->next = new_first; new_first->prev = head; } 7. 內核鏈表的遍歷操作(重點) 7.1 list_entry Linux鏈表中僅保存了數據項結構中list_head成員變量的地址,可以通過list_entry宏通過list_head成員訪問到作為它的所有者的的起始基地址(思考結構體的成員偏移量的概念,只有知道了結構體基地址才能通過offset得到
成員地址,之后才能繼續遍歷) 這里的ptr是一個鏈表的頭結點,這個宏就是取的這個鏈表
"頭結點(注意不是第一個元素哦,是頭結點,要得到第一個元素還得繼續往下走一個)"所指結構體的首地址 #define list_entry(ptr, type, member) container_of(ptr, type, member) 7.2 list_first_entry 這里的ptr是一個鏈表的頭結點,這個宏就是取的這個鏈表"第一元素"所指結構體的首地址 #define list_first_entry(ptr, type, member) list_entry((ptr)->next, type, member) 7.3 list_for_each(pos, head) 得到了鏈表的第一個元素的基地址之后,才可以開始元素的遍歷 #define list_for_each(pos, head) \ /* prefetch()的功能是預取內存的內容,也就是程序告訴CPU哪些內容可能馬上用到,CPU預先其取出內存操作數,然后將其送入高速緩存,用於優化,是的執行速度更快 */ for (pos = (head)->next; prefetch(pos->next), pos != (head); \ pos = pos->next) 7.4 __list_for_each(pos, head) __list_for_each沒有采用pretetch來進行預取 #define __list_for_each(pos, head) \ for (pos = (head)->next; pos != (head); pos = pos->next) 7.5 list_for_each_prev(pos, head) 實現方法與list_for_each相同,不同的是用head的前趨結點進行遍歷。實現鏈表的逆向遍歷 #define list_for_each_prev(pos, head) \ for (pos = (head)->prev; prefetch(pos->prev), pos != (head); \ pos = pos->prev) 7.6 list_for_each_entry(pos, head, member) 用鏈表外的結構體地址來進行遍歷,而不用鏈表的地址進行遍歷 #define list_for_each_entry(pos, head, member) \ for (pos = list_entry((head)->next, typeof(*pos), member); \ prefetch(pos->member.next), &pos->member != (head); \ pos = list_entry(pos->member.next, typeof(*pos), member))
下面我們來一起學習一下我們在研究linux內核的時候會遇到的隊列/鏈表結構
0x1: 內核LKM模塊的鏈表
我們知道,在命令行輸入: lsmod可以得到當前系統加載的lKM內核模塊,我們來學習一下這個功能通過內核代碼要怎么實現
mod_ls.c:
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/init.h>
#include <linux/version.h>
#include <linux/list.h>
 
MODULE_LICENSE("Dual BSD/GPL");
 
struct module *m = &__this_module;
 
static void list_module_test(void)
{
        struct module *mod;
        list_for_each_entry(mod, m->list.prev, list)
                printk ("%s\n", mod->name);
 
}
static int list_module_init (void)
{
        list_module_test();
        return 0;
}
 
static void list_module_exit (void)
{
        printk ("unload listmodule.ko\n");
}
 
module_init(list_module_init);
module_exit(list_module_exit);

Makefile

#
# Variables needed to build the kernel module
#
name      = mod_ls

obj-m += $(name).o

all: build

.PHONY: build install clean

build:
    make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules CONFIG_DEBUG_SECTION_MISMATCH=y

install: build
    -mkdir -p /lib/modules/`uname -r`/kernel/arch/x86/kernel/
    cp $(name).ko /lib/modules/`uname -r`/kernel/arch/x86/kernel/
    depmod /lib/modules/`uname -r`/kernel/arch/x86/kernel/$(name).ko

clean:
    [ -d /lib/modules/$(shell uname -r)/build ] && \
    make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean

編譯並加載運行,使用dmesg tail命令可以看到我們的內核代碼使用list_for_each_entry將當前系統內核中的"LKM內核模塊雙鏈表"給遍歷出來了

0x2: 進程鏈表
trave_process.c
#include <linux/module.h> 
#include <linux/init.h> 
#include <linux/list.h> 
#include <linux/sched.h> 
#include <linux/time.h> 
#include <linux/fs.h> 
#include <asm/uaccess.h> 
#include <linux/mm.h> 


MODULE_AUTHOR( "Along" ) ; 
MODULE_LICENSE( "GPL" ) ; 

struct task_struct * task = NULL , * p = NULL ; 
struct list_head * pos = NULL ; 
struct timeval start, end; 
int count = 0; 

/*function_use表示使用哪一種方法測試,
 * 0:三個方法同時使用,
 * 1:list_for_each,
 * 2:list_for_each_entry,
 * 3:for_each_process
 */ 
int function_use = 0; 
char * method; 
char * filename= "testlog" ; 

void print_message( void ) ; 
void writefile( char * filename, char * data ) ; 
void traversal_list_for_each( void ) ; 
void traversal_list_for_each_entry( void ) ; 
void traversal_for_each_process( void ) ; 


static int init_module_list( void ) 
{ 
    switch ( function_use) { 
        case 1: 
            traversal_list_for_each( ) ; 
            break ; 
        case 2: 
            traversal_list_for_each_entry( ) ; 
            break ; 
        case 3: 
            traversal_for_each_process( ) ; 
            break ; 
        default : 
            traversal_list_for_each( ) ; 
            traversal_list_for_each_entry( ) ; 
            traversal_for_each_process( ) ; 
            break ; 
    } 
    return 0; 
} 
static void exit_module_list( void ) 
{ 
    printk( KERN_ALERT "GOOD BYE!!/n" ) ; 
} 

module_init( init_module_list ) ; 
module_exit( exit_module_list ) ; 
module_param( function_use, int , S_IRUGO) ; 

void print_message( void ) 
{ 
    char * str1 = "the method is: " ; 
    char * str2 = "系統當前共 " ; 
    char * str3 = " 個進程/n" ; 
    char * str4 = "開始時間: " ; 
    char * str5 = "/n結束時間: " ; 
    char * str6 = "/n時間間隔: " ; 
    char * str7 = "." ; 
    char * str8 = "ms" ; 
    char data[ 1024] ; 
    char tmp[ 50] ; 
    int cost; 

    printk( "系統當前共 %d 個進程!!/n" , count ) ; 
    printk( "the method is : %s/n" , method) ; 
    printk( "開始時間:%10i.%06i/n" , ( int ) start. tv_sec, ( int ) start. tv_usec) ; 
    printk( "結束時間:%10i.%06i/n" , ( int ) end. tv_sec, ( int ) end. tv_usec) ; 
    printk( "時間間隔:%10i/n" , ( int ) end. tv_usec- ( int ) start. tv_usec) ; 

    memset ( data, 0, sizeof ( data) ) ; 
    memset ( tmp, 0, sizeof ( tmp) ) ; 

    strcat ( data, str1) ; 
    strcat ( data, method) ; 
    strcat ( data, str2) ; 
    snprintf( tmp, sizeof ( count ) , "%d" , count ) ; 
    strcat ( data, tmp) ; 
    strcat ( data, str3) ; 
    strcat ( data, str4) ; 


    memset ( tmp, 0, sizeof ( tmp) ) ; 
    /*
     * 下面這種轉換秒的方法是錯誤的,因為sizeof最終得到的長度實際是Int類型的
     * 長度,而實際的妙數有10位數字,所以最終存到tmp中的字符串也就只有三位
     * 數字
     * snprintf(tmp, sizeof((int)start.tv_sec),"%d",(int)start.tv_usec );
    */ 
    
    /*取得開始時間的秒數和毫秒數*/ 

    snprintf( tmp, 10, "%d" , ( int ) start. tv_sec ) ; 
    strcat ( data, tmp) ; 
    snprintf( tmp, sizeof ( str7) , "%s" , str7 ) ; 
    strcat ( data, tmp) ; 
    snprintf( tmp, 6, "%d" , ( int ) start. tv_usec ) ; 
    strcat ( data, tmp) ; 

    strcat ( data, str5) ; 
    
    /*取得結束時間的秒數和毫秒數*/ 

    snprintf( tmp, 10, "%d" , ( int ) end. tv_sec ) ; 
    strcat ( data, tmp) ; 
    snprintf( tmp, sizeof ( str7) , "%s" , str7 ) ; 
    strcat ( data, tmp) ; 
    snprintf( tmp, 6, "%d" , ( int ) end. tv_usec ) ; 
    strcat ( data, tmp) ; 

    /*計算時間差,因為可以知道我們這個程序花費的時間是在
     *毫秒級別的,所以計算時間差時我們就沒有考慮秒,只是
     *計算毫秒的差值
     */ 
    strcat ( data, str6) ; 
    cost = ( int ) end. tv_usec- ( int ) start. tv_usec; 
    snprintf( tmp, sizeof ( cost) , "%d" , cost ) ; 

    strcat ( data, tmp) ; 
    strcat ( data, str8) ; 
    strcat ( data, "/n/n" ) ; 

    writefile( filename, data) ; 
    printk( "%d/n" , sizeof ( data) ) ; 
} 

void writefile( char * filename, char * data ) 
{ 
    struct file * filp; 
    mm_segment_t fs; 

    filp = filp_open( filename, O_RDWR| O_APPEND| O_CREAT, 0644) ; ; 
    if ( IS_ERR( filp) ) { 
        printk( "open file error.../n" ) ; 
        return ; 
    } 
    fs = get_fs( ) ; 
    set_fs( KERNEL_DS) ; 
    filp->f_op->write(filp, data, strlen ( data) , &filp->f_pos); 
    set_fs( fs) ; 
    filp_close( filp, NULL ) ; 
} 
void traversal_list_for_each( void ) 
{ 

    task = & init_task; 
    count = 0; 
    method= "list_for_each/n" ; 

    do_gettimeofday( & start) ; 
    list_for_each( pos, &task->tasks ) { 
        p = list_entry( pos, struct task_struct, tasks ) ; 
        count++ ; 
        printk( KERN_ALERT "%d/t%s/n" , p->pid, p->comm ) ; 
    } 
    do_gettimeofday( & end) ; 
    
    print_message( ) ; 
    
} 

void traversal_list_for_each_entry( void ) 
{ 

    task = & init_task; 
    count = 0; 
    method= "list_for_each_entry/n" ; 

    do_gettimeofday( & start) ; 
    list_for_each_entry( p, & task->tasks, tasks ) { 
        count++ ; 
        printk( KERN_ALERT "%d/t%s/n" , p->pid, p->comm ) ; 
    } 
    do_gettimeofday( & end) ; 

    print_message( ) ; 
} 

void traversal_for_each_process( void ) 
{ 
    count = 0; 
    method= "for_each_process/n" ; 

    do_gettimeofday( & start) ; 
    for_each_process( task) { 
        count++; 
        printk( KERN_ALERT "%d/t%s/n" , task->pid, task->comm ) ; 
    } 
    do_gettimeofday( & end) ; 
            
    print_message( ) ; 
} 

Makefile

#
## Variables needed to build the kernel module
#
#
name      = trave_process

obj-m += $(name).o

all: build

.PHONY: build install clean

build:
    make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules CONFIG_DEBUG_SECTION_MISMATCH=y

install: build
    -mkdir -p /lib/modules/`uname -r`/kernel/arch/x86/kernel/
    cp $(name).ko /lib/modules/`uname -r`/kernel/arch/x86/kernel/
    depmod /lib/modules/`uname -r`/kernel/arch/x86/kernel/$(name).ko

clean:
    [ -d /lib/modules/$(shell uname -r)/build ] && \
    make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean

編譯、加載並運行后,可以根據進程鏈表(task_struct鏈表)遍歷出當前系統內核中存在的進程

Relevant Link:
http://blog.csdn.net/tigerjibo/article/details/8299599
http://www.cnblogs.com/chengxuyuancc/p/3376627.html
http://blog.csdn.net/tody_guo/article/details/5447402 

 

3. 內核模塊相關數據結構

0x0: THIS_MODULE宏

和CURRENT宏有幾分相似,可以通過THIS_MODULE宏來引用模塊的struct module結構指針

\linux-2.6.32.63\include\linux\module.h

#ifdef MODULE
    #define MODULE_GENERIC_TABLE(gtype,name)            \
    extern const struct gtype##_id __mod_##gtype##_table        \
      __attribute__ ((unused, alias(__stringify(name))))
    
    extern struct module __this_module;
    #define THIS_MODULE (&__this_module)
#else  /* !MODULE */
    #define MODULE_GENERIC_TABLE(gtype,name)
    #define THIS_MODULE ((struct module *)0)
#endif

__this_module這個符號是在加載到內核后才產生的。insmod命令執行后,會調用kernel/module.c里的一個系統調用sys_init_module,它會調用load_module函數,將用戶空間傳入的整個內核模塊文件創建成一個內核模塊,並返回一個struct module結構體,從此,內核中便以這個結構體代表這個內核模塊。THIS_MODULE類似進程的CURRENT
關於sys_init_module、load_module的系統調用內核代碼原理分析,請參閱另一篇文章

http://www.cnblogs.com/LittleHann/p/3920387.html

0x1: struct module

結構體struct module在內核中代表一個內核模塊,通過insmod(實際執行init_module系統調用)把自己編寫的內核模塊插入內核時,模塊便與一個 struct module結構體相關聯,並成為內核的一部分,也就是說在內核中,以module這個結構體代表一個內核模塊(和windows下kprocess、kthread的概念很類似),從這里也可以看出,在內核領域,windows和linux在很多地方是異曲同工的

struct module
{
    /*
    1. enum module_state state
    enum module_state
    {
        MODULE_STATE_LIVE,    //模塊當前正常使用中(存活狀態) 
        MODULE_STATE_COMING,    //模塊當前正在被加載
        MODULE_STATE_GOING,    //模塊當前正在被卸載
    };
    load_module函數中完成模塊的部分創建工作后,把狀態置為 MODULE_STATE_COMING
    sys_init_module函數中完成模塊的全部初始化工作后(包括把模塊加入全局的模塊列表,調用模塊本身的初始化函數),把模塊狀態置為MODULE_STATE_LIVE
    使用rmmod工具卸載模塊時,會調用系統調用delete_module,會把模塊的狀態置為MODULE_STATE_GOING
    這是模塊內部維護的一個狀態
    */
    enum module_state state;

    /*
    2. struct list_head list
    list是作為一個列表的成員,所有的內核模塊都被維護在一個全局鏈表中,鏈表頭是一個全局變量struct module *modules。任何一個新創建的模塊,都會被加入到這個鏈表的頭部
    struct list_head 
    {
        struct list_head *next, *prev;
    };
    這里,我們需要再次理解一下,鏈表是內核中的一個重要的機制,包括進程、模塊在內的很多東西都被以鏈表的形式進行組織,因為是雙向循環鏈表,我們可以任何一個modules->next遍歷並獲取到當前內核中的任何鏈表元素,這
在很多的枚舉場景、隱藏、反隱藏的技術中得以應用
    */
    struct list_head list;
    
    /*
    3. char name[MODULE_NAME_LEN]
    name是模塊的名字,一般會拿模塊文件的文件名作為模塊名。它是這個模塊的一個標識
    */
    char name[MODULE_NAME_LEN];

    /*
    4. struct module_kobject mkobj
    該成員是一個結構體類型,結構體的定義如下:
    struct module_kobject
    {
        /*
    4.1  struct kobject kobj
        kobj是一個struct kobject結構體
        kobject是組成設備模型的基本結構。設備模型是在2.6內核中出現的新的概念,因為隨着拓撲結構越來越復雜,以及要支持諸如電源管理等新特性的要求,向新版本的內核明確提出了這樣的要求:需要有一個對系統的一般性
抽象描述。設備模型提供了這樣的抽象 kobject最初只是被理解為一個簡單的引用計數,但現在也有了很多成員,它所能處理的任務以及它所支持的代碼包括:對象的引用計數;sysfs表述;結構關聯;熱插拔事件處理。下面是kobject結構的定義: struct kobject { //k_name和name都是該內核對象的名稱,在內核模塊的內嵌kobject中,名稱即為內核模塊的名稱 const char *k_name; char name[KOBJ_NAME_LEN]; /* kref是該kobject的引用計數,新創建的kobject被加入到kset時(調用kobject_init),引用計數被加1,然后kobject跟它的parent建立關聯時,引用計數被加1,所以一個新創建的kobject,其引用計數總是為2
*/ struct kref kref; //entry是作為鏈表的節點,所有同一子系統下的所有相同類型的kobject被鏈接成一個鏈表組織在一起 struct list_head entry; //parent指向該kobject所屬分層結構中的上一層節點,所有內核模塊的parent是module struct kobject *parent; /* 成員kset就是嵌入相同類型結構的kobject集合。下面是struct kset結構體的定義: struct kset { struct subsystem *subsys; struct kobj_type *ktype; struct list_head list; spinlock_t list_lock; struct kobject kobj; struct kset_uevent_ops * uevent_ops; }; */ struct kset *kset; //ktype則是模塊的屬性,這些屬性都會在kobject的sysfs目錄中顯示 struct kobj_type *ktype; //dentry則是文件系統相關的一個節點 struct dentry *dentry; }; */ struct kobject kobj; //mod指向包容它的struct module成員 struct module *mod; }; */ struct module_kobject mkobj; struct module_param_attrs *param_attrs; const char *version; const char *srcversion; /* Exported symbols */ const struct kernel_symbol *syms; unsigned int num_syms; const unsigned long *crcs; /* GPL-only exported symbols. */ const struct kernel_symbol *gpl_syms; unsigned int num_gpl_syms; const unsigned long *gpl_crcs; unsigned int num_exentries; const struct exception_table_entry *extable; int (*init)(void); /* 初始化相關 */ void *module_init; void *module_core; unsigned long init_size, core_size; unsigned long init_text_size, core_text_size; struct mod_arch_specific arch; int unsafe; int license_gplok; #ifdef CONFIG_MODULE_UNLOAD struct module_ref ref[NR_CPUS]; struct list_head modules_which_use_me; struct task_struct *waiter; void (*exit)(void); #endif #ifdef CONFIG_KALLSYMS Elf_Sym *symtab; unsigned long num_symtab; char *strtab; struct module_sect_attrs *sect_attrs; #endif void *percpu; char *args; };

從struct module結構體可以看出,在內核態,我們如果要枚舉當前模塊列表,可以使用

1. struct module->list
2. struct module->mkobj->kobj->entry
3. struct module->mkobj->kobj->kset
//通過它們三個都可以指向一個內核模塊的鏈表

Relevant Link:

http://lxr.free-electrons.com/source/include/linux/module.h
http://www.cs.fsu.edu/~baker/devices/lxr/http/source/linux/include/linux/module.h
http://blog.chinaunix.net/uid-9525959-id-2001630.html
http://blog.csdn.net/linweig/article/details/5044722

0x2: struct module_use

source/include/linux/module.h

/* modules using other modules: kdb wants to see this. */
struct module_use 
{
    struct list_head source_list;
    struct list_head target_list;
    struct module *source, *target;
};

"struct module_use"和"struct module->module_which_use_me"這兩個結果共同組合和保證了內核模塊中的依賴關系。
如果模塊B使用了模塊A提供的函數,那么模塊A和模塊B之間就存在關系,可以從兩個方面來看這種關系

1. 模塊B依賴模塊A
除非模塊A已經駐留在內核內存,否則模塊B無法裝載

2. 模塊B引用模塊A
除非模塊B已經移除,否則模塊A無法從內核移除,在內核中,這種關系稱之為"模塊B使用模塊A"

對每個使用了模塊A中函數的模塊B,都會創建一個module_use結構體實例,該實例將被添加到模塊A(被依賴的模塊)的module實例中的modules_which_use_me鏈表中,modules_which_use_me指向模塊B的module實例。
明白了模塊間的依賴關系在數據結構上的表現,可以很容易地枚舉出所有模塊的依賴關系

 

4. 文件系統相關數據結構

0x1: struct file

文件結構體代表一個打開的文件,系統中的每個打開的文件在內核空間都有一個關聯的struct file。它由內核在打開文件時創建,並傳遞給在文件上進行操作的任何函數。在文件的所有實例都關閉后,內核釋放這個數據結構

struct file 
{
    /*
     * fu_list becomes invalid after file_free is called and queued via
     * fu_rcuhead for RCU freeing
     */
    union 
    {
        /*
        定義在 linux/include/linux/list.h中 
        struct list_head 
        {
            struct list_head *next, *prev;
        };
        用於通用文件對象鏈表的指針,所有打開的文件形成一個鏈表
        */
        struct list_head    fu_list;
        /*
        定義在linux/include/linux/rcupdate.h中  
        struct rcu_head 
        {
            struct rcu_head *next;
            void (*func)(struct rcu_head *head);
        };
        RCU(Read-Copy Update)是Linux 2.6內核中新的鎖機制
        */
        struct rcu_head     fu_rcuhead;
    } f_u;
    
    /*
    定義在linux/include/linux/namei.h中
    struct path 
    {
        /*
        struct vfsmount *mnt的作用是指出該文件的已安裝的文件系統,即指向VFS安裝點的指針
        */
        struct vfsmount *mnt;
        /*
        struct dentry *dentry是與文件相關的目錄項對象,指向相關目錄項的指針
        */
        struct dentry *dentry;
    };
    */
    struct path        f_path;
#define f_dentry    f_path.dentry
#define f_vfsmnt    f_path.mnt

    /*
   指向文件操作表的指針
定義在linux/include/linux/fs.h中,其中包含着與文件關聯的操作,例如 struct file_operations { struct module *owner; loff_t (*llseek) (struct file *, loff_t, int); ssize_t (*read) (struct file *, char __user *, size_t, loff_t *); ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t); ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t); int (*readdir) (struct file *, void *, filldir_t); unsigned int (*poll) (struct file *, struct poll_table_struct *); int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long); long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long); long (*compat_ioctl) (struct file *, unsigned int, unsigned long); int (*mmap) (struct file *, struct vm_area_struct *); int (*open) (struct inode *, struct file *); int (*flush) (struct file *, fl_owner_t id); int (*release) (struct inode *, struct file *); int (*fsync) (struct file *, struct dentry *, int datasync); int (*aio_fsync) (struct kiocb *, int datasync); int (*fasync) (int, struct file *, int); int (*lock) (struct file *, int, struct file_lock *); ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int); unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long); int (*check_flags)(int); int (*flock) (struct file *, int, struct file_lock *); ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int); ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int); int (*setlease)(struct file *, long, struct file_lock **); }; 當打開一個文件時,內核就創建一個與該文件相關聯的struct file結構,其中的*f_op就指向的是具體對該文件進行操作的函數 例如用戶調用系統調用read來讀取該文件的內容時,那么系統調用read最終會陷入內核調用sys_read函數,而sys_read最終會調用於該文件關聯的struct file結構中的f_op->read函數對文件內容進行讀取 */ const struct file_operations *f_op; spinlock_t f_lock; /* f_ep_links, f_flags, no IRQ */ /* typedef struct { volatile int counter; } atomic_t; volatile修飾字段告訴gcc不要對該類型的數據做優化處理,對它的訪問都是對內存的訪問,而不是對寄存器的訪問 f_count的作用是記錄對文件對象的引用計數,也即當前有多少個進程在使用該文件 */ atomic_long_t f_count; /* 當打開文件時指定的標志,對應系統調用open的int flags參數。驅動程序為了支持非阻塞型操作需要檢查這個標志 */ unsigned int f_flags; /* 對文件的讀寫模式,對應系統調用open的mod_t mode參數。如果驅動程序需要這個值,可以直接讀取這個字段。 mod_t被定義為: typedef unsigned int __kernel_mode_t; typedef __kernel_mode_t mode_t; */ fmode_t f_mode; /* 當前的文件指針位置,即文件的讀寫位置 loff_t被定義為: typedef long long __kernel_loff_t; typedef __kernel_loff_t loff_t; */ loff_t f_pos; /* struct fown_struct在linux/include/linux/fs.h被定義 struct fown_struct { rwlock_t lock; /* protects pid, uid, euid fields */ struct pid *pid; /* pid or -pgrp where SIGIO should be sent */ enum pid_type pid_type; /* Kind of process group SIGIO should be sent to */ uid_t uid, euid; /* uid/euid of process setting the owner */ int signum; /* posix.1b rt signal to be delivered on IO */ }; 該結構的作用是通過信號進行I/O時間通知的數據 */ struct fown_struct f_owner; const struct cred *f_cred; /* struct file_ra_state結構被定義在/linux/include/linux/fs.h中 struct file_ra_state { pgoff_t start; /* where readahead started */ unsigned long size; /* # of readahead pages */ unsigned long async_size; /* do asynchronous readahead when there are only # of pages ahead */ unsigned long ra_pages; /* Maximum readahead window */ unsigned long mmap_hit; /* Cache hit stat for mmap accesses */ unsigned long mmap_miss; /* Cache miss stat for mmap accesses */ unsigned long prev_index; /* Cache last read() position */ unsigned int prev_offset; /* Offset where last read() ended in a page */ }; 該結構標識了文件預讀狀態,文件預讀算法使用的主要數據結構,當打開一個文件時,f_ra中出了perv_page(默認為-1)和ra_apges(對該文件允許的最大預讀量)這兩個字段外,其他的所有西端都置為0 */ struct file_ra_state f_ra; /* 記錄文件的版本號,每次使用后都自動遞增 */ u64 f_version; #ifdef CONFIG_SECURITY /* #ifdef CONFIG_SECURITY void *f_security; #endif 如果在編譯內核時配置了安全措施,那么struct file結構中就會有void *f_security數據項,用來描述安全措施或者是記錄與安全有關的信息。 */ void *f_security; #endif /* 系統在調用驅動程序的open方法前將這個指針置為NULL。驅動程序可以將這個字段用於任意目的,也可以忽略這個字段。驅動程序可以用這個字段指向已分配的數據,但是一定要在內核釋放file結構前的release方法中清除它 */ void *private_data; #ifdef CONFIG_EPOLL /* 被用在fs/eventpoll.c來鏈接所有鈎到這個文件上。其中 1) f_ep_links是文件的事件輪詢等待者鏈表的頭 2) f_ep_lock是保護f_ep_links鏈表的自旋鎖 */ struct list_head f_ep_links; struct list_head f_tfile_llink; #endif /* #ifdef CONFIG_EPOLL */ /* struct address_space被定義在/linux/include/linux/fs.h中,此處是指向文件地址空間的指針 */ struct address_space *f_mapping; #ifdef CONFIG_DEBUG_WRITECOUNT unsigned long f_mnt_write_state; #endif };

每個文件對象總是包含在下列的一個雙向循環鏈表之中

1. "未使用"文件對象的鏈表
該鏈表既可以用做文件對象的內存高速緩存,又可以當作超級用戶的備用存儲器,也就是說,即使系統的動態內存用完,也允許超級用戶打開文件。由於這些對象是未使用的,它們的f_count域是NULL,該鏈表首元素的地址存放在變量
free_list中,內核必須確認該鏈表總是至少包含NR_RESERVED_FILES個對象,通常該值設為10
2. "正在使用"文件對的象鏈表 該鏈表中的每個元素至少由一個進程使用,因此,各個元素的f_count域不會為NULL,該鏈表中第一個元素的地址存放在變量anon_list中 如果VFS需要分配一個新的文件對象,就調用函數get_empty_filp()。該函數檢測"未使用"文件對象鏈表的元素個數是否多於NR_RESERVED_FILES,如果是,可以為新打開的文件使用其中的一個元素;如果沒有,則退回到正常的內存
分配(也就是說這是一種高速緩存機制)

Relevant Link:

http://linux.chinaunix.net/techdoc/system/2008/07/24/1020195.shtml
http://blog.csdn.net/fantasyhujian/article/details/9166117

0x2: struct inode

我們知道,在linux內核中,用file結構表示打開的文件描述符,而用inode結構表示具體的文件

struct inode 
{    
    /*
    哈希表 
    */
    struct hlist_node    i_hash;

    /*
    索引節點鏈表(backing dev IO list)
    */
    struct list_head    i_list;     
    struct list_head    i_sb_list;

    /*
    目錄項鏈表
    */
    struct list_head    i_dentry;

    /*
    節點號
    */
    unsigned long        i_ino;

    /*
    引用記數
    */
    atomic_t        i_count;

    /*
    硬鏈接數
    */
    unsigned int        i_nlink;

    /*
    使用者id
    */
    uid_t            i_uid;

    /*
    使用者所在組id
    */
    gid_t            i_gid;

    /*
    實設備標識符
    */
    dev_t            i_rdev;

    /*
    版本號
    */
    u64            i_version;

    /*
    以字節為單位的文件大小
    */
    loff_t            i_size;
#ifdef __NEED_I_SIZE_ORDERED
    seqcount_t        i_size_seqcount;
#endif
    /*
    最后訪問時間
    */
    struct timespec        i_atime;

    /*
    最后修改(modify)時間
    */
    struct timespec        i_mtime;

    /*
    最后改變(change)時間
    */
    struct timespec        i_ctime;

    /*
    文件的塊數
    */
    blkcnt_t        i_blocks;

    /*
    以位為單位的塊大小
    */ 
    unsigned int        i_blkbits;
    
    /*
    使用的字節數
    */
    unsigned short          i_bytes;

    /*
    訪問權限控制
    */
    umode_t            i_mode;
    
    /*
    自旋鎖 
    */
    spinlock_t        i_lock;     
    struct mutex        i_mutex;

    /*
    索引節點信號量
    */
    struct rw_semaphore    i_alloc_sem;

    /*
    索引節點操作表
    索引節點的操作inode_operations定義在linux/fs.h
    struct inode_operations 
    {
        /*
        1. VFS通過系統調用create()和open()來調用該函數,從而為dentry對象創建一個新的索引節點。在創建時使用mode制定初始模式
        */
        int (*create) (struct inode *, struct dentry *,int); 
        /*
        2. 該函數在特定目錄中尋找索引節點,該索引節點要對應於dentry中給出的文件名
        */
        struct dentry * (*lookup) (struct inode *, struct dentry *); 
        /*
        3. 該函數被系統調用link()調用,用來創建硬連接。硬鏈接名稱由dentry參數指定,連接對象是dir目錄中ld_dentry目錄想所代表的文件
        */
        int (*link) (struct dentry *, struct inode *, struct dentry *); 
        /*
        4. 該函數被系統調用unlink()調用,從目錄dir中刪除由目錄項dentry制動的索引節點對象
        */
        int (*unlink) (struct inode *, struct dentry *); 
        /*
        5. 該函數被系統調用symlik()調用,創建符號連接,該符號連接名稱由symname指定,連接對象是dir目錄中的dentry目錄項
        */
        int (*symlink) (struct inode *, struct dentry *, const char *); 
        /*
        6. 該函數被mkdir()調用,創建一個新路徑。創建時使用mode制定的初始模式
        */
        int (*mkdir) (struct inode *, struct dentry *, int); 
        /*
        7. 該函數被系統調用rmdir()調用,刪除dir目錄中的dentry目錄項代表的文件
        */
        int (*rmdir) (struct inode *, struct dentry *); 
        /*
        8. 該函數被系統調用mknod()調用,創建特殊文件(設備文件、命名管道或套接字)。要創建的文件放在dir目錄中,其目錄項問dentry,關聯的設備為rdev,初始權限由mode指定
        */
        int (*mknod) (struct inode *, struct dentry *, int, dev_t); 
        /*
        9. VFS調用該函數來移動文件。文件源路徑在old_dir目錄中,源文件由old_dentry目錄項所指定,目標路徑在new_dir目錄中,目標文件由new_dentry指定
        */
        int (*rename) (struct inode *, struct dentry *, struct inode *, struct dentry *); 
        /*
        10. 該函數被系統調用readlink()調用,拷貝數據到特定的緩沖buffer中。拷貝的數據來自dentry指定的符號鏈接,最大拷貝大小可達到buflen字節
        */
        int (*readlink) (struct dentry *, char *, int); 
        /*
        11. 該函數由VFS調用,從一個符號連接查找他指向的索引節點,由dentry指向的連接被解析
        */
        int (*follow_link) (struct dentry *, struct nameidata *); 
        /*
        12. 在follow_link()調用之后,該函數由vfs調用進行清楚工作
        */
        int (*put_link) (struct dentry *, struct nameidata *); 
        /*
        13. 該函數由VFS調用,修改文件的大小,在調用之前,索引節點的i_size項必須被設置成預期的大小
        */
        void (*truncate) (struct inode *);
        
        /*
        該函數用來檢查inode所代表的文件是否允許特定的訪問模式,如果允許特定的訪問模式,返回0,否則返回負值的錯誤碼。多數文件系統都將此區域設置為null,使用VFS提供的通用方法進行檢查,這種檢查操作僅僅比較索引
及誒但對象中的訪問模式位是否和mask一致,比較復雜的系統, 比如支持訪問控制鏈(ACL)的文件系統,需要使用特殊的permission()方法
*/ int (*permission) (struct inode *, int); /* 該函數被notify_change調用,在修改索引節點之后,通知發生了改變事件 */ int (*setattr) (struct dentry *, struct iattr *); /* 在通知索引節點需要從磁盤中更新時,VFS會調用該函數 */ int (*getattr) (struct vfsmount *, struct dentry *, struct kstat *); /* 該函數由VFS調用,向dentry指定的文件設置擴展屬性,屬性名為name,值為value */ int (*setxattr) (struct dentry *, const char *, const void *, size_t, int); /* 該函數被VFS調用,向value中拷貝給定文件的擴展屬性name對應的數值 */ ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t); /* 該函數將特定文件所有屬性別表拷貝到一個緩沖列表中 */ ssize_t (*listxattr) (struct dentry *, char *, size_t); /* 該函數從給定文件中刪除指定的屬性 */ int (*removexattr) (struct dentry *, const char *); }; */ const struct inode_operations *i_op; /* 默認的索引節點操作 former ->i_op->default_file_ops */ const struct file_operations *i_fop; /* 相關的超級塊 */ struct super_block *i_sb; /* 文件鎖鏈表 */ struct file_lock *i_flock; /* 相關的地址映射 */ struct address_space *i_mapping; /* 設備地址映射
  address_space結構與文件的對應:一個具體的文件在打開后,內核會在內存中為之建立一個struct inode結構,其中的i_mapping域指向一個address_space結構。這樣,一個文件就對應一個address_space結構,一個 address_space與一個偏移量能夠確定一個page cache 或swap cache中的一個頁面。因此,當要尋址某個數據時,很容易根據給定的文件及數據在文件內的偏移量而找到相應的頁面
*/ struct address_space i_data; #ifdef CONFIG_QUOTA /* 節點的磁盤限額 */ struct dquot *i_dquot[MAXQUOTAS]; #endif /* 塊設備鏈表 */ struct list_head i_devices; union { //管道信息 struct pipe_inode_info *i_pipe; //塊設備驅動 struct block_device *i_bdev; struct cdev *i_cdev; }; /* 索引節點版本號 */ __u32 i_generation; #ifdef CONFIG_FSNOTIFY /* 目錄通知掩碼 all events this inode cares about */ __u32 i_fsnotify_mask; struct hlist_head i_fsnotify_mark_entries; /* fsnotify mark entries */ #endif #ifdef CONFIG_INOTIFY struct list_head inotify_watches; /* watches on this inode */ struct mutex inotify_mutex; /* protects the watches list */ #endif /* 狀態標志 */ unsigned long i_state; /* 首次修改時間 jiffies of first dirtying */ unsigned long dirtied_when; /* 文件系統標志 */ unsigned int i_flags; /* 寫者記數 */ atomic_t i_writecount; #ifdef CONFIG_SECURITY /* 安全模塊 */ void *i_security; #endif #ifdef CONFIG_FS_POSIX_ACL struct posix_acl *i_acl; struct posix_acl *i_default_acl; #endif void *i_private; /* fs or device private pointer */ };

0x3: struct stat

struct stat在我們進行文件、目錄屬性讀寫的時候、磁盤IO狀態監控的時候常常會用到的數據結構

/*
struct stat  
{   
    dev_t       st_dev;     // ID of device containing file -文件所在設備的ID  
    ino_t       st_ino;     // inode number -inode節點號  
    mode_t      st_mode;    // protection -保護模式?  
    nlink_t     st_nlink;   // number of hard links -鏈向此文件的連接數(硬連接)   
    uid_t       st_uid;     // user ID of owner -user id 
    gid_t       st_gid;     // group ID of owner - group id 
    dev_t       st_rdev;    // device ID (if special file) -設備號,針對設備文件  
    off_t       st_size;    // total size, in bytes -文件大小,字節為單位  
    blksize_t   st_blksize; // blocksize for filesystem I/O -系統塊的大小   
    blkcnt_t    st_blocks;  // number of blocks allocated -文件所占塊數
    
    time_t      st_atime;   // time of last access - 最近存取時間  
    time_t      st_mtime;   // time of last modification - 最近修改時間  
    time_t      st_ctime;   // time of last status change - 最近創建時間 
};  
*/

Relevant Link:

http://blog.sina.com.cn/s/blog_7943319e01018m4h.html
http://www.cnblogs.com/QJohnson/archive/2011/06/24/2089414.html
http://blog.csdn.net/tianmohust/article/details/6609470

Each process on the system has its own list of open files, root filesystem, current working directory, mount points, and so on. Three data structures tie together the VFS layer and the processes on the system: the files_struct,fs_struct, and namespace structure.

The second process-related structure is fs_struct, which contains filesystem information related to a process and is pointed at by the fs field in the process descriptor. The structure is defined in <linux/fs_struct.h>. Here it is, with comments:

0x4: struct fs_struct

文件系統相關信息結構體

struct fs_struct 
{
    atomic_t count;            //共享這個表的進程個數
    rwlock_t lock;            //用於表中字段的讀/寫自旋鎖
    int umask;            //當打開文件設置文件權限時所使用的位掩碼
    
    struct dentry * root;        //根目錄的目錄項 
    struct dentry * pwd;        //當前工作目錄的目錄項
    struct dentry * altroot;    //模擬根目錄的目錄項(在80x86結構上始終為NULL)

    struct vfsmount * rootmnt;    //根目錄所安裝的文件系統對象
    struct vfsmount* pwdmnt;    //當前工作目錄所安裝的文件系統對象  
    struct vfsmount* altrootmnt;    //模擬根目錄所安裝的文件系統對象(在80x86結構上始終為NULL)
};

0x5: struct files_struct

The files_struct is defined in <linux/file.h>. This table's address is pointed to by the files enTRy in the processor descriptor. All per-process information about open files and file descriptors is contained therein. Here it is, with comments:

表示進程當前打開的文件,表的地址存放於進程描述符task_struct的files字段,每個進程用一個files_struct結構來記錄文件描述符的使用情況,這個files_struct結構稱為用戶打開文件表,它是進程的私有數據

struct files_struct 
{
    atomic_t count;                    //共享該表的進程數

    struct fdtable *fdt;                //指向fdtable結構的指針
    struct fdtable fdtab;                //指向fdtable結構

    spinlock_t file_lock ____cacheline_aligned_in_smp;
    int next_fd;                    //已分配的文件描述符加1
    struct embedded_fd_set close_on_exec_init;    //指向執行exec()時需要關閉的文件描述符
    struct embedded_fd_set open_fds_init;        //文件描述符的初值集合
    struct file * fd_array[NR_OPEN_DEFAULT];        //文件對象指針的初始化數組
};

0x6: struct fdtable

struct fdtable 
{
    unsigned int max_fds;
    int max_fdset;

    /* 
    current fd array 
    指向文件對象的指針數組,通常,fd字段指向files_struct結構的fd_array字段,該字段包括32個文件對象指針。如果進程打開的文件數目多於32,內核就分配一個新的、更大的文件指針數組,並將其地址存放在fd字段中,
內核同時也更新max_fds字段的值 對於在fd數組中所有元素的每個文件來說,數組的索引就是文件描述符(file descriptor)。通常,數組的第一個元素(索引為0)是進程的標准輸入文件,數組的第二個元素(索引為1)是進程的標准輸出文件,數組的第三個元素
(索引為2)是進程的標准錯誤文件
*/ struct file ** fd; fd_set *close_on_exec; fd_set *open_fds; struct rcu_head rcu; struct files_struct *free_files; struct fdtable *next; }; #define NR_OPEN_DEFAULT BITS_PER_LONG #define BITS_PER_LONG 32 /* asm-i386 */

用一張圖表示task_struct、fs_struct、files_struct、fdtable、file的關系

Relevant Link:

http://oss.org.cn/kernel-book/ch08/8.2.4.htm
http://www.makelinux.net/books/lkd2/ch12lev1sec10

0x7: struct dentry

struct dentry 
{
    //目錄項引用計數器 
    atomic_t d_count;

    /*
    目錄項標志 protected by d_lock 
    #define DCACHE_AUTOFS_PENDING 0x0001    // autofs: "under construction"  
    #define DCACHE_NFSFS_RENAMED  0x0002    // this dentry has been "silly renamed" and has to be eleted on the last dput() 
    #define    DCACHE_DISCONNECTED 0x0004        //指定了一個dentry當前沒有連接到超級塊的dentry樹
    #define DCACHE_REFERENCED    0x0008      //Recently used, don't discard.  
    #define DCACHE_UNHASHED        0x0010        //該dentry實例沒有包含在任何inode的散列表中
    #define DCACHE_INOTIFY_PARENT_WATCHED    0x0020 // Parent inode is watched by inotify 
    #define DCACHE_COOKIE        0x0040        // For use by dcookie subsystem 
    #define DCACHE_FSNOTIFY_PARENT_WATCHED    0x0080 // Parent inode is watched by some fsnotify listener 
    */
    unsigned int d_flags;    

    //per dentry lock    
    spinlock_t d_lock;        

    //當前dentry對象表示一個裝載點,那么d_mounted設置為1,否則為0
    int d_mounted;

    /*
    文件名所屬的inode,如果為NULL,則表示不存在的文件名
    如果dentry對象是一個不存在的文件名建立的,則d_inode為NULL指針,這有助於加速查找不存在的文件名,通常情況下,這與查找實際存在的文件名同樣耗時
    */
    struct inode *d_inode;         
    /*
    The next three fields are touched by __d_lookup.  Place them here so they all fit in a cache line.
    */
    //用於查找的散列表 lookup hash list 
    struct hlist_node d_hash;    

    /*
    指向當前的dentry實例的父母了的dentry實例 parent directory
    當前的dentry實例即位於父目錄的d_subdirs鏈表中,對於根目錄(沒有父目錄),d_parent指向其自身的dentry實例
    */  
    struct dentry *d_parent;

    /*
    d_iname指定了文件的名稱,qstr是一個內核字符串的包裝器,它存儲了實際的char*字符串以及字符串長度和散列值,這使得更容易處理查找工作
    要注意的是,這里並不存儲絕對路徑,而是只有路徑的最后一個分量,例如對/usr/bin/emacs只存儲emacs,因為在linux中,路徑信息隱含在了dentry層次鏈表結構中了
    */    
    struct qstr d_name;

    //LRU list
    struct list_head d_lru;        
    /*
     * d_child and d_rcu can share memory
     */
    union 
    {
        /* child of parent list */
        struct list_head d_child;
        //鏈表元素,用於將dentry連接到inode的i_dentry鏈表中    
         struct rcu_head d_rcu;
    } d_u;

    //our children 子目錄/文件的目錄項鏈表
    struct list_head d_subdirs;    

    /*
    inode alias list 鏈表元素,用於將dentry連接到inode的i_dentry鏈表中 
    d_alias用作鏈表元素,以連接表示相同文件的各個dentry對象,在利用硬鏈接用兩個不同名稱表示同一文件時,會發生這種情況,對應於文件的inode的i_dentry成員用作該鏈表的表頭,各個dentry對象通過d_alias連接到該鏈表中
    */
    struct list_head d_alias;    

    //used by d_revalidate 
    unsigned long d_time;

    /*
    d_op指向一個結構,其中包含了各種函數指針,提供對dentry對象的各種操作,這些操作必須由底層文件系統實現
    struct dentry_operations 
    {
        //在把目錄項對象轉換為一個文件路徑名之前,判定該目錄項對象是否依然有效
        int (*d_revalidate)(struct dentry *, struct nameidata *);

        //生成一個散列值,用於目錄項散列表
        int (*d_hash) (struct dentry *, struct qstr *);
        
        //比較兩個文件名
        int (*d_compare) (struct dentry *, struct qstr *, struct qstr *);

        //當對目錄項對象的最后一個引用被刪除,調用該方法
        int (*d_delete)(struct dentry *);

        //當要釋放一個目錄項對象時,調用該方法
        void (*d_release)(struct dentry *);

        //當一個目錄對象變為負狀態時,調用該方法
        void (*d_iput)(struct dentry *, struct inode *);
        char *(*d_dname)(struct dentry *, char *, int);
    };
    */        
    const struct dentry_operations *d_op;

    //The root of the dentry tree dentry樹的根,超級塊
    struct super_block *d_sb;    

    //fs-specific data 特定文件系統的數據
    void *d_fsdata;            

    /*
    短文件名small names存儲在這里
    如果文件名由少量字符組成,則只保存在d_iname中,而不是dnanme中,用於加速訪問
    */ 
    unsigned char d_iname[DNAME_INLINE_LEN_MIN];    
};

Relevant Link:

http://blog.csdn.net/fudan_abc/article/details/1775313

0x8: struct vfsmount

struct vfsmount
{
    struct list_head mnt_hash;

    //裝載點所在的父文件系統的vfsmount結構 fs we are mounted on,文件系統之間的父子關系就是這樣實現的
    struct vfsmount *mnt_parent;    

    //裝載點在父文件系統中的dentry(即裝載點自身對應的dentry) dentry of mountpoint 
    struct dentry *mnt_mountpoint;    

    //當前文件系統的相對根目錄的dentry root of the mounted tree 
    struct dentry *mnt_root;    

    /*
    指向超級塊的指針 pointer to superblock 
    mnt_sb指針建立了與相關的超級塊之間的關聯(對每個裝載的文件系統而言,都有且只有一個超級塊實例)
    */
    struct super_block *mnt_sb;    

    //子文件系統鏈表 
    struct list_head mnt_mounts;    
    //鏈表元素,用於父文件系統中的mnt_mounts鏈表
    struct list_head mnt_child;    

    /*
    #define MNT_NOSUID    0x01 (禁止setuid執行)
    #define MNT_NODEV    0x02 (裝載的文件系統是虛擬的,沒有物理后端設備)
    #define MNT_NOEXEC    0x04
    #define MNT_NOATIME    0x08
    #define MNT_NODIRATIME    0x10
    #define MNT_RELATIME    0x20
    #define MNT_READONLY    0x40    // does the user want this to be r/o?  
    #define MNT_STRICTATIME 0x80
    #define MNT_SHRINKABLE    0x100 (專用於NFS、AFS 用來標記子裝載,設置了該標記的裝載允許自動移除)
    #define MNT_WRITE_HOLD    0x200
    #define MNT_SHARED    0x1000        // if the vfsmount is a shared mount (共享裝載)
    #define MNT_UNBINDABLE    0x2000    // if the vfsmount is a unbindable mount (不可綁定裝載)
    #define MNT_PNODE_MASK    0x3000    // propagation flag mask (傳播標志掩碼) 
    */
    int mnt_flags;
    /* 4 bytes hole on 64bits arches */

    //設備名稱,例如/dev/dsk/hda1 Name of device e.g. /dev/dsk/hda1 
    const char *mnt_devname;    
    struct list_head mnt_list;

    //鏈表元素,用於特定於文件系統的到期鏈表中 link in fs-specific expiry list 
    struct list_head mnt_expire;

    //鏈表元素,用於共享裝載的循環鏈表 circular list of shared mounts     
    struct list_head mnt_share;    

    //從屬裝載的鏈表 list of slave mounts 
    struct list_head mnt_slave_list;
    //鏈表元素,用於從屬裝載的鏈表 slave list entry 
    struct list_head mnt_slave;    

    //指向主裝載,從屬裝載位於master->mnt_slave_list鏈表上 slave is on master->mnt_slave_list 
    struct vfsmount *mnt_master;    

    //所屬的命名空間 containing namespace 
    struct mnt_namespace *mnt_ns;    
    int mnt_id;            /* mount identifier */
    int mnt_group_id;        /* peer group identifier */
    /*
    mnt_count實現了一個使用計數器,每當一個vfsmount實例不再需要時,都必須用mntput將計數器減1.mntget與mntput相對
    We put mnt_count & mnt_expiry_mark at the end of struct vfsmount to let these frequently modified fields in a separate cache line (so that reads of mnt_flags wont ping-pong on SMP machines)
    把mnt_count和mnt_expiry_mark防止在struct vfsmount的末尾,以便讓這些頻繁修改的字段與結構的主體處於兩個不同的緩存行中(這樣在SMP機器上讀取mnt_flags不會造成高速緩存的顛簸)
    */
    atomic_t mnt_count;

    //如果標記為到期,則其值為true true if marked for expiry 
    int mnt_expiry_mark;        
    int mnt_pinned;
    int mnt_ghosts;
#ifdef CONFIG_SMP
    int *mnt_writers;
#else
    int mnt_writers;
#endif
};

Relevant Link: 

http://www.cnblogs.com/Wandererzj/archive/2012/04/12/2444888.html

0x9: struct nameidata

路徑查找是VFS的一個很重要的操作:給定一個文件名,獲取該文件名的inode。路徑查找是VFS中相當繁瑣的一部分,主要是因為

1. 符號鏈接
一個文件可能通過符號鏈接引用另一個文件,查找代碼必須考慮到這種可能性,能夠識別出鏈接,並在相應的處理后跳出循環

2. 文件系統裝載點
必須檢測裝載點,而后據此重定向查找操作

3. 在通向目標文件名的路徑上,必須檢查所有目錄的訪問權限,進程必須有適當的權限,否則操作將終止,並給出錯誤信息

4. . ..和//等特殊路徑引入了復雜性

路徑查找過程涉及到很多函數調用,在這些調用過程中,nameidata起到了很重要的作用:

1. 向查找函數傳遞參數
2. 保存查找結果 

inode是類Unix系統的文件系統的基本索引方法,每個文件都對應一個inode,再通過inode找到文件中的實際數據,因此根據文件路徑名找到具體的inode節點就是一個很重要的處理步驟。系統會緩存用過的每個文件或目錄對應的dentry結構, 從該結構可以指向相應的inode, 每次打開文件, 都會最終對應到文件的inode,中間查找過程稱為namei

結構體定義如下

struct nameidata 
{
    /*
    用於確定文件路徑
    struct path 
    {
        struct vfsmount *mnt;
        struct dentry *dentry;
    };
    */
    struct path    path;

    //需要查找的名稱,這是一個快速字符串,除了路徑字符串本身外,還包含字符串的長度和一個散列值
    struct qstr    last;

    //
    struct path    root;
    unsigned int    flags;
    int        last_type;

    //當前路徑深度
    unsigned    depth;

    //由於在符號鏈接處理時,nd的名字一直發生變化,這里用來保存符號鏈接處理中的路徑名
    char *saved_names[MAX_NESTED_LINKS + 1];

    /* Intent data */
    union 
    {
        struct open_intent open;
    } intent;
};

Relevant Link:

http://man7.org/linux/man-pages/man7/path_resolution.7.html
http://blog.sina.com.cn/s/blog_4a2f24830100l2h4.html
http://blog.csdn.net/kickxxx/article/details/9529961
http://blog.csdn.net/air_snake/article/details/2690554
http://losemyheaven.blog.163.com/blog/static/17071980920124593256317/

0x10: struct super_block

/source/include/linux/fs.h

struct super_block 
{
    /* 
    Keep this first 
    指向超級塊鏈表的指針,用於將系統中所有的超級塊聚集到一個鏈表中,該鏈表的表頭是全局變量super_blocks
    */
    struct list_head    s_list;

    /* 
    search index; _not_ kdev_t 
    設備標識符
    */        
    dev_t            s_dev;        

    //以字節為單位的塊大小
    unsigned long        s_blocksize;

    //以位為單位的塊大小
    unsigned char        s_blocksize_bits;

    //修改臟標志,如果以任何方式改變了超級塊,需要向磁盤回寫,都會將s_dirt設置為1,否則為0
    unsigned char        s_dirt;

    //文件大小上限 Max file size
    loff_t            s_maxbytes;     

    //文件系統類型
    struct file_system_type    *s_type; 

    /*
    struct super_operations 
    {
        //給定的超級塊下創建和初始化一個新的索引節點對象; 
        struct inode *(*alloc_inode)(struct super_block *sb);

        //用於釋放給定的索引節點; 
        void (*destroy_inode)(struct inode *);

        //VFS在索引節點臟(被修改)時會調用此函數,日志文件系統(如ext3,ext4)執行該函數進行日志更新; 
        void (*dirty_inode) (struct inode *);

        //用於將給定的索引節點寫入磁盤,wait參數指明寫操作是否需要同步; 
        int (*write_inode) (struct inode *, struct writeback_control *wbc);

        //在最后一個指向索引節點的引用被釋放后,VFS會調用該函數,VFS只需要簡單地刪除這個索引節點后,普通Uinx文件系統就不會定義這個函數了;
        void (*drop_inode) (struct inode *);

        //用於從磁盤上刪除給定的索引節點; 
        void (*delete_inode) (struct inode *);

        //在卸載文件系統時由VFS調用,用來釋放超級塊,調用者必須一直持有s_lock鎖;
        void (*put_super) (struct super_block *);

        //用給定的超級塊更新磁盤上的超級塊。VFS通過該函數對內存中的超級塊和磁盤中的超級塊進行同步。調用者必須一直持有s_lock鎖; 
        void (*write_super) (struct super_block *);

        //使文件系統的數據元與磁盤上的文件系統同步。wait參數指定操作是否同步; 
        int (*sync_fs)(struct super_block *sb, int wait);
        int (*freeze_fs) (struct super_block *);
        int (*unfreeze_fs) (struct super_block *);

         //VFS通過調用該函數獲取文件系統狀態。指定文件系統縣官的統計信息將放置在statfs中; 
        int (*statfs) (struct dentry *, struct kstatfs *);

        //當指定新的安裝選項重新安裝文件系統時,VFS會調用該函數。調用者必須一直持有s_lock鎖; 
        int (*remount_fs) (struct super_block *, int *, char *);

        //VFS調用該函數釋放索引節點,並清空包含相關數據的所有頁面; 
        void (*clear_inode) (struct inode *);

        //VFS調用該函數中斷安裝操作。該函數被網絡文件系統使用,如NFS; 
        void (*umount_begin) (struct super_block *);

        int (*show_options)(struct seq_file *, struct vfsmount *);
        int (*show_stats)(struct seq_file *, struct vfsmount *);
        #ifdef CONFIG_QUOTA
        ssize_t (*quota_read)(struct super_block *,
        int, char *, size_t, loff_t);
        ssize_t (*quota_write)(struct super_block *,
        int, const char *, size_t, loff_t);
        #endif
        int (*bdev_try_to_free_page)(struct super_block*,
        struct page*, gfp_t);
    };
    */
    const struct super_operations    *s_op;

    //磁盤限額方法
    const struct dquot_operations    *dq_op;

    //磁盤限額方法
    const struct quotactl_ops    *s_qcop;

    //導出方法
    const struct export_operations *s_export_op;

    //掛載標志 
    unsigned long        s_flags;

    //文件系統魔數
    unsigned long        s_magic;

    //目錄掛載點,s_root將超級塊與全局根目錄的dentry項關聯起來,只有通常可見的文件系統的超級塊,才指向/(根)目錄的dentry實例。具有特殊功能、不出現在通常的目錄層次結構中的文件系統(例如管道或套接字文件系統),指向專門的項,不能通過普通的文件命令訪問。處理文件系統對象的代碼經常需要檢查文件系統是否已經裝載,而s_root可用於該目的,如果它為NULL,則該文件系統是一個偽文件系統,只在內核內部可見。否則,該文件系統在用戶空間中是可見的
    struct dentry        *s_root;

    //卸載信號量
    struct rw_semaphore    s_umount;

    //超級塊信號量
    struct mutex        s_lock;

    //引用計數
    int            s_count;

    //尚未同步標志
    int            s_need_sync;

    //活動引用計數
    atomic_t        s_active;
#ifdef CONFIG_SECURITY
    //安全模塊
    void                    *s_security;
#endif
    struct xattr_handler    **s_xattr;

    //all inodes 
    struct list_head    s_inodes;    

    //匿名目錄項 anonymous dentries for (nfs) exporting 
    struct hlist_head    s_anon;        

    //被分配文件鏈表,列出了該超級塊表示的文件系統上所有打開的文件。內核在卸載文件系統時將參考該列表,如果其中仍然包含為寫入而打開的文件,則文件系統仍然處於使用中,卸載操作失敗,並將返回適當的錯誤信息
    struct list_head    s_files;

    /* s_dentry_lru and s_nr_dentry_unused are protected by dcache_lock */
    struct list_head    s_dentry_lru; 

    //unused dentry lru of dentry on lru 
    int            s_nr_dentry_unused;

    //指向了底層文件系統的數據所在的相關塊設備
    struct block_device    *s_bdev;
    struct backing_dev_info *s_bdi;
    struct mtd_info        *s_mtd;

    //該類型文件系統
    struct list_head    s_instances;

    //限額相關選項 Diskquota specific options 
    struct quota_info    s_dquot;     

    int            s_frozen;
    wait_queue_head_t    s_wait_unfrozen;

    //文本名字 Informational name 
    char s_id[32];                 

    //Filesystem private info 
    void             *s_fs_info;
    fmode_t            s_mode;

    /*
     * The next field is for VFS *only*. No filesystems have any business
     * even looking at it. You had been warned.
     */
    struct mutex s_vfs_rename_mutex;    /* Kludge */

    /* Granularity of c/m/atime in ns. Cannot be worse than a second 指定了文件系統支持的各種時間戳的最大可能的粒度 */
    u32           s_time_gran;

    /*
     * Filesystem subtype.  If non-empty the filesystem type field
     * in /proc/mounts will be "type.subtype"
     */
    char *s_subtype;

    /*
     * Saved mount options for lazy filesystems using
     * generic_show_options()
     */
    char *s_options;
};

Relevant Link:

http://linux.chinaunix.net/techdoc/system/2008/09/06/1030468.shtml
http://lxr.free-electrons.com/source/include/linux/fs.h

0x11: struct file_system_type

struct file_system_type 
{
    //文件系統的類型名,以字符串的形式出現,保存了文件系統的名稱(例如reiserfs、ext3)
    const char *name;

    /*
    使用的標志,指明具體文件系統的一些特性,有關標志定義於fs.h中
    #define FS_REQUIRES_DEV 1 
    #define FS_BINARY_MOUNTDATA 2
    #define FS_HAS_SUBTYPE 4
    #define FS_REVAL_DOT    16384    // Check the paths ".", ".." for staleness  
    #define FS_RENAME_DOES_D_MOVE    32768    // FS will handle d_move() during rename() internally. 
    */
    int fs_flags;

    //用於從底層存儲介質讀取超級塊的函數,地址保存在get_sb中,這個函數對裝載過程很重要,邏輯上,該函數依賴具體的文件系統,不能實現為抽象,而且該函數也不能保存在super_operations結構中,因為超級塊對象和指向該結構的指針都是在調用get_sb之后創建的
    int (*get_sb) (struct file_system_type *, int, const char *, void *, struct vfsmount *);

    //kill_sb在不再需要某個文件系統類型時執行清理工作
    void (*kill_sb) (struct super_block *);

    /*
    1. 如果file_system_type所代表的文件系統是通過可安裝模塊(LKM)實現的,則該指針指向代表着具體模塊的module結構
    2. 如果文件系統是靜態地鏈接到內核,則這個域為NULL
    實際上,我們只需要把這個域置為THIS_MODLUE(宏),它就能自動地完成上述工作 
    */    
    struct module *owner;

    //把所有的file_system_type結構鏈接成單項鏈表的鏈接指針,變量file_systems指向這個鏈表。這個鏈表是一個臨界資源,受file_systems_lock自旋讀寫鎖的保護
    struct file_system_type * next;

    /*
    對於每個已經裝載的文件系統,在內存中都創建了一個超級塊結構,該結構保存了文件系統它本身和裝載點的有關信息。由於可以裝載幾個同一類型的文件系統(例如home、root分區,它們的文件系統類型通常相同),同一文件系統類型可能對應了多個超級塊結構,這些超級塊聚集在一個鏈表中。fs_supers是對應的表頭
    這個域是Linux2.4.10以后的內核版本中新增加的,這是一個雙向鏈表。鏈表中的元素是超級塊結構,每個文件系統都有一個超級塊,但有些文件系統可能被安裝在不同的設備上,而且每個具體的設備都有一個超級塊,這些超級塊就形成一個雙向鏈表
    */
    struct list_head fs_supers;

    struct lock_class_key s_lock_key;
    struct lock_class_key s_umount_key;

    struct lock_class_key i_lock_key;
    struct lock_class_key i_mutex_key;
    struct lock_class_key i_mutex_dir_key;
    struct lock_class_key i_alloc_sem_key;
};

Relevant Link:

http://oss.org.cn/kernel-book/ch08/8.4.1.htm

 

5. 內核安全相關數據結構

0x1: struct security_operations

這是一個鈎子函數的指針數組,其中每一個數組元素都是一個SELINUX安全鈎子函數,在2.6以上的內核中,大部分涉及安全控制的系統調用都被替換為了這個結構體中的對應鈎子函數項,從而使SELINUX能在代碼執行流這個層面實現安全訪問控制

這個結構中包含了按照內核對象或內核子系統分組的鈎子組成的子結構,以及一些用於系統操作的頂層鈎子。在內核源代碼中很容易找到對鈎子函數的調用: 其前綴是security_ops->xxxx

struct security_operations 
{
    char name[SECURITY_NAME_MAX + 1];

    int (*ptrace_access_check) (struct task_struct *child, unsigned int mode);
    int (*ptrace_traceme) (struct task_struct *parent);
    int (*capget) (struct task_struct *target,
               kernel_cap_t *effective,
               kernel_cap_t *inheritable, kernel_cap_t *permitted);
    int (*capset) (struct cred *new,
               const struct cred *old,
               const kernel_cap_t *effective,
               const kernel_cap_t *inheritable,
               const kernel_cap_t *permitted);
    int (*capable) (struct task_struct *tsk, const struct cred *cred,
            int cap, int audit);
    int (*acct) (struct file *file);
    int (*sysctl) (struct ctl_table *table, int op);
    int (*quotactl) (int cmds, int type, int id, struct super_block *sb);
    int (*quota_on) (struct dentry *dentry);
    int (*syslog) (int type);
    int (*settime) (struct timespec *ts, struct timezone *tz);
    int (*vm_enough_memory) (struct mm_struct *mm, long pages);

    int (*bprm_set_creds) (struct linux_binprm *bprm);
    int (*bprm_check_security) (struct linux_binprm *bprm);
    int (*bprm_secureexec) (struct linux_binprm *bprm);
    void (*bprm_committing_creds) (struct linux_binprm *bprm);
    void (*bprm_committed_creds) (struct linux_binprm *bprm);

    int (*sb_alloc_security) (struct super_block *sb);
    void (*sb_free_security) (struct super_block *sb);
    int (*sb_copy_data) (char *orig, char *copy);
    int (*sb_kern_mount) (struct super_block *sb, int flags, void *data);
    int (*sb_show_options) (struct seq_file *m, struct super_block *sb);
    int (*sb_statfs) (struct dentry *dentry);
    int (*sb_mount) (char *dev_name, struct path *path,
             char *type, unsigned long flags, void *data);
    int (*sb_check_sb) (struct vfsmount *mnt, struct path *path);
    int (*sb_umount) (struct vfsmount *mnt, int flags);
    void (*sb_umount_close) (struct vfsmount *mnt);
    void (*sb_umount_busy) (struct vfsmount *mnt);
    void (*sb_post_remount) (struct vfsmount *mnt,
                 unsigned long flags, void *data);
    void (*sb_post_addmount) (struct vfsmount *mnt,
                  struct path *mountpoint);
    int (*sb_pivotroot) (struct path *old_path,
                 struct path *new_path);
    void (*sb_post_pivotroot) (struct path *old_path,
                   struct path *new_path);
    int (*sb_set_mnt_opts) (struct super_block *sb,
                struct security_mnt_opts *opts);
    void (*sb_clone_mnt_opts) (const struct super_block *oldsb,
                   struct super_block *newsb);
    int (*sb_parse_opts_str) (char *options, struct security_mnt_opts *opts);

#ifdef CONFIG_SECURITY_PATH
    int (*path_unlink) (struct path *dir, struct dentry *dentry);
    int (*path_mkdir) (struct path *dir, struct dentry *dentry, int mode);
    int (*path_rmdir) (struct path *dir, struct dentry *dentry);
    int (*path_mknod) (struct path *dir, struct dentry *dentry, int mode,
               unsigned int dev);
    int (*path_truncate) (struct path *path, loff_t length,
                  unsigned int time_attrs);
    int (*path_symlink) (struct path *dir, struct dentry *dentry,
                 const char *old_name);
    int (*path_link) (struct dentry *old_dentry, struct path *new_dir,
              struct dentry *new_dentry);
    int (*path_rename) (struct path *old_dir, struct dentry *old_dentry,
                struct path *new_dir, struct dentry *new_dentry);
#endif

    int (*inode_alloc_security) (struct inode *inode);
    void (*inode_free_security) (struct inode *inode);
    int (*inode_init_security) (struct inode *inode, struct inode *dir,
                    char **name, void **value, size_t *len);
    int (*inode_create) (struct inode *dir,
                 struct dentry *dentry, int mode);
    int (*inode_link) (struct dentry *old_dentry,
               struct inode *dir, struct dentry *new_dentry);
    int (*inode_unlink) (struct inode *dir, struct dentry *dentry);
    int (*inode_symlink) (struct inode *dir,
                  struct dentry *dentry, const char *old_name);
    int (*inode_mkdir) (struct inode *dir, struct dentry *dentry, int mode);
    int (*inode_rmdir) (struct inode *dir, struct dentry *dentry);
    int (*inode_mknod) (struct inode *dir, struct dentry *dentry,
                int mode, dev_t dev);
    int (*inode_rename) (struct inode *old_dir, struct dentry *old_dentry,
                 struct inode *new_dir, struct dentry *new_dentry);
    int (*inode_readlink) (struct dentry *dentry);
    int (*inode_follow_link) (struct dentry *dentry, struct nameidata *nd);
    int (*inode_permission) (struct inode *inode, int mask);
    int (*inode_setattr)    (struct dentry *dentry, struct iattr *attr);
    int (*inode_getattr) (struct vfsmount *mnt, struct dentry *dentry);
    void (*inode_delete) (struct inode *inode);
    int (*inode_setxattr) (struct dentry *dentry, const char *name,
                   const void *value, size_t size, int flags);
    void (*inode_post_setxattr) (struct dentry *dentry, const char *name,
                     const void *value, size_t size, int flags);
    int (*inode_getxattr) (struct dentry *dentry, const char *name);
    int (*inode_listxattr) (struct dentry *dentry);
    int (*inode_removexattr) (struct dentry *dentry, const char *name);
    int (*inode_need_killpriv) (struct dentry *dentry);
    int (*inode_killpriv) (struct dentry *dentry);
    int (*inode_getsecurity) (const struct inode *inode, const char *name, void **buffer, bool alloc);
    int (*inode_setsecurity) (struct inode *inode, const char *name, const void *value, size_t size, int flags);
    int (*inode_listsecurity) (struct inode *inode, char *buffer, size_t buffer_size);
    void (*inode_getsecid) (const struct inode *inode, u32 *secid);

    int (*file_permission) (struct file *file, int mask);
    int (*file_alloc_security) (struct file *file);
    void (*file_free_security) (struct file *file);
    int (*file_ioctl) (struct file *file, unsigned int cmd,
               unsigned long arg);
    int (*file_mmap) (struct file *file,
              unsigned long reqprot, unsigned long prot,
              unsigned long flags, unsigned long addr,
              unsigned long addr_only);
    int (*file_mprotect) (struct vm_area_struct *vma,
                  unsigned long reqprot,
                  unsigned long prot);
    int (*file_lock) (struct file *file, unsigned int cmd);
    int (*file_fcntl) (struct file *file, unsigned int cmd,
               unsigned long arg);
    int (*file_set_fowner) (struct file *file);
    int (*file_send_sigiotask) (struct task_struct *tsk,
                    struct fown_struct *fown, int sig);
    int (*file_receive) (struct file *file);
    int (*dentry_open) (struct file *file, const struct cred *cred);

    int (*task_create) (unsigned long clone_flags);
    int (*cred_alloc_blank) (struct cred *cred, gfp_t gfp);
    void (*cred_free) (struct cred *cred);
    int (*cred_prepare)(struct cred *new, const struct cred *old,
                gfp_t gfp);
    void (*cred_commit)(struct cred *new, const struct cred *old);
    void (*cred_transfer)(struct cred *new, const struct cred *old);
    int (*kernel_act_as)(struct cred *new, u32 secid);
    int (*kernel_create_files_as)(struct cred *new, struct inode *inode);
    int (*kernel_module_request)(void);
    int (*task_setuid) (uid_t id0, uid_t id1, uid_t id2, int flags);
    int (*task_fix_setuid) (struct cred *new, const struct cred *old,
                int flags);
    int (*task_setgid) (gid_t id0, gid_t id1, gid_t id2, int flags);
    int (*task_setpgid) (struct task_struct *p, pid_t pgid);
    int (*task_getpgid) (struct task_struct *p);
    int (*task_getsid) (struct task_struct *p);
    void (*task_getsecid) (struct task_struct *p, u32 *secid);
    int (*task_setgroups) (struct group_info *group_info);
    int (*task_setnice) (struct task_struct *p, int nice);
    int (*task_setioprio) (struct task_struct *p, int ioprio);
    int (*task_getioprio) (struct task_struct *p);
    int (*task_setrlimit) (unsigned int resource, struct rlimit *new_rlim);
    int (*task_setscheduler) (struct task_struct *p, int policy,
                  struct sched_param *lp);
    int (*task_getscheduler) (struct task_struct *p);
    int (*task_movememory) (struct task_struct *p);
    int (*task_kill) (struct task_struct *p,
              struct siginfo *info, int sig, u32 secid);
    int (*task_wait) (struct task_struct *p);
    int (*task_prctl) (int option, unsigned long arg2,
               unsigned long arg3, unsigned long arg4,
               unsigned long arg5);
    void (*task_to_inode) (struct task_struct *p, struct inode *inode);

    int (*ipc_permission) (struct kern_ipc_perm *ipcp, short flag);
    void (*ipc_getsecid) (struct kern_ipc_perm *ipcp, u32 *secid);

    int (*msg_msg_alloc_security) (struct msg_msg *msg);
    void (*msg_msg_free_security) (struct msg_msg *msg);

    int (*msg_queue_alloc_security) (struct msg_queue *msq);
    void (*msg_queue_free_security) (struct msg_queue *msq);
    int (*msg_queue_associate) (struct msg_queue *msq, int msqflg);
    int (*msg_queue_msgctl) (struct msg_queue *msq, int cmd);
    int (*msg_queue_msgsnd) (struct msg_queue *msq,
                 struct msg_msg *msg, int msqflg);
    int (*msg_queue_msgrcv) (struct msg_queue *msq,
                 struct msg_msg *msg,
                 struct task_struct *target,
                 long type, int mode);

    int (*shm_alloc_security) (struct shmid_kernel *shp);
    void (*shm_free_security) (struct shmid_kernel *shp);
    int (*shm_associate) (struct shmid_kernel *shp, int shmflg);
    int (*shm_shmctl) (struct shmid_kernel *shp, int cmd);
    int (*shm_shmat) (struct shmid_kernel *shp,
              char __user *shmaddr, int shmflg);

    int (*sem_alloc_security) (struct sem_array *sma);
    void (*sem_free_security) (struct sem_array *sma);
    int (*sem_associate) (struct sem_array *sma, int semflg);
    int (*sem_semctl) (struct sem_array *sma, int cmd);
    int (*sem_semop) (struct sem_array *sma,
              struct sembuf *sops, unsigned nsops, int alter);

    int (*netlink_send) (struct sock *sk, struct sk_buff *skb);
    int (*netlink_recv) (struct sk_buff *skb, int cap);

    void (*d_instantiate) (struct dentry *dentry, struct inode *inode);

    int (*getprocattr) (struct task_struct *p, char *name, char **value);
    int (*setprocattr) (struct task_struct *p, char *name, void *value, size_t size);
    int (*secid_to_secctx) (u32 secid, char **secdata, u32 *seclen);
    int (*secctx_to_secid) (const char *secdata, u32 seclen, u32 *secid);
    void (*release_secctx) (char *secdata, u32 seclen);

    int (*inode_notifysecctx)(struct inode *inode, void *ctx, u32 ctxlen);
    int (*inode_setsecctx)(struct dentry *dentry, void *ctx, u32 ctxlen);
    int (*inode_getsecctx)(struct inode *inode, void **ctx, u32 *ctxlen);

#ifdef CONFIG_SECURITY_NETWORK
    int (*unix_stream_connect) (struct socket *sock,
                    struct socket *other, struct sock *newsk);
    int (*unix_may_send) (struct socket *sock, struct socket *other);

    int (*socket_create) (int family, int type, int protocol, int kern);
    int (*socket_post_create) (struct socket *sock, int family,
                   int type, int protocol, int kern);
    int (*socket_bind) (struct socket *sock,
                struct sockaddr *address, int addrlen);
    int (*socket_connect) (struct socket *sock,
                   struct sockaddr *address, int addrlen);
    int (*socket_listen) (struct socket *sock, int backlog);
    int (*socket_accept) (struct socket *sock, struct socket *newsock);
    int (*socket_sendmsg) (struct socket *sock,
                   struct msghdr *msg, int size);
    int (*socket_recvmsg) (struct socket *sock,
                   struct msghdr *msg, int size, int flags);
    int (*socket_getsockname) (struct socket *sock);
    int (*socket_getpeername) (struct socket *sock);
    int (*socket_getsockopt) (struct socket *sock, int level, int optname);
    int (*socket_setsockopt) (struct socket *sock, int level, int optname);
    int (*socket_shutdown) (struct socket *sock, int how);
    int (*socket_sock_rcv_skb) (struct sock *sk, struct sk_buff *skb);
    int (*socket_getpeersec_stream) (struct socket *sock, char __user *optval, int __user *optlen, unsigned len);
    int (*socket_getpeersec_dgram) (struct socket *sock, struct sk_buff *skb, u32 *secid);
    int (*sk_alloc_security) (struct sock *sk, int family, gfp_t priority);
    void (*sk_free_security) (struct sock *sk);
    void (*sk_clone_security) (const struct sock *sk, struct sock *newsk);
    void (*sk_getsecid) (struct sock *sk, u32 *secid);
    void (*sock_graft) (struct sock *sk, struct socket *parent);
    int (*inet_conn_request) (struct sock *sk, struct sk_buff *skb,
                  struct request_sock *req);
    void (*inet_csk_clone) (struct sock *newsk, const struct request_sock *req);
    void (*inet_conn_established) (struct sock *sk, struct sk_buff *skb);
    void (*req_classify_flow) (const struct request_sock *req, struct flowi *fl);
    int (*tun_dev_create)(void);
    void (*tun_dev_post_create)(struct sock *sk);
    int (*tun_dev_attach)(struct sock *sk);
#endif    /* CONFIG_SECURITY_NETWORK */

#ifdef CONFIG_SECURITY_NETWORK_XFRM
    int (*xfrm_policy_alloc_security) (struct xfrm_sec_ctx **ctxp,
            struct xfrm_user_sec_ctx *sec_ctx);
    int (*xfrm_policy_clone_security) (struct xfrm_sec_ctx *old_ctx, struct xfrm_sec_ctx **new_ctx);
    void (*xfrm_policy_free_security) (struct xfrm_sec_ctx *ctx);
    int (*xfrm_policy_delete_security) (struct xfrm_sec_ctx *ctx);
    int (*xfrm_state_alloc_security) (struct xfrm_state *x,
        struct xfrm_user_sec_ctx *sec_ctx,
        u32 secid);
    void (*xfrm_state_free_security) (struct xfrm_state *x);
    int (*xfrm_state_delete_security) (struct xfrm_state *x);
    int (*xfrm_policy_lookup) (struct xfrm_sec_ctx *ctx, u32 fl_secid, u8 dir);
    int (*xfrm_state_pol_flow_match) (struct xfrm_state *x,
                      struct xfrm_policy *xp,
                      struct flowi *fl);
    int (*xfrm_decode_session) (struct sk_buff *skb, u32 *secid, int ckall);
#endif    /* CONFIG_SECURITY_NETWORK_XFRM */

    /* key management security hooks */
#ifdef CONFIG_KEYS
    int (*key_alloc) (struct key *key, const struct cred *cred, unsigned long flags);
    void (*key_free) (struct key *key);
    int (*key_permission) (key_ref_t key_ref,
                   const struct cred *cred,
                   key_perm_t perm);
    int (*key_getsecurity)(struct key *key, char **_buffer);
    int (*key_session_to_parent)(const struct cred *cred,
                     const struct cred *parent_cred,
                     struct key *key);
#endif    /* CONFIG_KEYS */

#ifdef CONFIG_AUDIT
    int (*audit_rule_init) (u32 field, u32 op, char *rulestr, void **lsmrule);
    int (*audit_rule_known) (struct audit_krule *krule);
    int (*audit_rule_match) (u32 secid, u32 field, u32 op, void *lsmrule,
                 struct audit_context *actx);
    void (*audit_rule_free) (void *lsmrule);
#endif /* CONFIG_AUDIT */
};

Relevant Link:

http://www.hep.by/gnu/kernel/lsm/framework.html
http://blog.sina.com.cn/s/blog_858820890101eb3c.html
http://mirror.linux.org.au/linux-mandocs/2.6.4-cset-20040312_2111/security_operations.html

0x2: struct kprobe

用於存儲每個探測點的基本結構

struct kprobe 
{
    /*用於保存kprobe的全局hash表,以被探測的addr為key*/
    struct hlist_node hlist;

    /* list of kprobes for multi-handler support */
    /*當對同一個探測點存在多個探測函數時,所有的函數掛在這條鏈上*/
    struct list_head list;

    /*count the number of times this probe was temporarily disarmed */
    unsigned long nmissed;

    /* location of the probe point */
    /*被探測的目標地址,要注意的是,只能是addr或是symbol_name其中一個填入了值,如果兩個都填入,在注冊這個探頭的時候就會出現錯誤-21非法符號*/
    kprobe_opcode_t *addr;

    /* Allow user to indicate symbol name of the probe point */
    /*symblo_name的存在,允許用戶指定函數名而非確定的地址,我們在設置的時候就可以直接設置函數名,而有內核函數kallsyms_lookup_name("xx")去獲取具體的函數地址*/
    const char *symbol_name;

    /* Offset into the symbol */
    /*
    如果被探測點為函數內部某個指令,需要使用addr + offset的方式
    從這點也可以看出,kprobe可以hook在內核中的任何位置
    */
    unsigned int offset;

    /* Called before addr is executed. */
    /*探測函數,在目標探測點執行之前調用*/
    kprobe_pre_handler_t pre_handler;

    /* Called after addr is executed, unless... */
    /*探測函數,在目標探測點執行之后調用*/
    kprobe_post_handler_t post_handler;

    /*
    ...called if executing addr causes a fault (eg. page fault).
    Return 1 if it handled fault, otherwise kernel will see it.
    */
    kprobe_fault_handler_t fault_handler;

    /*
    called if breakpoint trap occurs in probe handler.
    Return 1 if it handled break, otherwise kernel will see it.
    */
    kprobe_break_handler_t break_handler;

    /*opcode 以及 ainsn 用於保存被替換的指令碼*/ 
    /* Saved opcode (which has been replaced with breakpoint) */
    kprobe_opcode_t opcode;

    /* copy of the original instruction */
    struct arch_specific_insn ainsn;

    /*
    Indicates various status flags.
    Protected by kprobe_mutex after this kprobe is registered.
    */
    u32 flags;
};

0x3: struct jprobe

我們知道,jprobe是對kprobes的一層功能上的封裝,這點從數據結構上也能看出來

struct jprobe 
{  
    struct kprobe kp;  

    /*
    定義的probe程序,要注意的是
    1. 注冊進去的探頭程序應該和被注冊的函數的參數列表一致
    2. 我們在設置函數指針的時候需要使用(kprobe_opcode_t *)進行強制轉換
    */
    void *entry;  
}  

0x4: struct kretprobe

kretprobe注冊(register_kretprobe)的時候需要傳遞這個結構體

struct kretprobe 
{
    struct kprobe kp;

    //注冊的回調函數,handler指定探測點的處理函數
    kretprobe_handler_t handler;

    //注冊的預處理回調函數,類似於kprobes中的pre_handler()
    kretprobe_handler_t entry_handler;

    //maxactive指定可以同時運行的最大處理函數實例數,它應當被恰當設置,否則可能丟失探測點的某些運行
    int maxactive;
    int nmissed;

    //指示kretprobe需要為回調監控預留多少內存空間
    size_t data_size;
    struct hlist_head free_instances;
    raw_spinlock_t lock;
}; 

0x5: struct kretprobe_instance

在kretprobe的注冊處理函數(.handler)中我們可以拿到這個結構體

struct kretprobe_instance 
{
    struct hlist_node hlist;
    
    //指向相應的kretprobe_instance變量(就是我們在register_kretprobe時傳入的參數) 
    struct kretprobe *rp;
    
    //返回地址
    kprobe_opcode_t *ret_addr;

    //指向相應的task_struct
    struct task_struct *task;
    char data[0];
};

0x6: struct kretprobe_blackpoint 、struct kprobe_blacklist_entry

struct kretprobe_blackpoint 
{
    const char *name;
    void *addr;
}; 

struct kprobe_blacklist_entry 
{
    struct list_head list;
    unsigned long start_addr;
    unsigned long end_addr;
};

0x7: struct linux_binprm

在Linux內核中,每種二進制格式都表示為struct linux_binprm數據結構,Linux支持的二進制格式有

1. flat_format: 平坦格式
用於沒有內存管理單元(MMU)的嵌入式CPU上,為節省空間,可執行文件中的數據還可以壓縮(如果內核可提供zlib支持)

2. script_format: 偽格式
用於運行使用#!機制的腳本,檢查文件的第一行,內核即知道使用何種解釋器,啟動適當的應用程序即可(例如: #! /usr/bin/perl 則啟動perl)

3. misc_format: 偽格式
用於啟動需要外部解釋器的應用程序,與#!機制相比,解釋器無須顯示指定,而可以通過特定的文件標識符(后綴、文件頭..),例如該格式用於執行java字節碼或用wine運行windows程序

4. elf_format: 
這是一種與計算機和體系結構無關的格式,可用於32/64位,它是linux的標准格式

5. elf_fdpic_format: ELF格式變體
提供了針對沒有MMU系統的特別特性

6. irix_format: ELF格式變體
提供了特定於irix的特性

7. som_format:
在PA-Risc計算機上使用,特定於HP-UX的格式

8. aout_format:
a.out是引入ELF之前linux的標准格式

/source/include/linux/binfmts.h

/*
 * This structure is used to hold the arguments that are used when loading binaries.
 */
struct linux_binprm
{
    //保存可執行文件的頭128字節
    char buf[BINPRM_BUF_SIZE];
#ifdef CONFIG_MMU
    struct vm_area_struct *vma;
    unsigned long vma_pages;
#else
# define MAX_ARG_PAGES    32
    struct page *page[MAX_ARG_PAGES];
#endif
    struct mm_struct *mm;
    /*
    當前內存頁最高地址
    current top of mem 
    */
    unsigned long p; 
    unsigned int
        cred_prepared:1,/* true if creds already prepared (multiple
                 * preps happen for interpreters) */
        cap_effective:1;/* true if has elevated effective capabilities,
                 * false if not; except for init which inherits
                 * its parent's caps anyway */
#ifdef __alpha__
    unsigned int taso:1;
#endif
    unsigned int recursion_depth;
    //要執行的文件
    struct file * file;
    //new credentials  
    struct cred *cred;    
    int unsafe;        /* how unsafe this exec is (mask of LSM_UNSAFE_*) */
    unsigned int per_clear;    /* bits to clear in current->personality */
    //命令行參數和環境變量數目
    int argc, envc;
    /*
    要執行的文件的名稱
    Name of binary as seen by procps
    */
    char * filename;
    /*
    要執行的文件的真實名稱,通常和filename相同
    Name of the binary really executed. Most of the time same as filename, but could be different for binfmt_{misc,script}
    */     
    char * interp;         
    unsigned interp_flags;
    unsigned interp_data;
    unsigned long loader, exec;
};

0x7: struct linux_binfmt

/source/include/linux/binfmts.h

/*
 * This structure defines the functions that are used to load the binary formats that
 * linux accepts.
*/
struct linux_binfmt 
{
    //鏈表結構
    struct list_head lh;
    struct module *module;
    //裝入二進制代碼
    int (*load_binary)(struct linux_binprm *, struct  pt_regs * regs);

    //裝入公用庫
    int (*load_shlib)(struct file *);

    int (*core_dump)(long signr, struct pt_regs *regs, struct file *file, unsigned long limit);
    unsigned long min_coredump;    /* minimal dump size */
    int hasvdso;
};

 

6. 系統網絡狀態相關的數據結構

0x1: struct ifconf

\linux-2.6.32.63\include\linux\if.h

/* Structure used in SIOCGIFCONF request.  Used to retrieve interface
   configuration for machine (useful for programs which must know all
   networks accessible).  
*/ 
struct ifconf
{
    int ifc_len;        // Size of buffer.   
    union
    {
    __caddr_t ifcu_buf;
    struct ifreq *ifcu_req;    //保存每塊網卡的具體信息的結構體數組
    } ifc_ifcu;
};
#define ifc_buf    ifc_ifcu.ifcu_buf   /* Buffer address.  */
#define ifc_req    ifc_ifcu.ifcu_req   /* Array of structures.  */
#define _IOT_ifconf _IOT(_IOTS(struct ifconf),1,0,0,0,0) /* not right */

0x2: struct ifreq

\linux-2.6.32.63\include\linux\if.h

/*
 * Interface request structure used for socket
 * ioctl's.  All interface ioctl's must have parameter
 * definitions which begin with ifr_name.  The
 * remainder may be interface specific.
*/
struct ifreq 
{
#define IFHWADDRLEN    6
    union
    {
        char    ifrn_name[IFNAMSIZ];        /* if name, e.g. "en0" */
    } ifr_ifrn;
    
    //描述套接口的地址結構
    union 
    {
        struct    sockaddr ifru_addr;
        struct    sockaddr ifru_dstaddr;
        struct    sockaddr ifru_broadaddr;
        struct    sockaddr ifru_netmask;
        struct  sockaddr ifru_hwaddr;
        short    ifru_flags;
        int    ifru_ivalue;
        int    ifru_mtu;
        struct  ifmap ifru_map;
        char    ifru_slave[IFNAMSIZ];    /* Just fits the size */
        char    ifru_newname[IFNAMSIZ];
        void __user *    ifru_data;
        struct    if_settings ifru_settings;
    } ifr_ifru;
};
#define ifr_name    ifr_ifrn.ifrn_name    /* interface name     */
#define ifr_hwaddr    ifr_ifru.ifru_hwaddr    /* MAC address         */
#define    ifr_addr    ifr_ifru.ifru_addr    /* address        */
#define    ifr_dstaddr    ifr_ifru.ifru_dstaddr    /* other end of p-p lnk    */
#define    ifr_broadaddr    ifr_ifru.ifru_broadaddr    /* broadcast address    */
#define    ifr_netmask    ifr_ifru.ifru_netmask    /* interface net mask    */
#define    ifr_flags    ifr_ifru.ifru_flags    /* flags        */
#define    ifr_metric    ifr_ifru.ifru_ivalue    /* metric        */
#define    ifr_mtu        ifr_ifru.ifru_mtu    /* mtu            */
#define ifr_map        ifr_ifru.ifru_map    /* device map        */
#define ifr_slave    ifr_ifru.ifru_slave    /* slave device        */
#define    ifr_data    ifr_ifru.ifru_data    /* for use by interface    */
#define ifr_ifindex    ifr_ifru.ifru_ivalue    /* interface index    */
#define ifr_bandwidth    ifr_ifru.ifru_ivalue    /* link bandwidth    */
#define ifr_qlen    ifr_ifru.ifru_ivalue    /* Queue length     */
#define ifr_newname    ifr_ifru.ifru_newname    /* New name        */
#define ifr_settings    ifr_ifru.ifru_settings    /* Device/proto settings*/

code

#include <arpa/inet.h>
#include <net/if.h>
#include <net/if_arp.h>
#include <netinet/in.h>
#include <stdio.h>
#include <sys/ioctl.h>
#include <sys/socket.h>
#include <unistd.h>
 
#define MAXINTERFACES 16    /* 最大接口數 */
 
int fd;         /* 套接字 */
int if_len;     /* 接口數量 */
struct ifreq buf[MAXINTERFACES];    /* ifreq結構數組 */
struct ifconf ifc;                  /* ifconf結構 */
 
int main(argc, argv)
{
    /* 建立IPv4的UDP套接字fd */
    if ((fd = socket(AF_INET, SOCK_DGRAM, 0)) == -1)
    {
        perror("socket(AF_INET, SOCK_DGRAM, 0)");
        return -1;
    }
 
    /* 初始化ifconf結構 */
    ifc.ifc_len = sizeof(buf);
    ifc.ifc_buf = (caddr_t) buf;
 
    /* 獲得接口列表 */
    if (ioctl(fd, SIOCGIFCONF, (char *) &ifc) == -1)
    {
        perror("SIOCGIFCONF ioctl");
        return -1;
    }
 
    if_len = ifc.ifc_len / sizeof(struct ifreq); /* 接口數量 */
    printf("接口數量:%d/n/n", if_len);
 
    while (if_len– > 0) /* 遍歷每個接口 */
    {
        printf("接口:%s/n", buf[if_len].ifr_name); /* 接口名稱 */
 
        /* 獲得接口標志 */
        if (!(ioctl(fd, SIOCGIFFLAGS, (char *) &buf[if_len])))
        {
            /* 接口狀態 */
            if (buf[if_len].ifr_flags & IFF_UP)
            {
                printf("接口狀態: UP/n");
            }
            else
            {
                printf("接口狀態: DOWN/n");
            }
        }
        else
        {
            char str[256];
            sprintf(str, "SIOCGIFFLAGS ioctl %s", buf[if_len].ifr_name);
            perror(str);
        }
 
 
        /* IP地址 */
        if (!(ioctl(fd, SIOCGIFADDR, (char *) &buf[if_len])))
        {
            printf("IP地址:%s/n",
                    (char*)inet_ntoa(((struct sockaddr_in*) (&buf[if_len].ifr_addr))->sin_addr));
        }
        else
        {
            char str[256];
            sprintf(str, "SIOCGIFADDR ioctl %s", buf[if_len].ifr_name);
            perror(str);
        }
 
        /* 子網掩碼 */
        if (!(ioctl(fd, SIOCGIFNETMASK, (char *) &buf[if_len])))
        {
            printf("子網掩碼:%s/n",
                    (char*)inet_ntoa(((struct sockaddr_in*) (&buf[if_len].ifr_addr))->sin_addr));
        }
        else
        {
            char str[256];
            sprintf(str, "SIOCGIFADDR ioctl %s", buf[if_len].ifr_name);
            perror(str);
        }
 
        /* 廣播地址 */
        if (!(ioctl(fd, SIOCGIFBRDADDR, (char *) &buf[if_len])))
        {
            printf("廣播地址:%s/n",
                    (char*)inet_ntoa(((struct sockaddr_in*) (&buf[if_len].ifr_addr))->sin_addr));
        }
        else
        {
            char str[256];
            sprintf(str, "SIOCGIFADDR ioctl %s", buf[if_len].ifr_name);
            perror(str);
        }
 
        /*MAC地址 */
        if (!(ioctl(fd, SIOCGIFHWADDR, (char *) &buf[if_len])))
        {
            printf("MAC地址:%02x:%02x:%02x:%02x:%02x:%02x/n/n",
                    (unsigned char) buf[if_len].ifr_hwaddr.sa_data[0],
                    (unsigned char) buf[if_len].ifr_hwaddr.sa_data[1],
                    (unsigned char) buf[if_len].ifr_hwaddr.sa_data[2],
                    (unsigned char) buf[if_len].ifr_hwaddr.sa_data[3],
                    (unsigned char) buf[if_len].ifr_hwaddr.sa_data[4],
                    (unsigned char) buf[if_len].ifr_hwaddr.sa_data[5]);
        }
        else
        {
            char str[256];
            sprintf(str, "SIOCGIFHWADDR ioctl %s", buf[if_len].ifr_name);
            perror(str);
        }
    }//–while end
 
    //關閉socket
    close(fd);
    return 0;
}

Relevant Link:

http://blog.csdn.net/jk110333/article/details/8832077
http://www.360doc.com/content/12/0314/15/5782959_194281431.shtml

0x3: struct socket

\linux-2.6.32.63\include\linux\net.h

struct socket 
{    
    /*
    1. state:socket狀態
    typedef enum 
    {
        SS_FREE = 0,            //該socket還未分配
        SS_UNCONNECTED,         //未連向任何socket
        SS_CONNECTING,          //正在連接過程中
        SS_CONNECTED,           //已連向一個socket
        SS_DISCONNECTING        //正在斷開連接的過程中
    }socket_state; 
    */
    socket_state        state;

    kmemcheck_bitfield_begin(type);
    /*
    2. type:socket類型
    enum sock_type 
    {
        SOCK_STREAM    = 1,    //stream (connection) socket
        SOCK_DGRAM    = 2,    //datagram (conn.less) socket
        SOCK_RAW    = 3,    //raw socket
        SOCK_RDM    = 4,    //reliably-delivered message
        SOCK_SEQPACKET    = 5,//sequential packet socket
        SOCK_DCCP    = 6,    //Datagram Congestion Control Protocol socket
        SOCK_PACKET    = 10,    //linux specific way of getting packets at the dev level.
    };
    */
    short            type;
    kmemcheck_bitfield_end(type);

    /*
    3. flags:socket標志
        1) #define SOCK_ASYNC_NOSPACE 0
        2) #define SOCK_ASYNC_WAITDATA 1
        3) #define SOCK_NOSPACE 2
        4) #define SOCK_PASSCRED 3
        5) #define SOCK_PASSSEC 4
    */
    unsigned long        flags;

    //fasync_list is used when processes have chosen asynchronous handling of this 'file'
    struct fasync_struct    *fasync_list;
    //4. Not used by sockets in AF_INET
    wait_queue_head_t    wait;

    //5. file holds a reference to the primary file structure associated with this socket
    struct file        *file;

    /*
    6. sock
    This is very important, as it contains most of the useful state associated with a socket. 
    */
    struct sock        *sk;

    //7. ops:定義了當前socket的處理函數
    const struct proto_ops    *ops;
};

0x4: struct sock

struct sock本身不能獲取到當前socket的IP、Port相關信息,要通過inet_sk()進行轉換得到struct inet_sock才能得到IP、Port相關信息。但struct sock保存和當前socket大量的元描述信息

\linux-2.6.32.63\include\net\sock.h

struct sock 
{
    /*
     * Now struct inet_timewait_sock also uses sock_common, so please just
     * don't add nothing before this first member (__sk_common) --acme
     */
    //shared layout with inet_timewait_sock
    struct sock_common    __sk_common;
#define sk_node            __sk_common.skc_node
#define sk_nulls_node        __sk_common.skc_nulls_node
#define sk_refcnt        __sk_common.skc_refcnt

#define sk_copy_start        __sk_common.skc_hash
#define sk_hash            __sk_common.skc_hash
#define sk_family        __sk_common.skc_family
#define sk_state        __sk_common.skc_state
#define sk_reuse        __sk_common.skc_reuse
#define sk_bound_dev_if        __sk_common.skc_bound_dev_if
#define sk_bind_node        __sk_common.skc_bind_node
#define sk_prot            __sk_common.skc_prot
#define sk_net            __sk_common.skc_net

    kmemcheck_bitfield_begin(flags);
    //mask of %SEND_SHUTDOWN and/or %RCV_SHUTDOWN
    unsigned int        sk_shutdown  : 2,
                //%SO_NO_CHECK setting, wether or not checkup packets
                sk_no_check  : 2,
                //%SO_SNDBUF and %SO_RCVBUF settings
                sk_userlocks : 4,
                //which protocol this socket belongs in this network family
                sk_protocol  : 8,
                //socket type (%SOCK_STREAM, etc)
                sk_type      : 16;
    kmemcheck_bitfield_end(flags);
    //size of receive buffer in bytes
    int            sk_rcvbuf;
    //synchronizer
    socket_lock_t        sk_lock;
    /*
     * The backlog queue is special, it is always used with
     * the per-socket spinlock held and requires low latency
     * access. Therefore we special case it's implementation.
     */
    struct 
    {
        struct sk_buff *head;
        struct sk_buff *tail;
    } sk_backlog;
    //sock wait queue
    wait_queue_head_t    *sk_sleep;
    //destination cache
    struct dst_entry    *sk_dst_cache;
#ifdef CONFIG_XFRM
    //flow policy
    struct xfrm_policy    *sk_policy[2];
#endif
    //destination cache lock
    rwlock_t        sk_dst_lock;
    //receive queue bytes committed
    atomic_t        sk_rmem_alloc;
    //transmit queue bytes committed
    atomic_t        sk_wmem_alloc;
    //"o" is "option" or "other"
    atomic_t        sk_omem_alloc;
    //size of send buffer in bytes
    int            sk_sndbuf;
    //incoming packets
    struct sk_buff_head    sk_receive_queue;
    //Packet sending queue
    struct sk_buff_head    sk_write_queue;
#ifdef CONFIG_NET_DMA
    //DMA copied packets
    struct sk_buff_head    sk_async_wait_queue;
#endif
    //persistent queue size
    int            sk_wmem_queued;
    //space allocated forward
    int            sk_forward_alloc;
    //allocation mode
    gfp_t            sk_allocation;
    //route capabilities (e.g. %NETIF_F_TSO)
    int            sk_route_caps;
    //GSO type (e.g. %SKB_GSO_TCPV4)
    int            sk_gso_type;
    //Maximum GSO segment size to build
    unsigned int        sk_gso_max_size;
    //%SO_RCVLOWAT setting
    int            sk_rcvlowat;
    /*
    1. %SO_LINGER (l_onoff)
    2. %SO_BROADCAST
    3. %SO_KEEPALIVE
    4. %SO_OOBINLINE settings
    5. %SO_TIMESTAMPING settings
    */
    unsigned long         sk_flags;
    //%SO_LINGER l_linger setting
    unsigned long            sk_lingertime;
    //rarely used
    struct sk_buff_head    sk_error_queue;
    //sk_prot of original sock creator (see ipv6_setsockopt, IPV6_ADDRFORM for instance)
    struct proto        *sk_prot_creator;
    //used with the callbacks in the end of this struct
    rwlock_t        sk_callback_lock;
    //last error
    int            sk_err,
                //rrors that don't cause failure but are the cause of a persistent failure not just 'timed out'
                sk_err_soft;
                    //raw/udp drops counter
    atomic_t        sk_drops;
    //always used with the per-socket spinlock held
    //current listen backlog
    unsigned short        sk_ack_backlog;
    //listen backlog set in listen()
    unsigned short        sk_max_ack_backlog;
    //%SO_PRIORITY setting
    __u32            sk_priority;
    //%SO_PEERCRED setting
    struct ucred        sk_peercred;
    //%SO_RCVTIMEO setting
    long            sk_rcvtimeo;
    //%SO_SNDTIMEO setting
    long            sk_sndtimeo;
    //socket filtering instructions
    struct sk_filter          *sk_filter;
    //private area, net family specific, when not using slab
    void            *sk_protinfo;
    //sock cleanup timer
    struct timer_list    sk_timer;
    //time stamp of last packet received
    ktime_t            sk_stamp;
    //Identd and reporting IO signals
    struct socket        *sk_socket;
    //RPC layer private data
    void            *sk_user_data;
    //cached page for sendmsg
    struct page        *sk_sndmsg_page;
    //front of stuff to transmit
    struct sk_buff        *sk_send_head;
    //cached offset for sendmsg
    __u32            sk_sndmsg_off;
    //a write to stream socket waits to start
    int            sk_write_pending;
#ifdef CONFIG_SECURITY
    //used by security modules
    void            *sk_security;
#endif
    //generic packet mark
    __u32            sk_mark;
    /* XXX 4 bytes hole on 64 bit */
    //callback to indicate change in the state of the sock
    void            (*sk_state_change)(struct sock *sk);
    //callback to indicate there is data to be processed
    void            (*sk_data_ready)(struct sock *sk, int bytes);
    //callback to indicate there is bf sending space available
    void            (*sk_write_space)(struct sock *sk);
    //callback to indicate errors (e.g. %MSG_ERRQUEUE)
    void            (*sk_error_report)(struct sock *sk);
    //callback to process the backlog
      int            (*sk_backlog_rcv)(struct sock *sk, struct sk_buff *skb);  
      //called at sock freeing time, i.e. when all refcnt == 0
    void                    (*sk_destruct)(struct sock *sk);
}

0x5: struct proto_ops

\linux-2.6.32.63\include\linux\net.h

struct proto_ops 
{
    int        family;
    struct module    *owner;
    int        (*release)   (struct socket *sock);
    int        (*bind)         (struct socket *sock, struct sockaddr *myaddr, int sockaddr_len);
    int        (*connect)   (struct socket *sock, struct sockaddr *vaddr, int sockaddr_len, int flags);
    int        (*socketpair)(struct socket *sock1, struct socket *sock2);
    int        (*accept)    (struct socket *sock, struct socket *newsock, int flags);
    int        (*getname)   (struct socket *sock, struct sockaddr *addr, int *sockaddr_len, int peer);
    unsigned int    (*poll)         (struct file *file, struct socket *sock, struct poll_table_struct *wait);
    int        (*ioctl)     (struct socket *sock, unsigned int cmd, unsigned long arg);
    int         (*compat_ioctl) (struct socket *sock, unsigned int cmd, unsigned long arg);
    int        (*listen)    (struct socket *sock, int len);
    int        (*shutdown)  (struct socket *sock, int flags);
    int        (*setsockopt)(struct socket *sock, int level, int optname, char __user *optval, unsigned int optlen);
    int        (*getsockopt)(struct socket *sock, int level, int optname, char __user *optval, int __user *optlen);
    int        (*compat_setsockopt)(struct socket *sock, int level, int optname, char __user *optval, unsigned int optlen);
    int        (*compat_getsockopt)(struct socket *sock, int level, int optname, char __user *optval, int __user *optlen);
    int        (*sendmsg)   (struct kiocb *iocb, struct socket *sock, struct msghdr *m, size_t total_len);
    /* Notes for implementing recvmsg:
     * ===============================
     * msg->msg_namelen should get updated by the recvmsg handlers
     * iff msg_name != NULL. It is by default 0 to prevent
     * returning uninitialized memory to user space.  The recvfrom
     * handlers can assume that msg.msg_name is either NULL or has
     * a minimum size of sizeof(struct sockaddr_storage).
     */
    int        (*recvmsg)   (struct kiocb *iocb, struct socket *sock, struct msghdr *m, size_t total_len, int flags);
    int        (*mmap)         (struct file *file, struct socket *sock, struct vm_area_struct * vma);
    ssize_t        (*sendpage)  (struct socket *sock, struct page *page, int offset, size_t size, int flags);
    ssize_t     (*splice_read)(struct socket *sock,  loff_t *ppos, struct pipe_inode_info *pipe, size_t len, unsigned int flags);
};

0x6: struct inet_sock

在實際編程中,我們需要使用inet_sk(),將"struct sock"結果強制轉換為"struct inet_sock"之后,才可以從中取出我們想要的IP、Port等信息

\linux-2.6.32.63\include\net\inet_sock.h

static inline struct inet_sock *inet_sk(const struct sock *sk)
{
    return (struct inet_sock *)sk;
}

inet_sock的結構體定義如下

struct inet_sock 
{
    /* sk and pinet6 has to be the first two members of inet_sock */
    //ancestor class
    struct sock        sk;
#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
    //pointer to IPv6 control block
    struct ipv6_pinfo    *pinet6;
#endif
    /* Socket demultiplex comparisons on incoming packets. */
    //Foreign IPv4 addr
    __be32            daddr;
    //Bound local IPv4 addr
    __be32            rcv_saddr;
    //Destination port
    __be16            dport;
    //Local port
    __u16            num;
    //Sending source
    __be32            saddr;
    //Unicast TTL
    __s16            uc_ttl;
    __u16            cmsg_flags;
    struct ip_options_rcu    *inet_opt;
    //Source port
    __be16            sport;
    //ID counter for DF pkts
    __u16            id;
    //TOS
    __u8            tos;
    //Multicasting TTL
    __u8            mc_ttl;
    __u8            pmtudisc;
    __u8            recverr:1,
                //is this an inet_connection_sock?
                is_icsk:1,
                freebind:1,
                hdrincl:1,
                mc_loop:1,
                transparent:1,
                mc_all:1;
                //Multicast device index
    int            mc_index;
    __be32            mc_addr;
    struct ip_mc_socklist    *mc_list;
    //info to build ip hdr on each ip frag while socket is corked
    struct 
    {
        unsigned int        flags;
        unsigned int        fragsize;
        struct ip_options    *opt;
        struct dst_entry    *dst;
        int            length; /* Total length of all frames */
        __be32            addr;
        struct flowi        fl;
    } cork;
};

0x7: struct sockaddr

struct sockaddr 
{
    // address family, AF_xxx
    unsigned short    sa_family;
    
    // 14 bytes of protocol address
    char              sa_data[14];  
};

/* Structure describing an Internet (IP) socket address. */
#define __SOCK_SIZE__    16        /* sizeof(struct sockaddr)    */
struct sockaddr_in 
{
    /* Address family */
    sa_family_t        sin_family;
    
    /* Port number */
    __be16        sin_port;
    
    /* Internet address */
    struct in_addr    sin_addr;    

    /* Pad to size of `struct sockaddr'. */
    unsigned char        __pad[__SOCK_SIZE__ - sizeof(short int) - izeof(unsigned short int) - sizeof(struct in_addr)];
};
#define sin_zero    __pad        /* for BSD UNIX comp. -FvK    */

/* Internet address. */
struct in_addr 
{
    __be32    s_addr;
};

 

7. 系統內存相關的數據結構

0x1: struct mm_struct

指向進程所擁有的內存描述符,保存了進程的內存管理信息

struct mm_struct 
{
    struct vm_area_struct * mmap;        /* list of VMAs */
    struct rb_root mm_rb;
    struct vm_area_struct * mmap_cache;    /* last find_vma result */
    unsigned long (*get_unmapped_area) (struct file *filp, unsigned long addr, unsigned long len, unsigned long pgoff, unsigned long flags);
    void (*unmap_area) (struct mm_struct *mm, unsigned long addr);
    unsigned long mmap_base;        /* base of mmap area */
    unsigned long task_size;        /* size of task vm space */
    unsigned long cached_hole_size;     /* if non-zero, the largest hole below free_area_cache */
    unsigned long free_area_cache;        /* first hole of size cached_hole_size or larger */
    pgd_t * pgd;
    atomic_t mm_users;            /* How many users with user space? */
    atomic_t mm_count;            /* How many references to "struct mm_struct" (users count as 1) */
    int map_count;                /* number of VMAs */
    struct rw_semaphore mmap_sem;
    spinlock_t page_table_lock;        /* Protects page tables and some counters */

    /* List of maybe swapped mm's.    These are globally strung together off init_mm.mmlist, and are protected by mmlist_lock */
    struct list_head mmlist;        
    /* Special counters, in some configurations protected by the
     * page_table_lock, in other configurations by being atomic.
     */
    mm_counter_t _file_rss;
    mm_counter_t _anon_rss;

    unsigned long hiwater_rss;    /* High-watermark of RSS usage */
    unsigned long hiwater_vm;    /* High-water virtual memory usage */

    unsigned long total_vm, locked_vm, shared_vm, exec_vm;
    unsigned long stack_vm, reserved_vm, def_flags, nr_ptes;
    unsigned long start_code, end_code, start_data, end_data;
    unsigned long start_brk, brk, start_stack;
    unsigned long arg_start, arg_end, env_start, env_end;

    unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */

    struct linux_binfmt *binfmt;

    cpumask_t cpu_vm_mask;

    /* Architecture-specific MM context */
    mm_context_t context;

    /* Swap token stuff */
    /*
     * Last value of global fault stamp as seen by this process.
     * In other words, this value gives an indication of how long
     * it has been since this task got the token.
     * Look at mm/thrash.c
     */
    unsigned int faultstamp;
    unsigned int token_priority;
    unsigned int last_interval;

    unsigned long flags; /* Must use atomic bitops to access the bits */

    struct core_state *core_state; /* coredumping support */
#ifdef CONFIG_AIO
    spinlock_t        ioctx_lock;
    struct hlist_head    ioctx_list;
#endif
#ifdef CONFIG_MM_OWNER
    /*
     * "owner" points to a task that is regarded as the canonical
     * user/owner of this mm. All of the following must be true in
     * order for it to be changed:
     *
     * current == mm->owner
     * current->mm != mm
     * new_owner->mm == mm
     * new_owner->alloc_lock is held
     */
    struct task_struct *owner;
#endif

#ifdef CONFIG_PROC_FS
    /* store ref to file /proc/<pid>/exe symlink points to */
    struct file *exe_file;
    unsigned long num_exe_file_vmas;
#endif
#ifdef CONFIG_MMU_NOTIFIER
    struct mmu_notifier_mm *mmu_notifier_mm;
#endif
};

0x2: struct vm_area_struct

進程虛擬內存的每個區域表示為struct vm_area_struct的一個實例

struct vm_area_struct 
{
    /* 
    associated mm_struct 
    vm_mm是一個反向指針,指向該區域所屬的mm_struct實例
    */
    struct mm_struct             *vm_mm;   
    
    /* VMA start, inclusive vm_mm內的起始地址 */
    unsigned long                vm_start; 
    /* VMA end , exclusive 在vm_mm內結束地址之后的第一個字節的地址 */
    unsigned long                vm_end;    
    
    /* 
    list of VMA's 
    進程所有vm_area_struct實例的鏈表是通過vm_next實現的
    各進程的虛擬內存區域鏈表,按地址排序 
    */
    struct vm_area_struct        *vm_next;     

    /* 
    access permissions 
    該虛擬內存區域的訪問權限 
    1) _PAGE_READ
    2) _PAGE_WRITE
    3) _PAGE_EXECUTE
    */
    pgprot_t                     vm_page_prot; 
    
    /* 
    flags 
    vm_flags是描述該區域的一組標志,用於定義區域性質,這些都是在<mm.h>中聲明的預處理器常數 
    */
    unsigned long                vm_flags;      
    struct rb_node               vm_rb;         /* VMA's node in the tree */

    /*
    對於有地址空間和后備存儲器的區域來說:
    shared連接到address_space->i_mmap優先樹
    或連接到懸掛在優先樹結點之外、類似的一組虛擬內存區的鏈表
    或連接到ddress_space->i_mmap_nonlinear鏈表中的虛擬內存區域
    */
    union 
    {         /* links to address_space->i_mmap or i_mmap_nonlinear */
        struct 
        {
            struct list_head        list;
            void                    *parent;
            struct vm_area_struct   *head;
        } vm_set;
        struct prio_tree_node prio_tree_node;
    } shared;

    /*
    在文件的某一頁經過寫時復制之后,文件的MAP_PRIVATE虛擬內存區域可能同時在i_mmap樹和anon_vma鏈表中,MAP_SHARED虛擬內存區域只能在i_mmap樹中
    匿名的MAP_PRIVATE、棧或brk虛擬內存區域(file指針為NULL)只能處於anon_vma鏈表中
    */
    struct list_head             anon_vma_node;     /* anon_vma entry 對該成員的訪問通過anon_vma->lock串行化 */
    struct anon_vma              *anon_vma;         /* anonymous VMA object 對該成員的訪問通過page_table_lock串行化 */
    struct vm_operations_struct  *vm_ops;           /* associated ops 用於處理該結構的各個函數指針 */
    unsigned long                vm_pgoff;          /* offset within file 后備存儲器的有關信息 */
    struct file                  *vm_file;          /* mapped file, if any 映射到的文件(可能是NULL) */
    void                         *vm_private_data;  /* private data vm_pte(即共享內存) */
};

vm_flags是描述該區域的一組標志,用於定義區域性質,這些都是在<mm.h>中聲明的預處理器常數
\linux-2.6.32.63\include\linux\mm.h

#define VM_READ        0x00000001    /* currently active flags */
#define VM_WRITE    0x00000002
#define VM_EXEC        0x00000004
#define VM_SHARED    0x00000008

/* mprotect() hardcodes VM_MAYREAD >> 4 == VM_READ, and so for r/w/x bits. */
#define VM_MAYREAD    0x00000010    /* limits for mprotect() etc */
#define VM_MAYWRITE    0x00000020
#define VM_MAYEXEC    0x00000040
#define VM_MAYSHARE    0x00000080

/*
VM_GROWSDOWN、VM_GROWSUP表示一個區域是否可以向下、向上擴展
1. 由於堆自下而上增長,其區域需要設置VM_GROWSUP
2. 棧自頂向下增長,對該區域設置VM_GROWSDOWN
*/
#define VM_GROWSDOWN    0x00000100    /* general info on the segment */
#if defined(CONFIG_STACK_GROWSUP) || defined(CONFIG_IA64)
#define VM_GROWSUP    0x00000200
#else
#define VM_GROWSUP    0x00000000
#endif
#define VM_PFNMAP    0x00000400    /* Page-ranges managed without "struct page", just pure PFN */
#define VM_DENYWRITE    0x00000800    /* ETXTBSY on write attempts.. */

#define VM_EXECUTABLE    0x00001000
#define VM_LOCKED    0x00002000
#define VM_IO           0x00004000    /* Memory mapped I/O or similar */

/* 
Used by sys_madvise() 
由於區域很可能從頭到尾順序讀取,則設置VM_SEQ_READ。VM_RAND_READ指定了讀取可能是隨機的
這兩個標志用於"提示"內存管理子系統和塊設備層,以優化其性能,例如如果訪問是順序的,則啟用頁的預讀
*/            
#define VM_SEQ_READ    0x00008000    /* App will access data sequentially */
#define VM_RAND_READ    0x00010000    /* App will not benefit from clustered reads */

#define VM_DONTCOPY    0x00020000      /* Do not copy this vma on fork 相關的區域在fork系統調用執行時不復制 */
#define VM_DONTEXPAND    0x00040000    /* Cannot expand with mremap() 禁止區域通過mremap系統調用擴展 */
#define VM_RESERVED    0x00080000    /* Count as reserved_vm like IO */
#define VM_ACCOUNT    0x00100000    /* Is a VM accounted object VM_ACCOUNT指定區域是否被歸入overcommit特性的計算中 */
#define VM_NORESERVE    0x00200000    /* should the VM suppress accounting */
#define VM_HUGETLB    0x00400000    /* Huge TLB Page VM 如果區域是基於某些體系結構支持的巨型頁,則設置VM_HUGETLB */
#define VM_NONLINEAR    0x00800000    /* Is non-linear (remap_file_pages) */
#define VM_MAPPED_COPY    0x01000000    /* T if mapped copy of data (nommu mmap) */
#define VM_INSERTPAGE    0x02000000    /* The vma has had "vm_insert_page()" done on it */
#define VM_ALWAYSDUMP    0x04000000    /* Always include in core dumps */

#define VM_CAN_NONLINEAR 0x08000000    /* Has ->fault & does nonlinear pages */
#define VM_MIXEDMAP    0x10000000    /* Can contain "struct page" and pure PFN pages */
#define VM_SAO        0x20000000    /* Strong Access Ordering (powerpc) */
#define VM_PFN_AT_MMAP    0x40000000    /* PFNMAP vma that is fully mapped at mmap time */
#define VM_MERGEABLE    0x80000000    /* KSM may merge identical pages */

這些特性以多種方式限制內存分配

0x3: struct pg_data_t

\linux-2.6.32.63\include\linux\mmzone.h
在NUMA、UMA中,整個內存划分為"結點",每個結點關聯到系統中的一個處理器,在內核中表示為pg_data_t的實例,各個內存節點保存在一個單鏈表中,供內核遍歷

typedef struct pglist_data 
{
    //node_zones是一個數組,包含了結點中的管理區
    struct zone node_zones[MAX_NR_ZONES];

    //node_zonelists指定了結點及其內存域的列表,node_zonelist中zone的順序代表了分配內存的順序,前者分配內存失敗將會到后者的區域中分配內存,node_zonelist數組對每種可能的內存域類型都配置了一個獨立的數組項,包括類型為zonelist的備用列表
    struct zonelist node_zonelists[MAX_ZONELISTS];

    //nr_zones保存結點中不同內存域的數目
    int nr_zones;
#ifdef CONFIG_FLAT_NODE_MEM_MAP    /* means !SPARSEMEM */
    /*
    node_mem_map指向struct page實例數組的指針,用於描述結點的所有物理內存頁,它包含了結點中所有內存域的頁
    每個結點又划分為"內存域",是內存的進一步划分,各個內存域都關聯了一個數組,用來組織屬於該內存域的物理內存頁(頁幀),對每個頁幀,都分配一個struct page實例以及所需的管理數據
    */
    struct page *node_mem_map;
#ifdef CONFIG_CGROUP_MEM_RES_CTLR
    struct page_cgroup *node_page_cgroup;
#endif
#endif
    //在系統啟動期間,內存管理子系統初始化之前,內核也需要使用內存(必須保留部分內存用於初始化內存管理子系統),為了解決這個問題,內核使用了"自舉內存分配器(boot memory allocator)",bdata指向自舉內存分配器數據結構的實例
    struct bootmem_data *bdata;
#ifdef CONFIG_MEMORY_HOTPLUG
    /*
     * Must be held any time you expect node_start_pfn, node_present_pages
     * or node_spanned_pages stay constant.  Holding this will also
     * guarantee that any pfn_valid() stays that way.
     *
     * Nests above zone->lock and zone->size_seqlock.
     */
    spinlock_t node_size_lock;
#endif
    /*
    node_start_pfn是該NUMA結點第一個頁幀的邏輯編號,系統中所有結點的頁幀是依次編號的,每個頁幀的號碼都是全局唯一的(不單單是結點內唯一)
    node_start_pfn在UMA系統中總是0,因為其中只有一個結點,因此其第一個頁幀編號總是0
    */
    unsigned long node_start_pfn;
    /* 
    total number of physical pages 
    node_present_pages指定了結點中頁幀的總數目
    */
    unsigned long node_present_pages; 
    /* 
    total size of physical page range, including holes 
    node_spanned_pages給出了該結點以頁幀為單位計算的長度

    node_present_pages、node_spanned_pages的值不一定相同,因為結點中可能有一些空洞,並不對應真正的頁幀
    */
    unsigned long node_spanned_pages;

    //node_id是全局結點ID,系統中的NUMA結點都是從0開始編號
    int node_id;

    //kswapd是交換守護進程(swap deamon)的等待隊列,在將頁幀換出時會用到
    wait_queue_head_t kswapd_wait;

    //kswapd指向負責該結點的交換守護進程的task_strcut
    struct task_struct *kswapd;

    //kswapd_max_order用於頁交換子系統的實現,用來定義需要釋放的區域的長度
    int kswapd_max_order;
} pg_data_t;

0x4: struct zone

內存划分為"結點",每個結點關聯到系統中的一個處理器,各個結點又划分為"內存域",是內存的進一步划分
\linux-2.6.32.63\include\linux\mmzone.h

struct zone 
{
    /* Fields commonly accessed by the page allocator 通常由頁分配器訪問的字段*/

    /* 
    zone watermarks, access with *_wmark_pages(zone) macros 
    pages_min、pages_high、pages_low是頁換出時使用的"水印",如果內存不足,內核可以將頁寫到硬盤,這3個成員會影響交換守護進程的行為
    1. 如果空閑頁多於pages_high: 則內存域的狀態是理想的
    2. 如果空閑頁的數目低於pages_low: 則內核開始將頁換出到硬盤
    3. 如果空閑頁的數目低於pages_min: 則頁回收工作的壓力已經很大了,因為內存域中急需空閑頁,內核中有一些機制用於處理這種緊急情況
    */
    unsigned long watermark[NR_WMARK];

    /*
     * When free pages are below this point, additional steps are taken
     * when reading the number of free pages to avoid per-cpu counter
     * drift allowing watermarks to be breached
     */
    unsigned long percpu_drift_mark;

    /*
    We don't know if the memory that we're going to allocate will be freeable or/and it will be released eventually, 
    so to avoid totally wasting several GB of ram we must reserve some of the lower zone memory (otherwise we risk to run OOM on the lower zones despite there's tons of freeable ram on the higher zones). 
    This array is recalculated at runtime if the sysctl_lowmem_reserve_ratio sysctl changes.
    lowmem_reserve數組分別為各種內存域指定了若干頁,用於一些無論如何都不能失敗的關鍵性內存分配,各個內存域的份額根據重要性確定
  lowmem_reserve的計算由setup_per_zone_lowmem_reserve完成,內核迭代系統的所有結點,對每個結點的各個內存域分別計算預留內存最小值,具體的算法是
    內存域中頁幀的總數 / sysctl_lowmem_reserve_ratio[zone]
    除數(sysctl_lowmem_reserve_ratio[zone])的默認設置對低端內存域是256,對高端內存域是32
*/ unsigned long lowmem_reserve[MAX_NR_ZONES]; #ifdef CONFIG_NUMA int node; /* * zone reclaim becomes active if more unmapped pages exist. */ unsigned long min_unmapped_pages; unsigned long min_slab_pages; struct per_cpu_pageset *pageset[NR_CPUS]; #else /* pageset是一個數組,用於實現每個CPU的熱/冷頁幀列表,內核使用這些列表來保存可用於滿足實現的"新鮮頁"。但冷熱幀對應的高速緩存狀態不同 1. 熱幀: 頁幀已經加載到高速緩存中,與在內存中的頁相比,因此可以快速訪問,故稱之為熱的
    2. 冷幀: 未緩存的頁幀已經不在高速緩存中,故稱之為冷的
*/ struct per_cpu_pageset pageset[NR_CPUS]; #endif /* * free areas of different sizes */ spinlock_t lock; #ifdef CONFIG_MEMORY_HOTPLUG /* see spanned/present_pages for more description */ seqlock_t span_seqlock; #endif /* 不同長度的空閑區域 free_area是同名數據結構的數組,用於實現伙伴系統,每個數組元素都表示某種固定長度的一些連續內存區,對於包含在每個區域中的空閑內存頁的管理,free_area是一個起點 */ struct free_area free_area[MAX_ORDER]; #ifndef CONFIG_SPARSEMEM /* * Flags for a pageblock_nr_pages block. See pageblock-flags.h. * In SPARSEMEM, this map is stored in struct mem_section */ unsigned long *pageblock_flags; #endif /* CONFIG_SPARSEMEM */ ZONE_PADDING(_pad1_) /* Fields commonly accessed by the page reclaim scanner 通常由頁面回收掃描程序訪問的字段 */ spinlock_t lru_lock; struct zone_lru { struct list_head list; } lru[NR_LRU_LISTS]; struct zone_reclaim_stat reclaim_stat; /* since last reclaim 上一次回收以來掃描過的頁 */ unsigned long pages_scanned; /* zone flags 內存域標志 */ unsigned long flags; /* Zone statistics 內存域統計量,vm_stat維護了大量有關該內存域的統計信息,內核中很多地方都會更新其中的信息 */ atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS]; /* prev_priority holds the scanning priority for this zone. It is defined as the scanning priority at which we achieved our reclaim target at the previous try_to_free_pages() or balance_pgdat() invokation. We use prev_priority as a measure of how much stress page reclaim is under - it drives the swappiness decision: whether to unmap mapped pages. Access to both this field is quite racy even on uniprocessor. But it is expected to average out OK. prev_priority存儲了上一次掃描操作掃描該內存域的優先級,掃描操作是由try_to_free_pages進行的,直至釋放足夠的頁幀,掃描會根據該值判斷是否換出映射的頁 */ int prev_priority; /* * The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on * this zone's LRU. Maintained by the pageout code. */ unsigned int inactive_ratio; ZONE_PADDING(_pad2_) /* Rarely used or read-mostly fields 很少使用或大多數情況下是只讀的字段 */ /* 1. wait_table: the array holding the hash table 2. wait_table_hash_nr_entries: the size of the hash table array 3. wait_table_bits: wait_table_size == (1 << wait_table_bits) The purpose of all these is to keep track of the people waiting for a page to become available and make them runnable again when possible. The trouble is that this consumes a lot of space, especially when so few things wait on pages at a given time. So instead of using per-page waitqueues, we use a waitqueue hash table. The bucket discipline is to sleep on the same queue when colliding and wake all in that wait queue when removing. When something wakes, it must check to be sure its page is truly available, a la thundering herd. The cost of a collision is great, but given the expected load of the table, they should be so rare as to be outweighed by the benefits from the saved space. __wait_on_page_locked() and unlock_page() in mm/filemap.c, are the primary users of these fields, and in mm/page_alloc.c free_area_init_core() performs the initialization of them. wait_table、wait_table_hash_nr_entries、wait_table_bits實現了一個等待隊列,可用於存儲等待某一頁變為可用的等待進程,進程排成一個隊列,等待某些條件,在條件變為真時,內核會通知進程恢復工作 */ wait_queue_head_t * wait_table; unsigned long wait_table_hash_nr_entries; unsigned long wait_table_bits; /* Discontig memory support fields. 支持不連續內存模型的字段,內存域和父節點之間的關聯由zone_pgdat建立,zone_pgdat指向對應的pg_list_data實例(內存結點) */ struct pglist_data *zone_pgdat; /* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT zone_start_pfn是內存域第一個頁幀的索引 */ unsigned long zone_start_pfn; /* zone_start_pfn, spanned_pages and present_pages are all protected by span_seqlock. It is a seqlock because it has to be read outside of zone->lock, and it is done in the main allocator path. But, it is written quite infrequently. The lock is declared along with zone->lock because it is frequently read in proximity to zone->lock. It's good to give them a chance of being in the same cacheline. */ unsigned long spanned_pages; /* total size, including holes 內存域中頁的總數,包含空洞*/ unsigned long present_pages; /* amount of memory (excluding holes) 內存域中頁的實際數量(除去空洞) */ /*rarely used fields:*/ /* name是一個字符串,保存該內存域的管用名稱,有3個選項可用 1. Normal 2. DMA 3. HighMem */ const char *name; } ____cacheline_internodealigned_in_smp;

該結構比較特殊的方面是它由ZONE_PADDING分隔為幾個部分,這是因為對zone結構的訪問非常頻繁,在多處理器系統上,通常會有不同的CPU試圖同時訪問結構成員,因此使用了鎖防止它們彼此干擾,避免錯誤和不一致。由於內核對該結構的訪問非常頻繁,因此會經常性地獲取該結構的兩個自旋鎖zone->lock、zone->lru_lock
因此,如果數據保存在CPU高速緩存中,那么會處理的更快速。而高速緩存分為行,每一行負責不同的內存區,內核使用ZONE_PADDING宏生成"填充"字段添加到結構中,以確保每個自旋鎖都處於自身的"緩存行"中,還使用了編譯器關鍵字____cacheline_internodealigned_in_smp,用以實現最優的高速緩存對齊方式

這是內核在基於對CPU底層硬件的深刻理解后做出的優化,通過看似浪費空間的"冗余"操作,提高了CPU的並行處理效率,防止了因為鎖導致的等待損耗

0x5: struct page

\linux-2.6.32.63\include\linux\mm_types.h
該結構的格式是體系結構無關的,不依賴於使用的CPU類型,每個頁幀都由該結構描述 

/*
Each physical page in the system has a struct page associated with it to keep track of whatever it is we are using the page for at the moment. 
Note that we have no way to track which tasks are using a page, though if it is a pagecache page, rmap structures can tell us who is mapping it.
*/
struct page 
{
    /* 
    Atomic flags, some possibly updated asynchronously 
    flag存儲了體系結構無關的標志,用來存放頁的狀態屬性,每一位代表一種狀態,所以至少可以同時表示出32中不同的狀態,這些狀態定義在linux/page-flags.h中    
    enum pageflags 
    {
        PG_locked,            //Page is locked. Don't touch. 指定了頁是否鎖定,如果該比特位置位,內核的其他部分不允許訪問該頁,這防止了內存管理出現競態條件,例如從硬盤讀取數據到頁幀時
        PG_error,            //如果在涉及該頁的I/O操作期間發生錯誤,則PG_error置位
        PG_referenced,        //PG_referenced、PG_active控制了系統使用該頁的活躍程度,在頁交換子系統選擇換出頁時,該信息很重要
        PG_uptodate,        //PG_uptodate表示頁的數據已經從塊設備讀取,期間沒有出錯
        PG_dirty,            //如果與硬盤上的數據相比,頁的內容已經改變,則置位PG_dirty。處於性能考慮,頁並不在每次修改后立即寫回,因此內核使用該標志注明頁已經改變,可以在稍后刷出。設置了該標志的頁稱為臟的(即內存中的數據沒有與外存儲器介質如硬盤上的數據同步)
        PG_lru,                //PG_lru有助於實現頁面回收和切換,內核使用兩個最近最少使用(least recently used lru)鏈表來區別活動和不活動頁,如果頁在其中一個鏈表中,則設置該比特位
        PG_active,
        PG_slab,            //如果頁是SLAB分配器的一部分,則設置PG_slab位
        PG_owner_priv_1,    //Owner use. If pagecache, fs may use 
        PG_arch_1,
        PG_reserved,
        PG_private,            //If pagecache, has fs-private data: 如果page結構的private成員非空,則必須設置PG_private位,用於I/O的頁,可使用該字段將頁細分為多個緩沖區
        PG_private_2,        //If pagecache, has fs aux data 
        PG_writeback,        //Page is under writeback: 如果頁的內容處於向塊設備回寫的過程中,則需要設置PG_writeback位
    #ifdef CONFIG_PAGEFLAGS_EXTENDED
        PG_head,            //A head page 
        PG_tail,            //A tail page  
    #else
        PG_compound,        //A compound page: PG_compound表示該頁屬於一個更大的復合頁,復合頁由多個相連的普通頁組成
    #endif
        PG_swapcache,        //Swap page: swp_entry_t in private: 如果頁處於交換緩存,則設置PG_swapcache位,在這種情況下,private包含一個類型為swap_entry_t的項 
        PG_mappedtodisk,    //Has blocks allocated on-disk  
        PG_reclaim,            //To be reclaimed asap: 在可用內存的數量變少時,內核視圖周期性地回收頁,即剔除不活動、未用的頁,在內核決定回收某個特定的頁=之后,需要設置PG_reclaim標志通知
        PG_buddy,            //Page is free, on buddy lists: 如果頁空閑且包含在伙伴系統的列表中,則設置PG_buddy位,伙伴系統是頁分配機制的核心
        PG_swapbacked,        //Page is backed by RAM/swap 
        PG_unevictable,        //Page is "unevictable"  
    #ifdef CONFIG_HAVE_MLOCKED_PAGE_BIT
        PG_mlocked,            //Page is vma mlocked  
    #endif
    #ifdef CONFIG_ARCH_USES_PG_UNCACHED
        PG_uncached,        //Page has been mapped as uncached  
    #endif
    #ifdef CONFIG_MEMORY_FAILURE
        PG_hwpoison,        //hardware poisoned page. Don't touch  
    #endif
        __NR_PAGEFLAGS,
 
        PG_checked = PG_owner_priv_1,    //Filesystems   
        PG_fscache = PG_private_2,        //page backed by cache 

        //XEN  
        PG_pinned = PG_owner_priv_1,
        PG_savepinned = PG_dirty,

        //SLOB  
        PG_slob_free = PG_private,

        //SLUB  
        PG_slub_frozen = PG_active,
        PG_slub_debug = PG_error,
    };

    內核定義了一些標准宏,用於檢查頁是否設置了某個特定的比特位,或者操作某個比特位,這些宏的名稱有一定的模式,這些操作都是原子的
    1. PageXXX(page): 會檢查頁是否設置了PG_XXX位
    2. SetPageXXX: 在某個比特位沒有設置的情況下,設置該比特位,並返回原值
    3. ClearPageXXX: 無條件地清除某個特定的比特位
    4. TestClearPageXXX: 清除某個設置的比特位,並返回原值 
    */
    unsigned long flags;    

    /*
    Usage count, see below
    _count記錄了該頁被引用了多少次,_count是一個使用計數,表示內核中引用該頁的次數
    1. 在其值到達0時,內核就知道page實例當前不使用,因此可以刪除
    2. 如果其值大於0,該實例絕不會從內存刪除
    */    
    atomic_t _count;         
    union 
    {
        /* 
        Count of ptes mapped in mms, to show when page is mapped & limit reverse map searches.
        內存管理子系統中映射的頁表項計數,用於表示在頁表中有多少項指向該頁,還用於限制逆向映射搜索
        atomic_t類型允許以原子方式修改其值,即不受並發訪問的影響
        */
        atomic_t _mapcount; 
        struct 
        {    /* 
            SLUB: 用於SLUB分配器,表示對象的數目 
            */
            u16 inuse;
            u16 objects;
        };
    };
    union 
    {
        struct 
        {
            /* 
            Mapping-private opaque data: 由映射私有,不透明數據
            usually used for buffer_heads if PagePrivate set: 如果設置了PagePrivate,通常用於buffer_heads
            used for swp_entry_t if PageSwapCache: 如果設置了PageSwapCache,則用於swp_entry_t
            indicates order in the buddy system if PG_buddy is set: 如果設置了PG_buddy,則用於表示伙伴系統中的階
            private是一個指向"私有"數據的指針,虛擬內存管理會忽略該數據
            */
            unsigned long private;        

            /* 
            If low bit clear, points to inode address_space, or NULL: 如果最低位為0,則指向inode address_space,成為NULL
            If page mapped as anonymous memory, low bit is set, and it points to anon_vma object: 如果頁映射為匿名內存,則將最低位置位,而且該指針指向anon_vma對象
            mapping指定了頁幀所在的地址空間
            */
            struct address_space *mapping;    
        };
#if USE_SPLIT_PTLOCKS
        spinlock_t ptl;
#endif
        /* 
        SLUB: Pointer to slab 
        用於SLAB分配器: 指向SLAB的指針
        */
        struct kmem_cache *slab;    
        /* 
        Compound tail pages 
        內核可以將多個相連的頁合並成較大的復合頁(compound page),分組中的第一個頁稱作首頁(head page),而所有其余各頁叫做尾頁(tail page),所有尾頁對應的page實例中,都將first_page設置為指向首頁
        用於復合頁的頁尾,指向首頁
        */
        struct page *first_page;    
    };
    union 
    {
        /* 
        Our offset within mapping. 
        index是頁幀在映射內的偏移量
        */
        pgoff_t index;        
        void *freelist;        /* SLUB: freelist req. slab lock */
    };

    /* 
    Pageout list(換出頁列表), eg. active_list protected by zone->lru_lock 
    */
    struct list_head lru;        
    /*
     * On machines where all RAM is mapped into kernel address space,
     * we can simply calculate the virtual address. On machines with
     * highmem some memory is mapped into kernel virtual memory
     * dynamically, so we need a place to store that address.
     * Note that this field could be 16 bits on x86 ... ;)
     *
     * Architectures with slow multiplication can define
     * WANT_PAGE_VIRTUAL in asm/page.h
     */
#if defined(WANT_PAGE_VIRTUAL)
    /* 
    Kernel virtual address (NULL if not kmapped, ie. highmem) 
    內核虛擬地址(如果沒有映射機制則為NULL,即高端內存)
    */
    void *virtual;            
#endif /* WANT_PAGE_VIRTUAL */
#ifdef CONFIG_WANT_PAGE_DEBUG_FLAGS
    unsigned long debug_flags;    /* Use atomic bitops on this */
#endif

#ifdef CONFIG_KMEMCHECK
    /*
     * kmemcheck wants to track the status of each byte in a page; this
     * is a pointer to such a status block. NULL if not tracked.
     */
    void *shadow;
#endif
};

很多時候,需要等待頁的狀態改變,然后才能恢復工作,內核提供了兩個輔助函數
\linux-2.6.32.63\include\linux\pagemap.h

static inline void wait_on_page_locked(struct page *page);
假定內核的一部分在等待一個被鎖定的頁面,直至頁面解鎖,wait_on_page_locked提供了該功能,在頁面鎖定的情況下調用該函數,內核將進入睡眠,在頁解鎖之后,睡眠進程被自動喚醒並繼續共走

static inline void wait_on_page_writeback(struct page *page);
wait_on_page_writeback的工作方式類似,該函數會等待到與頁面相關的所有待決回寫操作結束,將頁面包含的數據同步到塊設備(例如硬盤)為止

 

8. 中斷相關的數據結構

0x1: struct irq_desc

用於表示IRQ描述符的結構定義如下:\linux-2.6.32.63\include\linux\irq.h

struct irq_desc { //1. interrupt number for this descriptor unsigned int irq; //2. irq stats per cpu unsigned int *kstat_irqs; #ifdef CONFIG_INTR_REMAP //3. iommu with this irq struct irq_2_iommu *irq_2_iommu; #endif //4. highlevel irq-events handler [if NULL, __do_IRQ()]  irq_flow_handler_t handle_irq; //5. low level interrupt hardware access struct irq_chip *chip; //6. MSI descriptor struct msi_desc *msi_desc; //7. per-IRQ data for the irq_chip methods void *handler_data; //8. platform-specific per-chip private data for the chip methods, to allow shared chip implementations void *chip_data; /* IRQ action list */ //9. the irq action chain struct irqaction *action; /* IRQ status */ //10. status information unsigned int status; /* nested irq disables */ //11. disable-depth, for nested irq_disable() calls unsigned int depth; /* nested wake enables */ //12. enable depth, for multiple set_irq_wake() callers unsigned int wake_depth; /* For detecting broken IRQs */ //13. stats field to detect stalled irqs unsigned int irq_count; /* Aging timer for unhandled count */ //14. aging timer for unhandled count unsigned long last_unhandled; //15. stats field for spurious unhandled interrupts unsigned int irqs_unhandled; //16. locking for SMP spinlock_t lock; #ifdef CONFIG_SMP //17. IRQ affinity on SMP  cpumask_var_t affinity; //18. node index useful for balancing unsigned int node; #ifdef CONFIG_GENERIC_PENDING_IRQ //19. pending rebalanced interrupts  cpumask_var_t pending_mask; #endif #endif //20. number of irqaction threads currently running  atomic_t threads_active; //21. wait queue for sync_irq to wait for threaded handlers  wait_queue_head_t wait_for_threads; #ifdef CONFIG_PROC_FS //22. /proc/irq/ procfs entry struct proc_dir_entry *dir; #endif //23. flow handler name for /proc/interrupts output const char *name; } ____cacheline_internodealigned_in_smp;

status描述了IRQ的當前狀態
irq.h中定義了各種表示當前狀態的常數,可用於描述IRQ電路當前的狀態。每個常數表示位串中的一個置為的標志位(可以同時設置)

/*
 * IRQ line status.
 *
 * Bits 0-7 are reserved for the IRQF_* bits in linux/interrupt.h
 *
 * IRQ types
 */
#define IRQ_TYPE_NONE        0x00000000    /* Default, unspecified type */
#define IRQ_TYPE_EDGE_RISING    0x00000001    /* Edge rising type */
#define IRQ_TYPE_EDGE_FALLING    0x00000002    /* Edge falling type */
#define IRQ_TYPE_EDGE_BOTH (IRQ_TYPE_EDGE_FALLING | IRQ_TYPE_EDGE_RISING)
#define IRQ_TYPE_LEVEL_HIGH    0x00000004    /* Level high type */
#define IRQ_TYPE_LEVEL_LOW    0x00000008    /* Level low type */
#define IRQ_TYPE_SENSE_MASK    0x0000000f    /* Mask of the above */
#define IRQ_TYPE_PROBE        0x00000010    /* Probing in progress */

/* 
IRQ handler active - do not enter! 
與IRQ_DISABLED類似,IRQ_DISABLED會阻止其余的內核代碼執行該處理程序
*/
#define IRQ_INPROGRESS        0x00000100    

/* 
IRQ disabled - do not enter!  
用戶表示被設備驅動程序禁用的IRQ電路毛概標志通知內核不要進入處理程序
*/
#define IRQ_DISABLED        0x00000200    

/* 
IRQ pending - replay on enable 
當CPU產生一個中斷但尚未執行對應的處理程序時,IRQ_PENDING標志位置位
*/
#define IRQ_PENDING        0x00000400    

/* 
IRQ has been replayed but not acked yet 
IRQ_REPLAY意味着該IRQ已經禁用,但此前尚有一個未確認的中斷
*/
#define IRQ_REPLAY        0x00000800    
#define IRQ_AUTODETECT        0x00001000    /* IRQ is being autodetected */
#define IRQ_WAITING        0x00002000    /* IRQ not yet seen - for autodetection */

/* 
IRQ level triggered 
用於Alpha和PowerPC系統,用於區分電平觸發和邊沿觸發的IRQ
*/
#define IRQ_LEVEL        0x00004000    

/* 
IRQ masked - shouldn't be seen again 
為正確處理發生在中斷處理期間的中斷,需要IRQ_MASKED標志位
*/
#define IRQ_MASKED        0x00008000    

/* 
IRQ is per CPU 
某個IRQ只能發生在一個CPU上時,將設置IRQ_PER_CPU標志位,在SMP系統中,該標志使幾個用於防止並發訪問的保護機制變得多余
*/
#define IRQ_PER_CPU        0x00010000    
#define IRQ_NOPROBE        0x00020000    /* IRQ is not valid for probing */
#define IRQ_NOREQUEST        0x00040000    /* IRQ cannot be requested */
#define IRQ_NOAUTOEN        0x00080000    /* IRQ will not be enabled on request irq */
#define IRQ_WAKEUP        0x00100000    /* IRQ triggers system wakeup */
#define IRQ_MOVE_PENDING    0x00200000    /* need to re-target IRQ destination */
#define IRQ_NO_BALANCING    0x00400000    /* IRQ is excluded from balancing */
#define IRQ_SPURIOUS_DISABLED    0x00800000    /* IRQ was disabled by the spurious trap */
#define IRQ_MOVE_PCNTXT        0x01000000    /* IRQ migration from process context */
#define IRQ_AFFINITY_SET    0x02000000    /* IRQ affinity was set from userspace*/
#define IRQ_SUSPENDED        0x04000000    /* IRQ has gone through suspend sequence */
#define IRQ_ONESHOT        0x08000000    /* IRQ is not unmasked after hardirq */
#define IRQ_NESTED_THREAD    0x10000000    /* IRQ is nested into another, no own handler thread */

#ifdef CONFIG_IRQ_PER_CPU
# define CHECK_IRQ_PER_CPU(var) ((var) & IRQ_PER_CPU)
# define IRQ_NO_BALANCING_MASK    (IRQ_PER_CPU | IRQ_NO_BALANCING)
#else
# define CHECK_IRQ_PER_CPU(var) 0
# define IRQ_NO_BALANCING_MASK    IRQ_NO_BALANCING
#endif

0x2: struct irq_chip

\linux-2.6.32.63\include\linux\irq.h

struct irq_chip 
{
    /*
    1. name for /proc/interrupts
    包含一個短的字符串,用於標識硬件控制器
        1) IA-32: XTPIC
        2) AMD64: IO-APIC
    */
    const char    *name;

    //2. start up the interrupt (defaults to ->enable if NULL),用於第一次初始化一個IRQ,startup實際上就是將工作轉給enable
    unsigned int    (*startup)(unsigned int irq);

    //3. shut down the interrupt (defaults to ->disable if NULL)
    void        (*shutdown)(unsigned int irq);

    //4. enable the interrupt (defaults to chip->unmask if NULL)
    void        (*enable)(unsigned int irq);

    //5. disable the interrupt (defaults to chip->mask if NULL)
    void        (*disable)(unsigned int irq);

    //6. start of a new interrupt
    void        (*ack)(unsigned int irq);

    //7. mask an interrupt source
    void        (*mask)(unsigned int irq);

    //8. ack and mask an interrupt source
    void        (*mask_ack)(unsigned int irq);

    //9. unmask an interrupt source
    void        (*unmask)(unsigned int irq);

    //10. end of interrupt - chip level
    void        (*eoi)(unsigned int irq);

    //11. end of interrupt - flow level
    void        (*end)(unsigned int irq);

    //12. set the CPU affinity on SMP machines
    int        (*set_affinity)(unsigned int irq, const struct cpumask *dest);

    //13. resend an IRQ to the CPU
    int        (*retrigger)(unsigned int irq);

    //14. set the flow type (IRQ_TYPE_LEVEL/etc.) of an IRQ
    int        (*set_type)(unsigned int irq, unsigned int flow_type);

    //15. enable/disable power-management wake-on of an IRQ
    int        (*set_wake)(unsigned int irq, unsigned int on);

    //16. function to lock access to slow bus (i2c) chips
    void        (*bus_lock)(unsigned int irq);

    //17. function to sync and unlock slow bus (i2c) chips
    void        (*bus_sync_unlock)(unsigned int irq);

    /* Currently used only by UML, might disappear one day.*/
#ifdef CONFIG_IRQ_RELEASE_METHOD
    //18. release function solely used by UML
    void        (*release)(unsigned int irq, void *dev_id);
#endif
    /*
     * For compatibility, ->typename is copied into ->name.
     * Will disappear.
     */
    //19. obsoleted by name, kept as migration helper
    const char    *typename;
};

該結構需要考慮內核中出現的各個IRQ實現的所有特性。因此,一個該結構的特定實例,通常只定義所有可能方法的一個子集,下面以IO-APIC、i8259A標准中斷控制器作為例子

\linux-2.6.32.63\arch\x86\kernel\io_apic.c

static struct irq_chip ioapic_chip __read_mostly = {
    .name        = "IO-APIC",
    .startup    = startup_ioapic_irq,
    .mask        = mask_IO_APIC_irq,
    .unmask        = unmask_IO_APIC_irq,
    .ack        = ack_apic_edge,
    .eoi        = ack_apic_level,
#ifdef CONFIG_SMP
    .set_affinity    = set_ioapic_affinity_irq,
#endif
    .retrigger    = ioapic_retrigger_irq,
};

linux-2.6.32.63\arch\alpha\kernel\irq_i8259.c

struct irq_chip i8259a_irq_type = {
    .name        = "XT-PIC",
    .startup    = i8259a_startup_irq,
    .shutdown    = i8259a_disable_irq,
    .enable        = i8259a_enable_irq,
    .disable    = i8259a_disable_irq,
    .ack        = i8259a_mask_and_ack_irq,
    .end        = i8259a_end_irq,
};

可以看到,運行該設備,只需要定義所有可能處理程序函數的一個子集

0x3: struct irqaction

struct irqaction結構是struct irq_desc中和IRQ處理函數相關的成員結構

struct irqaction 
{
    //1. name、dev_id唯一地標識一個中斷處理程序
    irq_handler_t           handler;
    void                    *dev_id;

    void __percpu           *percpu_dev_id;

    //2. next用於實現共享的IRQ處理程序
    struct irqaction        *next;
    irq_handler_t           thread_fn;
    struct task_struct      *thread;
    unsigned int            irq;

    //3. flags是一個標志變量,通過位圖描述了IRQ(和相關的中斷)的一些特性,位圖中的各個標志位可以通過預定義的常數訪問
    unsigned int            flags;
    unsigned long           thread_flags;
    unsigned long           thread_mask;

    //4. name是一個短字符串,用於標識設備
    const char              *name;
    struct proc_dir_entry   *dir;
} ____cacheline_internodealigned_in_smp;

幾個irqaction實例聚集到一個鏈表中,鏈表的所有元素都必須處理同一個IRQ編號,在發生一個共享中斷時,內核掃描該鏈表找出中斷實際上的來源設備

 

9. 進程間通信(IPC)相關數據結構

0x1: struct ipc_namespace

從內核版本2.6.19開始,IPC機制已經能夠意識到命名空間的存在,但管理IPC命名空間比較簡單,因為它們之間沒有層次關系,給定的進程屬於task_struct->nsproxy->ipc_ns指向的命名空間,初始的默認命名空間通過ipc_namespace的靜態實例init_ipc_ns實現,每個命名空間都包括如下結構
source/include/linux/ipc_namespace.h

struct ipc_namespace 
{
    atomic_t    count;
    /*
    每個數組元素對應一種IPC機制
        1) ids[0]: 信號量
        2) ids[1]: 消息隊列
        3) ids[2]: 共享內存
    */
    struct ipc_ids    ids[3];

    int        sem_ctls[4];
    int        used_sems;

    int        msg_ctlmax;
    int        msg_ctlmnb;
    int        msg_ctlmni;
    atomic_t    msg_bytes;
    atomic_t    msg_hdrs;
    int        auto_msgmni;

    size_t        shm_ctlmax;
    size_t        shm_ctlall;
    int        shm_ctlmni;
    int        shm_tot;

    struct notifier_block ipcns_nb;

    /* The kern_mount of the mqueuefs sb.  We take a ref on it */
    struct vfsmount    *mq_mnt;

    /* # queues in this ns, protected by mq_lock */
    unsigned int    mq_queues_count;

    /* next fields are set through sysctl */
    unsigned int    mq_queues_max;   /* initialized to DFLT_QUEUESMAX */
    unsigned int    mq_msg_max;      /* initialized to DFLT_MSGMAX */
    unsigned int    mq_msgsize_max;  /* initialized to DFLT_MSGSIZEMAX */
};

Relevant Link:

http://blog.csdn.net/bullbat/article/details/7781027
http://book.51cto.com/art/201005/200882.htm

0x2: struct ipc_ids

這個結構保存了有關IPC對象狀態的一般信息,每個struct ipc_ids結構實例對應於一種IPC機制: 共享內存、信號量、消息隊列。為了防止對每個;類別都需要查找對應的正確數組索引,內核提供了輔助函數msg_ids、shm_ids、sem_ids
source/include/linux/ipc_namespace.h

struct ipc_ids 
{
    //1. 當前使用中IPC對象的數目
    int in_use; 

    /*
    2. 用戶連續產生用戶空間IPC ID,需要注意的是,ID不等同於序號,內核通過ID來標識IPC對象,ID按資源類型管理,即一個ID用於消息隊列,一個用於信號量、一個用於共享內存對象
    每次創建新的IPC對象時,序號加1(自動進行回繞,即到達最大值自動變為0)
    用戶層可見的ID = s * SEQ_MULTIPLER + i,其中s是當前序號,i是內核內部的ID,SEQ_MULTIPLER設置為IPC對象的上限
    如果重用了內部ID,仍然會產生不同的用戶空間ID,因為序號不會重用,在用戶層傳遞了一個舊的ID時,這種做法最小化了使用錯誤資源的風險
    */
    unsigned short seq;
    unsigned short seq_max; 

    //3. 內核信號量,它用於實現信號量操作,避免用戶空間中的競態條件,該互斥量有效地保護了包含信號量值的數據結構
    struct rw_semaphore rw_mutex;

    //4. 每個IPC對象都由kern_ipc_perm的一個實例表示,ipcs_idr用於將ID關聯到指向對應的kern_ipc_perm實例的指針
    struct idr ipcs_idr;
};

每個IPC對象都由kern_ipc_perm的一個實例表示,每個對象都有一個內核內部ID,ipcs_idr用於將ID關聯到指向對應的kern_ipc_perm實例的指針

0x3: struct kern_ipc_perm

這個結構保存了當前IPC對象的"所有者"、和訪問權限等相關信息
/source/include/linux/ipc.h

/* Obsolete, used only for backwards compatibility and libc5 compiles */
struct ipc_perm
{
    //1. 保存了用戶程序用來標識信號量的魔數
    __kernel_key_t    key;

    //2. 當前IPC對象所有者的UID
    __kernel_uid_t    uid;

    //3. 當前IPC對象所有者的組ID
    __kernel_gid_t    gid;

    //4. 產生信號量的進程的用戶ID
    __kernel_uid_t    cuid;

    //5. 產生信號量的進程的用戶組ID
    __kernel_gid_t    cgid;

    //6. 位掩碼。指定了所有者、組、其他用戶的訪問權限
    __kernel_mode_t    mode; 

    //7. 一個序號,在分配IPC時使用
    unsigned short    seq;
};

這個結果不足以保存信號量所需的所有信息。在進程的task_struct實例中有一個與IPC相關的成員

struct task_struct
{
    ...
    #ifdef CONFIG_SYSVIPC  
        struct sysv_sem sysvsem;
    #endif
    ...
}
//只有設置了配置選項CONFIG_SYSVIPC時,Sysv相關代碼才會被編譯到內核中

0x4: struct sysv_sem

struct sysv_sem數據結構封裝了另一個成員

struct sysv_sem 
{
    //用於撤銷信號量
    struct sem_undo_list *undo_list;
};

如果崩潰金曾修改了信號量狀態之后,可能會導致有等待該信號量的進程無法喚醒(偽死鎖),則該機制在這種情況下很有用。通過使用撤銷列表中的信息在適當的時候撤銷這些操作,信號量可以恢復到一致狀態,防止死鎖

0x5: struct sem_queue

struct sem_queue數據結構用於將信號量與睡眠進程關聯起來,該進程想要執行信號量操作,但目前因為資源爭奪關系不允許。簡單來說,信號量的"待決操作列表"中,每一項都是該數據結構的實例

/* One queue for each sleeping process in the system. */
struct sem_queue 
{
    /* 
    queue of pending operations: 等待隊列,使用next、prev串聯起來的雙向鏈表 
    */
    struct list_head    list;

    /* 
    this process: 睡眠的結構 
    */
    struct task_struct    *sleeper; 

    /* 
    undo structure: 用於撤銷的結構 
    */
    struct sem_undo        *undo;     

    /* 
    process id of requesting process: 請求信號量操作的進程ID 
    */
    int                pid;    

    /* 
    completion status of operation: 操作的完成狀態 
    */ 
    int                status;     

    /* 
    array of pending operations: 待決操作數組
    */
    struct sembuf        *sops;     

    /* 
    number of operations: 操作數目 
    */
    int            nsops;

    /* 
    does the operation alter the array?: 操作是否改變了數組?
    */     
    int            alter;   
};

對每個信號量,都有一個隊列管理與信號量相關的所有睡眠進程(待決進程),該隊列並未使用內核的標准設施實現,而是通過next、prev指針手工實現

信號量各數據結構之間的相互關系

0x6: struct msg_queue

和消息隊列相關的數據結構,struct msg_queue作為消息隊列的鏈表頭,描述了當前消息隊列的相關信息以及隊列的訪問權限

/source/include/linux/msg.h

/* one msq_queue structure for each present queue on the system */ struct msg_queue { struct kern_ipc_perm q_perm; /* last msgsnd time: 上一次調用msgsnd發送消息的時間 */ time_t q_stime; /* last msgrcv time: 上一次調用msgrcv接收消息的時間 */ time_t q_rtime; /* last change time: 上一次修改的時間 */ time_t q_ctime; /* current number of bytes on queue: 隊列上當前字節數目 */ unsigned long q_cbytes; /* number of messages in queue: 隊列中的消息數目 */ unsigned long q_qnum; /* max number of bytes on queue: 隊列上最大字節數目 */ unsigned long q_qbytes; /* pid of last msgsnd: 上一次調用msgsnd的pid */ pid_t q_lspid; /* last receive pid: 上一次接收消息的pid */ pid_t q_lrpid; struct list_head q_messages; struct list_head q_receivers; struct list_head q_senders; };

我們重點關注一下結構體的最后3個成員,它們是3個標准的內核鏈表

1. struct list_head q_messages : 消息本身
2. struct list_head q_receivers : 睡眠的接收者
3. struct list_head q_senders : 睡眠的發送者

q_messages(消息本身)中的各個消息都封裝在一個msg_msg實例中

0x7: struct msg_msg

\linux-2.6.32.63\include\linux\msg.h

/* one msg_msg structure for each message */
struct msg_msg 
{
    //用作連接各個消息的鏈表元素
    struct list_head m_list; 

    //消息類型
    long  m_type;  

    /* 
    message text size: 消息長度 
    */        
    int m_ts;    

    /*
    如果保存了超過一個內存頁的長消息,則需要next
    每個消息都(至少)分配了一個內存頁,msg_msg實例保存在該頁的起始處,剩余的空間可以用於存儲消息的正文
    */       
    struct msg_msgseg* next;
    void *security;
    /* the actual message follows immediately */
};

消息隊列通信時,發送進程和接收進程都可以進入睡眠

1. 如果消息隊列已經達到最大容量,則發送者在試圖寫入時會進入睡眠
2. 如果接受者在試圖獲取消息時會進入睡眠

在實際的編程中,為了緩解因為消息隊列上限滿導致消息發送者(senders 向消息隊列中寫入數據的進程)被強制睡眠阻塞,我們可以采取幾個措施

1. vim /etc/sysctl.conf
2. 使用非阻塞消息發送方式調用msgsnd() API
/*
int ret = msgsnd(msgq_id, msg, msg_size, IPC_NOWAIT);
IPC_NOWAIT:當消息隊列已滿的時候,msgsnd函數不等待立即返回
*/

Relevant Link:

http://blog.csdn.net/guoping16/article/details/6584024

0x8: struct msg_sender

對於消息隊列來說,睡眠的發送者放置在msg_queue的q_senders鏈表中,鏈表元素使用下列數據結構

/* one msg_sender for each sleeping sender */
struct msg_sender 
{
    //鏈表元素
    struct list_head    list;

    //指向對應進程的task_struct的指針
    struct task_struct    *tsk;
};

這里不需要額外的信息,因為發送進程是sys_msgsnd系統調用期間進入睡眠,也可能是通過sys_ipc系統調用期間進入睡眠(sys_ipc會在喚醒后自動重試發送操作)

0x9: struct msg_receiver

/*
 * one msg_receiver structure for each sleeping receiver:
 */
struct msg_receiver 
{
    struct list_head    r_list;
    struct task_struct    *r_tsk;

    int            r_mode;

    //對預期消息的描述
    long            r_msgtype;
    long            r_maxsize;

    //指向msg_msg實例的指針,在消息可用的情況下,該指針指定了復制數據的目標地址
    struct msg_msg        *volatile r_msg;
};

每個消息隊列都有一個msqid_ds結構與其關聯

0x10: struct msqid_ds

\linux-2.6.32.63\include\linux\msg.h

/* Obsolete, used only for backwards compatibility and libc5 compiles */ struct msqid_ds { struct ipc_perm msg_perm; struct msg *msg_first; /* first message on queue,unused */ struct msg *msg_last; /* last message in queue,unused */ __kernel_time_t msg_stime; /* last msgsnd time */ __kernel_time_t msg_rtime; /* last msgrcv time */ __kernel_time_t msg_ctime; /* last change time */ unsigned long msg_lcbytes; /* Reuse junk fields for 32 bit */ unsigned long msg_lqbytes; /* ditto */ unsigned short msg_cbytes; /* current number of bytes on queue */ unsigned short msg_qnum; /* number of messages in queue */ unsigned short msg_qbytes; /* max number of bytes on queue */ __kernel_ipc_pid_t msg_lspid; /* pid of last msgsnd */ __kernel_ipc_pid_t msg_lrpid; /* last receive pid */ };

下圖說明了消息隊列所涉及各數據結構的相互關系

 

10. 命名空間(namespace)相關數據結構

Linux內核通過數據結構之間互相的連接關系,形成了一套虛擬的命名空間的虛擬化概念

0x1: struct pid_namespace

\linux-2.6.32.63\include\linux\pid_namespace.h

struct pid_namespace { struct kref kref; struct pidmap pidmap[PIDMAP_ENTRIES]; int last_pid; /* 每個PID命名空間都具有一個進程,其發揮的作用相當於全局的init進程,init的一個目的是對孤兒調用wait4,命名空間局部的init變體也必須完成該工作。child_reaper保存了指向該進程的task_struct的指針 */ struct task_struct *child_reaper; struct kmem_cache *pid_cachep; /* 2. level表示當前命名空間在命名空間層次結構中的深度。初始命名空間的level為0,下一層為1,逐層遞增。level較高的命名空間中的ID,對level較低的命名空間來說是可見的(即子命名空間對父命名空間可見)。從給定的level位置,內核即可推斷進程會關聯到多少個ID(即子命名空間中的進程需要關聯從當前命名空間一直到最頂層的所有命名空間) */ unsigned int level; /* 3. parent是指向父命名空間的指針 */ struct pid_namespace *parent; #ifdef CONFIG_PROC_FS struct vfsmount *proc_mnt; #endif #ifdef CONFIG_BSD_PROCESS_ACCT struct bsd_acct_struct *bacct; #endif };

0x2: struct pid、struct upid

PID的管理圍繞着兩個數據結構展開,struct pid是內核對PID的內部表示、struct upid則表示特定的命名空間中可見的信息

\linux-2.6.32.63\include\linux\pid.h

/* struct upid is used to get the id of the struct pid, as it is seen in particular namespace. Later the struct pid is found with find_pid_ns() using the int nr and struct pid_namespace *ns. */ struct upid { /* Try to keep pid_chain in the same cacheline as nr for find_vpid */ //1. 表示ID的數值 int nr; //2. 指向該ID所屬的命名空間的指針 struct pid_namespace *ns; /* 3. 所有的upid實例都保存在一個散列表中,pid_chain用內核的標准方法實現了散列溢出鏈表 */ struct hlist_node pid_chain; }; struct pid { //1. 引用計數器  atomic_t count; unsigned int level; /* lists of tasks that use this pid task是一個數組,每個數組項都是一個散列表頭,對應於一個ID類型,因為一個ID可能用於幾個進程,所有共享同一個給定ID的task_struct實例,都通過該列表連接起來,PIDTYPE_MAX表示ID類型的數目 enum pid_type { PIDTYPE_PID, PIDTYPE_PGID, PIDTYPE_SID, PIDTYPE_MAX }; */ struct hlist_head tasks[PIDTYPE_MAX]; struct rcu_head rcu; struct upid numbers[1]; };

另一幅圖如下

可以看到,內核用數據結構中的這種N:N的關系,實現了一個虛擬的層次命名空間結構

0x3: struct nsproxy

\linux-2.6.32.63\include\linux\nsproxy.h

/*
A structure to contain pointers to all per-process namespaces
1. fs (mount)
2. uts
3. network
4. sysvipc
5. etc
'count' is the number of tasks holding a reference. The count for each namespace, then, will be the number of nsproxies pointing to it, not the number of tasks.
The nsproxy is shared by tasks which share all namespaces. As soon as a single namespace is cloned or unshared, the nsproxy is copied.
*/
struct nsproxy 
{
    atomic_t count;

    /*
    1. UTS(UNIX Timesharing System)命名空間包含了運行內核的名稱、版本、底層體系結構類型等信息
    */
    struct uts_namespace *uts_ns;

    /*
    2. 保存在struct ipc_namespace中的所有與進程間通信(IPC)有關的信息
    */
    struct ipc_namespace *ipc_ns;

    /*
    3. 已經裝載的文件系統的視圖,在struct mnt_namespace中給出
    */
    struct mnt_namespace *mnt_ns;

    /*
    4. 有關進程ID的信息,由struct pid_namespace提供
    */
    struct pid_namespace *pid_ns;

    /*
    5. struct net包含所有網絡相關的命名空間參數
    */
    struct net          *net_ns;
};
extern struct nsproxy init_nsproxy;

0x4: struct mnt_namespace

\linux-2.6.32.63\include\linux\mnt_namespace.h

struct mnt_namespace 
{
    //使用計數器,指定了使用該命名空間的進程數目
    atomic_t        count;

    //指向根目錄的vfsmount實例
    struct vfsmount *    root;

    //雙鏈表表頭,保存了VFS命名空間中所有文件系統的vfsmount實例,鏈表元素是vfsmount的成員mnt_list
    struct list_head    list;
    wait_queue_head_t poll;
    int event;
};

 

Copyright (c) 2014 LittleHann All rights reserved

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM