Linux 下 Epoll 機制概述

本文轉載自查看原文 2020-05-11 12:00 1449 Android Handler

在深入系統的學習Handler的時候，我們接觸到了Looper之所以死循環不會導致CPU使用率過高，是因為使用了Linux下的epoll機制。

Android的應用層通過Message.java實現隊列，利用管道和epoll機制實現線程狀態的管理，配合起來實現了Android主線程的消息隊列模型。

Handler知識鏈接：

Android Handler 機制（一）：Handler 運行機制完整梳理

Android Handler 機制（二）：Hander 機制深入探究問題梳理

本文參考鏈接：

深入理解 Epoll：https://zhuanlan.zhihu.com/p/93609693

一、Epoll介紹

Epoll是linux2.6內核的一個新的系統調用，Epoll在設計之初，就是為了替代select，Epoll線性復雜度的模型，epoll的時間復雜度為O(1), 也就意味着，Epoll在高並發場景，隨着文件描述符的增長，有良好的可擴展性。

select 和 poll 監聽文件描述符list，進行一個線性的查找 O(n)
epoll: 使用了內核文件級別的回調機制O(1)

下圖展示了文件描述符的量級和CPU耗時：

/proc/sys/fs/epoll/max_user_watches

表示用戶能注冊到epoll實例中的最大文件描述符的數量限制。

關鍵函數

epoll_create1: 創建一個epoll實例，文件描述符

epoll_ctl: 將監聽的文件描述符添加到epoll實例中，實例代碼為將標准輸入文件描述符添加到epoll中

epoll_wait: 等待epoll事件從epoll實例中發生，並返回事件以及對應文件描述符l

epoll 關鍵的核心數據結構如下：

typedef union epoll_data {
  void *ptr;
  int fd;
  uint32_t u32;
  uint64_t u64;
} epoll_data_t;

struct epoll_event {
  uint32_t events;  /* Epoll events */
  epoll_data_t data;    /* User data variable */
};

邊沿觸發vs水平觸發

epoll事件有兩種模型，邊沿觸發：edge-triggered (ET)，水平觸發：level-triggered (LT)。

水平觸發(level-triggered)：

socket接收緩沖區不為空有數據可讀讀事件一直觸發
socket發送緩沖區不滿可以繼續寫入數據寫事件一直觸發

邊沿觸發(edge-triggered)：

socket的接收緩沖區狀態變化時觸發讀事件，即空的接收緩沖區剛接收到數據時觸發讀事件
socket的發送緩沖區狀態變化時觸發寫事件，即滿的緩沖區剛空出空間時觸發讀事件

邊沿觸發僅觸發一次，水平觸發會一直觸發。

事件宏

EPOLLIN ：表示對應的文件描述符可以讀（包括對端SOCKET正常關閉）；
EPOLLOUT：表示對應的文件描述符可以寫；
EPOLLPRI：表示對應的文件描述符有緊急的數據可讀（這里應該表示有帶外數據到來）；
EPOLLERR：表示對應的文件描述符發生錯誤；
EPOLLHUP：表示對應的文件描述符被掛斷；
EPOLLET：將 EPOLL設為邊緣觸發(Edge Triggered)模式（默認為水平觸發），這是相對於水平觸發(Level Triggered)來說的。
EPOLLONESHOT：只監聽一次事件，當監聽完這次事件之后，如果還需要繼續監聽這個socket的話，需要再次把這個socket加入到EPOLL隊列里

libevent 采用水平觸發， nginx 采用邊沿觸發。

代碼

　　　　　　 #define MAX_EVENTS 10

　　　　　　 struct epoll_event ev, events[MAX_EVENTS];
           int listen_sock, conn_sock, nfds, epollfd;

           /* Code to set up listening socket, 'listen_sock',
              (socket(), bind(), listen()) omitted */

           // 創建epoll實例
           epollfd = epoll_create1(0);

           if (epollfd == -1) {
               perror("epoll_create1");
               exit(EXIT_FAILURE);
           }

           // 將監聽的端口的socket對應的文件描述符添加到epoll事件列表中
           ev.events = EPOLLIN;
           ev.data.fd = listen_sock;
           if (epoll_ctl(epollfd, EPOLL_CTL_ADD, listen_sock, &ev) == -1) {
               perror("epoll_ctl: listen_sock");
               exit(EXIT_FAILURE);
           }

           for (;;) {
               // epoll_wait 阻塞線程，等待事件發生
               nfds = epoll_wait(epollfd, events, MAX_EVENTS, -1);
               if (nfds == -1) {
                   perror("epoll_wait");
                   exit(EXIT_FAILURE);
               }

               for (n = 0; n < nfds; ++n) {
                   if (events[n].data.fd == listen_sock) {
                       // 新建的連接
                       conn_sock = accept(listen_sock,
                                          (struct sockaddr *) &addr, &addrlen);
                       // accept 返回新建連接的文件描述符
                       if (conn_sock == -1) {
                           perror("accept");
                           exit(EXIT_FAILURE);
                       }
                       setnonblocking(conn_sock);
                       // setnotblocking 將該文件描述符置為非阻塞狀態

                       ev.events = EPOLLIN | EPOLLET;
                       ev.data.fd = conn_sock;
                       // 將該文件描述符添加到epoll事件監聽的列表中，使用ET模式
                       if (epoll_ctl(epollfd, EPOLL_CTL_ADD, conn_sock,
                                   &ev) == -1)
                           perror("epoll_ctl: conn_sock");
                           exit(EXIT_FAILURE);
                       }
                   } else {
                       // 使用已監聽的文件描述符中的數據
                       do_use_fd(events[n].data.fd);
                   }
               }
           }

性能測試

使用了wrk測試工具, 測試了epoll事件驅動的簡單的http server。

二、Epoll高效原理

Epoll在linux內核中源碼主要為 eventpoll.c 和 eventpoll.h 主要位於fs/eventpoll.c 和 include/linux/eventpool.h, 具體可以參考linux3.16，下述為部分關鍵數據結構摘要, 主要介紹epitem 紅黑樹節點和eventpoll 關鍵入口數據結構，維護着鏈表頭節點ready list header和紅黑樹根節點RB-Tree root。

/*
 * Each file descriptor added to the eventpoll interface will
 * have an entry of this type linked to the "rbr" RB tree.
 * Avoid increasing the size of this struct, there can be many thousands
 * of these on a server and we do not want this to take another cache line.
 */
struct epitem {
    union {
        /* RB tree node links this structure to the eventpoll RB tree */
        struct rb_node rbn;
        /* Used to free the struct epitem */
        struct rcu_head rcu;
    };

    /* List header used to link this structure to the eventpoll ready list */
    struct list_head rdllink;

    /*
     * Works together "struct eventpoll"->ovflist in keeping the
     * single linked chain of items.
     */
    struct epitem *next;

    /* The file descriptor information this item refers to */
    struct epoll_filefd ffd;

    /* Number of active wait queue attached to poll operations */
    int nwait;

    /* List containing poll wait queues */
    struct list_head pwqlist;

    /* The "container" of this item */
    struct eventpoll *ep;

    /* List header used to link this item to the "struct file" items list */
    struct list_head fllink;

    /* wakeup_source used when EPOLLWAKEUP is set */
    struct wakeup_source __rcu *ws;

    /* The structure that describe the interested events and the source fd */
    struct epoll_event event;
};

/*
 * This structure is stored inside the "private_data" member of the file
 * structure and represents the main data structure for the eventpoll
 * interface.
 */
struct eventpoll {
    /* Protect the access to this structure */
    spinlock_t lock;

    /*
     * This mutex is used to ensure that files are not removed
     * while epoll is using them. This is held during the event
     * collection loop, the file cleanup path, the epoll file exit
     * code and the ctl operations.
     */
    struct mutex mtx;

    /* Wait queue used by sys_epoll_wait() */
    wait_queue_head_t wq;

    /* Wait queue used by file->poll() */
    wait_queue_head_t poll_wait;

    /* List of ready file descriptors */
    struct list_head rdllist;

    /* RB tree root used to store monitored fd structs */
    struct rb_root rbr;

    /*
     * This is a single linked list that chains all the "struct epitem" that
     * happened while transferring ready events to userspace w/out
     * holding ->lock.
     */
    struct epitem *ovflist;

    /* wakeup_source used when ep_scan_ready_list is running */
    struct wakeup_source *ws;

    /* The user that created the eventpoll descriptor */
    struct user_struct *user;

    struct file *file;

    /* used to optimize loop detection check */
    int visited;
    struct list_head visited_list_link;
};

epoll使用RB-Tree紅黑樹去監聽並維護所有文件描述符，RB-Tree的根節點。調用epoll_create時，內核除了幫我們在epoll文件系統里建了個file結點，在內核cache里建了個紅黑樹用於存儲以后epoll_ctl傳來的socket外，還會再建立一個list鏈表，用於存儲准備就緒的事件.當epoll_wait調用時，僅僅觀察這個list鏈表里有沒有數據即可。有數據就返回，沒有數據就sleep，等到timeout時間到后即使鏈表沒數據也返回。所以，epoll_wait非常高效。而且，通常情況下即使我們要監控百萬計的句柄，大多一次也只返回很少量的准備就緒句柄而已，所以，epoll_wait僅需要從內核態copy少量的句柄到用戶態而已.

那么，這個准備就緒list鏈表是怎么維護的呢？

當我們執行epoll_ctl時，除了把socket放到epoll文件系統里file對象對應的紅黑樹上之外，還會給內核中斷處理程序注冊一個回調函數，告訴內核，如果這個句柄的中斷到了，就把它放到准備就緒list鏈表里。所以，當一個socket上有數據到了，內核在把網卡上的數據copy到內核中后就來把socket插入到准備就緒鏈表里了。

epoll相比於select並不是在所有情況下都要高效，例如在如果有少於1024個文件描述符監聽，且大多數socket都是出於活躍繁忙的狀態，這種情況下，select要比epoll更為高效，因為epoll會有更多次的系統調用，用戶態和內核態會有更加頻繁的切換。

epoll高效的本質在於：

減少了用戶態和內核態的文件句柄拷貝
減少了對可讀可寫文件句柄的遍歷
mmap 加速了內核與用戶空間的信息傳遞，epoll是通過內核與用戶mmap同一塊內存，避免了無謂的內存拷貝
IO性能不會隨着監聽的文件描述的數量增長而下降
使用紅黑樹存儲fd，以及對應的回調函數，其插入，查找，刪除的性能不錯，相比於hash，不必預先分配很多的空間

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 linux下select/poll/epoll機制的比較 linux 下 epoll 編程 Linux高並發機制——epoll模型 Linux select/poll和epoll實現機制對比深入理解 Linux 的 epoll 機制 Linux下的IPC機制從linux源碼看epoll 從linux源碼看epoll IO多路復用與epoll機制淺析對epoll機制的學習理解v1