BIO和NIO區別以及原理


  之前在學習NIO的時候只是簡單的學習了其使用,對齊組件Selector、Channel、Buffer 也是只是有三個重要的類,至於為什么叫NIO以及NIO的優點沒有了解,這里詳細記錄下。

1 . 簡單組成

 

  內核模式:跑內核程序。在內核模式下,代碼具有對硬件的所有控制權限。可以執行所有CPU指令,可以訪問任意地址的內存。內核模式是為操作系統最底層,最可信的函數服務的。在內核模式下的任何異常都是災難性的,將會導致整台機器停機。

  用戶模式:跑用戶程序。在用戶模式下,代碼沒有對硬件的直接控制權限,只能訪問自己的用戶空間地址。程序是通過調用系統接口(System APIs)來達到訪問硬件和內存。在這種保護模式下,即時程序發生崩潰也是可以恢復的。在你的電腦上大部分程序都是在用戶模式下運行的。

  當程序涉及到調用內核程序,會涉及到模式的狀態,CPU會先保存用戶線程的上下文,然后切換到內核模式去執行內核程序,最后在根據上下文切戶到應用程序。

 

文件描述符fd:文件描述符(File descriptor)是計算機科學中的一個術語,是一個用於表述指向文件的引用的抽象化概念。文件描述符在形式上是一個非負整數。實際上,它是一個索引值,指向內核為每一個進程所維護的該進程打開文件的記錄表。當程序打開一個現有文件或者創建一個新文件時,內核向進程返回一個文件描述符。在程序設計中,一些涉及底層的程序編寫往往會圍繞着文件描述符展開。但是文件描述符這一概念往往只適用於UNIX、Linux這樣的操作系統。

2. BIO測試

BIO測試:

import java.io.BufferedReader;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.ServerSocket;
import java.net.Socket;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;

public class SocketServer {

    private static final ExecutorService executorService = Executors.newFixedThreadPool(5);

    public static void main(String[] args) throws Exception {
        ServerSocket serverSocket = new ServerSocket(8088);
        System.out.println("serverSocket 8088 start");
        while (true) {
            Socket socket = serverSocket.accept();
            System.out.println("socket.getInetAddress(): " + socket.getInetAddress());
            executorService.execute(new MyThread(socket));
        }
    }

    static class MyThread extends Thread {

        private Socket socket;

        public MyThread(Socket socket) {
            this.socket = socket;
        }

        @Override
        public void run() {
            try {
                InputStream inputStream = socket.getInputStream();
                BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(inputStream));
                while (true) {
                    String s = bufferedReader.readLine();
                    System.out.println(Thread.currentThread().getId() + " 收到的消息\t" + s);
                }
            } catch (Exception exception) {
                // ignore
            } finally {
            }
        }
    }
}

使用JDK6 進行編譯然后監測方法執行

[root@localhost jdk6]# ./jdk1.6.0_06/bin/javac SocketServer.java 
[root@localhost jdk6]# strace -ff -o out ./jdk1.6.0_06/bin/java SocketServer
serverSocket 8088 start

strace是一個可用於診斷、調試和教學的Linux用戶空間跟蹤器。我們用它來監控用戶空間進程和內核的交互,比如系統調用、信號傳遞、進程狀態變更等。

會生成幾個out文件,如下:(每個線程對應一個文件,JVM啟動默認會創建一些守護線程,用於GC或者接收jmap 等命令的線程)

[root@localhost jdk6]# ll
total 64092
drwxr-xr-x. 9   10  143      204 Jul 23  2008 jdk1.6.0_06
-rw-r--r--. 1 root root 64885867 Jul 20 03:33 jdk-6u6-p-linux-x64.tar.gz
-rw-r--r--. 1 root root    21049 Jul 20 07:04 out.10685
-rw-r--r--. 1 root root   139145 Jul 20 07:04 out.10686
-rw-r--r--. 1 root root    21470 Jul 20 07:06 out.10687
-rw-r--r--. 1 root root      941 Jul 20 07:04 out.10688
-rw-r--r--. 1 root root      906 Jul 20 07:04 out.10689
-rw-r--r--. 1 root root      985 Jul 20 07:04 out.10690
-rw-r--r--. 1 root root      941 Jul 20 07:04 out.10691
-rw-r--r--. 1 root root      906 Jul 20 07:04 out.10692
-rw-r--r--. 1 root root      941 Jul 20 07:04 out.10693
-rw-r--r--. 1 root root   388112 Jul 20 07:06 out.10694
-rw-r--r--. 1 root root     1433 Jul 20 07:04 SocketServer.class
-rw-r--r--. 1 root root     1626 Jul 20 07:03 SocketServer.java
-rw-r--r--. 1 root root     1297 Jul 20 07:04 SocketServer$MyThread.class
[root@localhost jdk6]# ll | grep out | wc -l
10

1. 查找socket 關鍵字所在的文件

[root@localhost jdk6]# grep socket out.*
out.10686:socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
out.10686:connect(3, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
out.10686:socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
out.10686:connect(3, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
out.10686:socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) = 4
out.10686:getsockname(0, 0x7f64c9083350, [28])    = -1 ENOTSOCK (Socket operation on non-socket)
out.10686:socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) = 4
out.10686:socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) = 5
out.10686:socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) = 4

可以看到10686 文件有建立socket 的操作,然后分析10686文件。

(1) 定位到文件末尾:

 可以看到是accept命令阻塞,並且無返回結果

(2) 繼續追蹤,查看socket 啟動以及bind、listen 的過程

 可以看到主要過程是:一個SocketServer 啟動的過程如下

socket => 4 (文件描述符)
bind(4, 8088)
listen(4)

accept(4, 阻塞 

查看accept 命令的格式如下: (可以看到是接受一個socket 連接,如果有連接會返回一個正整數)

[root@localhost jdk6]# man 2 accept
ACCEPT(2)                                                    Linux Programmer's Manual                                                    ACCEPT(2)

NAME
       accept, accept4 - accept a connection on a socket

SYNOPSIS
       #include <sys/types.h>          /* See NOTES */
       #include <sys/socket.h>

       int accept(int sockfd, struct sockaddr *addr, socklen_t *addrlen);

       #define _GNU_SOURCE             /* See feature_test_macros(7) */
       #include <sys/socket.h>

       int accept4(int sockfd, struct sockaddr *addr,
                   socklen_t *addrlen, int flags);

DESCRIPTION
       The accept() system call is used with connection-based socket types (SOCK_STREAM, SOCK_SEQPACKET).  It extracts the first connection request
       on the queue of pending connections for the listening socket, sockfd, creates a new connected socket, and  returns  a  new  file  descriptor
       referring to that socket.  The newly created socket is not in the listening state.  The original socket sockfd is unaffected by this call.

       The  argument  sockfd  is a socket that has been created with socket(2), bound to a local address with bind(2), and is listening for connec‐
       tions after a listen(2)
RETURN VALUE
       On success, these system calls return a nonnegative integer that is a descriptor for the accepted socket.  On error,  -1  is  returned,  and
       errno is set appropriately.

2. nc 命令模擬建立一個客戶端連接

[root@localhost jdk6]# nc localhost 8088

(1) 查看服務器端會多生成一個out文件

[root@localhost jdk6]# ll
total 72712
drwxr-xr-x. 9   10  143      204 Jul 23  2008 jdk1.6.0_06
-rw-r--r--. 1 root root 64885867 Jul 20 03:33 jdk-6u6-p-linux-x64.tar.gz
-rw-r--r--. 1 root root    21049 Jul 20 07:04 out.10685
-rw-r--r--. 1 root root   141155 Jul 20 07:32 out.10686
-rw-r--r--. 1 root root   369445 Jul 20 07:33 out.10687
-rw-r--r--. 1 root root      941 Jul 20 07:04 out.10688
-rw-r--r--. 1 root root      906 Jul 20 07:04 out.10689
-rw-r--r--. 1 root root      985 Jul 20 07:04 out.10690
-rw-r--r--. 1 root root      941 Jul 20 07:04 out.10691
-rw-r--r--. 1 root root      906 Jul 20 07:04 out.10692
-rw-r--r--. 1 root root      941 Jul 20 07:04 out.10693
-rw-r--r--. 1 root root  7103157 Jul 20 07:33 out.10694
-rw-r--r--. 1 root root     1266 Jul 20 07:32 out.10866
-rw-r--r--. 1 root root     1433 Jul 20 07:04 SocketServer.class
-rw-r--r--. 1 root root     1626 Jul 20 07:03 SocketServer.java
-rw-r--r--. 1 root root     1297 Jul 20 07:04 SocketServer$MyThread.class
[root@localhost jdk6]# ll | grep out | wc -l
11

(2)查看10686 文件accept 返回的命令 (可以看到接收成功之后返回一個文件描述符是6, 接下來該 fd 會用於recvfrom 讀取數據)

 可以看到clone 是創建的處理任務的線程, 也就是對應內核是clone 命令進行創建線程。linux下沒有真正意義的線程,因為linux下沒有給線程設計專有的結構體,它的線程是用進程模擬的,而它是由多個進程共一塊地址空間而模擬得到的。

查看clone 函數如下: (類似於fork 函數創建子進程)

man 2 clone

CLONE(2)                                                     Linux Programmer's Manual                                                     CLONE(2)

NAME
       clone, __clone2 - create a child process

SYNOPSIS
       /* Prototype for the glibc wrapper function */

       #include <sched.h>

       int clone(int (*fn)(void *), void *child_stack,
                 int flags, void *arg, ...
                 /* pid_t *ptid, struct user_desc *tls, pid_t *ctid */ );

       /* Prototype for the raw system call */

       long clone(unsigned long flags, void *child_stack,
                 void *ptid, void *ctid,
                 struct pt_regs *regs);

   Feature Test Macro Requirements for glibc wrapper function (see feature_test_macros(7)):

       clone():
           Since glibc 2.14:
               _GNU_SOURCE
           Before glibc 2.14:
               _BSD_SOURCE || _SVID_SOURCE
                   /* _GNU_SOURCE also suffices */

DESCRIPTION
       clone() creates a new process, in a manner similar to fork(2).

(3) 查看10866 文件: 

 可以看到當前線程阻塞在recvfrom命令, 查看recfrom 命令如下:

man 2 recvfrom

RECV(2)                                                      Linux Programmer's Manual                                                      RECV(2)

NAME
       recv, recvfrom, recvmsg - receive a message from a socket

SYNOPSIS
       #include <sys/types.h>
       #include <sys/socket.h>

       ssize_t recv(int sockfd, void *buf, size_t len, int flags);

       ssize_t recvfrom(int sockfd, void *buf, size_t len, int flags,
                        struct sockaddr *src_addr, socklen_t *addrlen);

       ssize_t recvmsg(int sockfd, struct msghdr *msg, int flags);

DESCRIPTION
       The recvfrom() and recvmsg() calls are used to receive messages from a socket, and may be used to receive data on a socket whether or not it
       is connection-oriented.

       If src_addr is not NULL, and the underlying protocol provides the source address, this source address is filled in.  When src_addr is  NULL,
       nothing  is  filled  in; in this case, addrlen is not used, and should also be NULL.  The argument addrlen is a value-result argument, which
       the caller should initialize before the call to the size of the buffer associated with src_addr, and modified  on  return  to  indicate  the
       actual size of the source address.  The returned address is truncated if the buffer provided is too small; in this case, addrlen will return
       a value greater than was supplied to the call.

       The recv() call is normally used only on a connected socket (see connect(2)) and is identical to recvfrom() with a NULL src_addr argument.
...

RETURN VALUE
       These  calls return the number of bytes received, or -1 if an error occurred.  In the event of an error, errno is set to indicate the error.
       The return value will be 0 when the peer has performed an orderly shutdown.

  可以看到是從socket 連接中讀取數據,會導致阻塞。

(4) nc 建立連接的客戶端發送消息

[root@localhost jdk6]# nc localhost 8088
test

1》主窗口打印的信息如下:

[root@localhost jdk6]# strace -ff -o out ./jdk1.6.0_06/bin/java SocketServer
serverSocket 8088 start
socket.getInetAddress(): /0:0:0:0:0:0:0:1
9 收到的消息    test

2》查看out.10866 打出的命令如下

 

   可以看到接收完消息之后然后再次進入recvfrom 命令。

 

總結: BIO問題總結

1. 每連接每線程,造成的問題就是線程內存消費、cpu 調度消耗

2. 根源是blocking阻塞:accept和recvfrom 內核操作。 解決方案就是內核提供NONBLOCKING非阻塞方案。

3. 查看socket 方法提供了一個參數SOCK_NONBLOCK 用於設置非阻塞:  (獲取不到返回-1, 到java 里面就是null)

SOCK_NONBLOCK   Set the O_NONBLOCK file status flag on the new open file description.  Using this flag saves  extra  calls  to  fcntl(2)  to
                       achieve the same result.

 

3. NIO測試

NIO在Java 中被叫做new io, 在操作系統層面被稱為nonblocking io。 下面的測試基於JDK8。

代碼如下:

import java.net.InetSocketAddress;
import java.nio.ByteBuffer;
import java.nio.channels.ServerSocketChannel;
import java.nio.channels.SocketChannel;
import java.util.LinkedList;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;

public class NIOSocket {


    private static final ExecutorService executorService = Executors.newFixedThreadPool(5);

    public static void main(String[] args) throws Exception {
        LinkedList<SocketChannel> clients = new LinkedList<>();
        ServerSocketChannel serverSocketChannel = ServerSocketChannel.open();
        serverSocketChannel.bind(new InetSocketAddress(8088));
        serverSocketChannel.configureBlocking(false); // 對應於操作系統的NONBLOCKING

        while (true) {
            Thread.sleep(500);
            /**
             *  accept 調用內核的命令
             *  BIO的時候一直阻塞,有客戶端鏈接的時候返回這個客戶端的fd 文件描述符
             *  NONBLOCKING 會有返回值,只是返回值是-1
             **/
            SocketChannel client = serverSocketChannel.accept(); // 不會阻塞 -1 null (OS層面返回-1, java 層面返回null)
            if (client != null) {
                client.configureBlocking(false);
                int port = client.socket().getPort();
                System.out.println("client.socket().getPort(): " + port);
                clients.add(client);
            }

            ByteBuffer buffer = ByteBuffer.allocate(4096); // 申請內存,可以在堆內,也可以在堆外DM
            // 遍歷client讀取消息
            for (SocketChannel c : clients) {
                int read = c.read(buffer); // >0 -1 不會阻塞
                if (read > 0) {
                    buffer.flip();
                    byte[] bytes = new byte[buffer.limit()];
                    buffer.get(bytes);
                    String string = new String(bytes);
                    System.out.println("client.socket().getPort(): " + c.socket().getPort() + " 收到的消息: " + string);
                    buffer.clear();
                }
            }
        }
    }
}

0. NIO啟動的過程如下:

socket => 4 (文件描述符)
bind(4, 8088)
listen(4)
4.nonblocking (設置內核系統調用參數為非阻塞,4是文件描述符)

accept(4, xxx) => -1 6

  可以看到是設置非阻塞,調用內核方法的時候不會阻塞,有則返回文件描述符,沒有則返回-1,到應用程序內部對應null 或者 -1.

1. 用JDK8編譯

2.  用strace 追蹤

[root@localhost jdk8]# strace -ff -o out ./jdk1.8.0_291/bin/java NIOSocket

3. 查看生成的out 文件

[root@localhost jdk8]# ll 
total 143724
drwxr-xr-x. 8 10143 10143       273 Apr  7 15:14 jdk1.8.0_291
-rw-r--r--. 1 root  root  144616467 Jul 20 03:42 jdk-8u291-linux-i586.tar.gz
-rw-r--r--. 1 root  root       2358 Jul 20 08:18 NIOSocket.class
-rw-r--r--. 1 root  root       2286 Jul 20 08:18 NIOSocket.java
-rw-r--r--. 1 root  root      12822 Jul 20 08:20 out.11117
-rw-r--r--. 1 root  root    1489453 Jul 20 08:20 out.11118
-rw-r--r--. 1 root  root      10315 Jul 20 08:20 out.11119
-rw-r--r--. 1 root  root       1445 Jul 20 08:20 out.11120
-rw-r--r--. 1 root  root       1424 Jul 20 08:20 out.11121
-rw-r--r--. 1 root  root        884 Jul 20 08:20 out.11122
-rw-r--r--. 1 root  root      11113 Jul 20 08:20 out.11123
-rw-r--r--. 1 root  root        884 Jul 20 08:20 out.11124
-rw-r--r--. 1 root  root     269113 Jul 20 08:20 out.11125
[root@localhost jdk8]# ll | grep out | wc -l
9

一個socket 必須經過上面的socket、bind、listen、accept過程,查看其過程

(1) socket

socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) = 4

(2) bind 和listen

bind(4, {sa_family=AF_INET6, sin6_port=htons(8088), inet_pton(AF_INET6, "::", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, 28) = 0
listen(4, 50)

(3)  查看accept:(可以看到是非阻塞的方式進行,默認會返回一個 -1 值)

 4. 用nc 建立一個連接

nc localhost 8088

5. 查看out.11118 文件

會有一條accept 命令返回值不是-1.如下:

accept(4, {sa_family=AF_INET6, sin6_port=htons(59238), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, [28]) = 5

 6. 建立的鏈接發送一條消息HELLO

7.  主線程查看消息

[root@localhost jdk8]# strace -ff -o out ./jdk1.8.0_291/bin/java NIOSocket
client.socket().getPort(): 59238
client.socket().getPort(): 59238 收到的消息: HELLO

8.  查看out.11118 文件讀取到的信息: 可以看到調用的是read 方法。並且會讀取到返回的消息

 

NIO優缺點:

優點:規避多線程的問題

缺點:假設一萬個連接,只有一個發來數據,每循環一次,必須向內核發送一萬次read 調用,那么有9999次是無意義的,消耗時間和資源(用戶空間向內核空間的循環遍歷,復雜度在系統調用上)。

解決辦法: 內核繼續向前發展,引入多路復用器。 selector、poll、epoll

 

 補充:上面代碼設置的是非阻塞,默認是阻塞,如果去掉設置非阻塞的參數,查看結果如下

1. 代碼:

public class NIOSocket {public static void main(String[] args) throws Exception {
        LinkedList<SocketChannel> clients = new LinkedList<>();
        ServerSocketChannel serverSocketChannel = ServerSocketChannel.open();
        serverSocketChannel.bind(new InetSocketAddress(8088));
       //  serverSocketChannel.configureBlocking(false); // 對應於操作系統的NONBLOCKING

        while (true) {
            Thread.sleep(500);
            /**
             *  accept 調用內核的命令
             *  BIO的時候一直阻塞,有客戶端鏈接的時候返回這個客戶端的fd 文件描述符
             *  NONBLOCKING 會有返回值,只是返回值是-1
             **/
            SocketChannel client = serverSocketChannel.accept(); // 不會阻塞 -1 null (OS層面返回-1, java 層面返回null)
            if (client != null) {
          //      client.configureBlocking(false);
                int port = client.socket().getPort();
                System.out.println("client.socket().getPort(): " + port);
                clients.add(client);
            }

            ByteBuffer buffer = ByteBuffer.allocate(4096); // 申請內存,可以在堆內,也可以在堆外DM
            // 遍歷client讀取消息
            for (SocketChannel c : clients) {
                int read = c.read(buffer); // >0 -1 不會阻塞
                if (read > 0) {
                    buffer.flip();
                    byte[] bytes = new byte[buffer.limit()];
                    buffer.get(bytes);
                    String string = new String(bytes);
                    System.out.println("client.socket().getPort(): " + c.socket().getPort() + " 收到的消息: " + string);
                    buffer.clear();
                }
            }
        }
    }
}

2. strace 查看阻塞情況:

(1). 查看accept 阻塞情況

socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) = 4
setsockopt(4, SOL_IPV6, IPV6_V6ONLY, [0], 4) = 0
setsockopt(4, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0

bind(4, {sa_family=AF_INET6, sin6_port=htons(8088), inet_pton(AF_INET6, "::", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, 28) = 0

listen(4, 50) 

。。。

accept(4, 

(2) nc 連接之后查看read 阻塞情況

 

 

4. 引入多路復用器

上面NIO模型中,有個缺點是:假設一萬個連接,只有一個發來數據,每循環一次,必須向內核發送一萬次read 調用,那么有9999次是無意義的,消耗時間和資源(用戶空間向內核空間的循環遍歷,復雜度在系統調用上)。

解決辦法: 內核繼續向前發展,引入多路復用器。 selector、poll、epoll。

socket => 4 (文件描述符)
bind(4, 8088)
listen(4)
4.nonblocking (設置內核系統調用參數為非阻塞,4是文件描述符)

while(true) {
    select(fd) // O(1), 上限是1024
    read(fd) // 讀取數據
}

 

如果是應用程序自己讀取IO,那么這個IO模型,無論是BIO、NIO、多路復用,都是同步IO模型,多路復用器只能給fd文件描述符的狀態,不能給到數據。也就是需要用戶程序調用內核程序從內核空間讀取用戶程序內存。Windows的IOCP 內核有線程,拷貝數據到用戶空間。

 

select、poll 多路復用器

優勢: 通過一次系統調用,把fds傳遞給內核,內核進行遍歷,這種遍歷減少了系統調用的次數。

缺點:

1.重復傳遞 fd 文件描述符,解決辦法:內核開辟空間保留fd   

2.每次select、poll, 內核都要遍歷全量的 fd,解決辦法:計組深度只是,中斷,callback,增強。

3. select支持的文件描述符數量太小了,默認是1024個。

  因此產生了epoll。

5. epoll 理解

優勢:

1. 對fd數量沒有限制(當然這個在poll也被解決了)

2. 拋棄了bitmap數組實現了新的結構來存儲多種事件類型

3. 無需重復拷貝fd 隨用隨加 隨棄隨刪

4. 采用事件驅動避免輪詢查看可讀寫事件

 

linux 查看epoll命令如下:

man epoll

NAME
       epoll - I/O event notification facility

SYNOPSIS
       #include <sys/epoll.h>

NAME
       epoll - I/O event notification facility

SYNOPSIS
       #include <sys/epoll.h>

DESCRIPTION
       The  epoll API performs a similar task to poll(2): monitoring multiple file descriptors to see if I/O is possible on any of them.  The epoll
       API can be used either as an edge-triggered or a level-triggered interface and scales well to large numbers  of  watched  file  descriptors.
       The following system calls are provided to create and manage an epoll instance:

       *  epoll_create(2)  creates  an  epoll instance and returns a file descriptor referring to that instance.  (The more recent epoll_create1(2)
          extends the functionality of epoll_create(2).)

       *  Interest in particular file descriptors is then registered via epoll_ctl(2).  The set of file  descriptors  currently  registered  on  an
          epoll instance is sometimes called an epoll set.

       *  epoll_wait(2) waits for I/O events, blocking the calling thread if no events are currently available.

 可以看到epoll 本身包括三個子命令,epoll_create、epoll_ctl、cpoll_wait,

epoll提供了三個函數,epoll_create,epoll_ctl 和 epoll_wait,epoll_create是創建一個epoll句柄,創建一個epoll 實例,並且初始化其相關數據結構;epoll_ctl是注冊要監聽的事件類型;epoll_wait則是等待事件的產生。

對於第一個缺點,epoll的解決方案在epoll_ctl函數中。每次注冊新的事件到epoll句柄中時(在epoll_ctl中指定EPOLL_CTL_ADD),會把所有的fd拷貝進內核,而不是在epoll_wait的時候重復拷貝。epoll保證了每個fd在整個過程中只會拷貝一次。

對於第二個缺點,epoll的解決方案不像select或poll一樣每次都把current輪流加入fd對應的設備等待隊列中,而只在epoll_ctl時把current掛一遍(這一遍必不可少)並為每個fd指定一個回調函數,當設備就緒,喚醒等待隊列上的等待者時,就會調用這個回調函數,而這個回調函數會把就緒的fd加入一個就緒鏈表)。epoll_wait的工作實際上就是在這個就緒鏈表中查看有沒有就緒的fd(利用schedule_timeout()實現睡一會,判斷一會的效果,和select實現中的第7步是類似的)。

對於第三個缺點,epoll沒有這個限制,它所支持的FD上限是最大可以打開文件的數目,這個數字一般遠大於2048,舉個例子,在1GB內存的機器上大約是10萬左右,具體數目可以cat /proc/sys/fs/file-max察看,一般來說這個數目和系統內存關系很大。

 

man 2 查看二類系統調用命令如下:

(1) epoll_create: 創建一個epoll 實例,並且初始化其相關的數據結構

EPOLL_CREATE(2)                                              Linux Programmer's Manual                                              EPOLL_CREATE(2)

NAME
       epoll_create, epoll_create1 - open an epoll file descriptor

SYNOPSIS
       #include <sys/epoll.h>

       int epoll_create(int size);
       int epoll_create1(int flags);

DESCRIPTION
       epoll_create()  creates  an  epoll(7)  instance.   Since Linux 2.6.8, the size argument is ignored, but must be greater than zero; see NOTES
       below.

       epoll_create() returns a file descriptor referring to the new epoll instance.  This file descriptor is used for all the subsequent calls  to
       the  epoll interface.  When no longer required, the file descriptor returned by epoll_create() should be closed by using close(2).  When all
       file descriptors referring to an epoll instance have been closed, the kernel destroys the instance and releases the associated resources for
       reuse.

   epoll_create1()
       If  flags  is  0,  then, other than the fact that the obsolete size argument is dropped, epoll_create1() is the same as epoll_create().  The
       following value can be included in flags to obtain different behavior:

       EPOLL_CLOEXEC
              Set the close-on-exec (FD_CLOEXEC) flag on the new file descriptor.  See the description of the O_CLOEXEC flag in open(2) for reasons
              why this may be useful.

RETURN VALUE
       On success, these system calls return a nonnegative file descriptor.  On error, -1 is returned, and errno is set to indicate the error.

 (2) epoll_ctl:fd添加/刪除於epoll_create返回的epfd中

EPOLL_CTL(2)                                                 Linux Programmer's Manual                                                 EPOLL_CTL(2)

NAME
       epoll_ctl - control interface for an epoll descriptor

SYNOPSIS
       #include <sys/epoll.h>

       int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);

DESCRIPTION
       This  system call performs control operations on the epoll(7) instance referred to by the file descriptor epfd.  It requests that the opera‐
       tion op be performed for the target file descriptor, fd.

       Valid values for the op argument are :

       EPOLL_CTL_ADD
              Register the target file descriptor fd on the epoll instance referred to by the file descriptor epfd and associate  the  event  event
              with the internal file linked to fd.

       EPOLL_CTL_MOD
              Change the event event associated with the target file descriptor fd.

       EPOLL_CTL_DEL
              Remove  (deregister) the target file descriptor fd from the epoll instance referred to by epfd.  The event is ignored and can be NULL
              (but see BUGS below).

       The event argument describes the object linked to the file descriptor fd.  The struct epoll_event is defined as :

           typedef union epoll_data {
               void        *ptr;
               int          fd;
               uint32_t     u32;
               uint64_t     u64;
           } epoll_data_t;

           struct epoll_event {
               uint32_t     events;      /* Epoll events */
               epoll_data_t data;        /* User data variable */
           };

       The events member is a bit set composed using the following available event types:

       EPOLLIN
              The associated file is available for read(2) operations.

       EPOLLOUT
              The associated file is available for write(2) operations.

       EPOLLRDHUP (since Linux 2.6.17)
              Stream socket peer closed connection, or shut down writing half of connection.  (This flag is especially useful  for  writing  simple
              code to detect peer shutdown when using Edge Triggered monitoring.)

       EPOLLPRI
              There is urgent data available for read(2) operations.

       EPOLLERR
              Error  condition  happened  on the associated file descriptor.  epoll_wait(2) will always wait for this event; it is not necessary to
              set it in events.

       EPOLLHUP
              Hang up happened on the associated file descriptor.  epoll_wait(2) will always wait for this event; it is not necessary to set it  in
              events.

       EPOLLET
              Sets  the  Edge  Triggered  behavior  for  the  associated  file descriptor.  The default behavior for epoll is Level Triggered.  See
              epoll(7) for more detailed information about Edge and Level Triggered event distribution architectures.

       EPOLLONESHOT (since Linux 2.6.2)
              Sets the one-shot behavior for the associated file descriptor.  This means that after an event is pulled out with  epoll_wait(2)  the
              associated  file  descriptor  is internally disabled and no other events will be reported by the epoll interface.  The user must call
              epoll_ctl() with EPOLL_CTL_MOD to rearm the file descriptor with a new event mask.

RETURN VALUE
       When successful, epoll_ctl() returns zero.  When an error occurs, epoll_ctl() returns -1 and errno is set appropriately.

(3) epoll_wait:該接口是阻塞等待內核返回的可讀寫事件,epfd還是epoll_create的返回值,events是個結構體數組指針存儲epoll_event,也就是將內核返回的待處理epoll_event結構都存儲下來

EPOLL_WAIT(2)                                                Linux Programmer's Manual                                                EPOLL_WAIT(2)

NAME
       epoll_wait, epoll_pwait - wait for an I/O event on an epoll file descriptor

SYNOPSIS
       #include <sys/epoll.h>

       int epoll_wait(int epfd, struct epoll_event *events,
                      int maxevents, int timeout);
       int epoll_pwait(int epfd, struct epoll_event *events,
                      int maxevents, int timeout,
                      const sigset_t *sigmask);

DESCRIPTION
       The  epoll_wait() system call waits for events on the epoll(7) instance referred to by the file descriptor epfd.  The memory area pointed to
       by events will contain the events that will be available for the caller.  Up to maxevents are returned by epoll_wait().  The maxevents argu‐
       ment must be greater than zero.

       The  timeout  argument  specifies the minimum number of milliseconds that epoll_wait() will block.  (This interval will be rounded up to the
       system clock granularity, and kernel scheduling delays mean that the blocking interval may overrun by a small amount.)  Specifying a timeout
       of  -1 causes epoll_wait() to block indefinitely, while specifying a timeout equal to zero cause epoll_wait() to return immediately, even if
       no events are available.

       The struct epoll_event is defined as :

           typedef union epoll_data {
               void    *ptr;
               int      fd;
               uint32_t u32;
               uint64_t u64;
           } epoll_data_t;

           struct epoll_event {
               uint32_t     events;    /* Epoll events */
               epoll_data_t data;      /* User data variable */
           };

       The data of each returned structure will contain the same data the user set with an  epoll_ctl(2)  (EPOLL_CTL_ADD,EPOLL_CTL_MOD)  while  the
       events member will contain the returned event bit field.

   epoll_pwait()
       The  relationship between epoll_wait() and epoll_pwait() is analogous to the relationship between select(2) and pselect(2): like pselect(2),
       epoll_pwait() allows an application to safely wait until either a file descriptor becomes ready or until a signal is caught.

       The following epoll_pwait() call:

           ready = epoll_pwait(epfd, &events, maxevents, timeout, &sigmask);

       is equivalent to atomically executing the following calls:

           sigset_t origmask;

           sigprocmask(SIG_SETMASK, &sigmask, &origmask);
           ready = epoll_wait(epfd, &events, maxevents, timeout);
           sigprocmask(SIG_SETMASK, &origmask, NULL);

       The sigmask argument may be specified as NULL, in which case epoll_pwait() is equivalent to epoll_wait().

RETURN VALUE
       When successful, epoll_wait() returns the number of file descriptors ready for the requested I/O, or zero if no file descriptor became ready
       during the requested timeout milliseconds.  When an error occurs, epoll_wait() returns -1 and errno is set appropriately.

  可以看到epoll定義的事件結構。

 1.  epoll官方demo

       #define MAX_EVENTS 10
           struct epoll_event ev, events[MAX_EVENTS];
           int listen_sock, conn_sock, nfds, epollfd;

           /* Set up listening socket, 'listen_sock' (socket(),
              bind(), listen()) */

           epollfd = epoll_create(10);
           if (epollfd == -1) {
               perror("epoll_create");
               exit(EXIT_FAILURE);
           }

           ev.events = EPOLLIN;
           ev.data.fd = listen_sock;
           if (epoll_ctl(epollfd, EPOLL_CTL_ADD, listen_sock, &ev) == -1) {
               perror("epoll_ctl: listen_sock");
               exit(EXIT_FAILURE);
           }

           for (;;) {
               nfds = epoll_wait(epollfd, events, MAX_EVENTS, -1);
               if (nfds == -1) {
                   perror("epoll_pwait");
                   exit(EXIT_FAILURE);
               }

               for (n = 0; n < nfds; ++n) {
                   if (events[n].data.fd == listen_sock) {
                       conn_sock = accept(listen_sock,
                                       (struct sockaddr *) &local, &addrlen);
                       if (conn_sock == -1) {
                           perror("accept");
                           exit(EXIT_FAILURE);
                       }
                       setnonblocking(conn_sock);
                       ev.events = EPOLLIN | EPOLLET;
                       ev.data.fd = conn_sock;
                       if (epoll_ctl(epollfd, EPOLL_CTL_ADD, conn_sock,
                                   &ev) == -1) {
                           perror("epoll_ctl: conn_sock");
                           exit(EXIT_FAILURE);
                       }
                   } else {
                       do_use_fd(events[n].data.fd);
                   }
               }
           }

2. 事件觸發模式

The  epoll  event  distribution interface is able to behave both as edge-triggered (ET) and as level-triggered (LT). epoll的事件發布有兩種模式,分別對應邊緣觸發和水平觸發

1. 水平觸發(level-trggered):默認是該模式

只要文件描述符關聯的讀內核緩沖區非空,有數據可以讀取,就一直發出可讀信號進行通知

當文件描述符關聯的內核寫緩沖區不滿,有空間可以寫入,就一直發出可寫信號進行通知

2. 邊緣觸發(edge-triggered):

當文件描述符關聯的讀內核緩沖區由空轉化為非空的時候,則發出可讀信號進行通知

當文件描述符關聯的內核寫緩沖區由滿轉化為不滿的時候,則發出可寫信號進行通知

兩者的區別:

水平觸發是只要讀緩沖區有數據,就會一直觸發可讀信號,而邊緣觸發僅僅在空變為非空的時候通知一次,舉個例子:

1. 讀緩沖區剛開始是空的

2. 讀緩沖區寫入2KB數據

3. 水平觸發和邊緣觸發模式此時都會發出可讀信號

4. 收到信號通知后,讀取了1kb的數據,讀緩沖區還剩余1KB數據

5. 水平觸發會再次進行通知,而邊緣觸發不會再進行通知

所以邊緣觸發需要一次性的把緩沖區的數據讀完為止,也就是一直讀,直到讀到EGAIN(EGAIN說明緩沖區已經空了)為止,因為這一點,邊緣觸發需要設置文件句柄為非阻塞

一道面試題:使用Linux epoll模型的LT水平觸發模式,當socket可寫時,會不停的觸發socket可寫的事件,如何處理?

普通做法:

  當需要向socket寫數據時,將該socket加入到epoll等待可寫事件。接收到socket可寫事件后,調用write或send發送數據,當數據全部寫完后, 將socket描述符移出epoll列表,這種做法需要反復添加和刪除。

改進做法:

  向socket寫數據時直接調用send發送,當send返回錯誤碼EAGAIN,才將socket加入到epoll,等待可寫事件后再發送數據,全部數據發送完畢,再移出epoll模型,改進的做法相當於認為socket在大部分時候是可寫的,不能寫了再讓epoll幫忙監控。

3. epoll 模型圖

  可以簡單的理解為如下圖

(1) 調用epoll_create 創建一個epoll 實例(初始化相關的數據結構),並且返回一個fd文件描述符

(2) 調用epoll_ctl 注冊事件到上面返回的文件描述符,實際就是添加一個fd以及監聽的事件到內核空間(紅黑樹維護一個結構)

當有事件發生內核會把事件結構移動到另一個就緒隊列

(3) 調用epoll_wait 從就緒隊列獲取事件(獲取的事件包括fd、事件類型等)

 

 4. 測試

代碼如下:

import java.net.InetSocketAddress;
import java.nio.ByteBuffer;
import java.nio.channels.SelectionKey;
import java.nio.channels.Selector;
import java.nio.channels.ServerSocketChannel;
import java.nio.channels.SocketChannel;
import java.util.Iterator;
import java.util.Set;

public class NIOSocket {

    public static void main(String[] args) throws Exception {
        // 創建ServerSocketChannel -> ServerSocket
        // Java NIO中的 ServerSocketChannel 是一個可以監聽新進來的TCP連接的通道, 就像標准IO中的ServerSocket一樣。ServerSocketChannel類在 java.nio.channels包中。
        // 通過調用 ServerSocketChannel.open() 方法來打開ServerSocketChannel.如:
        ServerSocketChannel serverSocketChannel = ServerSocketChannel.open();
        serverSocketChannel.socket().bind(new InetSocketAddress(8088));
        serverSocketChannel.configureBlocking(false);

        // 得到一個Selecor對象 (sun.nio.ch.WindowsSelectorImpl)
        Selector selector = Selector.open();

        //把 serverSocketChannel 注冊到  selector 關心 事件為 OP_ACCEPT
        //SelectionKey中定義的4種事件
        //SelectionKey.OP_ACCEPT —— 接收連接進行事件,表示服務器監聽到了客戶連接,那么服務器可以接收這個連接了
        // SelectionKey.OP_CONNECT —— 連接就緒事件,表示客戶與服務器的連接已經建立成功
        //SelectionKey.OP_READ  —— 讀就緒事件,表示通道中已經有了可讀的數據,可以執行讀操作了(通道目前有數據,可以進行讀操作了)
        //SelectionKey.OP_WRITE —— 寫就緒事件,表示已經可以向通道寫數據了(通道目前可以用於寫操作)
        serverSocketChannel.register(selector, SelectionKey.OP_ACCEPT);

        System.out.println("注冊后的selectionkey 數量=" + selector.keys().size()); // 1

        //循環等待客戶端連接
        while (true) {
            //這里我們等待1秒,如果沒有事件發生, 返回
            if (selector.select(1000) == 0) { //沒有事件發生
//                System.out.println("服務器等待了1秒,無連接");
                continue;
            }

            //如果返回的>0, 就獲取到相關的 selectionKey集合
            //1.如果返回的>0, 表示已經獲取到關注的事件
            //2. selector.selectedKeys() 返回關注事件的集合
            //   通過 selectionKeys 反向獲取通道
            Set<SelectionKey> selectionKeys = selector.selectedKeys();
            System.out.println("selectionKeys 數量 = " + selectionKeys.size());

            //hasNext() :該方法會判斷集合對象是否還有下一個元素,如果已經是最后一個元素則返回false。
            //next():把迭代器的指向移到下一個位置,同時,該方法返回下一個元素的引用。
            //remove() 從迭代器指向的集合中移除迭代器返回的最后一個元素。
            //遍歷 Set<SelectionKey>, 使用迭代器遍歷
            Iterator<SelectionKey> keyIterator = selectionKeys.iterator();

            while (keyIterator.hasNext()) {
                //獲取到SelectionKey
                SelectionKey key = keyIterator.next();
                //根據key 對應的通道發生的事件做相應處理
                if (key.isAcceptable()) { //如果是 OP_ACCEPT, 有新的客戶端連接
                    //該該客戶端生成一個 SocketChannel
                    SocketChannel socketChannel = serverSocketChannel.accept();
                    System.out.println("客戶端連接成功 生成了一個 socketChannel " + socketChannel.hashCode());
                    //將  SocketChannel 設置為非阻塞
                    socketChannel.configureBlocking(false);
                    //將socketChannel 注冊到selector, 關注事件為 OP_READ, 同時給socketChannel 關聯一個Buffer
                    socketChannel.register(selector, SelectionKey.OP_READ, ByteBuffer.allocate(1024));

                    System.out.println("客戶端連接后 ,注冊的selectionkey 數量=" + selector.keys().size()); //2,3,4..
                }

                if (key.isReadable()) {  //發生 OP_READ
                    //通過key 反向獲取到對應channel
                    SocketChannel channel = (SocketChannel) key.channel();
                    //獲取到該channel關聯的buffer
                    ByteBuffer buffer = (ByteBuffer) key.attachment();
                    channel.read(buffer);
                    System.out.println("from 客戶端: " + new String(buffer.array()));
                }

                //手動從集合中移動當前的selectionKey, 防止重復操作
                keyIterator.remove();
            }
        }
    }
}

(1)  啟動程序進行監測

[root@localhost jdk8]# strace -ff -o out ./jdk1.8.0_291/bin/java NIOSocket

(2) 查看out 文件總數

[root@localhost jdk8]# ll
total 143780
-rw-r--r--. 1 root  root       1033 Jul 20 23:11 Client.class
-rw-r--r--. 1 root  root        206 Jul 20 23:10 Client.java
drwxr-xr-x. 8 10143 10143       273 Apr  7 15:14 jdk1.8.0_291
-rw-r--r--. 1 root  root  144616467 Jul 20 03:42 jdk-8u291-linux-i586.tar.gz
-rw-r--r--. 1 root  root       2705 Jul 21 05:54 NIOSocket.class
-rw-r--r--. 1 root  root       5004 Jul 21 05:44 NIOSocket.java
-rw-r--r--. 1 root  root      13093 Jul 21 05:54 out.29779
-rw-r--r--. 1 root  root    2305003 Jul 21 05:54 out.29780
-rw-r--r--. 1 root  root      12951 Jul 21 05:54 out.29781
-rw-r--r--. 1 root  root       2101 Jul 21 05:54 out.29782
-rw-r--r--. 1 root  root       1784 Jul 21 05:54 out.29783
-rw-r--r--. 1 root  root       5016 Jul 21 05:54 out.29784
-rw-r--r--. 1 root  root      99615 Jul 21 05:54 out.29785
-rw-r--r--. 1 root  root        914 Jul 21 05:54 out.29786
-rw-r--r--. 1 root  root     119854 Jul 21 05:54 out.29787
-rw-r--r--. 1 root  root       7308 Jul 21 05:54 out.29789

(3) nc 連接到8088並且發送消息 "hello"

[root@localhost jdk8]# nc localhost 8088
hello

(4) 從out.29780查看重要的信息

socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) = 4
setsockopt(4, SOL_IPV6, IPV6_V6ONLY, [0], 4) = 0
setsockopt(4, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
。。。
bind(4, {sa_family=AF_INET6, sin6_port=htons(8088), inet_pton(AF_INET6, "::", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, 28) = 0
listen(4, 50) 
。。。
epoll_create(256)                       = 7
。。。
epoll_ctl(7, EPOLL_CTL_ADD, 5, {EPOLLIN, {u32=5, u64=17757820874070687749}}) = 0
。。。
epoll_ctl(7, EPOLL_CTL_ADD, 4, {EPOLLIN, {u32=4, u64=17757820874070687748}}) = 0
gettimeofday({tv_sec=1626861254, tv_usec=513203}, NULL) = 0
epoll_wait(7, [], 4096, 1000)           = 0
gettimeofday({tv_sec=1626861255, tv_usec=513652}, NULL) = 0
epoll_wait(7, [], 4096, 1000)           = 0
gettimeofday({tv_sec=1626861256, tv_usec=515602}, NULL) = 0
epoll_wait(7, [], 4096, 1000)           = 0
gettimeofday({tv_sec=1626861257, tv_usec=518045}, NULL) = 0
epoll_wait(7, [], 4096, 1000)           = 0
gettimeofday({tv_sec=1626861258, tv_usec=520289}, NULL) = 0
epoll_wait(7, [], 4096, 1000)           = 0
gettimeofday({tv_sec=1626861259, tv_usec=521552}, NULL) = 0
epoll_wait(7, [], 4096, 1000)           = 0

。。。

accept(4, {sa_family=AF_INET6, sin6_port=htons(59252), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, [28]) = 9
。。。
epoll_ctl(7, EPOLL_CTL_ADD, 9, {EPOLLIN, {u32=9, u64=17757980303256715273}}) = 0
gettimeofday({tv_sec=1626861260, tv_usec=952780}, NULL) = 0
epoll_wait(7, [], 4096, 1000)           = 0
。。。
epoll_wait(7, [{EPOLLIN, {u32=9, u64=17757980303256715273}}], 4096, 1000) = 1
write(1, "selectionKeys \346\225\260\351\207\217 = 1", 24) = 24
write(1, "\n", 1)                       = 1
。。。
read(9, "hello\n", 1024)                = 6

可以看到大致過程:

1》建立socket

2》bind端口

3》listen 監聽端口

4》epoll_create(256) = 7 創建epoll 實例

5》注冊事件 (第一個是內置的,第二個是serverSocketChannel 的 fd 注冊到 epfd)

epoll_ctl(7, EPOLL_CTL_ADD, 5, {EPOLLIN, {u32=5, u64=17757820874070687749}}) = 0 /

epoll_ctl(7, EPOLL_CTL_ADD, 4, {EPOLLIN, {u32=4, u64=17757820874070687748}}) = 0

6》epoll_wait 獲取事件

7》獲取到連接事件

8》accept = 9 返回一個 fd 是9的 客戶端socket

9》注冊fd為 9 、事件為讀事件到epfd

10》epoll_wait 獲取到一個事件,可以看到事件為可讀,事件的fd為9.

11》read(9  進行讀取數據

 

  驗證了上面的過程: epoll-create -> epoll_ctl -> epoll_wait

5. 測試2

簡單的例子測試其過程

import java.net.InetSocketAddress;
import java.nio.channels.SelectionKey;
import java.nio.channels.Selector;
import java.nio.channels.ServerSocketChannel;

public class NIOSocket {

    public static void main(String[] args) throws Exception {
        ServerSocketChannel serverSocketChannel = ServerSocketChannel.open();
        serverSocketChannel.socket().bind(new InetSocketAddress(8088));
        serverSocketChannel.configureBlocking(false);
        System.out.println("serverSocketChannel init 8088");

        Selector selector = Selector.open();
        serverSocketChannel.register(selector, SelectionKey.OP_ACCEPT);
        System.out.println("Selector.open() = 8088");

        int select = selector.select(1000);
        System.out.println("select: " + select);
    }
}

strace 查看其如下: socket\bind\lisen 就跳過

。。。
epoll_create(256)                       = 8
。。。
epoll_ctl(8, EPOLL_CTL_ADD, 6, {EPOLLIN, {u32=6, u64=17757820874070687750}}) = 0
。。。
epoll_ctl(8, EPOLL_CTL_ADD, 4, {EPOLLIN, {u32=4, u64=17757820874070687748}}) = 0
gettimeofday({tv_sec=1626858133, tv_usec=975699}, NULL) = 0
epoll_wait(8, [], 4096, 1000)           = 0

可以看到

(1) epoll_create  創建一個epoll 實例,返回一個fd

(2) epoll_ctr 注冊事件、fd 到剛才返回的epfd

(3) epoll_wait  獲取epfd 的事件列表

6. 測試3 

Selector selector = Selector.open();

對於如上代碼, 測試其調用內核命令:

epoll_create(256)                       = 6
。。。
epoll_ctl(6, EPOLL_CTL_ADD, 4, {EPOLLIN, {u32=4, u64=17762324473698058244}}) = 0

 

補充: select、poll、epoll的區別

(1)select==>時間復雜度O(n)

  它僅僅知道了,有I/O事件發生了,卻並不知道是哪那幾個流(可能有一個,多個,甚至全部),我們只能無差別輪詢所有流,找出能讀出數據,或者寫入數據的流,對他們進行操作。所以select具有O(n)的無差別輪詢復雜度,同時處理的流越多,無差別輪詢時間就越長。最大的fd文件描述符長度是1024.

(2)poll==>時間復雜度O(n)

  poll本質上和select沒有區別,它將用戶傳入的數組拷貝到內核空間,然后查詢每個fd對應的設備狀態, 但是它沒有最大連接數的限制,原因是它是基於鏈表來存儲的.

(3)epoll==>時間復雜度O(1)

  epoll可以理解為event poll,不同於忙輪詢和無差別輪詢,epoll會把哪個流發生了怎樣的I/O事件通知我們。所以我們說epoll實際上是事件驅動(每個事件關聯上fd)的,此時我們對這些流的操作都是有意義的。(復雜度降低到了O(1))

  select,poll,epoll都是IO多路復用的機制。I/O多路復用就通過一種機制,可以監視多個描述符,一旦某個描述符就緒(一般是讀就緒或者寫就緒),能夠通知程序進行相應的讀寫操作。但select,poll,epoll本質上都是同步I/O,因為他們都需要在讀寫事件就緒后應用程序自己負責進行讀寫,也就是說這個讀寫過程是阻塞的,而異步I/O則無需自己負責進行讀寫,異步I/O的實現會負責把數據從內核拷貝到用戶空間。

  epoll跟select都能提供多路I/O復用的解決方案。在現在的Linux內核里有都能夠支持,其中epoll是Linux所特有,而select則應該是POSIX所規定,一般操作系統均有實現。

  我們Java 程序使用selector 的時候,在不同的操作系統上可能會使用不同的多路復用器,我在centos7上使用的是epoll。

補充:man查看支持的手冊,如果不支持的話 yum install -y man-pages    安裝全量的man手冊

man 2 cmd 是查看系統調用

[root@localhost jdk8]# man man

       1   Executable programs or shell commands
       2   System calls (functions provided by the kernel)
       3   Library calls (functions within program libraries)
       4   Special files (usually found in /dev)
       5   File formats and conventions eg /etc/passwd
       6   Games
       7   Miscellaneous (including macro packages and conventions), e.g. man(7), groff(7)
       8   System administration commands (usually only for root)
       9   Kernel routines [Non standard]

補充: C10K問題

  最初的服務器是基於進程/線程模型。新到來一個TCP連接,就需要分配一個進程。假如有C10K,就需要創建1W個進程,可想而知單機是無法承受的。那么如何突破單機性能是高性能網絡編程必須要面對的問題,進而這些局限和問題就統稱為C10K問題。

  因為Linux是互聯網企業中使用率最高的操作系統,Epoll就成為C10K killer、高並發、高性能、異步非阻塞這些技術的代名詞了。FreeBSD推出了kqueue,Linux推出了epoll,Windows推出了IOCP,Solaris推出了/dev/poll。這些操作系統提供的功能就是為了解決C10K問題。epoll技術的編程模型就是異步非阻塞回調,也可以叫做Reactor,事件驅動,事件輪循(EventLoop)。Nginx,libevent,node.js這些就是Epoll時代的產物。

 

補充:redis采用多路復用原理查看

1. 下載並安全redis

2. strace 檢測redis 啟動

[root@localhost test]# strace -ff -o redisout ../redis-5.0.4/src/redis-server 
34127:C 21 Jul 2021 21:57:26.281 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
34127:C 21 Jul 2021 21:57:26.281 # Redis version=5.0.4, bits=64, commit=00000000, modified=0, pid=34127, just started
34127:C 21 Jul 2021 21:57:26.282 # Warning: no config file specified, using the default config. In order to specify a config file use ../redis-5.0.4/src/redis-server /path/to/redis.conf
34127:M 21 Jul 2021 21:57:26.284 * Increased maximum number of open files to 10032 (it was originally set to 1024).
                _._                                                  
           _.-``__ ''-._                                             
      _.-``    `.  `_.  ''-._           Redis 5.0.4 (00000000/0) 64 bit
  .-`` .-```.  ```\/    _.,_ ''-._                                   
 (    '      ,       .-`  | `,    )     Running in standalone mode
 |`-._`-...-` __...-.``-._|'` _.-'|     Port: 6379
 |    `-._   `._    /     _.-'    |     PID: 34127
  `-._    `-._  `-./  _.-'    _.-'                                   
 |`-._`-._    `-.__.-'    _.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |           http://redis.io        
  `-._    `-._`-.__.-'_.-'    _.-'                                   
 |`-._`-._    `-.__.-'    _.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |                                  
  `-._    `-._`-.__.-'_.-'    _.-'                                   
      `-._    `-.__.-'    _.-'                                       
          `-._        _.-'                                           
              `-.__.-'                                               

34127:M 21 Jul 2021 21:57:26.294 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
34127:M 21 Jul 2021 21:57:26.294 # Server initialized
34127:M 21 Jul 2021 21:57:26.294 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
34127:M 21 Jul 2021 21:57:26.296 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
34127:M 21 Jul 2021 21:57:26.296 * Ready to accept connections

2. 查看out文件

[root@localhost test]# ll
total 48
-rw-r--r--. 1 root root 34219 Jul 21 21:57 redisout.34127
-rw-r--r--. 1 root root   134 Jul 21 21:57 redisout.34128
-rw-r--r--. 1 root root   134 Jul 21 21:57 redisout.34129
-rw-r--r--. 1 root root   134 Jul 21 21:57 redisout.34130

3. 我們知道啟動一個程序需要socket、bind、listen, 搜索bind

[root@localhost test]# grep bind ./*
./redisout.34127:bind(6, {sa_family=AF_INET6, sin6_port=htons(6379), inet_pton(AF_INET6, "::", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, 28) = 0
./redisout.34127:bind(7, {sa_family=AF_INET, sin_port=htons(6379), sin_addr=inet_addr("0.0.0.0")}, 16) = 0

4. 查看 redisout.34127

...
socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 7
setsockopt(7, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(7, {sa_family=AF_INET, sin_port=htons(6379), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
listen(7, 511)                          = 0
...

epoll_create(1024)                      = 5
...
epoll_ctl(5, EPOLL_CTL_ADD, 6, {EPOLLIN, {u32=6, u64=6}}) = 0
epoll_ctl(5, EPOLL_CTL_ADD, 7, {EPOLLIN, {u32=7, u64=7}}) = 0
epoll_ctl(5, EPOLL_CTL_ADD, 3, {EPOLLIN, {u32=3, u64=3}}) = 0
...
epoll_wait(5, [], 10128, 0)             = 0
open("/proc/34127/stat", O_RDONLY)      = 8
read(8, "34127 (redis-server) R 34125 341"..., 4096) = 341
close(8)                                = 0
read(3, 0x7ffd0d4c055f, 1)              = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(5, [], 10128, 100)           = 0
open("/proc/34127/stat", O_RDONLY)      = 8
read(8, "34127 (redis-server) R 34125 341"..., 4096) = 341
close(8)                                = 0
read(3, 0x7ffd0d4c055f, 1)              = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(5, [], 10128, 100)           = 0
...

5. 建立一個客戶端並且存一個值

[root@localhost test]# ../redis-5.0.4/src/redis-cli 
127.0.0.1:6379> set testkey testvalue
OK

6. 繼續查看34127 文件

。。。
accept(7, {sa_family=AF_INET, sin_port=htons(48084), sin_addr=inet_addr("127.0.0.1")}, [128->16]) = 8
。。。
epoll_ctl(5, EPOLL_CTL_ADD, 8, {EPOLLIN, {u32=8, u64=8}}) = 0
。。。
epoll_wait(5, [{EPOLLIN, {u32=8, u64=8}}], 10128, 6) = 1
read(8, "*1\r\n$7\r\nCOMMAND\r\n", 16384) = 17
。。。
read(8, "*3\r\n$3\r\nset\r\n$7\r\ntestkey\r\n$9\r\nte"..., 16384) = 41
read(3, 0x7ffd0d4c055f, 1)              = -1 EAGAIN (Resource temporarily unavailable)
write(8, "+OK\r\n", 5)                  = 5
。。。

可以看看到也read接受客戶端發送的數據和write寫回到客戶端的數據滿足redis 協議發送請求數據和解析響應數據

7. 客戶端發送get 請求

127.0.0.1:6379> get testkey
"testvalue"

8. 查看out 文件

。。。
epoll_wait(5, [{EPOLLIN, {u32=8, u64=8}}], 10128, 100) = 1
read(8, "*2\r\n$3\r\nget\r\n$7\r\ntestkey\r\n", 16384) = 26
read(3, 0x7ffd0d4c055f, 1)              = -1 EAGAIN (Resource temporarily unavailable)
write(8, "$9\r\ntestvalue\r\n", 15)     = 15
。。。

9. 上面也可以看到redis 啟動的時候啟動了4個線程(根據生成的out文件可以看出來), 也可以用top 查看

(1)  查看PID

[root@localhost test]# netstat -nltp | grep 6379
tcp        0      0 0.0.0.0:6379            0.0.0.0:*               LISTEN      34127/../redis-5.0. 
tcp6       0      0 :::6379                 :::*                    LISTEN      34127/../redis-5.0. 

(2) 查看線程信息

[root@localhost test]# top -Hp 34127

 

  redis 單線程是說接受請求、處理獲取數據以及寫數據等核心操作是單線程,一個線程內完成的,其他線程用來處理AOF、刪除過期key 等操作。

 

關於redis接受數據協議和發送數據協議參考 https://www.cnblogs.com/qlqwjy/p/8560052.html

 

 補充:nginx單線程多路復用查看,epoll 的過程

1. 用strace 啟動監測

[root@localhost sbin]# strace -ff -o out ./nginx

2. 查看生成的out文件

[root@localhost sbin]# ll
total 3796
-rwxr-xr-x. 1 root root 3851552 Jul 22 01:02 nginx
-rw-r--r--. 1 root root   20027 Jul 22 03:56 out.47227
-rw-r--r--. 1 root root    1100 Jul 22 03:56 out.47228
-rw-r--r--. 1 root root    5512 Jul 22 03:56 out.47229

  可以看到生成3個文件

3. ps 查看相關進程

[root@localhost sbin]# ps -ef | grep nginx | grep -v 'grep'
root      47225  38323  0 03:56 pts/1    00:00:00 strace -ff -o out ./nginx
root      47228      1  0 03:56 ?        00:00:00 nginx: master process ./nginx
nobody    47229  47228  0 03:56 ?        00:00:00 nginx: worker process

  可以看到由一個master進程一個worker進程。master進程負責重啟、檢測語法等,worker進程用於接收請求。

4. 查看out文件

(1) 查看47228 master 文件

  1 set_robust_list(0x7fec5129da20, 24)     = 0
  2 setsid()                                = 47228
  3 umask(000)                              = 022
  4 open("/dev/null", O_RDWR)               = 7
  5 dup2(7, 0)                              = 0
  6 dup2(7, 1)                              = 1
  7 close(7)                                = 0
  8 open("/usr/local/nginx/logs/nginx.pid", O_RDWR|O_CREAT|O_TRUNC, 0644) = 7
  9 pwrite64(7, "47228\n", 6, 0)            = 6
 10 close(7)                                = 0
 11 dup2(5, 2)                              = 2
 12 close(3)                                = 0
 13 rt_sigprocmask(SIG_BLOCK, [HUP INT QUIT USR1 USR2 ALRM TERM CHLD WINCH IO], NULL, 8) = 0
 14 socketpair(AF_UNIX, SOCK_STREAM, 0, [3, 7]) = 0
 15 ioctl(3, FIONBIO, [1])                  = 0
 16 ioctl(7, FIONBIO, [1])                  = 0
 17 ioctl(3, FIOASYNC, [1])                 = 0
 18 fcntl(3, F_SETOWN, 47228)               = 0
 19 fcntl(3, F_SETFD, FD_CLOEXEC)           = 0
 20 fcntl(7, F_SETFD, FD_CLOEXEC)           = 0
 21 clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fec5129da10) = 47229
 22 rt_sigsuspend([], 8

  可以看到master 沒有epoll相關命令,master 進程主要用來負責接受信號、熱更新、熱部署、監聽worker服務狀態。也可以看到最后通過clone 命令創建一個47220 worker子進程。

(2) 查看47229 文件

。。。
epoll_create(512)                       = 8
eventfd2(0, 0)                          = 9
epoll_ctl(8, EPOLL_CTL_ADD, 9, {EPOLLIN|EPOLLET, {u32=7088384, u64=7088384}}) = 0
socketpair(AF_UNIX, SOCK_STREAM, 0, [10, 11]) = 0
epoll_ctl(8, EPOLL_CTL_ADD, 10, {EPOLLIN|EPOLLRDHUP|EPOLLET, {u32=7088384, u64=7088384}}) = 0
close(11)                               = 0
epoll_wait(8, [{EPOLLIN|EPOLLHUP|EPOLLRDHUP, {u32=7088384, u64=7088384}}], 1, 5000) = 1
close(10)                               = 0
mmap(NULL, 225280, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fec51266000
brk(NULL)                               = 0x20ba000
brk(0x20f1000)                          = 0x20f1000
epoll_ctl(8, EPOLL_CTL_ADD, 6, {EPOLLIN|EPOLLRDHUP, {u32=1361469456, u64=140652950478864}}) = 0
close(3)                                = 0
epoll_ctl(8, EPOLL_CTL_ADD, 7, {EPOLLIN|EPOLLRDHUP, {u32=1361469672, u64=140652950479080}}) = 0
epoll_wait(8, 
。。。

(3) curl 進行訪問測試

[root@localhost test3]# curl http://localhost:80
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
<style>
    body {
        width: 35em;
        margin: 0 auto;
        font-family: Tahoma, Verdana, Arial, sans-serif;
    }
</style>
</head>
<body>
<h1>Welcome to nginx!</h1>
<p>If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.</p>

<p>For online documentation and support please refer to
<a href="http://nginx.org/">nginx.org</a>.<br/>
Commercial support is available at
<a href="http://nginx.com/">nginx.com</a>.</p>

<p><em>Thank you for using nginx.</em></p>
</body>
</html>

(4) 繼續查看47229  文件

epoll_ctl(8, EPOLL_CTL_ADD, 7, {EPOLLIN|EPOLLRDHUP, {u32=1361469672, u64=140652950479080}}) = 0
epoll_wait(8, [{EPOLLIN, {u32=1361469456, u64=140652950478864}}], 512, -1) = 1
accept4(6, {sa_family=AF_INET, sin_port=htons(40704), sin_addr=inet_addr("127.0.0.1")}, [112->16], SOCK_NONBLOCK) = 3
epoll_ctl(8, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLRDHUP|EPOLLET, {u32=1361469888, u64=140652950479296}}) = 0
epoll_wait(8, [{EPOLLIN, {u32=1361469888, u64=140652950479296}}], 512, 60000) = 1
recvfrom(3, "GET / HTTP/1.1\r\nUser-Agent: curl"..., 1024, 0, NULL, NULL) = 73
stat("/usr/local/nginx/html/index.html", {st_mode=S_IFREG|0644, st_size=612, ...}) = 0
open("/usr/local/nginx/html/index.html", O_RDONLY|O_NONBLOCK) = 10
fstat(10, {st_mode=S_IFREG|0644, st_size=612, ...}) = 0
writev(3, [{iov_base="HTTP/1.1 200 OK\r\nServer: nginx/1"..., iov_len=238}], 1) = 238
sendfile(3, 10, [0] => [612], 612)      = 612
write(4, "127.0.0.1 - - [22/Jul/2021:04:29"..., 86) = 86
close(10)                               = 0
setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0
epoll_wait(8, [{EPOLLIN|EPOLLRDHUP, {u32=1361469888, u64=140652950479296}}], 512, 65000) = 1
recvfrom(3, "", 1024, 0, NULL, NULL)    = 0
close(3)                                = 0
epoll_wait(8, 

(5) 再次過濾查看socket相關以及epoll 相關

[root@localhost sbin]# grep socket ./*
Binary file ./nginx matches
./out.47227:socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 4
./out.47227:connect(4, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
./out.47227:socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 4
./out.47227:connect(4, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
./out.47227:socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 4
./out.47227:connect(4, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
./out.47227:socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 4
./out.47227:connect(4, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
./out.47227:socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 6
./out.47228:socketpair(AF_UNIX, SOCK_STREAM, 0, [3, 7]) = 0
./out.47229:socketpair(AF_UNIX, SOCK_STREAM, 0, [10, 11]) = 0
[root@localhost sbin]# grep bind ./*
Binary file ./nginx matches
./out.47227:bind(6, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
[root@localhost sbin]# grep listen ./*
Binary file ./nginx matches
./out.47227:listen(6, 511)                          = 0
./out.47227:listen(6, 511)                          = 0
[root@localhost sbin]# grep epoll_create ./*
Binary file ./nginx matches
./out.47227:epoll_create(100)                       = 5
./out.47229:epoll_create(512)                       = 8
[root@localhost sbin]# grep epoll_ctl ./*
Binary file ./nginx matches
./out.47229:epoll_ctl(8, EPOLL_CTL_ADD, 9, {EPOLLIN|EPOLLET, {u32=7088384, u64=7088384}}) = 0
./out.47229:epoll_ctl(8, EPOLL_CTL_ADD, 10, {EPOLLIN|EPOLLRDHUP|EPOLLET, {u32=7088384, u64=7088384}}) = 0
./out.47229:epoll_ctl(8, EPOLL_CTL_ADD, 6, {EPOLLIN|EPOLLRDHUP, {u32=1361469456, u64=140652950478864}}) = 0
./out.47229:epoll_ctl(8, EPOLL_CTL_ADD, 7, {EPOLLIN|EPOLLRDHUP, {u32=1361469672, u64=140652950479080}}) = 0
./out.47229:epoll_ctl(8, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLRDHUP|EPOLLET, {u32=1361469888, u64=140652950479296}}) = 0
[root@localhost sbin]# grep epoll_wait ./*
Binary file ./nginx matches
./out.47229:epoll_wait(8, [{EPOLLIN|EPOLLHUP|EPOLLRDHUP, {u32=7088384, u64=7088384}}], 1, 5000) = 1
./out.47229:epoll_wait(8, [{EPOLLIN, {u32=1361469456, u64=140652950478864}}], 512, -1) = 1
./out.47229:epoll_wait(8, [{EPOLLIN, {u32=1361469888, u64=140652950479296}}], 512, 60000) = 1
./out.47229:epoll_wait(8, [{EPOLLIN|EPOLLRDHUP, {u32=1361469888, u64=140652950479296}}], 512, 65000) = 1
./out.47229:epoll_wait(8, 

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM