用Netty開發中間件：高並發性能優化

本文轉載自查看原文 2016-02-02 15:26 20858 netty

用Netty開發中間件：高並發性能優化

最近在寫一個后台中間件的原型，主要是做消息的分發和透傳。因為要用Java實現，所以網絡通信框架的第一選擇當然就是Netty了，使用的是Netty 4版本。Netty果然效率很高，不用做太多努力就能達到一個比較高的tps。但使用過程中也碰到了一些問題，個人覺得都是比較經典而在網上又不太容易查找到相關資料的問題，所以在此總結一下。

1.Context Switch過高

壓測時用nmon監控內核，發現Context Switch高達30w+。這明顯不正常，但JVM能有什么導致Context Switch。參考之前整理過的恐龍書《Operating System Concept》的讀書筆記《進程調度》和Wiki上的Context Switch介紹，進程/線程發生上下文切換的原因有：

I/O等待：在多任務系統中，進程主動發起I/O請求，但I/O設備還沒有准備好，所以會發生I/O阻塞，進程進入Wait狀態。
時間片耗盡：在多任務分時系統中，內核分配給進程的時間片已經耗盡了，進程進入Ready狀態，等待內核重新分配時間片后的執行機會。
硬件中斷：在搶占式的多任務分時系統中，I/O設備可以在任意時刻發生中斷，CPU會停下當前正在執行的進程去處理中斷，因此進程進入Ready狀態。

根據分析，重點就放在第一個和第二個因素上。

進程與線程的上下文切換

之前的讀書筆記里總結的是進程的上下文切換原因，那線程的上下文切換又有什么不同呢？在StackOverflow上果然找到了提問thread context switch vs process context switch：

“The main distinction between a thread switch and a process switch is that during a thread switch, the virtual memory space remains the same, while it does not during a process switch. Both types involve handing control over to the operating system kernel to perform the context switch. The process of switching in and out of the OS kernel along with the cost of switching out the registers is the largest fixed cost of performing a context switch.
A more fuzzy cost is that a context switch messes with the processors cacheing mechanisms. Basically, when you context switch, all of the memory addresses that the processor “remembers” in it’s cache effectively become useless. The one big distinction here is that when you change virtual memory spaces, the processor’s Translation Lookaside Buffer (TLB) or equivalent gets flushed making memory accesses much more expensive for a while. This does not happen during a thread switch.”

通過排名第一的大牛的解答了解到，進程和線程的上下文切換都涉及進出系統內核和寄存器的保存和還原，這是它們的最大開銷。但與進程的上下文切換相比，線程還是要輕量一些，最大的區別是線程上下文切換時虛擬內存地址保持不變，所以像TLB等CPU緩存不會失效。但要注意的是另一份提問What is the overhead of a context-switch?的中提到了：Intel和AMD在2008年引入的技術可能會使TLB不失效。感興趣的話請自行研究吧。

1.1 非阻塞I/O

針對第一個因素I/O等待，最直接的解決辦法就是使用非阻塞I/O操作。在Netty中，就是服務端和客戶端都使用NIO。

這里在說一下如何主動的向Netty的Channel寫入數據，因為網絡上搜到的資料都是千篇一律：服務端就是接到請求后在Handler中寫入返回數據，而客戶端的例子竟然也都是在Handler里Channel Active之后發送數據。因為要做消息透傳，而且是向下游系統發消息時是異步非阻塞的，網上那種例子根本沒法用，所以在這里說一下我的方法吧。

關於服務端，在接收到請求后，在channelRead0()中通過ctx.channel()得到Channel，然后就通過ThreadLocal變量或其他方法，只要能把這個Channel保存住就行。當需要返回響應數據時就主動向持有的Channel寫數據。具體請參照后面第4節。

關於客戶端也是同理，在啟動客戶端之后要拿到Channel，當要主動發送數據時就向Channel中寫入。

EventLoopGroup group = new NioEventLoopGroup();
        Bootstrap b = new Bootstrap();
        b.group(group)
            .channel(NioSocketChannel.class)
            .remoteAddress(host, port)
            .handler(new ChannelInitializer<SocketChannel>() {
                @Override
                protected void initChannel(SocketChannel ch) throws Exception {
                    ch.pipeline().addLast(...);
                }
            });

        try {
            ChannelFuture future = b.connect().sync();
            this.channel = future.channel();
        }
        catch (InterruptedException e) {
            throw new IllegalStateException("Error when start netty client: addr=[" + addr + "]", e);
        }

1.2 減少線程數

線程太多的話每個線程得到的時間片就少，CPU要讓各個線程都有機會執行就要切換，切換就要不斷保存和還原線程的上下文現場。於是檢查Netty的I/O worker的EventLoopGroup。之前在《Netty 4源碼解析：服務端啟動》中曾經分析過，EventLoopGroup默認的線程數是CPU核數的二倍。所以手動配置NioEventLoopGroup的線程數，減少一些I/O線程。

private void doStartNettyServer(int port) throws InterruptedException {
        EventLoopGroup bossGroup = new NioEventLoopGroup();
        EventLoopGroup workerGroup = new NioEventLoopGroup(4);
        try {
            ServerBootstrap b = new ServerBootstrap()
                    .group(bossGroup, workerGroup)
                    .channel(NioServerSocketChannel.class)
                    .localAddress(port)
                    .childHandler(new ChannelInitializer<SocketChannel>() {
                        @Override
                        public void initChannel(SocketChannel ch) throws Exception {
                            ch.pipeline().addLast(...);
                        }
                    });

            // Bind and start to accept incoming connections.
            ChannelFuture f = b.bind(port).sync();

            // Wait until the server socket is closed.
            f.channel().closeFuture().sync();
        } finally {
            bossGroup.shutdownGracefully();
            workerGroup.shutdownGracefully();
        }
    }

此外因為還用了Akka作為業務線程池，所以還看了下如何修改Akka的默認配置。方法是新建一個叫做application.conf的配置文件，我們創建ActorSystem時會自動加載這個配置文件，下面的配置文件中定制了一個dispatcher：

my-dispatcher {
  # Dispatcher is the name of the event-based dispatcher
  type = Dispatcher
  mailbox-type = "akka.dispatch.SingleConsumerOnlyUnboundedMailbox"
  # What kind of ExecutionService to use
  executor = "fork-join-executor"
  # Configuration for the fork join pool
  fork-join-executor {
    # Min number of threads to cap factor-based parallelism number to
    parallelism-min = 2
    # Parallelism (threads) ... ceil(available processors * factor)
    parallelism-factor = 1.0
    # Max number of threads to cap factor-based parallelism number to
    parallelism-max = 16
  }
  # Throughput defines the maximum number of messages to be
  # processed per actor before the thread jumps to the next actor.
  # Set to 1 for as fair as possible.
  throughput = 100
}

簡單來說，最關鍵的幾個配置項是：

parallelism-factor：決定線程池的大小（竟然不是parallelism-max）。
throughput：決定coroutine的切換頻率，1是最為頻繁也最為公平的設置。

因為本篇主要是介紹Netty的，所以具體含義就詳細介紹了，請參考官方文檔中對Dispatcher和Mailbox的介紹。創建特定Dispatcher的Akka很簡單，以下是創建類型化Actor時指定Dispatcher的方法。

TypedActor.get(system).typedActorOf(
            new TypedProps<MyActorImpl>(
                    MyActor.class,
                    new Creator<MyActorImpl>() {
                        @Override
                        public MyActorImpl create() throws Exception {
                            return new MyActorImpl(XXX);
                        }
                    }
            ).withDispatcher("my-dispatcher")
    );

1.3 去業務線程池

盡管上面做了種種改進配置，用jstack查看線程配置確實生效了，但Context Switch的狀況並沒有好轉。於是干脆去掉Akka實現的業務線程池，徹底減少線程上下文的切換。發現CS從30w+一下子降到了16w！費了好大力氣在萬能的StackOverflow上查到了一篇文章，其中一句話點醒了我：

And if the recommendation is not to block in the event loop, then this can be done in an application thread. But that would imply an extra context switch. This extra context switch may not be acceptable to latency sensitive applaications.

有了線索就趕緊去查Netty源碼，發現的確像調用channel.write()操作不是在當前線程上執行。Netty內部統一使用executor.inEventLoop()判斷當前線程是否是EventLoopGroup的線程，否則會包裝好Task交給內部線程池執行：

private void write(Object msg, boolean flush, ChannelPromise promise) {

        AbstractChannelHandlerContext next = findContextOutbound();
        EventExecutor executor = next.executor();
        if (executor.inEventLoop()) {
            next.invokeWrite(msg, promise);
            if (flush) {
                next.invokeFlush();
            }
        } else {
            int size = channel.estimatorHandle().size(msg);
            if (size > 0) {
                ChannelOutboundBuffer buffer = channel.unsafe().outboundBuffer();
                // Check for null as it may be set to null if the channel is closed already
                if (buffer != null) {
                    buffer.incrementPendingOutboundBytes(size);
                }
            }
            Runnable task;
            if (flush) {
                task = WriteAndFlushTask.newInstance(next, msg, size, promise);
            }  else {
                task = WriteTask.newInstance(next, msg, size, promise);
            }
            safeExecute(executor, task, promise, msg);
        }
    }

業務線程池原來是把雙刃劍。雖然將任務交給業務線程池異步執行降低了Netty的I/O線程的占用時間、減輕了壓力，但同時業務線程池增加了線程上下文切換的次數。通過上述這些優化手段，終於將壓測時的CS從每秒30w+降到了8w左右，效果還是挺明顯的！

2.系統調用開銷

系統調用一般會涉及到從User Space到Kernel Space的模態轉換(Mode Transition或Mode Switch)。這種轉換也是有一定開銷的。

Mode Switch vs. Context Switch

StackOverflow上果然什么問題都有。前面介紹過了線程的上下文切換，那它與內核態和用戶態的切換是什么關系？模態切換算是CS的一種嗎？Does there have to be a mode switch for something to qualify as a context switch?回答了這個問題：

“A mode switch happens inside one process. A context switch involves more than one process (or thread). Context switch happens only in kernel mode. If context switching happens between two user mode processes, first cpu has to change to kernel mode, perform context switch, return back to user mode and so on. So there has to be a mode switch associated with a context switch. But a context switch doesn’t imply a mode switch (could be done by the hardware alone). A mode switch does not require a context switch either.”

Context Switch必須在內核中完成，原理簡單說就是主動觸發一個軟中斷（類似被動被硬件觸發的硬中斷），所以一般Context Switch都會伴隨Mode Switch。然而有些硬件也可以直接完成（不是很懂了），有些CPU甚至沒有我們常說Ring 0 ~ 3的特權級概念。而Mode Switch則與Context Switch更是無關了，按照Wiki上的說法硬要扯上關系的話也只能說有的系統里可能在Mode Switch中發生Context Switch。

Netty涉及的系統調用最多的就是網絡通信操作了，所以為了降低系統調用的頻度，最直接的方法就是緩沖輸出內容，達到一定的數據大小、寫入次數或時間間隔時才flush緩沖區。

對於緩沖區大小不足，寫入速度過快等問題，Netty提供了writeBufferLowWaterMark和writeBufferHighWaterMark選項，當緩沖區達到一定大小時則不能寫入，避免被撐爆。感覺跟Netty提供的Traffic Shaping流量整形功能有點像呢。具體還未深入研究，感興趣的同學可以自行學習一下。

3.Zero Copy實現

《Netty權威指南（第二版）》中專門有一節介紹Netty的Zero Copy，但針對的是Netty內部的零拷貝功能。我們這里想談的是如何在應用代碼中實現Zero Copy，最典型的應用場景就是消息透傳。因為透傳不需要完整解析消息，只需要知道消息要轉發給下游哪個系統就足夠了。所以透傳時，我們可以只解析出部分消息，消息整體還原封不動地放在Direct Buffer里，最后直接將它寫入到連接下游系統的Channel中。所以應用層的Zero Copy實現就分為兩部分：Direct Buffer配置和Buffer的零拷貝傳遞。

3.1 內存池

使用Netty帶來的又一個好處就是內存管理。只需一行簡單的配置，就能獲得到內存池帶來的好處。在底層，Netty實現了一個Java版的Jemalloc內存管理庫（還記得Redis自帶的那個嗎），為我們做完了所有“臟活累活”！

ServerBootstrap b = new ServerBootstrap()
            .group(bossGroup, workerGroup)
            .channel(NioServerSocketChannel.class)
            .localAddress(port)
            .childOption(ChannelOption.ALLOCATOR, PooledByteBufAllocator.DEFAULT)
            .childHandler(new ChannelInitializer<SocketChannel>() {
                @Override
                public void initChannel(SocketChannel ch) throws Exception {
                    ch.pipeline().addLast(...);
                }
            });

3.2 應用層的Zero Copy

默認情況下，Netty會自動釋放ByteBuf。也就是說當我們覆寫的channelRead0()返回時，ByteBuf就結束了它的使命，被Netty自動釋放掉（如果是池化的就可會被放回到內存池中）。

public abstract class SimpleChannelInboundHandler<I> extends ChannelInboundHandlerAdapter {

    @Override
    public void channelRead(ChannelHandlerContext ctx, Object msg) throws Exception {
        boolean release = true;
        try {
            if (acceptInboundMessage(msg)) {
                @SuppressWarnings("unchecked")
                I imsg = (I) msg;
                channelRead0(ctx, imsg);
            } else {
                release = false;
                ctx.fireChannelRead(msg);
            }
        } finally {
            if (autoRelease && release) {
                ReferenceCountUtil.release(msg);
            }
        }
    }
}

因為Netty是用引用計數的方式來判斷是否回收的，所以要想繼續使用ByteBuf而不讓Netty釋放的話，就要增加它的引用計數。只要我們在ChannelPipeline中的任意一個Handler中調用ByteBuf.retain()將引用計數加1，Netty就不會釋放掉它了。我們在連接下游的客戶端的Encoder中發送消息成功后再釋放掉，這樣就達到了零拷貝透傳的效果：

public class RespEncoder extends MessageToByteEncoder<Resp> {

    @Override
    protected void encode(ChannelHandlerContext ctx, Msg msg, ByteBuf out) throws Exception {
        // Raw in Msg is retained ByteBuf
        out.writeBytes(msg.getRaw(), 0, msg.getRaw().readerIndex());
        msg.getRaw().release();
    }

}

4.並發下的狀態處理

前面第1.1節介紹的異步寫入持有的Channel和第2節介紹的根據一定規則flush緩沖區等等，都涉及到狀態的保存。如果要並發訪問這些狀態的話，就要提防並發的race condition問題，避免更新沖突、丟失等等。

4.1 Channel保存

在Netty服務端的Handler里如何持有Channel呢？我是這樣做的，在channelActive()或第一次進入channelRead0()時創建一個Session對象持有Channel。因為之前在《Netty 4源碼解析：請求處理》中曾經分析過Netty 4的線程模型：多個客戶端可能會對應一個EventLoop線程，但對於一個客戶端來說只能對應一個EventLoop線程。每個客戶端都對應自己的Handler實例，並且一直使用到連接斷開。

public class FrontendHandler extends SimpleChannelInboundHandler<Msg> {

    private Session session;

    @Override
    public void channelActive(ChannelHandlerContext ctx) throws Exception {
        session = factory.createSession(ctx.channel());
        super.channelActive(ctx);
    }

    @Override
    protected void channelRead0(final ChannelHandlerContext ctx, Msg msg) throws Exception {
        session.handleRequest(msg);
    }

    @Override
    public void channelInactive(ChannelHandlerContext ctx) throws Exception {
        session = null;
        super.channelInactive(ctx);
    }

}

4.2 Decoder狀態

因為網絡粘包拆包等因素，Decoder不可避免的要保存一些解析過程的中間狀態。因為Netty對於每個客戶端的生命周期內會一直使用同一個Decoder實例，所以解析完成后一定要重置中間狀態，避免后續解析錯誤。

public class RespDecoder extends ReplayingDecoder {

    public MsgDecoder() {
        doCleanUp();
    }

    @Override
    protected void decode(ChannelHandlerContext ctx, ByteBuf in, List<Object> out)
            throws Exception {
        if (doParseMsg(in)) {
            doSendToHandler(out);
            doCleanUp();
        }
    }
}

5.總結

5.1 多變的Netty

總結之前先吐槽一下，令人又愛又恨的Netty更新速度。從Netty 3到Netty 4，API發生了一次“大地震”，好多網上的示例程序都是基於Netty 3，所以學習Netty 4時發現好多例子都跑不起來了。除了API，Netty內部的線程模型等等變化就更不用說了。本以為用上了Netty 4就可以安心了，結果Netty 5的線程模型又-變-了！看看官方文檔里的說法吧，升級的話又要注意了。

Even more flexible thread model

In Netty 4.x each EventLoop is tightly coupled with a fixed thread that executes all I/O events of its registered Channels and any tasks submitted to it. Starting with version 5.0 an EventLoop does no longer use threads directly but instead makes use of an Executor abstraction. That is, it takes an Executor object as a parameter in its constructor and instead of polling for I/O events in an endless loop each iteration is now a task that is submitted to this Executor. Netty 4.x would simply spawn its own threads and completely ignore the fact that it’s part of a larger system. Starting with Netty 5.0, developers can run Netty and the rest of the system in the same thread pool and potentially improve performance by applying better scheduling strategies and through less scheduling overhead (due to fewer threads). It shall be mentioned, that this change does not in any way affect the way ChannelHandlers are developed. From a developer’s point of view, the only thing that changes is that it’s no longer guaranteed that a ChannelHandler will always be executed by the same thread. It is, however, guaranteed that it will never be executed by two or more threads at the same time. Furthermore, Netty will also take care of any memory visibility issues that might occur. So there’s no need to worry about thread-safety and volatile variables within a ChannelHandler.

根據官方文檔的說法，Netty不再保證特定的Handler實例在運行時一定對應一個線程，所以，在Handler中用ThreadLocal的話就是比較危險的寫法了！

5.2 高並發編程技巧

經過上面的種種琢磨和努力，tps終於從幾千達到了5w左右，學到了很多之前不懂的網絡編程和性能優化的知識，還是很有成就感的！總結一下，高並發中間件的優化策略有：

線程數控制：高並發下如果線程較多時，Context Switch會非常明顯，超過CPU核心數的線程不會帶來任何好處。不是特別耗時的操作的話，業務線程池也是有害無益的。Netty 5為我們提供了指定底層線程池的機會，這樣能更好的控制整個中間件的線程數和調度策略。
非阻塞I/O操作：要想線程少還多做事，避免阻塞是一定要做的。
減少系統調用：雖然Mode Switch比Context Switch的開銷要小得多，但我們還是要盡量減少頻繁的syscall。
數據零拷貝：從內核空間的Direct Buffer拷貝到用戶空間，每次透傳都拷貝的話累積起來是個不小的開銷。
共享狀態保護：中間件內部的並發處理也是決定性能的關鍵。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 SpringCloud高並發性能優化高並發性能測試如果讓你設計一個高並發的消息中間件，你會怎么做？高並發性能調試經驗分享高並發性能調試經驗分享談談中間件開發萬人高並發性能測試方案2018.10.3 基於gin的golang web開發：中間件開啟Tomcat APR運行模式，優化並發性能高性能分布式應用開發中間件ICE介紹