最近發現系統中出現了很多 IOException: Connection reset by peer 與 ClosedChannelException: null
深入看了看代碼, 做了些測試, 發現 Connection reset 會在客戶端不知道 channel 被關閉的情況下, 觸發了 eventloop 的 unsafe.read() 操作拋出
而 ClosedChannelException 一般是由 Netty 主動拋出的, 在 AbstractChannel 以及 SSLHandler 里都可以看到 ClosedChannel 相關的代碼
AbstractChannel
static final ClosedChannelException CLOSED_CHANNEL_EXCEPTION = new ClosedChannelException(); ... static { CLOSED_CHANNEL_EXCEPTION.setStackTrace(EmptyArrays.EMPTY_STACK_TRACE); NOT_YET_CONNECTED_EXCEPTION.setStackTrace(EmptyArrays.EMPTY_STACK_TRACE); } ... @Override public void write(Object msg, ChannelPromise promise) { ChannelOutboundBuffer outboundBuffer = this.outboundBuffer; if (outboundBuffer == null) { // If the outboundBuffer is null we know the channel was closed and so // need to fail the future right away. If it is not null the handling of the rest // will be done in flush0() // See https://github.com/netty/netty/issues/2362 safeSetFailure(promise, CLOSED_CHANNEL_EXCEPTION); // release message now to prevent resource-leak ReferenceCountUtil.release(msg); return; } outboundBuffer.addMessage(msg, promise); }
在代碼的許多部分, 都會有這個 ClosedChannelException, 大概的意思是說在 channel close 以后, 如果還調用了 write 方法, 則會將 write 的 future 設置為 failure, 並將 cause 設置為 ClosedChannelException, 同樣 SSLHandler 中也類似
-----------------
回到 Connection reset by peer, 要模擬這個情況比較簡單, 就是在 server 端設置一個在 channelActive 的時候就 close channel 的 handler. 而在 client 端則寫一個 Connect 成功后立即發送請求數據的 listener. 如下
client
public static void main(String[] args) throws IOException, InterruptedException { Bootstrap b = new Bootstrap(); b.group(new NioEventLoopGroup()) .channel(NioSocketChannel.class) .handler(new ChannelInitializer<NioSocketChannel>() { @Override protected void initChannel(NioSocketChannel ch) throws Exception { } }); b.connect("localhost", 8090).addListener(new ChannelFutureListener() { @Override public void operationComplete(ChannelFuture future) throws Exception { if (future.isSuccess()) { future.channel().write(Unpooled.buffer().writeBytes("123".getBytes())); future.channel().flush(); } } });
server
public class SimpleServer { public static void main(String[] args) throws Exception { EventLoopGroup bossGroup = new NioEventLoopGroup(1); EventLoopGroup workerGroup = new NioEventLoopGroup(); ServerBootstrap b = new ServerBootstrap(); b.group(bossGroup, workerGroup) .channel(NioServerSocketChannel.class) .option(ChannelOption.SO_REUSEADDR, true) .childHandler(new ChannelInitializer<NioSocketChannel>() { @Override protected void initChannel(NioSocketChannel ch) throws Exception { ch.pipeline().addLast(new SimpleServerHandler()); } }); b.bind(8090).sync().channel().closeFuture().sync(); } } public class SimpleServerHandler extends ChannelInboundHandlerAdapter { @Override public void channelActive(ChannelHandlerContext ctx) throws Exception { ctx.channel().close().sync(); } @Override public void channelRead(ChannelHandlerContext ctx, final Object msg) throws Exception { System.out.println(123); } @Override public void channelInactive(ChannelHandlerContext ctx) throws Exception { System.out.println("inactive"); } }
這種情況之所以能觸發 connection reset by peer 異常, 是因為 connect 成功以后, client 段先會觸發 connect 成功的 listener, 這個時候 server 段雖然斷開了 channel, 也觸發 channel 斷開的事件 (它會觸發一個客戶端 read 事件, 但是這個 read 會返回 -1, -1 代表 channel 關閉, client 的 channelInactive 跟 channel active 狀態的改變都是在這時發生的), 但是這個事件是在 connect 成功的 listener 之后執行, 所以這個時候 listener 里的 channel 並不知道自己已經斷開, 它還是會繼續進行 write 跟 flush 操作, 在調用 flush 后, eventloop 會進入 OP_READ 事件里, 這時候 unsafe.read() 就會拋出 connection reset 異常. eventloop 代碼如下
NioEventLoop
private static void processSelectedKey(SelectionKey k, AbstractNioChannel ch) { final NioUnsafe unsafe = ch.unsafe(); if (!k.isValid()) { // close the channel if the key is not valid anymore unsafe.close(unsafe.voidPromise()); return; } try { int readyOps = k.readyOps(); // Also check for readOps of 0 to workaround possible JDK bug which may otherwise lead // to a spin loop if ((readyOps & (SelectionKey.OP_READ | SelectionKey.OP_ACCEPT)) != 0 || readyOps == 0) { unsafe.read(); if (!ch.isOpen()) { // Connection already closed - no need to handle write. return; } } if ((readyOps & SelectionKey.OP_WRITE) != 0) { // Call forceFlush which will also take care of clear the OP_WRITE once there is nothing left to write ch.unsafe().forceFlush(); } if ((readyOps & SelectionKey.OP_CONNECT) != 0) { // remove OP_CONNECT as otherwise Selector.select(..) will always return without blocking // See https://github.com/netty/netty/issues/924 int ops = k.interestOps(); ops &= ~SelectionKey.OP_CONNECT; k.interestOps(ops); unsafe.finishConnect(); } } catch (CancelledKeyException e) { unsafe.close(unsafe.voidPromise()); } }
這就是 connection reset by peer 產生的原因
------------------
再來看 ClosedChannelException 如何產生, 要復現他也很簡單. 首先要明確, 並沒有客戶端主動關閉才會出現 ClosedChannelException 這么一說. 下面來看兩種出現 ClosedChannelException 的客戶端寫法
client 1, 主動關閉 channel
public class SimpleClient { private static final Logger logger = LoggerFactory.getLogger(SimpleClient.class); public static void main(String[] args) throws IOException, InterruptedException { Bootstrap b = new Bootstrap(); b.group(new NioEventLoopGroup()) .channel(NioSocketChannel.class) .handler(new ChannelInitializer<NioSocketChannel>() { @Override protected void initChannel(NioSocketChannel ch) throws Exception { } }); b.connect("localhost", 8090).addListener(new ChannelFutureListener() { @Override public void operationComplete(ChannelFuture future) throws Exception { if (future.isSuccess()) { future.channel().close(); future.channel().write(Unpooled.buffer().writeBytes("123".getBytes())).addListener(new ChannelFutureListener() { @Override public void operationComplete(ChannelFuture future) throws Exception { if (!future.isSuccess()) { logger.error("Error", future.cause()); } } }); future.channel().flush(); } } }); } }
只要在 write 之前主動調用了 close, 那么 write 必然會知道 close 是 close 狀態, 最后 write 就會失敗, 並且 future 里的 cause 就是 ClosedChannelException
--------------------
client 2. 由服務端造成的 ClosedChannelException
public class SimpleClient { private static final Logger logger = LoggerFactory.getLogger(SimpleClient.class); public static void main(String[] args) throws IOException, InterruptedException { Bootstrap b = new Bootstrap(); b.group(new NioEventLoopGroup()) .channel(NioSocketChannel.class) .handler(new ChannelInitializer<NioSocketChannel>() { @Override protected void initChannel(NioSocketChannel ch) throws Exception { } }); Channel channel = b.connect("localhost", 8090).sync().channel(); Thread.sleep(3000); channel.writeAndFlush(Unpooled.buffer().writeBytes("123".getBytes())).addListener(new ChannelFutureListener() { @Override public void operationComplete(ChannelFuture future) throws Exception { if (!future.isSuccess()) { logger.error("error", future.cause()); } } }); } }
服務端
public class SimpleServer { public static void main(String[] args) throws Exception { EventLoopGroup bossGroup = new NioEventLoopGroup(1); EventLoopGroup workerGroup = new NioEventLoopGroup(); ServerBootstrap b = new ServerBootstrap(); b.group(bossGroup, workerGroup) .channel(NioServerSocketChannel.class) .option(ChannelOption.SO_REUSEADDR, true) .childHandler(new ChannelInitializer<NioSocketChannel>() { @Override protected void initChannel(NioSocketChannel ch) throws Exception { ch.pipeline().addLast(new SimpleServerHandler()); } }); b.bind(8090).sync().channel().closeFuture().sync(); } }
這種情況下, 服務端將 channel 關閉, 客戶端先 sleep, 這期間 client 的 eventLoop 會處理客戶端關閉的時間, 也就是 eventLoop 的 processKey 方法會進入 OP_READ, 然后 read 出來一個 -1, 最后觸發 client channelInactive 事件, 當 sleep 醒來以后, 客戶端調用 writeAndFlush, 這時候客戶端 channel 的狀態已經變為了 inactive, 所以 write 失敗, cause 為 ClosedChannelException
