【原創】大叔問題定位分享(5)Kafka客戶端報錯SocketException: Too many open files 打開的文件過多


kafka0.8.1

一 問題

10月22號應用系統忽然報錯:

[2014/12/22 11:52:32.738]java.net.SocketException: 打開的文件過多

[2014/12/22 11:52:32.738]       at java.net.Socket.createImpl(Socket.java:447)

[2014/12/22 11:52:32.738]       at java.net.Socket.connect(Socket.java:577)

[2014/12/22 11:52:32.738]       at java.net.Socket.connect(Socket.java:528)

[2014/12/22 11:52:32.738]       at sun.net.NetworkClient.doConnect(NetworkClient.java:180)

[2014/12/22 11:52:32.738]       at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)

[2014/12/22 11:52:32.738]       at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)

[2014/12/22 11:52:32.738]       at sun.net.www.http.HttpClient.<init>(HttpClient.java:211)

[2014/12/22 11:52:32.738]       at sun.net.www.http.HttpClient.New(HttpClient.java:308)

[2014/12/22 11:52:32.738]       at sun.net.www.http.HttpClient.New(HttpClient.java:326)

[2014/12/22 11:52:32.738]       at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:996)

[2014/12/22 11:52:32.738]       at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:932)

[2014/12/22 11:52:32.738]       at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:850)

[2014/12/22 11:52:32.738]       at sun.net.www.protocol.http.HttpURLConnection.getOutputStream(HttpURLConnection.java:1091)

[2014/12/22 11:52:32.738]       at com.rbc.http.client.RbcHttpClient.callCptService(RbcHttpClient.java:50)

[2014/12/22 11:52:32.738]       at com.rbc.http.client.RbcHttpClient.callCptService(RbcHttpClient.java:144)

[2014/12/22 11:52:32.738]       at com.rbc.http.bean.BaseBean.call(BaseBean.java:54)

[2014/12/22 11:52:32.739]       at app.package.APPClass.send2Center(APPClass.java:852)

[2014/12/22 11:52:32.739]       at sun.reflect.GeneratedMethodAccessor4312.invoke(Unknown Source)

[2014/12/22 11:52:32.739]       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

[2014/12/22 11:52:32.739]       at java.lang.reflect.Method.invoke(Method.java:606)

[2014/12/22 11:52:32.739]       at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:307)

[2014/12/22 11:52:32.739]       at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:198)

[2014/12/22 11:52:32.739]       at com.sun.proxy.$Proxy157.send2Center(Unknown Source)

[2014/12/22 11:52:32.739]       at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)

[2014/12/22 11:52:32.739]       at java.util.concurrent.FutureTask.run(FutureTask.java:262)

[2014/12/22 11:52:32.739]       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

[2014/12/22 11:52:32.739]       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

[2014/12/22 11:52:32.739]       at java.lang.Thread.run(Thread.java:745)

發現是因為上午打開一個開關,即應用系統將訪問日志發送到消息平台kafka,但是應用系統服務器沒有配置消息平台kafka的host,導致出現問題;

 

二 問題重現

將kafka客戶端的broker-list配置為一個不存在的host,並且不斷發送消息,執行:

lsof –p $pid|wc -l

會發現該java進程的文件描述符不斷增長,其中包含大量的socket連接,並最終超過限制,可以通過:

ulimit -a|grep 'open files'

來查看文件描述符數量限制,超出限制之后,會就不斷報錯:

java.net.SocketException: Too many open files

問題重現;

 

但是如果將broker-list配置為一個不存在的ip,則不能重現問題;

 

三 背景

1 File Descriptor

Generally, a file descriptor is an index for an entry in a kernel-resident array data structure containing the details of open files. In POSIX this data structure is called a file descriptor table, and each process has its own file descriptor table. The process passes the file descriptor to the kernel through a system call, and the kernel will access the file on behalf of the process. The process itself cannot read or write the file descriptor table directly.

On Linux, the set of file descriptors open in a process can be accessed under the path /proc/PID/fd/, where PID is the process identifier.

 

ulimit -a|grep 'open files'

查看進程文件描述符限制

cat /proc/sys/fs/file-max

查看系統總的文件描述符限制

lsof –p $pid

查看某個進程具體打開的文件描述符

ulimit –n $num

臨時修改文件描述符限制

 

2 Kafka Async Producer

為了避免kafka出現問題影響應用系統,使用kafka異步producer,發送消息后會將消息添加到內存隊列並立即返回,另外有一個線程不斷處理內存隊列中的消息並發送到kafka:

 

當kafka服務器出現問題,消息發送不出去的時候,只要配置queue.enqueue.timeout.ms=0,則內存隊列滿時消息直接拋棄掉,不會阻塞:

 

 

四 問題原因

Kafka客戶端發送消息前,首先會通過broker-list中的配置連接到任一台broker獲取metadata:

ERROR [2014-12-22 14:05:04,687] (Logging.scala:103)  - Producer connection to $kafka_server_1:9096 unsuccessful

java.nio.channels.UnresolvedAddressException

       at sun.nio.ch.Net.checkAddress(Net.java:127)

       at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:613)

       at kafka.network.BlockingChannel.connect(BlockingChannel.scala:57)

       at kafka.producer.SyncProducer.connect(SyncProducer.scala:141)

       at kafka.producer.SyncProducer.getOrMakeConnection(SyncProducer.scala:156)

       at kafka.producer.SyncProducer.kafka$producer$SyncProducer$$doSend(SyncProducer.scala:68)

       at kafka.producer.SyncProducer.send(SyncProducer.scala:112)

       at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:53)     

 

在這個過程報錯UnresolvedAddressException,SyncProducer部分代碼如下:

 

BlockingChannel代碼如下:

 

發現問題原因:BlockingChannel的connect方法中,channel.connect這行拋異常,此時connected為false,但是SyncProducer的disconnect中只有當BlockingChannel的connected為true時才會調用BlockingChannel的disconnect方法,所以導致一直在調用Blockingchannel的connect方法而沒有調用disconnect方法,而connect方法中會調用SocketChannel.open方法:

 

->

 

->

 

->

 

可見每次調用SocketChannel.open方法都會創建一個文件描述符,這樣文件描述符會不斷增加,為什么只有配host時出問題,而配ip時沒有問題,SocketChannelImpl部分代碼:

 

當配host時,是Net.checkAddress時拋UnresolvedAddressException異常,這時直接退出;當配ip時,是在Net.connect時拋ConnectionTimeoutException異常,這時會有后續的finally和catch代碼執行,其中readerCleanup代碼如下:

 

->

 

可見雖然沒有調用disconnect,也會關閉文件描述符,所以配ip時不會出現文件描述符過多的問題;

 

五 修復

修改kafka.network.SocketChannel代碼如下:

 

增加try-catch,並在UnresolvedAddressException異常發生時調用disconnect,問題解決;

 

這是當時發現問題之后在kafka jira上提交的issues: https://issues.apache.org/jira/browse/KAFKA-1832

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM