記一次dubbo服務丟失的問題排查


主要環境與用到的(關鍵)組件:

Springboot2.3.2
其中,dubbo-spring-boot-starter版本為2.7.8
zookeeper3.5.9

首先是服務報錯:

No provider available from registry ... for ... on consumer ... use dubbo version 2.7.8, please check status of providers(disabled, not registered or in blacklist).

很多使用過dubbo的同學對這個報錯信息應該都不陌生,一般情況下,如果沒有配置好生產端、消費端或者注冊中心,啟動的時候就會報這個錯。但這里的情況是,消費端和生產端都已經正常運行了一段時間了,大概10天左右,突然報這個錯。

這里借助網上總結的一張圖,來梳理一下排查No Provider的情況:

首先就來看看zookeeper下,生產端的provider情況,找到zookeeper的安裝bin目錄下:

sh zkCli.sh
ls /dubbo/xxx/provider

查看生產端的信息,結果發現[],這時的生產端程序還是running狀態(ps -ef進程存在),大概率就是這個生產端出了問題。由於日志有滾動刪除,而且出問題的時間點也不確定,所以這個問題感覺沒辦法在服務本身的日志中查看。於是想到查看zookeeper下的事務日志:

cd lib/ java -cp .:zookeeper-3.5.9.jar:slf4j-api-1.7.25.jar:zookeeper-jute-3.5.9.jar org.apache.zookeeper.server.LogFormatter /tmp/zookeeper/version-2/log.7d1 > /tmp/zookeeper/version-2/log.log

但也看不出什么端倪。這里不太確定是服務本身的問題,還是服務於zookeeper之間鏈接的問題。

於是在消費服務中設置了一個定時任務,每秒鍾去調用生產端的一個接口,並在生產和消費端都打印日志,同時加個監控,這樣在除了問題之后,就能第一時間找到具體的錯誤信息。
重啟服務8天后,重現了這個錯誤,這一次直接看到了錯誤日志。

首先,消費端爆出了

2022-02-18 16:23:27.713  WARN 2386027 --- [scheduling-1] o.a.d.r.c.s.FailoverClusterInvoker       :  [DUBBO] Although retry the method heart in the service com.jf.zk.proxy.facade.BindFacade was successful by the provider 192.168.1.13:20010, but there have been failed providers [192.168.1.13:20010] (1/1) from the registry 192.168.1.13:2181 on the consumer 192.168.1.13 using the dubbo version 2.7.8. Last error is: Invoke remote method timeout. method: heart, provider: dubbo://192.168.1.13:20010/com.jf.zk.proxy.facade.BindFacade?anyhost=true&application=proxy-web&check=false&deprecated=false&dubbo=2.0.2&dynamic=true&generic=false&group=hcontrol-test&init=false&interface=com.jf.zk.proxy.facade.BindFacade&metadata-type=remote&methods=appUserBind,unbind,heart,ideUserBind&pid=2386027&qos.enable=false&register.ip=192.168.1.13&release=2.7.8&remote.application=proxy-service&revision=1.3.1&side=consumer&sticky=false&timeout=30000&timestamp=1644487367193&version=1.0, cause: org.apache.dubbo.remoting.TimeoutException: Waiting server-side response timeout by scan timer. start time: 2022-02-18 16:22:39.000, end time: 2022-02-18 16:23:09.020, client elapsed: 0 ms, server elapsed: 30020 ms, timeout: 30000 ms, request: Request [id=707282, version=2.0.2, twoway=true, event=false, broken=false, data=null], channel: /192.168.1.13:54056 -> /192.168.1.13:20010, dubbo version: 2.7.8, current host: 192.168.1.13
org.apache.dubbo.rpc.RpcException: Invoke remote method timeout. method: heart, provider: dubbo://192.168.1.13:20010/com.jf.zk.proxy.facade.BindFacade?anyhost=true&application=proxy-web&check=false&deprecated=false&dubbo=2.0.2&dynamic=true&generic=false&group=hcontrol-test&init=false&interface=com.jf.zk.proxy.facade.BindFacade&metadata-type=remote&methods=appUserBind,unbind,heart,ideUserBind&pid=2386027&qos.enable=false&register.ip=192.168.1.13&release=2.7.8&remote.application=proxy-service&revision=1.3.1&side=consumer&sticky=false&timeout=30000&timestamp=1644487367193&version=1.0, cause: org.apache.dubbo.remoting.TimeoutException: Waiting server-side response timeout by scan timer. start time: 2022-02-18 16:22:39.000, end time: 2022-02-18 16:23:09.020, client elapsed: 0 ms, server elapsed: 30020 ms, timeout: 30000 ms, request: Request [id=707282, version=2.0.2, twoway=true, event=false, broken=false, data=null], channel: /192.168.1.13:54056 -> /192.168.1.13:20010
......
Caused by: java.util.concurrent.ExecutionException: org.apache.dubbo.remoting.TimeoutException: Waiting server-side response timeout by scan timer. start time: 2022-02-18 16:22:39.000, end time: 2022-02-18 16:23:09.020, client elapsed: 0 ms, server elapsed: 30020 ms, timeout: 30000 ms, request: Request [id=707282, version=2.0.2, twoway=true, event=false, broken=false, data=null], channel: /192.168.1.13:54056 -> /192.168.1.13:20010
......
2022-02-18 16:23:28.009 ERROR 2386027 --- [scheduling-1] o.s.s.s.TaskUtils$LoggingErrorHandler    : Unexpected error occurred in scheduled task
org.apache.dubbo.rpc.RpcException: No provider available from registry ...

這里看到是一些請求超過了consumer端的timeout,后續就是No provider,猜想應該是生產端出現了什么問題,響應越來越慢,到后來服務長時間不可用,直至與其它服務斷開了鏈接。這里就要看看生產端到底發生了什么,關鍵日志如下:

2022-02-18 16:23:47.942  INFO 2385338 --- [main-SendThread(192.168.1.13:2181)] org.apache.zookeeper.ClientCnxn          : Unable to read additional data from server sessionid 0x100000040fc01ae, likely server has closed socket, closing socket connection and attempting reconnect
2022-02-18 16:23:32.144  WARN 2385338 --- [NettyServerWorker-6-2] o.a.d.remoting.transport.AbstractServer  :  [DUBBO] All clients has disconnected from /192.168.1.13:20010. You can graceful shutdown now., dubbo version: 2.7.8, current host: 192.168.1.13
2022-02-18 16:27:15.830  WARN 2385338 --- [DubboClientHandler-192.168.1.13:20004-thread-2] o.a.d.r.exchange.support.DefaultFuture   :  [DUBBO] The timeout response finally returned at 2022-02-18 16:27:13.882, response status is 31, channel: /192.168.1.13:35954 -> /192.168.1.13:20004, please check provider side for detailed result., dubbo version: 2.7.8, current host: 192.168.1.13
2022-02-18 16:27:38.590  WARN 2385338 --- [DubboClientHandler-192.168.1.13:20004-thread-3] o.a.d.r.exchange.support.DefaultFuture   :  [DUBBO] The timeout response finally returned at 2022-02-18 16:26:53.015, response status is 31, channel: /192.168.1.13:35954 -> /192.168.1.13:20004, please check provider side for detailed result., dubbo version: 2.7.8, current host: 192.168.1.13
......
2022-02-18 16:23:47.117  INFO 2385338 --- [main-SendThread(192.168.1.13:2181)] org.apache.zookeeper.ClientCnxn          : Unable to read additional data from server sessionid 0x100000040fc01ad, likely server has closed socket, closing socket connection and attempting reconnect
2022-02-18 16:35:32.357  INFO 2385338 --- [NettyClientWorker-1-2] o.a.d.r.t.netty4.NettyClientHandler      :  [DUBBO] The connection of /192.168.1.13:44900 -> /192.168.1.13:8003 is disconnected., dubbo version: 2.7.8, current host: 192.168.1.13
2022-02-18 16:36:13.950  WARN 2385338 --- [dubbo-client-idleCheck-thread-1] o.a.d.r.e.s.header.ReconnectTimerTask    :  [DUBBO] Reconnect to channel HeaderExchangeClient [channel=org.apache.dubbo.remoting.transport.netty4.NettyClient [/192.168.1.13:35954 -> /192.168.1.13:20004]], because heartbeat read idle time out: 180000ms, dubbo version: 2.7.8, current host: 192.168.1.13
......
2022-02-18 18:46:33.184  WARN 2385338 --- [nioEventLoopGroup-10-3] i.n.c.AbstractChannelHandlerContext      : An exception 'java.lang.OutOfMemoryError: Java heap space' [enable DEBUG level for full stacktrace] was thrown by a user handler's exceptionCaught() method while handling the following exception:
java.lang.OutOfMemoryError: Java heap space

這個服務雖然出現了很多與另一個服務的鏈接問題,但后續出現了java.lang.OutOfMemoryError: Java heap space,大概率是由於這個內存問題導致的。

由於服務還是running(ps -ef進程存在),我們可以直接把它的內存dump出來。

jmap -dump:file=javaDump.hprof,format=b {該服務的pid}
而后使用jvisualvm.exe(jdk自帶工具,其實JDK自帶了很多有用的工具,有必要了解一下)將這個hprof文件打開,可以查看很多信息:

查看到了一個我們代碼里定義的SessionModel對象異常地多,最終發現是同事代碼里每次新建一個業務會話時,都會生成一個SessionModel對象,將其放在ConcurrentHashMap中,但沒有clear或者remove的邏輯,導致相關的對象越來越多,最終導致OOM。
猜測這個服務在內存資源消耗到一定程度時,已經無法處理其他的請求,發出去的請求也無法應答,所以與之相關的rpc鏈接會被斷開,最終這個服務從注冊中心被清理。

這里還可以看看這個生產端的服務A調用其它服務B的日志:

......
2022-02-18 16:32:07.642  INFO 348283 --- [NettyServerWorker-6-3] o.a.d.r.t.netty4.NettyServerHandler      :  [DUBBO] IdleStateEvent triggered, close channel NettyChannel [channel=[id: 0x0efa3ba8, L:/192.168.1.13:20004 - R:/192.168.1.13:35954]], dubbo version: 2.7.8, current host: 192.168.1.13
2022-02-18 16:32:07.642  INFO 348283 --- [NettyServerWorker-6-3] o.a.d.r.transport.netty4.NettyChannel    :  [DUBBO] Close netty channel [id: 0x0efa3ba8, L:/192.168.1.13:20004 - R:/192.168.1.13:35954], dubbo version: 2.7.8, current host: 192.168.1.13
2022-02-18 16:32:07.642  INFO 348283 --- [NettyServerWorker-6-3] o.a.d.r.t.netty4.NettyServerHandler      :  [DUBBO] The connection of /192.168.1.13:35954 -> /192.168.1.13:20004 is disconnected., dubbo version: 2.7.8, current host: 192.168.1.13

這里找到對應日志的代碼:

猜測是服務B由於長時間沒有接收到來自上文提到的服務A(此時應該是客戶端的角色)的心跳,持續了timeout這么長時間,於是服務B就把鏈接關閉了,此時對於A來說,應該就是收不到B服務的響應,所以也會在日志前期報訪問服務B的timeout。

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM