今天查看生產環境的sentinel控制台,發現某dubbo應用一共5個節點,有3個失聯了。
查看失聯節點的應用日志,服務沒有掛,各dubbo接口的日志正常在打印。
在應用節點ping/telnet sentinel控制台節點,ip和端口能夠連通。
查看應用節點的sentinel-record日志,grep "Heartbeat" sentinel-record.log.2019-01-1*
,
sentinel-record.log.2019-01-14.0:2019-01-14 16:50:43 [Sentinel InitExecutor] Found init func: com.alibaba.csp.sentinel.transport.init.HeartbeatSenderInitFunc
sentinel-record.log.2019-01-14.0:2019-01-14 16:50:43 [SimpleHttpHeartbeatSender] Default console address list retrieved: [/xxx:xxx]
sentinel-record.log.2019-01-14.0:2019-01-14 16:50:43 [HeartbeatSenderInit] HeartbeatSender started: com.alibaba.csp.sentinel.transport.heartbeat.SimpleHttpHeartbeatSender
sentinel-record.log.2019-01-14.0:2019-01-14 16:50:43 [Sentinel InitExecutor] Initialized: com.alibaba.csp.sentinel.transport.init.HeartbeatSenderInitFunc with order 2147483647
發現1月14號有日志輸出,當時應用程序有改動,構建發布后節點重新啟動,日志顯示心跳初始化正常。
用jmc查看各節點sentinel定時發送心跳的線程情況,
失聯的節點:
正常的節點:
注意到失聯節點線程狀態全部變成了WAITING,而正常節點有一個線程是TIMED_WAITING。
HeartbeatSenderInitFunc
類的發送心跳代碼:
private void scheduleHeartbeatTask(/*@NonNull*/ final HeartbeatSender sender, /*@Valid*/ long interval) {
pool.scheduleAtFixedRate(new Runnable() {
@Override
public void run() {
try {
sender.sendHeartbeat();
} catch (Throwable e) {
RecordLog.warn("[HeartbeatSender] Send heartbeat error", e);
}
}
}, 5000, interval, TimeUnit.MILLISECONDS);
RecordLog.info("[HeartbeatSenderInit] HeartbeatSender started: "
+ sender.getClass().getCanonicalName());
}
線程中sender.sendHeartbeat();
是捕獲了Throwable
並記錄了異常日志的;
而在日志中並沒有搜到異常信息;
由此推斷,定時任務的線程已經失效了。
查詢資料可能是應用中有內存溢出,會導致線程掛掉。
在最近兩周的應用日志里搜索OutOfMemoryError
[ ERROR] [2019-01-14 10:25:29] [6beacd73653e7f50/6beacd73653e7f50] [DubboServerHandler-xxx:xxx-thread-397] com.alibaba.dubbo.rpc.filter.ExceptionFilter [91] - [DUBBO] Got unchecked and undeclared exception which called by xxx. service: com.winxuan.services.shopps.service.ShopItemService, method: getShopItemInfoId, exception: java.lang.OutOfMemoryError: GC overhead limit exceeded, dubbo version: 2.6.0, current host: xxx
總結:
JAVA應用如果出現OutOfMemoryError,可能導致ScheduledExecutorService
失效。
參考:
ScheduledExecutorService is broken https://community.oracle.com/thread/1144316