dubbo在redis注冊中心下 ReconnectTimerTask 不停重連provider 問題


     問題描述 : 使用redis 注冊中心時, dubbo消費端一直不停重試reconnect dubbo provider, 並報錯;

[DUBBO] Fail to connect to HeaderExchangeClient [channel=org.apache.dubbo.remoting.transport.netty4.NettyClient [10.1.1.12:0 -> /10.1.1.228:20888]], dubbo version: 2.7.3, current host: 10.1.1.12
2019-08-30 20:33:52.283 [/] httpwrapper [dubbo-client-idleCheck-thread-1] ERROR o.a.d.r.e.s.h.ReconnectTimerTask - [DUBBO] Fail to connect to HeaderExchangeClient [channel=org.apache.dubbo.remoting.transport.netty4.NettyClient [10.1.1.12:0 -> /10.1.1.228:20888]], dubbo version: 2.7.3, current host: 10.1.1.12 
org.apache.dubbo.remoting.RemotingException: client(url: dubbo://10.1.1.228:20888/com.cxq56.service.GeoService?actives=0&anyhost=true&application=httpwrapper&async=false&bean.name=providers:dubbo:com.cxq56.service.GeoService&check=false&cluster=failover&codec=dubbo&default.deprecated=false&default.dynamic=false&default.register=true&default.retries=1&default.timeout=10000&deprecated=false&dubbo=2.0.2&dynamic=false&generic=false&heartbeat=60000&interface=com.cxq56.service.GeoService&lazy=false&loadbalance=random&methods=createForbiddenGeo,calculatedDistance,createSiteInfo,getSiteAndDistance,getAllGeoByCityId,searchForPOI,createGeo&pid=1&qos.enable=false&register=true&register.ip=10.1.1.12&release=2.7.1&remote.application=geo-provider&retries=0&revision=1.0-SNAPSHOT&shutwait=40000&side=consumer&sticky=false&timeout=3000&timestamp=1567049198218&validation=false) failed to connect to server /10.1.1.228:20888 client-side timeout 3000ms (elapsed: 3000ms) from netty client 10.1.1.12 using dubbo version 2.7.3
at org.apache.dubbo.remoting.transport.netty4.NettyClient.doConnect(NettyClient.java:171)
at org.apache.dubbo.remoting.transport.AbstractClient.connect(AbstractClient.java:190)
at org.apache.dubbo.remoting.transport.AbstractClient.reconnect(AbstractClient.java:246)
at org.apache.dubbo.remoting.exchange.support.header.HeaderExchangeClient.reconnect(HeaderExchangeClient.java:155)
at org.apache.dubbo.remoting.exchange.support.header.ReconnectTimerTask.doTask(ReconnectTimerTask.java:49)
at org.apache.dubbo.remoting.exchange.support.header.AbstractTimerTask.run(AbstractTimerTask.java:87)
at org.apache.dubbo.common.timer.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:648)
at org.apache.dubbo.common.timer.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:727)
at org.apache.dubbo.common.timer.HashedWheelTimer$Worker.run(HashedWheelTimer.java:449)
at java.lang.Thread.run(Thread.java:748)

先說結論,這種異常有兩種情況:

1. dubbo消費端啟動時,獲取到一些已經過期的provider注冊信息,並嘗試重新連接。
2, dubbo消費端,dubbo provider正常運行過程中,provider突然非正常停機,導致的不停嘗試重新連接。(當然一般情況下, 非正常停機是不能容忍的)

背景:

1. dubbo針對服務可用性的檢測有自己的實現機制, 主要通過ReconnectTimerTask來實現定時重連, 確保服務可用; 
2. 另一方面, dubbo在使用redis注冊中心注冊時, 會往redis寫入一個hash值, key為service接口名, 如"/dubbo/com.sample.configmgmt.api.CustomerServiceApi",field字段是方法接口的一些基本信息如:“dubbo://127.0.0.1:20881/com.sample.configmgmt.api.OrgServiceApi?accepts=0&accesslog=true&anyhost=true&application=configmgmt-service&bean.name=ServiceBean:com.cxq56.configmgmt.api.OrgServiceApi&cluster=failover&deprecated=false&dubbo=2.0.2&dump.directory=/tmp&dynamic=true&generic=false&interface=com.cxq56.configmgmt.api.OrgServiceApi&loadbalance=random&methods=createOrg,getOrg,updateOrg,listOrg,pageOrg&pid=1&register=true&release=2.7.3&retries=0&revision=1.0-SNAPSHOT&shutwait=40000&side=provider&timeout=3000&timestamp=1567416954354&token=true”,value是一個定時更新的時間戳,由一個單獨的線程來維護; 當有別的服務通過注冊中心獲取可用的dubbo服務時, 如果value的時間戳小於當前日期, 那么就會認定這個服務時過期的, 不可用的;

    針對第二種情況,當有新的provider注冊並發送notify消息是,Notify線程會執行一系列操作,RegistryDirectory 會更新最新可用的provider信息(refreshInvoker),這樣就能把對應過期dubbo服務的ReconnectTimerTask注銷掉。

    經過查看,本次我們出現的問題,是屬於第一種情況; 當dubbo消費端啟動時,獲取到的服務端注冊信息,即使有些接口的時間戳已經過期了,但是還是嘗試重新連接;者和我預期的完全不同;這是因為dubbo服務端有個關鍵的參數設置的有問題。

HeaderExchangeClient.java :心跳機制和重連機制的啟動器

    private final Client client;
    private final ExchangeChannel channel;

    private static final HashedWheelTimer IDLE_CHECK_TIMER = new HashedWheelTimer(
            new NamedThreadFactory("dubbo-client-idleCheck", true), 1, TimeUnit.SECONDS, TICKS_PER_WHEEL);
    private HeartbeatTimerTask heartBeatTimerTask;
    private ReconnectTimerTask reconnectTimerTask;

    public HeaderExchangeClient(Client client, boolean startTimer) {
        Assert.notNull(client, "Client can't be null");
        this.client = client;
        this.channel = new HeaderExchangeChannel(client);

        if (startTimer) {
            URL url = client.getUrl();
            startReconnectTask(url);
            startHeartBeatTask(url);
        }
    }

    可以看到使用的是HashedWheelTimer來定時輪詢的;這里的reConnectTask如果失敗,就會打印出我們的異常日志;而且失敗后不會停止重試,會一直嘗試下去;那么這里有一個問題,是否redis有的歷史注冊信息,consumer都會去嘗試reconnect呢?
    所以我們嘗試打個斷點嘗試分析一下;並往上追述可以發現當dubbo consumer啟動時會在redis中注冊本身的消費端信息,同時也會通過接口名獲取所有provider注冊信息,並在RedisRegistry.java(我用的是redis注冊中心)中進行過濾,代碼如下:
RedisRegistry.java:實現注冊,取消注冊,以及訂閱,取消訂閱

private void doNotify(Jedis jedis, Collection<String> keys, URL url, Collection<NotifyListener> listeners) {
        if (keys == null || keys.isEmpty()
                || listeners == null || listeners.isEmpty()) {
            return;
        }
        long now = System.currentTimeMillis();
        List<URL> result = new ArrayList<>();
        List<String> categories = Arrays.asList(url.getParameter(CATEGORY_KEY, new String[0]));
        String consumerService = url.getServiceInterface();
        for (String key : keys) {
            if (!ANY_VALUE.equals(consumerService)) {
                String providerService = toServiceName(key);
                if (!providerService.equals(consumerService)) {
                    continue;
                }
            }
            String category = toCategoryName(key);
            if (!categories.contains(ANY_VALUE) && !categories.contains(category)) {
                continue;
            }
            List<URL> urls = new ArrayList<>();
            Map<String, String> values = jedis.hgetAll(key);
            if (CollectionUtils.isNotEmptyMap(values)) {
                for (Map.Entry<String, String> entry : values.entrySet()) {
                    URL u = URL.valueOf(entry.getKey());
            //如果dynamic為false 或者 過期時間 大於 當前時間 就加入這個注冊url,后面進行reconnect
                    if (!u.getParameter(DYNAMIC_KEY, true)
                            || Long.parseLong(entry.getValue()) >= now) {
                        if (UrlUtils.isMatch(url, u)) {
                            urls.add(u);
                        }
                    }
                }
            }
            if (urls.isEmpty()) {
                urls.add(URLBuilder.from(url)
                        .setProtocol(EMPTY_PROTOCOL)
                        .setAddress(ANYHOST_VALUE)
                        .setPath(toServiceName(key))
                        .addParameter(CATEGORY_KEY, category)
                        .build());
            }
            result.addAll(urls);
            if (logger.isInfoEnabled()) {
                logger.info("redis notify: " + key + " = " + urls);
            }
        }
        if (CollectionUtils.isEmpty(result)) {
            return;
        }
        for (NotifyListener listener : listeners) {
            notify(url, listener, result);
        }
    }

    provider注冊信息的過濾條件是,dynamic為true且過期時間小於當前時間,一般舊的注冊數據的過期時間肯定都會小於當前時間(這種數據算是臟數據,優雅停機和dubbo monitor都可以移除),源頭就在這個dynamic上,由於這個provider使用的dubbo版本是2.7.1,有一個bug,默認的dynamic的值為false,直接導致現在的問題;另外這個dynamic的官方文檔解釋的意思是 "服務是否動態注冊,如果設為false,注冊后將顯示后disable狀態,需人工啟用,並且服務提供者停止時,也不會自動取消冊,需人工禁用。" 但是並沒有說,consumer會一直重連。

重連代碼如下



/**
 * ReconnectTimerTask
 */
public class ReconnectTimerTask extends AbstractTimerTask {

    private static final Logger logger = LoggerFactory.getLogger(ReconnectTimerTask.class);

    private final int idleTimeout;

    public ReconnectTimerTask(ChannelProvider channelProvider, Long heartbeatTimeoutTick, int idleTimeout) {
        super(channelProvider, heartbeatTimeoutTick);
        this.idleTimeout = idleTimeout;
    }
    
    //2.7.3版本默認每分鍾執行一次
    @Override
    protected void doTask(Channel channel) {
        try {
            Long lastRead = lastRead(channel);
            Long now = now();

            // Rely on reconnect timer to reconnect when AbstractClient.doConnect fails to init the connection
            //如果此時channel已經斷開了,那么立即重連
            if (!channel.isConnected()) {
                try {
                    logger.info("Initial connection to " + channel);
                    ((Client) channel).reconnect();
                } catch (Exception e) {
                    logger.error("Fail to connect to " + channel, e);
                }
            // check pong at client
            //如果此時channel沒有斷開,但是從上次
            } else if (lastRead != null && now - lastRead > idleTimeout) {
                logger.warn("Reconnect to channel " + channel + ", because heartbeat read idle time out: "
                        + idleTimeout + "ms");
                try {
                    ((Client) channel).reconnect();
                } catch (Exception e) {
                    logger.error(channel + "reconnect failed during idle time.", e);
                }
            }
        } catch (Throwable t) {
            logger.warn("Exception when reconnect to remote channel " + channel.getRemoteAddress(), t);
        }
    }
}

-----------------------------------------------------------------------------------------------------------end------------------------------------------------------------------------


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM