在YARN上開發長服務,需要注意fault-tolerance,本篇文章對appmaster的平滑重啟的一個參數做了解析,如何設置可以有助於達到appmaster平滑重啟。
在yarn-site.xml有個參數
/** * The maximum number of application attempts. * It's a global setting for all application masters. */ yarn.resourcemanager.am.max-attempts
一個全局的appmaster重試次數的限制,yarn提交應用時,還可以為單獨一個應用設置最大重試次數
/** * Set the number of max attempts of the application to be submitted. WARNING: * it should be no larger than the global number of max attempts in the Yarn * configuration. * @param maxAppAttempts the number of max attempts of the application * to be submitted. */ @Public @Stable public abstract void setMaxAppAttempts(int maxAppAttempts);
當attempt失敗時,如果設置keepContainersAcrossAppAttempts了,resource manager會決定上個attempt的container是否仍然保留着。
boolean keepContainersAcrossAppAttempts = false; switch (finalAttemptState) { case FINISHED: { appEvent = new RMAppFinishedAttemptEvent(applicationId, appAttempt.getDiagnostics()); } break; case KILLED: { // don't leave the tracking URL pointing to a non-existent AM appAttempt.setTrackingUrlToRMAppPage(); appAttempt.invalidateAMHostAndPort(); appEvent = new RMAppFailedAttemptEvent(applicationId, RMAppEventType.ATTEMPT_KILLED, "Application killed by user.", false); } break; case FAILED: { // don't leave the tracking URL pointing to a non-existent AM appAttempt.setTrackingUrlToRMAppPage(); appAttempt.invalidateAMHostAndPort(); if (appAttempt.submissionContext .getKeepContainersAcrossApplicationAttempts() && !appAttempt.submissionContext.getUnmanagedAM()) { // See if we should retain containers for non-unmanaged applications if (!appAttempt.shouldCountTowardsMaxAttemptRetry()) { // Premption, hardware failures, NM resync doesn't count towards // app-failures and so we should retain containers. keepContainersAcrossAppAttempts = true; } else if (!appAttempt.maybeLastAttempt) { // Not preemption, hardware failures or NM resync. // Not last-attempt too - keep containers. keepContainersAcrossAppAttempts = true; } } appEvent = new RMAppFailedAttemptEvent(applicationId, RMAppEventType.ATTEMPT_FAILED, appAttempt.getDiagnostics(), keepContainersAcrossAppAttempts); } }
關注appAttempt.maybeLastAttempt這個變量,rs如何判斷是否這次attempt是最后一次呢?
private void createNewAttempt() { ApplicationAttemptId appAttemptId = ApplicationAttemptId.newInstance(applicationId, attempts.size() + 1); RMAppAttempt attempt = new RMAppAttemptImpl(appAttemptId, rmContext, scheduler, masterService, submissionContext, conf, // The newly created attempt maybe last attempt if (number of // previously failed attempts(which should not include Preempted, // hardware error and NM resync) + 1) equal to the max-attempt // limit. maxAppAttempts == (getNumFailedAppAttempts() + 1), amReq); attempts.put(appAttemptId, attempt); currentAttempt = attempt; }
在每次構造新的attempt時候,maxAppAttempts == (getNumFailedAppAttempts() + 1)會決定,已經失敗的次數+1,是否已經達到了maxAppAttempts的限制了。
而maxAppAttempts這個參數是由global和individual兩個配置取min,決定的。
int globalMaxAppAttempts = conf.getInt(YarnConfiguration.RM_AM_MAX_ATTEMPTS, YarnConfiguration.DEFAULT_RM_AM_MAX_ATTEMPTS); int individualMaxAppAttempts = submissionContext.getMaxAppAttempts(); if (individualMaxAppAttempts <= 0 || individualMaxAppAttempts > globalMaxAppAttempts) { this.maxAppAttempts = globalMaxAppAttempts; LOG.warn("The specific max attempts: " + individualMaxAppAttempts + " for application: " + applicationId.getId() + " is invalid, because it is out of the range [1, " + globalMaxAppAttempts + "]. Use the global max attempts instead."); } else { this.maxAppAttempts = individualMaxAppAttempts; }
總結:
如果希望appmaster可以達到不斷重啟,而且可以接管之前的container,需要把yarn.resourcemanager.am.max-attempts這個參數盡量調大,比如設置為10000,並且提交app時候設置submit context的最大次數,以及刷新窗口,這樣基本就可以滿足長服務應用在yarn上面的運行需求了。
