yarn關於app max attempt深度解析,針對長服務appmaster平滑重啟


在YARN上開發長服務,需要注意fault-tolerance,本篇文章對appmaster的平滑重啟的一個參數做了解析,如何設置可以有助於達到appmaster平滑重啟。

在yarn-site.xml有個參數

/**
   * The maximum number of application attempts.
   * It's a global setting for all application masters.
   */
yarn.resourcemanager.am.max-attempts

一個全局的appmaster重試次數的限制,yarn提交應用時,還可以為單獨一個應用設置最大重試次數

/**
   * Set the number of max attempts of the application to be submitted. WARNING:
   * it should be no larger than the global number of max attempts in the Yarn
   * configuration.
   * @param maxAppAttempts the number of max attempts of the application
   * to be submitted.
   */
  @Public
  @Stable
  public abstract void setMaxAppAttempts(int maxAppAttempts);

當attempt失敗時,如果設置keepContainersAcrossAppAttempts了,resource manager會決定上個attempt的container是否仍然保留着。

boolean keepContainersAcrossAppAttempts = false;
switch (finalAttemptState) {
  case FINISHED:
  {
    appEvent = new RMAppFinishedAttemptEvent(applicationId,
        appAttempt.getDiagnostics());
  }
  break;
  case KILLED:
  {
    // don't leave the tracking URL pointing to a non-existent AM
    appAttempt.setTrackingUrlToRMAppPage();
    appAttempt.invalidateAMHostAndPort();
    appEvent =
        new RMAppFailedAttemptEvent(applicationId,
            RMAppEventType.ATTEMPT_KILLED,
            "Application killed by user.", false);
  }
  break;
  case FAILED:
  {
    // don't leave the tracking URL pointing to a non-existent AM
    appAttempt.setTrackingUrlToRMAppPage();
    appAttempt.invalidateAMHostAndPort();

    if (appAttempt.submissionContext
      .getKeepContainersAcrossApplicationAttempts()
        && !appAttempt.submissionContext.getUnmanagedAM()) {
      // See if we should retain containers for non-unmanaged applications
      if (!appAttempt.shouldCountTowardsMaxAttemptRetry()) {
        // Premption, hardware failures, NM resync doesn't count towards
        // app-failures and so we should retain containers.
        keepContainersAcrossAppAttempts = true;
      } else if (!appAttempt.maybeLastAttempt) {
        // Not preemption, hardware failures or NM resync.
        // Not last-attempt too - keep containers.
        keepContainersAcrossAppAttempts = true;
      }
    }
    appEvent =
        new RMAppFailedAttemptEvent(applicationId,
          RMAppEventType.ATTEMPT_FAILED, appAttempt.getDiagnostics(),
          keepContainersAcrossAppAttempts);

  }
}

關注appAttempt.maybeLastAttempt這個變量,rs如何判斷是否這次attempt是最后一次呢?

private void createNewAttempt() {
    ApplicationAttemptId appAttemptId =
        ApplicationAttemptId.newInstance(applicationId, attempts.size() + 1);
    RMAppAttempt attempt =
        new RMAppAttemptImpl(appAttemptId, rmContext, scheduler, masterService,
          submissionContext, conf,
          // The newly created attempt maybe last attempt if (number of
          // previously failed attempts(which should not include Preempted,
          // hardware error and NM resync) + 1) equal to the max-attempt
          // limit.
          maxAppAttempts == (getNumFailedAppAttempts() + 1), amReq);
    attempts.put(appAttemptId, attempt);
    currentAttempt = attempt;
  }

在每次構造新的attempt時候,maxAppAttempts == (getNumFailedAppAttempts() + 1)會決定,已經失敗的次數+1,是否已經達到了maxAppAttempts的限制了。

而maxAppAttempts這個參數是由global和individual兩個配置取min,決定的。

int globalMaxAppAttempts = conf.getInt(YarnConfiguration.RM_AM_MAX_ATTEMPTS,
        YarnConfiguration.DEFAULT_RM_AM_MAX_ATTEMPTS);
    int individualMaxAppAttempts = submissionContext.getMaxAppAttempts();
    if (individualMaxAppAttempts <= 0 ||
        individualMaxAppAttempts > globalMaxAppAttempts) {
      this.maxAppAttempts = globalMaxAppAttempts;
      LOG.warn("The specific max attempts: " + individualMaxAppAttempts
          + " for application: " + applicationId.getId()
          + " is invalid, because it is out of the range [1, "
          + globalMaxAppAttempts + "]. Use the global max attempts instead.");
    } else {
      this.maxAppAttempts = individualMaxAppAttempts;
    }

 

總結:

如果希望appmaster可以達到不斷重啟,而且可以接管之前的container,需要把yarn.resourcemanager.am.max-attempts這個參數盡量調大,比如設置為10000,並且提交app時候設置submit context的最大次數,以及刷新窗口,這樣基本就可以滿足長服務應用在yarn上面的運行需求了。

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM