記錄一次線上yarn RM頻繁切換的故障


周末一大早被報警驚醒,rm頻繁切換 

急急忙忙排查 看到兩處錯誤日志

錯誤信息1

ervation <memory:0, vCores:0>
2019-12-21 11:51:57,781 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ATTEMPT_REMOVED to the scheduler
java.lang.NullPointerException
    at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode.unreserveResource(FSSchedulerNode.java:88)
    at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.unreserve(FSAppAttempt.java:589)
    at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.completedContainerInternal(FairScheduler.java:899)
    at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:564)
    at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplicationAttempt(FairScheduler.java:846)
    at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1479)
    at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:117)
    at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:804)
    at java.lang.Thread.run(Thread.java:748)

錯誤信息2

明月照我去搬磚 2019/12/21 14:51:07
2019-12-21 07:37:45,533 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ATTEMPT_REMOVED to the scheduler
java.lang.NullPointerException
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.completedContainerInternal(FairScheduler.java:902)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:564)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplicationAttempt(FairScheduler.java:837)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1475)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:117)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:804)
        at java.lang.Thread.run(Thread.java:748)
2019-12-21 07:37:45,534 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..

查看源碼處FairScheduler

 @Override
  protected void completedContainerInternal(
      RMContainer rmContainer, ContainerStatus containerStatus,
      RMContainerEventType event) {
    try {
      writeLock.lock();
      Container container = rmContainer.getContainer();

      // Get the application for the finished container
      FSAppAttempt application =
        getCurrentAttemptForContainer(container.getId());
      ApplicationId appId =
        container.getId().getApplicationAttemptId().getApplicationId();
      if (application == null) {
        LOG.info("Container " + container + " of" +
          " finished application " + appId +
          " completed with event " + event);
        return;
      }

      // Get the node on which the container was allocated
      FSSchedulerNode node = getFSSchedulerNode(container.getNodeId());

      if (rmContainer.getState() == RMContainerState.RESERVED) {
        application.unreserve(rmContainer.getReservedPriority(), node); //這里將node上該container資源釋放
      } else {
        try {
          application.containerCompleted(rmContainer, containerStatus, event); 
          node.releaseContainer(rmContainer.getContainerId(), false);
          updateRootQueueMetrics();
          LOG.info("Application attempt " + application.getApplicationAttemptId()
                  + " released container " + container.getId() + " on node: " + node
                  + " with event: " + event);
        }catch (Exception e){
          LOG.error(e.getMessage(), e);
        }
      }
    } finally {
      writeLock.unlock();
    }
  }

跟進去看下

  /**
   * Remove the reservation on {@code node} at the given {@link Priority}.
   * This dispatches SchedulerNode handlers as well.
   */
  public void unreserve(Priority priority, FSSchedulerNode node) {
    RMContainer rmContainer = node.getReservedContainer();
    unreserveInternal(priority, node);
    node.unreserveResource(this);
    clearReservation(node);
    getMetrics().unreserveResource(node.getPartition(),
        getUser(), rmContainer.getContainer().getResource());
  }
  @Override
  public synchronized void unreserveResource(
      SchedulerApplicationAttempt application) {
    // Cannot unreserve for wrong application...
    ApplicationAttemptId reservedApplication = 
        getReservedContainer().getContainer().getId().getApplicationAttemptId(); //獲取不到該container的attemptId 報空指針
    if (!reservedApplication.equals(
        application.getApplicationAttemptId())) {
      throw new IllegalStateException("Trying to unreserve " +  
          " for application " + application.getApplicationId() + 
          " when currently reserved " + 
          " for application " + reservedApplication.getApplicationId() + 
          " on node " + this);
    }
    
    setReservedContainer(null);
    this.reservedAppSchedulable = null;
  }

 

第二處報錯是

rmContainer為null 了對removeapplicationattent的調用和對相同嘗試的moveApplication的處理順序很短則應用程序嘗試仍將包含隊列引用,
但已從隊列的應用程序列表中刪除
如果對removeapplicationattent的兩個調用連續出現,則應用程序仍將包含隊列引用,但已從隊列的應用程序列表
中刪除
在這兩種情況下,第二個調用必須在進行removeApplication調
用之前進入。

其實就是重復釋放container 但container已經在該節點上釋放了 有一個狀態不一致問題
這邊是用的寫鎖 當一個線程已經讀到containerId 另一線程釋放掉 再次釋放 就會出現異常

修改方法一
 /**
   * Clean up a completed container.
   */
  @Override
  protected synchronized void completedContainerInternal(
      RMContainer rmContainer, ContainerStatus containerStatus,
      RMContainerEventType event) {
    try {
     // writeLock.lock();//注釋寫鎖 改用重鎖

      Container container = rmContainer.getContainer();

      // Get the application for the finished container
      FSAppAttempt application =
        getCurrentAttemptForContainer(container.getId());
      ApplicationId appId =
        container.getId().getApplicationAttemptId().getApplicationId();
      if (application == null) {
        LOG.info("Container " + container + " of" +
          " finished application " + appId +
          " completed with event " + event);
        return;
      }

修改方法二 

// Get the node on which the container was allocated
      FSSchedulerNode node = getFSSchedulerNode(container.getNodeId());
      try {
      if (rmContainer.getState() == RMContainerState.RESERVED) {
        application.unreserve(rmContainer.getReservedPriority(), node);
      } else {
       // try {  //將try移到上方  覆蓋unreserve方法
  application.containerCompleted(rmContainer, containerStatus, event);
node.releaseContainer(rmContainer.getContainerId(),
false);
updateRootQueueMetrics();
LOG.info(
"Application attempt " + application.getApplicationAttemptId() + " released container " + container.getId(
) + " on node: " + node + " with event: " + event);
}
catch (Exception e){
LOG.error(e.getMessage(), e); //將該異常處理掉而不是拋出
} }

 

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM