[YARN-11718] Provide config option to not shutdown NM if it is decommissioned - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: resourcemanager
Labels:
None

Description

Currently, an NM cannot be started if it is marked as decommissioned on the RM (in the exclude list) because RM sends a SHUTDOWN signal when NM tries to send a heartbeat after starting up:

https://github.com/apache/hadoop/blob/1655acc5e2d5fe27e01f46ea02bd5a7dea44fe12/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java#L455-L465

    // Check if this node is a 'valid' node
    if (!this.nodesListManager.isValidNode(host) &&
        !isNodeInDecommissioning(nodeId)) {
      String message =
          "Disallowed NodeManager from  " + host
              + ", Sending SHUTDOWN signal to the NodeManager.";
      LOG.info(message);
      response.setDiagnosticsMessage(message);
      response.setNodeAction(NodeAction.SHUTDOWN);
      return response;
    }

This couples the start/stop operations of the NM service very tightly with its state in the RM making it difficult to manage large fleets of NMs independently from the RM.

For example, after an NM OS upgrade, we will be able to start the NM, recommission it, and then check for the state without worrying about the order of the start/recommission operations (especially if we don't have control over the start operation - which is the case in large companies where start operation is part of the OS upgrade pipeline). This could also result in deployment failures on decommissioned nodes if the deployment pipeline checks for the running service before marking deploy as succeeded.

The patch will look something like this:

    // Check if this node is a 'valid' node
    if (!this.nodesListManager.isValidNode(host) &&
        !isNodeInDecommissioning(nodeId) &&
+       !this.noNMShutdownForInvalidNodes) {
      String message =
          "Disallowed NodeManager from  " + host
              + ", Sending SHUTDOWN signal to the NodeManager.";
      LOG.info(message);
      response.setDiagnosticsMessage(message);
      response.setNodeAction(NodeAction.SHUTDOWN);
      return response;
    }

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Aswin M Prabhu

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 31/Aug/24 18:27

Updated:: 03/Sep/24 04:14