Details
-
New Feature
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
Currently, an NM cannot be started if it is marked as decommissioned on the RM (in the exclude list) because RM sends a SHUTDOWN signal when NM tries to send a heartbeat after starting up:
// Check if this node is a 'valid' node if (!this.nodesListManager.isValidNode(host) && !isNodeInDecommissioning(nodeId)) { String message = "Disallowed NodeManager from " + host + ", Sending SHUTDOWN signal to the NodeManager."; LOG.info(message); response.setDiagnosticsMessage(message); response.setNodeAction(NodeAction.SHUTDOWN); return response; }
This couples the start/stop operations of the NM service very tightly with its state in the RM making it difficult to manage large fleets of NMs independently from the RM.
For example, after an NM OS upgrade, we will be able to start the NM, recommission it, and then check for the state without worrying about the order of the start/recommission operations (especially if we don't have control over the start operation - which is the case in large companies where start operation is part of the OS upgrade pipeline). This could also result in deployment failures on decommissioned nodes if the deployment pipeline checks for the running service before marking deploy as succeeded.
The patch will look something like this:
// Check if this node is a 'valid' node if (!this.nodesListManager.isValidNode(host) && !isNodeInDecommissioning(nodeId) && + !this.noNMShutdownForInvalidNodes) { String message = "Disallowed NodeManager from " + host + ", Sending SHUTDOWN signal to the NodeManager."; LOG.info(message); response.setDiagnosticsMessage(message); response.setNodeAction(NodeAction.SHUTDOWN); return response; }