Details
-
Bug
-
Status: Patch Available
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
When NM is unable to connect to RM, NM shuts itself down.
} catch (ConnectException e) { //catch and throw the exception if tried MAX wait time to connect RM dispatcher.getEventHandler().handle( new NodeManagerEvent(NodeManagerEventType.SHUTDOWN)); throw new YarnRuntimeException(e);
In large clusters, if RM is down for maintenance for longer period, all the NMs shuts themselves down, requiring additional work to bring up the NMs.
Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side effects, where non connection failures are being retried infinitely by all YarnClients (via RMProxy).
Attachments
Attachments
Issue Links
- is related to
-
YARN-196 Nodemanager should be more robust in handling connection failure to ResourceManager when a cluster is started
- Closed
-
YARN-479 NM retry behavior for connection to RM should be similar for lost heartbeats
- Closed
-
YARN-3668 Long run service shouldn't be killed even if Yarn crashed
- Open