Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
2.7.1
-
None
Description
We see issues with RM trying to launch a container while a NM is restarting and we get exceptions like NMNotReadyException. While YARN-3842 added retry for other clients of NM (AMs mainly) its not used by AMLauncher in RM causing there intermittent errors to cause job failures. This can manifest during rolling restart of NMs.