Details
-
Sub-task
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
3.4.0
-
None
-
Reviewed
Description
Current implementation nodelistmanager event blocks async dispacher and can cause RM crash and slowing down event processing.
- Cluster restart with 1K running apps . Each usable event will create 1K events over all events could be 5k*1k events for 5K cluster
- Event processing is blocked till new events are added to queue.
Solution :
- Add another async Event handler similar to scheduler.
- Instead of adding events to dispatcher directly call RMApp event handler.
Attachments
Attachments
Issue Links
- is related to
-
YARN-3990 AsyncDispatcher may overloaded with RMAppNodeUpdateEvent when Node is connected/disconnected
- Closed
- relates to
-
YARN-10739 GenericEventHandler.printEventQueueDetails causes RM recovery to take too much time
- Resolved