Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.0.0-alpha4, 3.1.1, 3.3.0
-
None
Description
Two exception cases:
The first case:
The exception desc:
14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) - Error in dispatcher thread java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) at java.lang.Thread.run(Thread.java:748){{}} *
- ActiveStandbyElector and ZKRMStateStore triggered toStandy event at 14:52:57,
Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2.
- As shown in the following figure, Thread_1 during the toStandby process , reinitializes the activeServices to null. At this point, Thread_2 will use the "activeServices" when executing the handleTransitionToStandByInNewThread method ultimately resulting in a NullPointerException and the Reosurcemanager server exit.
The second case:
The exception desc:
06:17:35,913 WARN ha.ActiveStandbyElector (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the winning of election org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active at org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510) Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll during transition to Active at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315) at org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144) ... 4 more Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation failed at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:765) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307) ... 5 more Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:467) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:754) ... 6 more 06:17:35,917 ERROR resourcemanager.ResourceManager (ResourceManager.java:handle(898)) - Received RMFatalEvent of type TRANSITION_TO_ACTIVE_FAILED, caused by failure to refresh configuration settings: org.apache.hadoop.ha.ServiceFailedException: RefreshAll opera tion failed{{}}
- ActiveStandbyElector and ZKRMStateStore triggered toActive event and toStandby event at 06:17:35, Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2.
- During the execution of Thread_ 1 the CapacityScheduler.reinitialize is called to refresh the Scheduler configuration. At this time, the csConfProvider property of the CapacityScheduler is not initialized and its value is null. As a result. when the reinitialize method is executed csConfProvider is used, triggering a NullPointerException and causing Thread_ 1 transition to active fail.
Solution
Due to the limited scope of lock control in ResourceMmanger’s transitionToActive and transitionToStandby methods, different events triggered asynchronously outside this lock scope can influence each other, leading to unpredictable issues. The proposed solution is to encapsulate different asynchronous tasks as TransitionToActiveStandbyRunner and enqueue them in a queue to be executed in order by a SingleThreadExecutor. This approach resolves the asynchronous problem and provides clearer and more controllable switching of to active and standby processes.
TransitionToActiveStandbyRunner and Subclasses
TransitionToActiveStandbyRunner
TransitionToActiveStandbyRunner is a template class where the logic for different scenarios is placed and executed within the doTransaction method.
public abstract class TransitionToActiveStandbyRunner implements Callable<TransitionToActiveStandbyResult> { @Override public TransitionToActiveStandbyResult call() throws Exception { ... before log ... TransitionToActiveStandbyResult result = doTransaction(); ... after log ... return result; } public abstract TransitionToActiveStandbyResult doTransaction();}
Subclasses
AdminServiceToActiveRunner
AdminServiceToActiveRunner encapsulates the logic of the transitionToActive method in AdminService, handling the requests from clients and ActiveStandbyElector to transition to the active state.
AdminServiceToStandbyRunner
AdminServiceToStandbyRunner encapsulates the logic of the transitionToStandby method in AdminService, handling the requests from clients and ActiveStandbyElector to transition to the standby state.
RmStartAndStopToStandby
RmStartAndStopToStandby is used for transitioning the ResourceManager service to standby when it is stopping or starting
RMStartToActiveRunner
RMStartToActiveRunner is used for transitioning the ResourceManager service to active when it is stopping.
RMFatalToStandbyRunner
RMFatalToStandbyRunner is used to handle RMFatalEvent during Yarn open HA mode for transitioning to standby.
Attachments
Attachments
Issue Links
- links to