Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.3.3
-
None
-
None
Description
this problem has occurred in our environment ,the process of the problem is as follow:
- network exception between resourcemanager and zookeeper
- resourcemanger reconnect zookeeper successful
- zookeeper session expire occurred
- resourcemanager create new zookeeper client and reconnect it
- if reconnect zk failed,will trigger RMFatalEvent
- then start new thread to continue reconnect and rejoin election,while the variable hasAlreadyRun controll just run once,so if still reconnect failed,there have no chance to reconnect
private class StandByTransitionRunnable implements Runnable { // The atomic variable to make sure multiple threads with the same runnable // run only once. private final AtomicBoolean hasAlreadyRun = new AtomicBoolean(false); @Override public void run() { // Run this only once, even if multiple threads end up triggering // this simultaneously. if (hasAlreadyRun.getAndSet(true)) { return; } if (rmContext.isHAEnabled()) { try { // Transition to standby and reinit active services LOG.info("Transitioning RM to Standby mode"); transitionToStandby(true); EmbeddedElector elector = rmContext.getLeaderElectorService(); if (elector != null) { elector.rejoinElection(); } } catch (Exception e) { LOG.error(FATAL, "Failed to transition RM to Standby mode.", e); ExitUtil.terminate(1, e); } } } }
so, i think use a lock here will be better