[YARN-11292] resourcemanager no longer reconnects to zk - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.3.3
Fix Version/s: None
Component/s: resourcemanager
Labels:
None

Description

this problem has occurred in our environment ，the process of the problem is as follow:

network exception between resourcemanager and zookeeper
resourcemanger reconnect zookeeper successful
zookeeper session expire occurred
resourcemanager create new zookeeper client and reconnect it
if reconnect zk failed，will trigger RMFatalEvent
then start new thread to continue reconnect and rejoin election，while the variable hasAlreadyRun controll just run once，so if still reconnect failed，there have no chance to reconnect

    private class StandByTransitionRunnable implements Runnable {
      // The atomic variable to make sure multiple threads with the same runnable
      // run only once.
      private final AtomicBoolean hasAlreadyRun = new AtomicBoolean(false);      @Override
      public void run() {
        // Run this only once, even if multiple threads end up triggering
        // this simultaneously.
        if (hasAlreadyRun.getAndSet(true)) {
          return;
        }        if (rmContext.isHAEnabled()) {
          try {
            // Transition to standby and reinit active services
            LOG.info("Transitioning RM to Standby mode");
            transitionToStandby(true);
            EmbeddedElector elector = rmContext.getLeaderElectorService();
            if (elector != null) {
              elector.rejoinElection();
            }
          } catch (Exception e) {
            LOG.error(FATAL, "Failed to transition RM to Standby mode.", e);
            ExitUtil.terminate(1, e);
          }
        }
      }
    }

so, i think use a lock here will be better

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: chenwencan

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 02/Sep/22 09:07

Updated:: 02/Sep/22 09:07