Uploaded image for project: 'Apache Curator'
  1. Apache Curator
  2. CURATOR-561

Race condition preventing reconnection

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 4.2.0
    • 5.2.1
    • None
    • None

    Description

      Heavy GC caused several rounds of suspend, lost, reconnected - as you'd expect. But eventually it failed to reconnect, even though ZK was up the entire time. My thread waited on acquireLock until the retry period was up. I believe I've tracked this down to a race condition between the Curator thread which calls ZookKeeperTestable#injectSessionExpiration to post the expiry and death events and the Zookeeper SendThread which posts the same.

       

      The EventThread is meant to receive the expiry event first so it can set up its replacement, then it receives the death event making it shut down.

      The race condition allows these two events to be received in the wrong order so the EventThread shuts down before a replacement can be set up.

       

      The "Curator-ConnectionStateManager" thread calls injectSessionExpiration which calls EventThread#queueEvent. It goes as far as my comment in the next code snippit, sets the status and let's say it pauses here, for whatever reason.

       

      private void queueEvent(WatchedEvent event,
              Set<Watcher> materializedWatchers) {
          if (event.getType() == EventType.None
                  && sessionState == event.getState()) {
              return;
          }
          sessionState = event.getState();   // The curator thread sets the status here
        // It then goes onto send the expiry event, but for now it's here..the event has not yet been sent

       

      The SendThread then goes into the same method with its Expiry message, but the state has already been set to expired by the curator thread so it returns without sending the expiry message. It then sends the event of death:

      // from ClientCnxn#onConnected
      eventThread.queueEvent(new WatchedEvent(
              Watcher.Event.EventType.None,
              Watcher.Event.KeeperState.Expired, null));
      eventThread.queueEventOfDeath(); 

       

      The EventThread receives this event of death and shuts down. The curator thread then continues and sends the original expiry message but it's now too late. With no EventThread created it remains in suspended state with any operations just retrying until they give up.

       

      I find it difficult to explan race conditions, but hopefully that made sense. I can recreate the issue in Intellij using Suspend-All break points to cause the session expiry and Suspend-Thread break points to order the threads in the correct way:

       

      Run the following and put a suspend-all break point at the System.out

      package com.gebatech.curator;
      
      import org.apache.curator.framework.CuratorFramework;
      import org.apache.curator.framework.CuratorFrameworkFactory;
      import org.apache.curator.framework.recipes.locks.InterProcessReadWriteLock;
      import org.apache.curator.retry.BoundedExponentialBackoffRetry;
      
      /**
       *
       */
      public class GoCurator {
          public static void main(String[] args) throws Exception {
      
              CuratorFramework cf = CuratorFrameworkFactory.builder()
                      .connectString("localhost:2181")
                      .retryPolicy(new BoundedExponentialBackoffRetry(200, 10000, 20))
                      .sessionTimeoutMs(30 * 1000)
                      .build();
      
              cf.start();
      
              InterProcessReadWriteLock lcok = new InterProcessReadWriteLock(cf, "/grant/myLock");
      
              System.out.println("Set a Suspend-ALL break point here and leave it until the session expires, 30 seconds ish");
      
              lcok.readLock().acquire();
      
              System.out.println();
          }
      } 

      Whlst suspended here, put a Suspend-Thread break poing in EventThread#queueEvent, on the if statement from the first code snippet.

       

      When you see the following log line, you can let the program resume;

      2020-02-20 20:15:05,912 [myid:] - INFO [SessionTracker:ZooKeeperServer@398] - Expiring session 0x1006300f2390000, timeout of 30000ms exceeded 

      The SendThread will hit your break point first with a Disconnect event, you can let this run.

      Next you'll get the SendThread with the Expiry event, wait here until you get the popup telling you that the Curator thread has hit the same break point. Switch to this and step through until it has set the expiry status, then leave it held there and switch back to the SendThread. Let the sendThread run, then remove break points and let everything else run.

       

      The program will then be stuck and you'll see this stacktrace for the acquire lock method

      "main@1" prio=5 tid=0x1 nid=NA waiting"main@1" prio=5 tid=0x1 nid=NA waiting  java.lang.Thread.State: WAITING   at java.lang.Object.wait(Object.java:-1)   at java.lang.Object.wait(Object.java:502)   at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1499)   at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1487)   at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:1547)   at org.apache.curator.framework.imps.CreateBuilderImpl$17.call(CreateBuilderImpl.java:1180)   at org.apache.curator.framework.imps.CreateBuilderImpl$17.call(CreateBuilderImpl.java:1156)   at org.apache.curator.connection.StandardConnectionHandlingPolicy.callWithRetry(StandardConnectionHandlingPolicy.java:64)   at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:100)   at org.apache.curator.framework.imps.CreateBuilderImpl.pathInForeground(CreateBuilderImpl.java:1153)   at org.apache.curator.framework.imps.CreateBuilderImpl.protectedPathInForeground(CreateBuilderImpl.java:607)   at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:597)   at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:575)   at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:51)   at org.apache.curator.framework.recipes.locks.StandardLockInternalsDriver.createsTheLock(StandardLockInternalsDriver.java:54)   at org.apache.curator.framework.recipes.locks.LockInternals.attemptLock(LockInternals.java:225)   at org.apache.curator.framework.recipes.locks.InterProcessMutex.internalLock(InterProcessMutex.java:237)   at org.apache.curator.framework.recipes.locks.InterProcessMutex.acquire(InterProcessMutex.java:89)   at com.gebatech.curator.GoCurator.main(GoCurator.java:26) 

       

      I know this issue falls between Zookeeper and Curator but thought I'd report it in case you can think of a solution. I think ideally they'd send a single expiry event with a death flag set or, performance impact allowing, synchronize that method.

      Attachments

        Issue Links

          Activity

            People

              randgalt Jordan Zimmerman
              digbyg Grant Digby
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h 10m
                  2h 10m