Uploaded image for project: 'Apache Curator'
  1. Apache Curator
  2. CURATOR-627

Curator remains in SUSPENDED state ignoring session timeout or network recovery

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 4.2.0
    • 5.2.1, 5.3.0
    • Framework
    • None
    • OS: Linux version 4.17.3-1.el7.elrepo.x86_64

    Description

      Description

      Our program encountered a problem where curator got stuck in the SUSPENDED state. It kept injecting "session expire" event, generating logs like this every 15s (same as session timeout):

      Session timeout has elapsed while SUSPENDED. Injecting a session expiration

      However, curator never transited to the LOST state after session timeout, nor transied to the RECONNECTED state upon network recovery.

       

      According to logs containing "zookeeper" or "curator" (see attachment "log4j.log"), related events are as follows:

      (All events happened in date 2021.12.05)

      • 23:34:07,122: ZooKeeper not heard from server, thus curator transited to the SUSPENDED state.
      • 23:34:09,128: ZooKeeper opening socket to zk server.
      • 23:34:22,223: curator injected a "session expiration" event, but had not transited to the LOST state at that time; never transited in the future.
      • 23:34:22,885: ZooKeeper opened socket to zk server, and received reponse indicating session expiration; thus, the current ZooKeeper object was closed. However, another ZooKeeper object was not created as expected.
      • 23:34:37,224 and later: curator kept injecting "session expiration" events every session timeout (15s). The log was filled with lines like this: "Session timeout has elapsed while SUSPENDED. Injecting a session expiration."

      In summary, curator stayed in the SUSPENED state and never transited to the LOST or RECONNECTED state. Besides, the underlying ZooKeeper object was never recreated according to logs and the "jstack" command. For a more detailed analysis, please refer to attachment "log analysis.md".

       

      The problem above was encountered only once in our testing environment within months, and has never occurred in our production enviromnent. We failed to find out how to reproduce, but suspect there is a racing condition when these events happen simultaneously: while curator is injecting a "session expire" event, the underlying "ClientCnxnSocketNIO" has just reconnected to a zk server and got a "session expire" response.

      Environment

      OS: Linux version 4.17.3-1.el7.elrepo.x86_64

       

      Project type: maven

      Curator version: 4.2.0

       

      Session timeout: 15s, which is equal to that negotiated with the server.

      Retry policy: no retry, where RetryPolicy#allowRetry() always returns false. That's because we handle retry at application level. Write operations should NOT be retried immediately after reconnected, until some extra validations pass.

       

      Our program creates exactly one "CuratorFramework" instance connecting to the only zk server in the testing environment. The server's connection string is "6x.xx.xx.27:2181", with some digits consealed for security. Ip addresses in logs are also such processed.

       

      ZkClient is imported for forward compatiblility with some legacy codes (still dependent on ZkClient), but it is bridged to curator instead of connecting to zk server itself. For more details about dependencies, please refer to file "pom.xml" in attachments.

      Attachments

      log4j.log: log of our program, with ip addresses and zk paths consealed.

      log analysis.md: a more detailed analysis about the logs.

      pom.xml: part of file "pom.xml", showing zookeeper-related dependencies.

      Attachments

        1. log4j.log
          239 kB
          Bai Yu
        2. log analysis.md
          7 kB
          Bai Yu
        3. pom.xml
          0.7 kB
          Bai Yu

        Issue Links

          Activity

            People

              Unassigned Unassigned
              Bai Yu Bai Yu
              Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: