Details
Description
Description
Our program encountered a problem where curator got stuck in the SUSPENDED state. It kept injecting "session expire" event, generating logs like this every 15s (same as session timeout):
Session timeout has elapsed while SUSPENDED. Injecting a session expiration
However, curator never transited to the LOST state after session timeout, nor transied to the RECONNECTED state upon network recovery.
According to logs containing "zookeeper" or "curator" (see attachment "log4j.log"), related events are as follows:
(All events happened in date 2021.12.05)
- 23:34:07,122: ZooKeeper not heard from server, thus curator transited to the SUSPENDED state.
- 23:34:09,128: ZooKeeper opening socket to zk server.
- 23:34:22,223: curator injected a "session expiration" event, but had not transited to the LOST state at that time; never transited in the future.
- 23:34:22,885: ZooKeeper opened socket to zk server, and received reponse indicating session expiration; thus, the current ZooKeeper object was closed. However, another ZooKeeper object was not created as expected.
- 23:34:37,224 and later: curator kept injecting "session expiration" events every session timeout (15s). The log was filled with lines like this: "Session timeout has elapsed while SUSPENDED. Injecting a session expiration."
In summary, curator stayed in the SUSPENED state and never transited to the LOST or RECONNECTED state. Besides, the underlying ZooKeeper object was never recreated according to logs and the "jstack" command. For a more detailed analysis, please refer to attachment "log analysis.md".
The problem above was encountered only once in our testing environment within months, and has never occurred in our production enviromnent. We failed to find out how to reproduce, but suspect there is a racing condition when these events happen simultaneously: while curator is injecting a "session expire" event, the underlying "ClientCnxnSocketNIO" has just reconnected to a zk server and got a "session expire" response.
Environment
OS: Linux version 4.17.3-1.el7.elrepo.x86_64
Project type: maven
Curator version: 4.2.0
Session timeout: 15s, which is equal to that negotiated with the server.
Retry policy: no retry, where RetryPolicy#allowRetry() always returns false. That's because we handle retry at application level. Write operations should NOT be retried immediately after reconnected, until some extra validations pass.
Our program creates exactly one "CuratorFramework" instance connecting to the only zk server in the testing environment. The server's connection string is "6x.xx.xx.27:2181", with some digits consealed for security. Ip addresses in logs are also such processed.
ZkClient is imported for forward compatiblility with some legacy codes (still dependent on ZkClient), but it is bridged to curator instead of connecting to zk server itself. For more details about dependencies, please refer to file "pom.xml" in attachments.
Attachments
log4j.log: log of our program, with ip addresses and zk paths consealed.
log analysis.md: a more detailed analysis about the logs.
pom.xml: part of file "pom.xml", showing zookeeper-related dependencies.
Attachments
Attachments
Issue Links
- is fixed by
-
CURATOR-561 Race condition preventing reconnection
- Closed