Details
Description
Zookeeper ensemble starts up properly after quorum is made. The leader is elected and it starts serving requests. After a while the Leader gets stuck, so its just accepting requests but not processing it, same is the case with participants. They are accepting requests but since the leader doesn't process they keep piling up.
This causes an issue with sudden increase on the no. of CLOSE_WAIT connections on the zookeeper servers. When this happens, the ensemble is completely unresponsive causing connection loss/timeouts. Once the CLOSE_WAIT start the number of open connections on each server spike as high as 100000 from a mere 200 connections within a few minutes.
A pattern was found in thread dump where we always saw NIOServerCxnFactory selector thread blocked on a lock waiting in org.apache.zookeeper.server.ZooKeeperSaslServer.createSaslServer
tdump_zkdev14.i.ia55.net_1694037623.logs-"NIOServerCxnFactory.SelectorThread-0" #16 daemon prio=5 os_prio=0 cpu=9126323.70ms elapsed=25935.16s tid=0x00007f9118702320 nid=0x20ed94 waiting for monitor entry [0x00007f907e635000] tdump_zkdev14.i.ia55.net_1694037623.logs: java.lang.Thread.State: BLOCKED (on object monitor) tdump_zkdev14.i.ia55.net_1694037623.logs- at org.apache.zookeeper.server.ZooKeeperSaslServer.createSaslServer(ZooKeeperSaslServer.java:42) tdump_zkdev14.i.ia55.net_1694037623.logs- - waiting to lock <0x0000000700391098> (a org.apache.zookeeper.Login) tdump_zkdev14.i.ia55.net_1694037623.logs- at org.apache.zookeeper.server.ZooKeeperSaslServer.<init>(ZooKeeperSaslServer.java:38)
{{}}
Seems to be related to https://issues.apache.org/jira/browse/ZOOKEEPER-2230
Thanks