Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
3.4.12, 3.5.7
-
None
-
None
-
cat /etc/os-release
NAME="SLES"
VERSION="12-SP5"
VERSION_ID="12.5"
PRETTY_NAME="SUSE Linux Enterprise Server 12 SP5"
ID="sles"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:suse:sles:12:sp5"docker version
Client:
Version: 20.10.6-ce
API version: 1.41
Go version: go1.13.15
Git commit: 8728dd246c3a
Built: Tue Apr 27 09:45:18 2021
OS/Arch: linux/amd64
Context: default
Experimental: trueServer:
Engine:
Version: 20.10.6-ce
API version: 1.41 (minimum version 1.12)
Go version: go1.13.15
Git commit: 8728dd246c3a
Built: Fri Apr 9 22:06:18 2021
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: v1.4.4
GitCommit: 05f951a3781f4f2c1911b05e61c160e9c30eaa8e
runc:
Version: 1.0.0-rc93
GitCommit: 12644e614e25b05da6fd08a38ffa0cfe1903fdec
docker-init:
Version: 0.1.3_catatonit
GitCommit:zookeeper version - 3.5.7
cat /etc/os-release NAME="SLES" VERSION="12-SP5" VERSION_ID="12.5" PRETTY_NAME="SUSE Linux Enterprise Server 12 SP5" ID="sles" ANSI_COLOR="0;32" CPE_NAME="cpe:/o:suse:sles:12:sp5" docker version Client: Version: 20.10.6-ce API version: 1.41 Go version: go1.13.15 Git commit: 8728dd246c3a Built: Tue Apr 27 09:45:18 2021 OS/Arch: linux/amd64 Context: default Experimental: true Server: Engine: Version: 20.10.6-ce API version: 1.41 (minimum version 1.12) Go version: go1.13.15 Git commit: 8728dd246c3a Built: Fri Apr 9 22:06:18 2021 OS/Arch: linux/amd64 Experimental: false containerd: Version: v1.4.4 GitCommit: 05f951a3781f4f2c1911b05e61c160e9c30eaa8e runc: Version: 1.0.0-rc93 GitCommit: 12644e614e25b05da6fd08a38ffa0cfe1903fdec docker-init: Version: 0.1.3_catatonit GitCommit: zookeeper version - 3.5.7
Description
I have a 3 node zookeeper cluster deployed as a stack using docker swarm.
Deploying this stack causes zookeeper to fail with a SocketTimeoutException during leader election with the following log
2021-06-11 03:59:34,607 [myid:2] - WARN [QuorumPeer[myid=2]/0.0.0.0:2181:QuorumCnxManager@584] - Cannot open channel to 3 at election address zoo3/10.0.11.5:3888 java.net.SocketTimeoutException: connect timed out at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:558) at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectAll(QuorumCnxManager.java:610) at org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLeaderElection.java:838) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:957)
The docker overlay network itself appears to be sound. A netstat on one of the nodes outputs
bash-4.4# netstat -tuln Active Internet connections (only servers) Proto Recv-Q Send-Q Local Address Foreign Address State tcp 0 0 0.0.0.0:2181 0.0.0.0:* LISTEN tcp 0 0 0.0.0.0:3888 0.0.0.0:* LISTEN tcp 0 0 0.0.0.0:42941 0.0.0.0:* LISTEN tcp 0 0 127.0.0.11:35453 0.0.0.0:* LISTEN udp 0 0 127.0.0.11:55009 0.0.0.0:*
showing the 3888 port is open. but a tcpdump only shows send and re-transmissions and there are no responses in port 3888.
Suspecting the issue maybe due to a short timeout or small number of retries, I have tried increasing the cnxTimeout to 300000 and electionPortBindRetry to 0 (infinite), but even after 13 hrs of continuous running and retrying election the same error persists
I have attached the stack.yml, the custom docker-entrypoint.sh that we override on top of the official container to enable running from a root host user, and the zoo.cfg file from inside the container.
Any help in identifying the underlying issue or mis-configuration, or any configuration parameter that may help solve the issue is deeply appreciated.