Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.10.0
-
None
-
None
Description
Symptom:If a thread is doing a file write and stuck in writeLongToFile, this thread will hang. This blocking shoud be handled by the zookeeper via PING. However, if the QuorumPeer executes the writeLongToFile and encounters a fail-slow disk, the entire follower can be stuck. The leader will abandon this follower, but the follower believes that it is a follower.
Callstack is as following:
at org.apache.zookeeper.common.AtomicFileOutputStream.write(AtomicFileOutputStream.java:72) at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221) at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291) at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:295) at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141) at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229) at java.io.BufferedWriter.flush(BufferedWriter.java:254) at org.apache.zookeeper.common.AtomicFileWritingIdiom.<init>(AtomicFileWritingIdiom.java:72) at org.apache.zookeeper.common.AtomicFileWritingIdiom.<init>(AtomicFileWritingIdiom.java:54) at org.apache.zookeeper.server.quorum.QuorumPeer.writeLongToFile(QuorumPeer.java:2233) at org.apache.zookeeper.server.quorum.QuorumPeer.setAcceptedEpoch(QuorumPeer.java:2262) at org.apache.zookeeper.server.quorum.Learner.registerWithLeader(Learner.java:510) at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:91) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1556)
Root cause: The Quorum is blocked in writeLongToFile and can not execute readPacket, so no timeout exception is arised to trigger the error handler.
Moreover, this problem cannot be handle by add "-Dlearner.asyncSending=true"(https://issues.apache.org/jira/browse/ZOOKEEPER-4074)