Details
Description
When a follower is processing the NEWLEADER message in SYNC phase, its QuorumPeer thread will call logRequest(..) to submit the txn persistence task to the SyncThread. The SyncThread may persist txns and reply ACKs of them before replying ACK-LD (i.e. ACK of NEWLEADER) to the leader. This may cause the consequence that the leader cannot collect enough number of ACK-LDs successfully, followed by the leader's shutdown and a new round of election. This introduces unnecessary recovery procedures, consumes extra time before servers get into the BROADCAST phase and reduces the service's availability a lot.
The following trace can be generated in the latest version nowadays.
Trace
Start the ensemble with three nodes: S0, S1 & S2.
- S2 is elected leader.
- S2 logs a new txn <1, 1> and makes a broadcast.
- S0 restarts & S1 crashes before receiving the proposal of <1, 1>.
- S2 is elected leader again.
- S2 syncs with S0 using DIFF, and sends the proposal of <1, 1> during SYNC.
- After S0 receives NEWLEADER, S0's SyncThread may persist the txn <1, 1> and reply corresponding ACK to the leader S2 before S0's QuorumPeer thread replies ACK-LD to the leader S2 .(This is possible because txn logging is processed asynchronously by SyncThread! )
- The corresponding LearnerHandler on S2 cannot recognize the ACK of some proposal before ACK-LD, and will be blocked at waitForStartup() until the leader turns its state to state.RUNNING.
- However, the QuorumPeer of the leader S2 will not receive enough number of ACK-LDs before timeout, and then throws InterruptedException during waitForNewLeaderAck(..).
- After that, the leader will shutdown and a new round of election is raised, which consumes extra time for establishing the quorum and reduces availability a lot.
Analysis
Root Cause:
Similar to ZOOKEEPER-4646 , the root cause lies in the asynchronous executions by multi-threads on the follower side.The implementation adopts the multi-threading style for performance optimization. However, it may bring some underlying subtle bugs.
When the follower receives NEWLEADER, it calls logRequest(..) to submit the logging requests to SyncRequestProcessor's queue before replying ACK-LD. The SyncThread may be scheduled to persist the txns and reply ACK(s) before the QuorumPeer thread replies ACK-LD.
On the leader side, the corresponding learnerHandler will not recognize the ACK of PROPOSAL during waitForNewLeaderAck(..) , and then will be blocked at waitForStartup() until the leader turns its state to state.RUNNING. However, the leader may not receive enough ACK-LDs before timeout, and then throws InterruptedException during waitForNewLeaderAck(..). After that, the leader will shutdown and the servers go into a new election and recovery process.
This recovery procedure consumes extra recovery time and increase system unavailability time.
To some degree it is unnecessary. It can be fixed and optimized by guaranteeing the follower side's message order.
Affected Versions:
The above trace has been generated in multiple versions such as 3.7.1 & 3.8.1 (the latest stable & current version till now) by our testing tools. The affected versions might be more, since the critical partial order between the follower's replying ACK-LD and replying ACK of PROPOSAL during SYNC stay non-deterministic in multiple versions.
Possible Fix
Considering this issue and ZOOKEEPER-4646 , one possible fix is to guarantee the following partial orders to be satisfied:
- The follower replies ACK of PROPOSAL only after it replies ACK-LD (i.e. ACK of NEWLEADER) to the leader (so as to avoid this issue).
- The follower replies ACK-LD only after it has persisted the txns that might be applied to the leader's datatree before the leader gets into the BROADCAST phase (to avoid
ZOOKEEPER-4646).