[ZOOKEEPER-4646] Committed txns may still be lost if followers crash after replying ACK of NEWLEADER but before writing txns to disk - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Duplicate
Affects Version/s: 3.6.3, 3.7.0, 3.8.0, 3.7.1, 3.9.0, 3.8.1, 3.9.1
Fix Version/s: None
Component/s: quorum, server
Labels:
- pull-request-available

Flags:

Important

Description

When a follower is processing the NEWLEADER message in SYNC phase, its QuorumPeer thread will call logRequest(..) to submit the txn persistence task to the SyncThread. The SyncThread will persist txns asynchronously and does not promise to finish the task before the follower replies ACK-LD (i.e. ACK of NEWLEADER) to the leader, which may lead to committed data loss.

Actually, this problem had been first raised in ~~ZOOKEEPER-3911~~ . However, the fix of ~~ZOOKEEPER-3911~~ does not solve the problem at the root. The following trace can still be generated in the latest version nowadays.

Trace

Trace-ZK-4646.pdf

The trace is basically the same as the one in ~~ZOOKEEPER-3911~~ (See the first comment provided by hanm in that issue). For convenience we use the zxid to represent a txn here.

Start the ensemble with three nodes: S0, S1 & S2.

S2 is elected leader.
All of them have the same log with the last zxid <1, 3>.
S2 logs a new txn <1, 4> and makes a broadcast.
S0 & S1 crash before they receive the proposal of <1, 4>.
S0 & S1 restart.
S2 is elected leader again.
S0 & S1 DIFF sync with S2 .
S0 & S1 send ACK-LD to S2 before their SyncThreads log txns to disk. (This is possible because txn logging is processed asynchronously! )
Verify clients of S2 have the view of <1, 4>.
The followers S0 & S1 crash before their SyncThreads persist txns to disk. (This is extremely timing sensitive but possible! )
S0 & S1 restart, and S2 crashes.
Verify clients of S0 & S1 do NOT have the view of <1, 4>, a violation of ZAB.

Extra note: The trace can be constructed with quorum nodes alive at any moment with careful time tuning of node shutdown & restart, e.g., let S0 & S1 shutdown and restart one by one in a short time.

Analysis

Root Cause:

The root cause lies in the asynchronous executions by multi-threads.

When a follower replies ACK-LD, it should promise that it has already logged the initial history of the leader (according to ZAB). However, txn logging is executed by the SyncThread asynchronously, so the above promise can be violated. It is possible that, after the leader receives ACK-LD, believing that the responding follower has been in sync, and then gets into the BROADCAST phase, while in fact the history of the follower is not in sync yet. At this time, environment failures might prevent the follower from logging successfully. When that node with stale or incomplete committed history is elected leader later, it might lose txns that have been committed and applied on the former leader node.

The implementation adopts the multi-threading style for performance optimization. However, it may bring some underlying subtle bugs that will not occur at the protocol level. The fix of ~~ZOOKEEPER-3911~~ simply calls logRequest(..) to submit the logging requests to SyncRequestProcessor's queue before replying ACK-LD inside the NEWLEADER processing logic, without further considering the risk of asynchronous executions by multi-threads. When the follower replies ACK-LD and then crashes before its SyncThread writes txns to disk, the problem is triggered.

Property Violation:

From the server side, the committed log of the ensemble does not append monotonically; different nodes have inconsistent committed logs. From the client side, clients connected to different nodes may have inconsistent views. A client may read stale data after a newer version is obtained. That newer version can only be obtained from certain nodes of the ensemble rather than all nodes. What's worse, that newer version may also be removed later.

Affected Versions:

The above trace has been generated in multiple versions such as 3.7.1 & 3.8.1 (the latest stable & current version till now) by our testing tools. The affected versions might be more, since the critical partial order between the follower's replying ACK-LD and updating its history during SYNC stay non-deterministic in multiple versions.

Possible Fix

Considering this issue and ZOOKEEPER-4685 , one possible fix is to guarantee the following partial orders to be satisfied:

A follower replies ACK-LD (i.e. ACK of NEWLEADER) only after it has persisted the txns that might be applied to the leader's datatree before the leader gets into the BROADCAST phase (so as to avoid this issue).
The follower replies ACK of PROPOSAL only after it replies ACK-LD (i.e. ACK of NEWLEADER) to the leader (so as to avoid ZOOKEEPER-4685 ).

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

Trace-ZK-4646.pdf
13/Dec/22 03:37
381 kB
Sirius

Issue Links

is related to

ZOOKEEPER-3023 Flaky test: org.apache.zookeeper.server.quorum.Zab1_0Test.testNormalFollowerRunWithDiff

Resolved

ZOOKEEPER-4643 Committed txns may be improperly truncated if follower crashes right after updating currentEpoch but before persisting txns to disk

Resolved

is superceded by

ZOOKEEPER-4394 Learner.syncWithLeader got NullPointerException

Resolved

ZOOKEEPER-4785 Txn loss due to race condition in Learner.syncWithLeader() during DIFF sync

Resolved

relates to

ZOOKEEPER-3023 Flaky test: org.apache.zookeeper.server.quorum.Zab1_0Test.testNormalFollowerRunWithDiff

Resolved

ZOOKEEPER-4643 Committed txns may be improperly truncated if follower crashes right after updating currentEpoch but before persisting txns to disk

Resolved

links to

GitHub Pull Request #1993

(1 relates to, 1 links to)

Committed txns may still be lost if followers crash after replying ACK of NEWLEADER but before writing txns to disk

Details

Description

Trace

Analysis

Possible Fix

Attachments

Attachments

Issue Links

Activity

People

Dates

Time Tracking