[SAMZA-1384] Race condition with async commit affects checkpoint correctness - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.13.1
Component/s: None
Labels:
None

Description

tl;dr if any in-flight request updates the offsets between producer.flush() and offsetmanager.checkpoint() we could write a checkpoint for a message that did not yet go out over the wire and could subsequently fail.

Consider two threads A and B. A is performing an async commit. B is an in-flight process(). The following sequence will cause data loss:
A: TaskInstance.commit() begins
A: producer.flush() is called // no new messages will go out in this batch

B: producer.send() is called
B: TaskCallback is invoked for the finished process()
B: OffsetManager records the offset for the completed process()

A: producer.flush() finishes
A: checkpoint is written using the latest offsets from the OffsetManager. This INCLUDES the offset for the latest send, which has not yet gone out over the wire.
A: TaskInstance.commit() finishes

B: producer.send()->callback is invoked with an error. Send was unsuccessful, but has been checkpointed already.
B: Exception is propagated and container fails
Result: Container is restarted and starts from the last checkpoint.

Note that this is only an issue when the commit() occurs concurrently with in-flight requests, so it doesn't affect the fully-synchronous mode or concurrent mode with synchronous commit().

Proposed solution:
Take a snapshot of the offsets in the OffsetManager at the beginning of commit(). Only checkpoint those offsets and nothing new that has been sent since the commit() started.

Attachments

Issue Links

links to

GitHub Pull Request #263

Activity

People

Assignee:: Jake Maes

Reporter:: Jake Maes

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 04/Aug/17 17:51

Updated:: 10/Aug/17 22:45

Resolved:: 10/Aug/17 22:45