Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
Private Beta
Description
Binglin is triggering a crash reasonably regularly under load:
- a tablet is flushed with a snapshot that has at least one txn in flight, but a txn with a later timestamp already committed. eg:
- txn 1 and 3 committed, 2 in flight. This gives a flush snapshot txn <= 1 or txn == 3.
- as of
KUDU-987, we don't wait for all in-flight transactions to commit during flush (necessary since the txn might be in flight for a while) - because txn 3 was committed, the UNDO delta has a ts range of [1, 3]
- we then select the newly-flushed rowset for compaction, and txn 2 is still not committed
- at this point, we hit a CHECK failure because we see an UNDO file which can't be fully ignored by a compaction (its time range overlaps with uncommitted ranges in the current snapshot)