[CASSANDRA-15369] Fake row deletions and range tombstones, causing digest mismatch and sstable growth - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Normal
Resolution: Fixed
Fix Version/s: 4.0-beta4, 4.0
Component/s: Consistency/Coordination, Local/Memtable, Local/SSTable
Labels:
None

Bug Category:
Correctness - Consistency
Severity:
Normal
Complexity:
Challenging
Discovered By:
Code Inspection
Platform:

All
Impacts:

None
Since Version:

3.0.0
Source Control Link:

trunk: https://github.com/apache/cassandra/pull/473
Test and Documentation Plan:

Hide

Added unit tests to read RT with named queries and slice queries.

CIrcle CI asf-cirunning

Show
Added unit tests to read RT with named queries and slice queries. CIrcle CI asf-ci running

Description

As assessed in ~~CASSANDRA-15363~~, we generate fake row deletions and fake tombstone markers under various circumstances:

If we perform a clustering key query (or select a compact column):
Serving from a Memtable, we will generate fake row deletions
Serving from an sstable, we will generate fake row tombstone markers

If we perform a slice query, we will generate only fake row tombstone markers for any range tombstone that begins or ends outside of the limit of the requested slice
If we perform a multi-slice or IN query, this will occur for each slice/clustering

Unfortunately, these different behaviours can lead to very different data stored in sstables until a full repair is run. When we read-repair, we only send these fake deletions or range tombstones. A fake row deletion, clustering RT and slice RT, each produces a different digest. So for each single point lookup we can produce a digest mismatch twice, and until a full repair is run we can encounter an unlimited number of digest mismatches across different overlapping queries.

Relatedly, this seems a more problematic variant of our atomicity failures caused by our monotonic reads, since RTs can have an atomic effect across (up to) the entire partition, whereas the propagation may happen on an arbitrarily small portion. If the RT exists on only one node, this could plausibly lead to fairly problematic scenario if that node fails before the range can be repaired.

At the very least, this behaviour can lead to an almost unlimited amount of extraneous data being stored until the range is repaired and compaction happens to overwrite the sub-range RTs and row deletions.

Attachments

Issue Links

relates to

CASSANDRA-15640 digest may not match when single partition named queries skip older sstables

Resolved

Activity

People

Assignee:: Zhao Yang

Reporter:: Benedict Elliott Smith

Authors:: Zhao Yang

Reviewers:: Andres de la Peña, Marcus Eriksson

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 22/Oct/19 11:37

Updated:: 07/Dec/20 18:53

Resolved:: 02/Nov/20 18:29