[KAFKA-13773] Data loss after recovery from crash due to full hard disk - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 2.8.0, 3.1.0, 2.8.1
Fix Version/s: 3.3.0, 3.2.1
Component/s: log
Labels:
None

Description

While doing some testing of Kafka on Kubernetes, the data disk for kafka filled up, which led to all 3 nodes crashing. I increased the disk size for all three nodes and started up kafka again (one by one, waiting for the previous node to become available before starting the next one). After a little while two out of three nodes had no data anymore.

According to the logs, the log cleaner kicked in and decided that the latest timestamp on those partitions was '0' (i.e. 1970-01-01), and that is older than the 2 week limit specified on the topic.

2022-03-28 12:17:19,740 INFO [LocalLog partition=audit-trail-0, dir=/var/lib/kafka/data-0/kafka-log1] Deleting segment files LogSegment(baseOffset=0, size=249689733, lastModifiedTime=1648460888636, largestRecordTimestamp=Some(0)) (kafka.log.LocalLog$) [kafka-scheduler-0]
2022-03-28 12:17:19,753 INFO Deleted log /var/lib/kafka/data-0/kafka-log1/audit-trail-0/00000000000000000000.log.deleted. (kafka.log.LogSegment) [kafka-scheduler-0]
2022-03-28 12:17:19,754 INFO Deleted offset index /var/lib/kafka/data-0/kafka-log1/audit-trail-0/00000000000000000000.index.deleted. (kafka.log.LogSegment) [kafka-scheduler-0]
2022-03-28 12:17:19,754 INFO Deleted time index /var/lib/kafka/data-0/kafka-log1/audit-trail-0/00000000000000000000.timeindex.deleted. (kafka.log.LogSegment) [kafka-scheduler-0]

Using kafka-dump-log.sh I was able to determine that the greatest timestamp in that file (before deletion) was actually 1648460888636 ( 2022-03-28, 09:48:08 UTC, which is today). However since this segment was the 'latest/current' segment much of the file is empty. The code that determines the last entry (TimeIndex.lastEntryFromIndexFile) doesn't seem to know this and just read the last position in the file, the file being mostly empty causes it to read 0 for that position.

The cleaner code seems to take this into account since UnifiedLog.deleteOldSegments is never supposed to delete the current segment, judging by the scaladoc, however in this case the check doesn't seem to do its job. Perhaps the detected highWatermark is wrong?

I've attached the logs and the zipped data directories (data files are over 3Gb in size when unzipped)

I've encountered this problem with both kafka 2.8.1 and 3.1.0.

I've also tried changing min.insync.replicas to 2: The issue still occurs.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

kafka-.zip
28/Mar/22 13:16
54.88 MB
Tim Alkemade
kafka-logfiles.zip
28/Mar/22 13:17
119 kB
Tim Alkemade
kafka-2.7.0vs2.8.0.zip
30/Mar/22 13:38
195 kB
Tim Alkemade
kafka-2.8.0-crash.zip
30/Mar/22 13:57
85 kB
Tim Alkemade
DiskAndOffsets.png
30/Mar/22 14:00
170 kB
Tim Alkemade
kafka-start-to-finish.zip
31/Mar/22 09:43
4.62 MB
Tim Alkemade

Issue Links

links to

GitHub Pull Request #12136

Activity

People

Assignee:: Luke Chen

Reporter:: Tim Alkemade

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 28/Mar/22 13:23

Updated:: 04/Jun/22 08:22

Resolved:: 02/Jun/22 06:16