Details
-
Bug
-
Status: Resolved
-
Blocker
-
Resolution: Fixed
-
1.15.0
Description
Extracting from FLINK-25185 discussion
On checkpoint abortion or any failure in AsyncCheckpointRunnable,
it discards the state, in particular shared (incremental) state.
Since FLINK-24611, this creates a problem because shared state can be re-used for future checkpoints.
A similar case is in PeriodicMaterializationManager (uploaded SST files will be deleted on failure without notifying the wrapped RocksDB state backend).
Symptom of this failure is a following exception during recovery:
Caused by: java.io.FileNotFoundException: /tmp/junit3146957979516280339/junit1602669867129285236/d6a6dbdd-3fd7-4786-9dc1-9ccc161740da (No such file or directory) at java.io.FileInputStream.open0(Native Method) ~[?:1.8.0_292] at java.io.FileInputStream.open(FileInputStream.java:195) ~[?:1.8.0_292] at java.io.FileInputStream.<init>(FileInputStream.java:138) ~[?:1.8.0_292] at org.apache.flink.core.fs.local.LocalDataInputStream.<init>(LocalDataInputStream.java:50) ~[flink-core-1.15-SNAPSHOT.jar:1.15-SNAPSHOT] at org.apache.flink.core.fs.local.LocalFileSystem.open(LocalFileSystem.java:134) ~[flink-core-1.15-SNAPSHOT.jar:1.15-SNAPSHOT] at org.apache.flink.core.fs.SafetyNetWrapperFileSystem.open(SafetyNetWrapperFileSystem.java:87) ~[flink-core-1.15-SNAPSHOT.jar:1.15-SNAPSHOT] at org.apache.flink.runtime.state.filesystem.FileStateHandle.openInputStream(FileStateHandle.java:68) ~[flink-runtime-1.15-SNAPSHOT.jar:1.15-SNAPSHOT] at org.apache.flink.changelog.fs.StateChangeFormat.read(StateChangeFormat.java:92) ~[flink-dstl-dfs-1.15-SNAPSHOT.jar:1.15-SNAPSHOT] at org.apache.flink.runtime.state.changelog.StateChangelogHandleStreamHandleReader$1.advance(StateChangelogHandleStreamHandleReader.java:85) ~[flink-runtime-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
Attachments
Issue Links
- causes
-
FLINK-24163 PartiallyFinishedSourcesITCase fails due to timeout
- Closed
- is caused by
-
FLINK-24611 Prevent JM from discarding state on checkpoint abortion
- Resolved
- is duplicated by
-
FLINK-25399 AZP fails with exit code 137 when running checkpointing test cases
- Closed
- is related to
-
FLINK-25185 StreamFaultToleranceTestBase hangs on AZP
- Closed
- links to