XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.13.5, 1.14.3
Fix Version/s: 1.14.5, 1.15.0
Component/s: Runtime / Checkpointing
Labels:
- pull-request-available

Description

After triggerCheckpoint, if checkpoint failed, flink will execute the tolerable-failed-checkpoints logic. But if triggerCheckpoint failed, flink won't execute the tolerable-failed-checkpoints logic.

How to reproduce this issue?

In our online env, hdfs sre deletes the flink base dir by mistake, and flink job don't have permission to create checkpoint dir. So cause flink trigger checkpoint failed.

There are some didn't meet expectations:

JM just log "Failed to trigger checkpoint for job 6f09d4a15dad42b24d52c987f5471f18 since Trigger checkpoint failure" . Don't show the root cause or exception.
user set tolerable-failed-checkpoints=0, but if triggerCheckpoint failed, flink won't execute the tolerable-failed-checkpoints logic.
When triggerCheckpoint failed, numberOfFailedCheckpoints is always 0
When triggerCheckpoint failed, we can't find checkpoint info in checkpoint history page.

All metrics are normal, so the next day we found out that the checkpoint failed, and the checkpoint has been failing for a day. it's not acceptable to the flink user.

I have some ideas:

Should tolerable-failed-checkpoints logic be executed when triggerCheckpoint fails?
When triggerCheckpoint failed, should increase numberOfFailedCheckpoints?
When triggerCheckpoint failed, should show checkpoint info in checkpoint history page?
JM just show "Failed to trigger checkpoint", should we show detailed exception to easy find the root cause?

Masters, could we do these changes? Please correct me if I'm wrong.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

image-2022-02-09-18-08-17-868.png
09/Feb/22 10:08
751 kB
Rui Fan
image-2022-02-09-18-08-34-992.png
09/Feb/22 10:08
70 kB
Rui Fan
image-2022-02-09-18-08-42-920.png
09/Feb/22 10:08
80 kB
Rui Fan
image-2022-02-18-11-28-53-337.png
18/Feb/22 03:28
670 kB
Rui Fan
image-2022-02-18-11-33-28-232.png
18/Feb/22 03:33
261 kB
Rui Fan
image-2022-02-18-11-44-52-745.png
18/Feb/22 03:44
148 kB
Rui Fan
image-2022-02-22-10-27-43-731.png
22/Feb/22 02:27
582 kB
Rui Fan
image-2022-02-22-10-31-05-012.png
22/Feb/22 02:31
27 kB
Rui Fan

Issue Links

causes

FLINK-26550 Correct the information of checkpoint failure

Resolved

FLINK-26993 CheckpointCoordinatorTest#testMinCheckpointPause

Closed

is duplicated by

FLINK-24384 Count checkpoints failed in trigger phase into numberOfFailedCheckpoints

Closed

links to

GitHub Pull Request #18852

GitHub Pull Request #18941

Activity

People

Assignee:: Rui Fan

Reporter:: Rui Fan

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 09/Feb/22 10:12

Updated:: 07/Apr/22 13:09

Resolved:: 04/Mar/22 09:21