[FLINK-20033] Job fails when stopping JobMaster - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.10.2, 1.11.2
Fix Version/s: 1.10.3, 1.11.3, 1.12.0
Component/s: Runtime / Coordination
Labels:
- pull-request-available

Description

When a JobMaster is stopped, we first disconnect all TaskExecutors. This disconnection causes potentially running Executions to fail. This in turn can cause a restart of the job or in the worst case a transition into FAILED state if the restarts are depleted. This again can cause the clean up of HA data.

Instead of failing the job, the job should be suspended if the JobMaster gets stopped because this happens if the Dispatcher loses its leadership. The problem has been fixed unintentionally by ~~FLINK-19237~~ in the master branch.

Attachments

Issue Links

causes

FLINK-20065 UnalignedCheckpointCompatibilityITCase.test failed with AskTimeoutException

Closed

is related to

FLINK-19237 LeaderChangeClusterComponentsTest.testReelectionOfJobMaster failed with "NoResourceAvailableException: Could not allocate the required slot within slot request timeout"

Closed

FLINK-19816 Flink restored from a wrong checkpoint (a very old one and not the last completed one)

Closed

links to

GitHub Pull Request #13978

GitHub Pull Request #14037

Activity

People

Assignee:: Till Rohrmann

Reporter:: Till Rohrmann

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 06/Nov/20 15:32

Updated:: 12/Nov/20 13:56

Resolved:: 12/Nov/20 13:56