[FLINK-11813] Standby per job mode Dispatchers don't know job's JobSchedulingStatus - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.6.4, 1.7.2, 1.8.0, 1.9.3, 1.10.3, 1.11.3, 1.13.1, 1.12.4
Fix Version/s: 1.15.0
Component/s: Runtime / Coordination
Labels:
- pull-request-available

Release Note:

Hide
The issue of re-submitting a job in Application Mode when the job finished but failed during cleanup is fixed through the introduction of the new component JobResultStore which enables Flink to persist the cleanup state of a job to the file system. (see ~~FLINK-25431~~)

Show
The issue of re-submitting a job in Application Mode when the job finished but failed during cleanup is fixed through the introduction of the new component JobResultStore which enables Flink to persist the cleanup state of a job to the file system. (see FLINK-25431 )

Description

At the moment, it can happen that standby Dispatchers in per job mode will restart a terminated job after they gained leadership. The problem is that we currently clear the RunningJobsRegistry once a job has reached a globally terminal state. After the leading Dispatcher terminates, a standby Dispatcher will gain leadership. Without having the information from the RunningJobsRegistry it cannot tell whether the job has been executed or whether the Dispatcher needs to re-execute the job. At the moment, the Dispatcher will assume that there was a fault and hence re-execute the job. This can lead to duplicate results.

I think we need some way to tell standby Dispatchers that a certain job has been successfully executed. One trivial solution could be to not clean up the RunningJobsRegistry but then we will clutter ZooKeeper.

Attachments

Issue Links

relates to

FLINK-19816 Flink restored from a wrong checkpoint (a very old one and not the last completed one)

Closed

FLINK-21928 DuplicateJobSubmissionException after JobManager failover

Closed

FLINK-21979 Job can be restarted from the beginning after it reached a terminal state

Closed

FLINK-21980 ZooKeeperRunningJobsRegistry creates an empty znode

Closed

FLINK-23874 JM did not store latest checkpiont id into Zookeeper, silently

Closed

links to

GitHub Pull Request #18910

(1 links to)

Sub-Tasks

1.	Introduce JobResultStore	Resolved	Matthias Pohl
2.	Implement file-based JobResultStore	Resolved	Mika Naylor
3.	Introduce common interfaces for cleaning up local and global job data	Resolved	Matthias Pohl
4.	Integrate retry strategy for cleanup stage	Closed	Matthias Pohl
5.	Reorganizes tests around Dispatcher cleanup	Resolved	Matthias Pohl
6.	Add cleanup tests to BlobServerCleanupTest	Resolved	Matthias Pohl
7.	Add JobManagerRunner implementation that picks up dirty job results to be cleaned up	Resolved	Matthias Pohl
8.	Rename ArchivedExecutionGraph.createFromInitializingJob into more generic createSparseArchivedExecutionGraph	Resolved	Matthias Pohl
9.	Make cancellation of jobs depend on the JobResultStore	Resolved	Matthias Pohl
10.	DispatcherTest.testJobDataAreCleanedUpInCorrectOrderOn*Job can be removed	Resolved	Matthias Pohl
11.	e2e test covering the main functionality of the JobResultStore	Resolved	Unassigned
12.	FileSystemJobResultStore fails to access Minio	Resolved	Matthias Pohl
13.	Add debug log message when marking a job result as dirty	Resolved	Matthias Pohl
14.	JobManagerMetricGroup needs to implement GloballyCleanableResource as well	Resolved	Matthias Pohl
15.	Add missing documentation	Resolved	Mika Naylor
16.	Make max retries configurable	Resolved	Matthias Pohl

Activity

People

Assignee:: Matthias Pohl

Reporter:: Till Rohrmann

Votes:: 0 Vote for this issue

Watchers:: 18 Start watching this issue

Dates

Created:: 04/Mar/19 15:44

Updated:: 01/Mar/22 14:56

Resolved:: 01/Mar/22 14:56