Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
1.6.4, 1.7.2, 1.8.0, 1.9.3, 1.10.3, 1.11.3, 1.13.1, 1.12.4
Description
At the moment, it can happen that standby Dispatchers in per job mode will restart a terminated job after they gained leadership. The problem is that we currently clear the RunningJobsRegistry once a job has reached a globally terminal state. After the leading Dispatcher terminates, a standby Dispatcher will gain leadership. Without having the information from the RunningJobsRegistry it cannot tell whether the job has been executed or whether the Dispatcher needs to re-execute the job. At the moment, the Dispatcher will assume that there was a fault and hence re-execute the job. This can lead to duplicate results.
I think we need some way to tell standby Dispatchers that a certain job has been successfully executed. One trivial solution could be to not clean up the RunningJobsRegistry but then we will clutter ZooKeeper.
Attachments
Issue Links
- relates to
-
FLINK-19816 Flink restored from a wrong checkpoint (a very old one and not the last completed one)
- Closed
-
FLINK-21928 DuplicateJobSubmissionException after JobManager failover
- Closed
-
FLINK-21979 Job can be restarted from the beginning after it reached a terminal state
- Closed
-
FLINK-21980 ZooKeeperRunningJobsRegistry creates an empty znode
- Closed
-
FLINK-23874 JM did not store latest checkpiont id into Zookeeper, silently
- Closed
- links to