Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
1.9.3, 1.10.3, 1.11.3, 1.12.2
Description
ZooKeeperRunningJobsRegistry#writeEnumToZooKeeper calls
this.client.newNamespaceAwareEnsurePath(zkPath).ensure(client.getZookeeperClient());
This creates an empty znode in zookeeper. If the job manager is interrupted at this point the job manager cannot recover. When trying to restore jobs on a restarted job manager, ZooKeeperRunningJobsRegistry#getJobSchedulingStatus will throw an exception due to the empty znode.
Behavior was verified in a test environment where the job manager was interrupted at that point in execution leaving ZK in the following state:
zk: localhost:2181(CONNECTED) 2] ls /flink/default [checkpoint-counter, checkpoints, jobgraphs, leader, leaderlatch, running_job_registry] [zk: localhost:2181(CONNECTED) 3] ls /flink/default/running_job_registry [c982053dd0b9100967e6a9d89202f2a5] [zk: localhost:2181(CONNECTED) 4] get /flink/default/running_job_registry/c982053dd0b9100967e6a9d89202f2a5 [zk: localhost:2181(CONNECTED) 5]
Attachments
Issue Links
- is related to
-
FLINK-21928 DuplicateJobSubmissionException after JobManager failover
- Closed
-
FLINK-11813 Standby per job mode Dispatchers don't know job's JobSchedulingStatus
- Closed
- links to