[FLINK-21980] ZooKeeperRunningJobsRegistry creates an empty znode - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.9.3, 1.10.3, 1.11.3, 1.12.2
Fix Version/s: 1.11.4, 1.13.0, 1.12.3
Component/s: Runtime / Coordination
Labels:
- pull-request-available

Description

ZooKeeperRunningJobsRegistry#writeEnumToZooKeeper calls

this.client.newNamespaceAwareEnsurePath(zkPath).ensure(client.getZookeeperClient());

This creates an empty znode in zookeeper. If the job manager is interrupted at this point the job manager cannot recover. When trying to restore jobs on a restarted job manager, ZooKeeperRunningJobsRegistry#getJobSchedulingStatus will throw an exception due to the empty znode.

Behavior was verified in a test environment where the job manager was interrupted at that point in execution leaving ZK in the following state:

zk: localhost:2181(CONNECTED) 2] ls /flink/default
[checkpoint-counter, checkpoints, jobgraphs, leader, leaderlatch, running_job_registry]
[zk: localhost:2181(CONNECTED) 3] ls /flink/default/running_job_registry 
[c982053dd0b9100967e6a9d89202f2a5]
[zk: localhost:2181(CONNECTED) 4] get /flink/default/running_job_registry/c982053dd0b9100967e6a9d89202f2a5 

[zk: localhost:2181(CONNECTED) 5]

Attachments

Issue Links

is related to

FLINK-21928 DuplicateJobSubmissionException after JobManager failover

Closed

FLINK-11813 Standby per job mode Dispatchers don't know job's JobSchedulingStatus

Closed

links to

GitHub Pull Request #15393

Activity

People

Assignee:: Ricky Burnett

Reporter:: Ricky Burnett

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 25/Mar/21 21:28

Updated:: 28/May/21 09:00

Resolved:: 07/Apr/21 17:00