[FLINK-24038] DispatcherResourceManagerComponent fails to deregister application if no leading ResourceManager - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.14.0
Fix Version/s: 1.15.0
Component/s: Runtime / Coordination
Labels:
- pull-request-available

Release Note:

Hide
A new multiple component leader election service was implemented that only runs a single leader election per Flink process. If this should cause any problems, then you can set `high-availability.use-old-ha-services: true` in the `flink-conf.yaml` to use the old high availability services.

Show
A new multiple component leader election service was implemented that only runs a single leader election per Flink process. If this should cause any problems, then you can set `high-availability.use-old-ha-services: true` in the `flink-conf.yaml` to use the old high availability services.

Description

With ~~FLINK-21667~~ we introduced a change that can cause the DispatcherResourceManagerComponent to fail when trying to stop the application. The problem is that the DispatcherResourceManagerComponent needs a leading ResourceManager to successfully execute the stop/deregister application call. If this is not the case, then it will fail fatally. In the case of multiple standby JobManager processes it can happen that the leading ResourceManager runs somewhere else.

I do see two possible solutions:

1. Run the leader election process for the whole JobManager process
2. Move the registration/deregistration of the application out of the ResourceManager so that it can be executed w/o a leader

Attachments

Issue Links

blocks

FLINK-23946 Application mode fails fatally when being shut down

Resolved

FLINK-25235 Re-enable ZooKeeperLeaderElectionITCase#testJobExecutionOnClusterWithLeaderChange

Resolved

causes

FLINK-26630 EmbeddedHaServices is not made for recovery on a single instance

Open

FLINK-25981 ZooKeeperMultipleComponentLeaderElectionDriverTest.testLeaderElectionWithMultipleDrivers failed

Resolved

is caused by

FLINK-21667 Standby RM might remove resources from Kubernetes

Closed

is related to

FLINK-25432 Introduce common interfaces for cleaning up local and global job data

Resolved

FLINK-33598 Watch HA configmap via name instead of lables to reduce pressure on APIserver

Resolved

relates to

FLINK-25500 ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange failed on azure

Reopened

FLINK-25847 KubernetesHighAvailabilityRecoverFromSavepointITCase. testRecoverFromSavepoint failed on azure

Closed

FLINK-27358 Kubernetes operator throws NPE when testing with Flink 1.15

Closed

FLINK-25393 Make ConfigMap Name for Leader Election Configurable

Open

FLINK-25806 Remove legacy high availability services

Closed

links to

GitHub Pull Request #17485

mentioned in: Page Loading...

(2 is related to, 5 relates to, 1 links to, 1 mentioned in)

Activity

People

Assignee:: Till Rohrmann

Reporter:: Till Rohrmann

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 28/Aug/21 17:49

Updated:: 20/Nov/23 13:06

Resolved:: 26/Jan/22 22:56