Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
1.14.0
Description
With FLINK-21667 we introduced a change that can cause the DispatcherResourceManagerComponent to fail when trying to stop the application. The problem is that the DispatcherResourceManagerComponent needs a leading ResourceManager to successfully execute the stop/deregister application call. If this is not the case, then it will fail fatally. In the case of multiple standby JobManager processes it can happen that the leading ResourceManager runs somewhere else.
I do see two possible solutions:
1. Run the leader election process for the whole JobManager process
2. Move the registration/deregistration of the application out of the ResourceManager so that it can be executed w/o a leader
Attachments
Issue Links
- blocks
-
FLINK-23946 Application mode fails fatally when being shut down
- Resolved
-
FLINK-25235 Re-enable ZooKeeperLeaderElectionITCase#testJobExecutionOnClusterWithLeaderChange
- Resolved
- causes
-
FLINK-26630 EmbeddedHaServices is not made for recovery on a single instance
- Open
-
FLINK-25981 ZooKeeperMultipleComponentLeaderElectionDriverTest.testLeaderElectionWithMultipleDrivers failed
- Resolved
- is caused by
-
FLINK-21667 Standby RM might remove resources from Kubernetes
- Closed
- is related to
-
FLINK-25432 Introduce common interfaces for cleaning up local and global job data
- Resolved
-
FLINK-33598 Watch HA configmap via name instead of lables to reduce pressure on APIserver
- Resolved
- relates to
-
FLINK-25500 ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange failed on azure
- Reopened
-
FLINK-25847 KubernetesHighAvailabilityRecoverFromSavepointITCase. testRecoverFromSavepoint failed on azure
- Closed
-
FLINK-27358 Kubernetes operator throws NPE when testing with Flink 1.15
- Closed
-
FLINK-25393 Make ConfigMap Name for Leader Election Configurable
- Open
-
FLINK-25806 Remove legacy high availability services
- Closed
- links to
- mentioned in
-
Page Loading...