[FLINK-21667] Standby RM might remove resources from Kubernetes - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.12.2
Fix Version/s: 1.14.0
Component/s: Deployment / Kubernetes, Runtime / Coordination
Labels:
- pull-request-available

Description

Currently, on initialization KubernetesResourceManagerDriver starts a watch for receiving pod events. It could happen that it starts to receive events before obtaining leadership. Consequently, a standby RM may remove terminated pods from Kubernetes during handling the events.

This is not very damaging atm, since the removed pods are already terminated anyway. However, it would still be good for a standby RM to strictly following the contract and make no modifications before obtaining leadership. We might consider to postpone starting of the watch to when the leadership is granted.

Attachments

Issue Links

blocks

FLINK-17707 Support configuring replica of Deployment based HA setups

Closed

causes

FLINK-23240 ResumeCheckpointManuallyITCase.testExternalizedFSCheckpointsWithLocalRecoveryZookeeper fails on azure

Closed

FLINK-24038 DispatcherResourceManagerComponent fails to deregister application if no leading ResourceManager

Closed

FLINK-25885 ClusterEntrypointTest.testWorkingDirectoryIsDeletedIfApplicationCompletes failed on azure

Closed

is related to

FLINK-22816 Investigate feasibility of supporting multiple RM leader sessions within JM process

Closed

links to

GitHub Pull Request #15524

(1 links to)

Activity

People

Assignee:: Xintong Song

Reporter:: Xintong Song

Votes:: 1 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 08/Mar/21 11:01

Updated:: 08/Feb/22 08:38

Resolved:: 03/Jun/21 11:02