[FLINK-32010] KubernetesLeaderRetrievalDriver always waits for lease update to resolve leadership - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.17.0, 1.16.1, 1.18.0
Fix Version/s: 1.16.2, 1.18.0, 1.17.1
Component/s: Deployment / Kubernetes, Runtime / Coordination
Labels:
- pull-request-available

Description

The k8s-based leader retrieval is based on ConfigMap watching. The config map lifecycle (from the consumer point of view) is handled as a series of events with the following types:

ADDED -> the first time the consumer has seen the CM
UPDATED -> any further changes to the CM
DELETED -> ... you get the idea

The implementation assumes that ElectionDriver (the one that creates the CM) and ElectionRetriver are started simultaneously and therefore ignore the ADDED events because the CM is always created as empty and is updated with the leadership information later on.

This assumption is incorrect in the following cases (I might be missing some, but that's not important, the goal is to illustrate the problem):

TM joining the cluster later when the leaders are established to discover RM / JM
RM tries to discover JM when
MultipleComponentLeaderElectionDriver is used

This, for example, leads to higher job submission latencies that could be unnecessarily held back for up to the lease retry period [1].

[1] Configured by high-availability.kubernetes.leader-election.retry-period

Attachments

Issue Links

relates to

FLINK-22054 Using a shared watcher for ConfigMap watching

Closed

links to

GitHub Pull Request #22524

Activity

People

Assignee:: David Morávek

Reporter:: David Morávek

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 05/May/23 09:50

Updated:: 06/May/23 06:20

Resolved:: 06/May/23 06:20