[FLINK-26630] EmbeddedHaServices is not made for recovery on a single instance - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Critical
Resolution: Unresolved
Affects Version/s: 1.15.0
Fix Version/s: None
Component/s: Runtime / Coordination
Labels:
None

Description

EmbeddedHaServices (and EmbeddedHaServicesWithLeadershipControl) provide leader election functionality to work on a single JVM. In ~~FLINK-25235~~ we introduced the re-instantiation of HighAvailabilityServices per JobManager (i.e. DispatcherResourceManagerComponent) in TestingMiniCluster to be able to close the HighAvailabilityServices during the shutdown of a JM and not only at the end of the HA cluster to get closer to a production environment where each JM has its own HAServices instance as well (that became crucial as part of the work of ~~FLINK-24038~~ which revokes the leadership when it closes the HAServices during a JM shutdown).

The EmbeddedHaServices, though, provide a no-op StandaloneJobGraphStore implementation, i.e. no real recovery is testable with the TestingMiniCluster (even before the change of ~~FLINK-25235~~). We should still fix that to enable users to use the TestingMiniCluster for such purposes. That means that we should provide a JobGraphStore and JobResultStore that's shared between the different HighAvailabilityServices instances and probably also the Checkpoint-related HA components.

Right now, the multi-JM setup of the TestingMiniCluster is only used in ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange where it's bound to the ZooKeeperHAServices. Therefore, it's not a pressing issue for 1.15. But we should fix it as a follow-up.

Attachments

Issue Links

is caused by

FLINK-24038 DispatcherResourceManagerComponent fails to deregister application if no leading ResourceManager

Closed

FLINK-25235 Re-enable ZooKeeperLeaderElectionITCase#testJobExecutionOnClusterWithLeaderChange

Resolved

is related to

FLINK-26502 Multiple component leader election has different close/stop behavior

Closed

FLINK-26556 Refactoring MiniCluster and TestingMiniCluster

Open

relates to

FLINK-31816 Refactor EmbeddedLeaderElectionService

Open

Activity

People

Assignee:: Unassigned

Reporter:: Matthias Pohl

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 14/Mar/22 09:27

Updated:: 10/Jul/23 07:24