Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
1.16.0
-
None
-
None
Description
After working with the MiniCluster & TestingMiniCluster on FLINK-25235, I saw some shortcomings that I patched up but did not fix. I would recommend that we change it for 1.16, because it is becoming more and more difficult to understand what is happening.
Background
- We recently changed the implementation of the leader election from a separate LeaderElectionDriver per JobManager component (e.g., dispatcher, rest server, etc.) to a combined LeaderElectionDriver for the JobManager as a whole. (For Zoo Keeper based leader election we previously used a ZooKeeperLeaderElectionDriver. Now we use MultipleComponentLeaderElectionDriverAdapter which de-multiplexes the leader election for all JobManager components via a single MultipleComponentLeaderElectionService to a single ZooKeeperMultipleComponentLeaderElectionDriver underneath.) This changed the mapping between HighAvailabilityServices and LeaderElectionDriver from a one-to-many to a one-to-one relationship.
- HighAvailabilityServices is the class responsible for high availability (e.g., leader fail-over). However, even though JobManager and TaskManager have a dependecy on this class, not all scenarios require high availability. The implementation AbstractNonHaServices and its EmeddedHaServices (single JVM setup) and StandaloneHaServices (no support for JobManager failures) are used for these scenarios.
- MiniCluster is the class responsible for managing a single-node Flink cluster. It contains a single HighAvailabilityServices object that is shared by the JobManager and multiple TaskManager.
- TestingMiniCluster extends the MiniCluster for test purposes. Among others, it is used for tests on leader election between multiple JobManager.
Issues with MiniCluster
- The MiniCluster is meant to "to execute Flink jobs locally". To me this means that the MiniCluster should only use EmbeddedHaServices as it does not need high availability on a single JVM. However, it does not put any constraints on the type of high availability services used.
- Although the MiniCluster is production code. it contains code that is meant exclusively for testing. MiniCluster#getHaLeadershipControl is used to give a test explicit control over the leader election of an EmbeddedHaService. Btw, this method depends on HighAvailabilityServices being an EmbeddedHaServicesWithLeadershipControl object.
- The code to create the HighAvailabilityServices object seems needlessly complex. In MiniCluster#createHighAvailabilityServices we differentiate between EmbeddedWithControlHighAvailabilityServices and everything else. In HighAvailabilityServicesUtils we distinguish between EmbeddedHaServices, ZooKeeperHaServices &ZooKeeperMultipleComponentLeaderElectionHaServices, and ones that are produced by HighAvailabilityServicesFactory. It would be nice, if we could flatten this into one method. Especially with the additional option to configure it via the TestingMiniCluster (see below). (I also don't understand why there is no Kubernetes option. Even though there is a KubernetesHaServicesFactory.)
Issues with TestingMiniCluster
- The name TestingSomething is usually used for mock objects. In contrast, the TestingMiniCluster does not mock anything.
- Its doc string says its used "to set a custom
{@link HighAvailabilityServices}
", but the MiniCluster already allows it.
- The use of the TestingMiniCluster seems to be to configure the number of JobManagers, configure the HighAvailabilityServices (which looks redundant see below), and to configure the TaskManager to use only local communication even when more than one TaskManager exists. Why can this not be optional settings on the MiniCluster itself?
- The TestingMiniCluster is used by tests that need multiple JobManager with separate leader election (e.g., ZooKeeperLeaderElectionITCase) and some that require that all parts including the TaskExecutor share the same HighAvailabilityServices (e.g., JobExecutionITCase).
- TestingMiniCluster allows overriding the method MiniCluster#createHighAvailabilityServices for creating a HighAvailabilityServices. However, that method already has a two step process of creating the HighAvailabilityServices. The existing process even includes the option of using a custom factory. Again, this redundancy makes the code hard to understand.
Conclusion
- The MiniCluster and TestingMiniCluster cover a large variety of different use cases that are related to what implementation of HighAvaibilityServices is used. Splitting these classes into one per use case would greatly improve the readability.
- The reason for using inheritance is not clear to me. First, the interaction between the subclass and parent class is hard to follow. Second, I don't see in what scenarios you would want to replace an instance of MiniCluster with TestingMiniCluster, so there should be no need to implement the same methods. (Could be wrong on this though.) Third, having a single non-abstract parent class with a single subclass on its own sounds like a degenerate inheritance tree.
Refactoring options
- We can strongly couple the MiniCluster with the type of HighlyAvailabilityServices and have one EmbeddedMiniCluster and one HighlyAvailableMiniCluster.
- Alternatively, we can split the generation of the HighlyAvailabilityServices object into a separate factory (and hopefully make the generation process less deep). With some additional settings, the MiniCluster could then take over the role of the TestingMiniCluster.
PS: To be exact, I should have used the term DispatcherResourceManagerComponent instead of JobManager, but this would've made the text even harder to read.
Attachments
Issue Links
- relates to
-
FLINK-26630 EmbeddedHaServices is not made for recovery on a single instance
- Open