Details
-
Improvement
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
None
-
None
Description
Startup rebuilds all state of the cluster. This is called recovery. The name is a bit misleading as it is not really recovery as it is loading the current state. State initialisation is a better term to use.
The current recovery code links the loading of applications and tasks (pods) to node loading. This makes the recovery code complex and thus fragile. It could, in a worst case scenario, lead to a pod not being recovered correctly.
Recovery should be a step by step process that has boundaries and steps:
- load node
- register nodes with the core
- load pods
- create applications in core
- register running pods as allocations with the core
- register pending pods as asks with the core
- process changes for nodes and pods
- start scheduling
No nodes, applications or asks on existing apps should be declined. Even if theĀ queue does not exist a running application must be added and handled. The current rejection of an application if it cannot be placed in the queue is an incorrect behaviour.
Attachments
Issue Links
- split to
-
YUNIKORN-2099 [Umbrella] State initialisation simplification (phase 2)
- Closed