Details
-
Bug
-
Status: Closed
-
Blocker
-
Resolution: Fixed
-
None
Description
This is a regression caused by YUNIKORN-677.
YUNIKORN-677 changes the check of how we see a pod needs recovery, now it is based on whether a pod is allocated to a node (when pod.Spec.NodeName is set). For occupied resources, it is similar, however, the fix in YUNIKORN-677 changes the condition for occupied resource recovery but leaves the node coordinator code (where we handle pod updates) as the old way. This caused the following issue:
- During recovery, the scheduler sees the scheduler pod was already allocated (pod.Spec.NodeName is set), so the occupied resources were reported to the core, code: https://github.com/apache/incubator-yunikorn-k8shim/blob/5658ce32f630d5ea75cea2772522a76ced30250a/pkg/cache/context_recovery.go#L113-L128.
- Once the scheduler is recovered, the pod informers will be started, and the node coordinator starts to run. In some cases, the node informer will inform us of the scheduler pod and the admission-controller pod phase changes (from Pending to Running), and this triggers another occupied resource update. Code: https://github.com/apache/incubator-yunikorn-k8shim/blob/5658ce32f630d5ea75cea2772522a76ced30250a/pkg/cache/node_coordinator.go#L74-L101
Attachments
Issue Links
- is broken by
-
YUNIKORN-677 Potential resource leak when complete and allocate pod happens simultaneously
- Closed
- links to