[YUNIKORN-741] Regression: occupied resources miscalculated sometimes for yunikorn pods - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Blocker
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.11
Component/s: shim - kubernetes
Labels:
- pull-request-available

Description

This is a regression caused by ~~YUNIKORN-677~~.

~~YUNIKORN-677~~ changes the check of how we see a pod needs recovery, now it is based on whether a pod is allocated to a node (when pod.Spec.NodeName is set). For occupied resources, it is similar, however, the fix in ~~YUNIKORN-677~~ changes the condition for occupied resource recovery but leaves the node coordinator code (where we handle pod updates) as the old way. This caused the following issue:

During recovery, the scheduler sees the scheduler pod was already allocated (pod.Spec.NodeName is set), so the occupied resources were reported to the core, code: https://github.com/apache/incubator-yunikorn-k8shim/blob/5658ce32f630d5ea75cea2772522a76ced30250a/pkg/cache/context_recovery.go#L113-L128.
Once the scheduler is recovered, the pod informers will be started, and the node coordinator starts to run. In some cases, the node informer will inform us of the scheduler pod and the admission-controller pod phase changes (from Pending to Running), and this triggers another occupied resource update. Code: https://github.com/apache/incubator-yunikorn-k8shim/blob/5658ce32f630d5ea75cea2772522a76ced30250a/pkg/cache/node_coordinator.go#L74-L101

Attachments

Issue Links

is broken by

YUNIKORN-677 Potential resource leak when complete and allocate pod happens simultaneously

Closed

links to

GitHub Pull Request #279

Activity

People

Assignee:: Weiwei Yang

Reporter:: Weiwei Yang

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 07/Jul/21 21:33

Updated:: 21/Jan/22 21:48

Resolved:: 08/Jul/21 01:55