Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.2.1
-
None
-
None
Description
Hi! Recently we've been running into this issue in our production systems where a high priority job starves a lower priority job and preemption isn't kicking in to rebalance the resources, over the period of 1h. It's to our understanding at least that when higher priority jobs show up, resources should be evicted from a lower priority queue based on fair share allocations relatively quickly based on the fairshare timeout.
This is for the higher priority queue (high):
Between 23:30 to 0:45, notice that the higher priority queue consistently demands a lot of memory and should be fairly allocated at least half of it, but doesn't get its fairshare.
This is for the lower priority queue (medium):
Notice that during the same point in time, the medium subqueue is using way more than its fairshare.
One interesting thing (could possibly be related to the issue) is that when this happens, the queue is at max resources and we see a lot of these logs:
diagnostics: [Mon May 16 06:29:28 +0000 2022] Application is added to the scheduler and is not yet activated. (Resource request: <memory:27136, vCores:4> exceeds current queue or its parents maximum resource allowed). Max share of queue: <memory:9223372036854775807, vCores:2147483647>
For this application in particular, it stays like this for a while and then an hour later finally ends up getting the resources it needs after the low priority job finishes. Note the max share of the queue number is strangely really off.
Our current preemption config for this cluster:
<fairSharePreemptionThreshold>1</fairSharePreemptionThreshold> <fairSharePreemptionTimeout>900</fairSharePreemptionTimeout> <minSharePreemptionTimeout>180</minSharePreemptionTimeout>
<queue name="low"> <weight>1</weight> </queue> <queue name="medium"> <weight>2</weight> </queue> <queue name="high"> <fairSharePreemptionTimeout>300</fairSharePreemptionTimeout> <weight>3</weight> </queue> </queue>
We've tried taking a heap dump and enabling debug logging, and one of our theories is that maybe the preemption thread does a check for whether it can add resources before preempting, and since the queue is already at max resources, this can't go through?
However, nothing super conclusive yet. Would love any assistance/insight you could provide. Happy to give more details as well.