Details
-
Sub-task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
A real deadlock can occur if a queue is removed during config refresh and workloads are still running:
Goroutine #1
objects.(*Queue).DecAllocatedResource { sq.Lock() } <<<<< objects.(*Queue).DecAllocatedResource { } } objects.(*Queue).DecAllocatedResource { if sq.parent != nil { } scheduler.(*PartitionContext).removeAllocation { if resources.StrictlyGreaterThanZero(total) { } scheduler.(*ClusterContext).processAllocationReleases { // notify the RM of the exact released allocations } scheduler.(*ClusterContext).handleRMUpdateAllocationEvent { } } scheduler.(*Scheduler).handleRMEvent { case ev := <-s.pendingEvents: }
Goroutine #2
objects.(*Queue).MarkQueueForRemoval { sq.Lock() } <<<<< objects.(*Queue).MarkQueueForRemoval { // need to lock for write as we don't want to add a queue while marking for removal } objects.(*Queue).MarkQueueForRemoval { if len(sq.children) > 0 { } scheduler.(*partitionManager).remove { // remove applications: we do not care about return values or issues } scheduler.(*partitionManager).Stop { manager.remove() } scheduler.(*ClusterContext).Stop { part.partitionManager.Stop() } scheduler.(*Scheduler).Stop { s.clusterContext.Stop() } entrypoint.(*ServiceContext).StopAll { s.Scheduler.Stop() } tests.TestBasicScheduler.deferwrap1 { ms := &mockScheduler{} } tests.TestBasicScheduler { waitForPendingAppResource(t, app, 0, 1000) }
MarkQueueForRemoval() performs a top-down traversal, while DecAllocatedResource() does the opposite (from the leaf).
We must reduce the scope of the locks. In both cases the methods are written with this approach:
func (t *Type) topDown() { t.Lock() defer t.Unlock() for _, ch := range t.children { ch.topDown() // lock must not be hold here } } func (t *Type) bottomUp() { t.Lock() defer t.Unlock() if t.parent != nil { t.parent.bottumUp() // lock must not be hold here } }
Attachments
Issue Links
- causes
-
YUNIKORN-2632 Data race in IncAllocatedResource
- Resolved
- links to