[YUNIKORN-2548] Potential deadlock during concurrent bottom-up/top-down queue traversal - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.5.1, 1.6.0
Component/s: core - scheduler
Labels:
- pull-request-available

Target Version:

1.5.1, 1.6.0

Description

A real deadlock can occur if a queue is removed during config refresh and workloads are still running:

Goroutine #1

objects.(*Queue).DecAllocatedResource { sq.Lock() } <<<<<
objects.(*Queue).DecAllocatedResource { } }
objects.(*Queue).DecAllocatedResource { if sq.parent != nil { }
scheduler.(*PartitionContext).removeAllocation { if resources.StrictlyGreaterThanZero(total) { }
scheduler.(*ClusterContext).processAllocationReleases { // notify the RM of the exact released allocations }
scheduler.(*ClusterContext).handleRMUpdateAllocationEvent { } }
scheduler.(*Scheduler).handleRMEvent { case ev := <-s.pendingEvents: }

Goroutine #2

objects.(*Queue).MarkQueueForRemoval { sq.Lock() } <<<<<
objects.(*Queue).MarkQueueForRemoval { // need to lock for write as we don't want to add a queue while marking for removal }
objects.(*Queue).MarkQueueForRemoval { if len(sq.children) > 0 { }
scheduler.(*partitionManager).remove { // remove applications: we do not care about return values or issues }
scheduler.(*partitionManager).Stop { manager.remove() }
scheduler.(*ClusterContext).Stop { part.partitionManager.Stop() }
scheduler.(*Scheduler).Stop { s.clusterContext.Stop() }
entrypoint.(*ServiceContext).StopAll { s.Scheduler.Stop() }
tests.TestBasicScheduler.deferwrap1 { ms := &mockScheduler{} }
tests.TestBasicScheduler { waitForPendingAppResource(t, app, 0, 1000) }

MarkQueueForRemoval() performs a top-down traversal, while DecAllocatedResource() does the opposite (from the leaf).

We must reduce the scope of the locks. In both cases the methods are written with this approach:

func (t *Type) topDown() {
  t.Lock()
  defer t.Unlock()

  for _, ch := range t.children {   
    ch.topDown()   // lock must not be hold here
  }
}

func (t *Type) bottomUp() {
  t.Lock()
  defer t.Unlock()

  if t.parent != nil {
    t.parent.bottumUp()  // lock must not be hold here
  }
 }

Attachments

Issue Links

causes

YUNIKORN-2632 Data race in IncAllocatedResource

Resolved

links to

GitHub Pull Request #839

Activity

People

Assignee:: Peter Bacsko

Reporter:: Peter Bacsko

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 08/Apr/24 22:25

Updated:: 17/May/24 13:58

Resolved:: 10/Apr/24 19:23