Uploaded image for project: 'Apache YuniKorn'
  1. Apache YuniKorn
  2. YUNIKORN-2544 [UMBRELLA] Fix Yunikorn potential locking issues
  3. YUNIKORN-2548

Potential deadlock during concurrent bottom-up/top-down queue traversal

    XMLWordPrintableJSON

Details

    Description

      A real deadlock can occur if a queue is removed during config refresh and workloads are still running:

      Goroutine #1

      objects.(*Queue).DecAllocatedResource { sq.Lock() } <<<<<
      objects.(*Queue).DecAllocatedResource { } }
      objects.(*Queue).DecAllocatedResource { if sq.parent != nil { }
      scheduler.(*PartitionContext).removeAllocation { if resources.StrictlyGreaterThanZero(total) { }
      scheduler.(*ClusterContext).processAllocationReleases { // notify the RM of the exact released allocations }
      scheduler.(*ClusterContext).handleRMUpdateAllocationEvent { } }
      scheduler.(*Scheduler).handleRMEvent { case ev := <-s.pendingEvents: }
      

      Goroutine #2

      objects.(*Queue).MarkQueueForRemoval { sq.Lock() } <<<<<
      objects.(*Queue).MarkQueueForRemoval { // need to lock for write as we don't want to add a queue while marking for removal }
      objects.(*Queue).MarkQueueForRemoval { if len(sq.children) > 0 { }
      scheduler.(*partitionManager).remove { // remove applications: we do not care about return values or issues }
      scheduler.(*partitionManager).Stop { manager.remove() }
      scheduler.(*ClusterContext).Stop { part.partitionManager.Stop() }
      scheduler.(*Scheduler).Stop { s.clusterContext.Stop() }
      entrypoint.(*ServiceContext).StopAll { s.Scheduler.Stop() }
      tests.TestBasicScheduler.deferwrap1 { ms := &mockScheduler{} }
      tests.TestBasicScheduler { waitForPendingAppResource(t, app, 0, 1000) }
      

      MarkQueueForRemoval() performs a top-down traversal, while DecAllocatedResource() does the opposite (from the leaf).

      We must reduce the scope of the locks. In both cases the methods are written with this approach:

      func (t *Type) topDown() {
        t.Lock()
        defer t.Unlock()
      
        for _, ch := range t.children {   
          ch.topDown()   // lock must not be hold here
        }
      }
      
      func (t *Type) bottomUp() {
        t.Lock()
        defer t.Unlock()
      
        if t.parent != nil {
          t.parent.bottumUp()  // lock must not be hold here
        }
       }
      

      Attachments

        Issue Links

          Activity

            People

              pbacsko Peter Bacsko
              pbacsko Peter Bacsko
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: