Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-8771

CapacityScheduler fails to unreserve when cluster resource contains empty resource type

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 3.2.0
    • 3.2.0, 3.1.2
    • capacityscheduler
    • None
    • Reviewed

    Description

      We found this problem when cluster is almost but not exhausted (93% used), scheduler kept allocating for an app but always fail to commit, this can blocking requests from other apps and parts of cluster resource can't be used.

      Reproduce this problem:
      (1) use DominantResourceCalculator
      (2) cluster resource has empty resource type, for example: gpu=0
      (3) scheduler allocates container for app1 who has reserved containers and whose queue limit or user limit reached(used + required > limit).

      Reference codes in RegularContainerAllocator#assignContainer:

          // How much need to unreserve equals to:
          // max(required - headroom, amountNeedUnreserve)
          Resource headRoom = Resources.clone(currentResoureLimits.getHeadroom());
          Resource resourceNeedToUnReserve =
              Resources.max(rc, clusterResource,
                  Resources.subtract(capability, headRoom),
                  currentResoureLimits.getAmountNeededUnreserve());
      
          boolean needToUnreserve =
              Resources.greaterThan(rc, clusterResource,
                  resourceNeedToUnReserve, Resources.none());
      

      For example, resourceNeedToUnReserve can be <8GB, -6 cores, 0 gpu> when headRoom=<0GB, 8 vcores, 0 gpu> and capacity=<8GB, 2 vcores, 0 gpu>, needToUnreserve which is the result of Resources#greaterThan will be false. This is not reasonable because required resource did exceed the headroom and unreserve is needed.
      After that, when reaching the unreserve process in RegularContainerAllocator#assignContainer, unreserve process will be skipped when shouldAllocOrReserveNewContainer is true (when required containers > reserved containers) and needToUnreserve is wrongly calculated to be false:

          if (availableContainers > 0) {
               if (rmContainer == null && reservationsContinueLooking
                && node.getLabels().isEmpty()) {
                    // unreserve process can be wrongly skipped when shouldAllocOrReserveNewContainer=true and needToUnreserve=false but required resource did exceed the headroom
                    if (!shouldAllocOrReserveNewContainer || needToUnreserve) { 
                          ... 
                    }
               }
          }
      

      Attachments

        1. YARN-8771.001.patch
          7 kB
          Tao Yang
        2. YARN-8771.002.patch
          7 kB
          Tao Yang
        3. YARN-8771.003.patch
          7 kB
          Tao Yang
        4. YARN-8771.004.patch
          7 kB
          Tao Yang

        Activity

          People

            Tao Yang Tao Yang
            Tao Yang Tao Yang
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: