[YARN-8771] CapacityScheduler fails to unreserve when cluster resource contains empty resource type - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 3.2.0
Fix Version/s: 3.2.0, 3.1.2
Component/s: capacityscheduler
Labels:
None

Target Version/s:

3.2.0, 3.1.1
Hadoop Flags:

Reviewed

Description

We found this problem when cluster is almost but not exhausted (93% used), scheduler kept allocating for an app but always fail to commit, this can blocking requests from other apps and parts of cluster resource can't be used.

Reproduce this problem:
(1) use DominantResourceCalculator
(2) cluster resource has empty resource type, for example: gpu=0
(3) scheduler allocates container for app1 who has reserved containers and whose queue limit or user limit reached(used + required > limit).

Reference codes in RegularContainerAllocator#assignContainer:

    // How much need to unreserve equals to:
    // max(required - headroom, amountNeedUnreserve)
    Resource headRoom = Resources.clone(currentResoureLimits.getHeadroom());
    Resource resourceNeedToUnReserve =
        Resources.max(rc, clusterResource,
            Resources.subtract(capability, headRoom),
            currentResoureLimits.getAmountNeededUnreserve());

    boolean needToUnreserve =
        Resources.greaterThan(rc, clusterResource,
            resourceNeedToUnReserve, Resources.none());

For example, resourceNeedToUnReserve can be <8GB, -6 cores, 0 gpu> when headRoom=<0GB, 8 vcores, 0 gpu> and capacity=<8GB, 2 vcores, 0 gpu>, needToUnreserve which is the result of Resources#greaterThan will be false. This is not reasonable because required resource did exceed the headroom and unreserve is needed.
After that, when reaching the unreserve process in RegularContainerAllocator#assignContainer, unreserve process will be skipped when shouldAllocOrReserveNewContainer is true (when required containers > reserved containers) and needToUnreserve is wrongly calculated to be false:

    if (availableContainers > 0) {
         if (rmContainer == null && reservationsContinueLooking
          && node.getLabels().isEmpty()) {
              // unreserve process can be wrongly skipped when shouldAllocOrReserveNewContainer=true and needToUnreserve=false but required resource did exceed the headroom
              if (!shouldAllocOrReserveNewContainer || needToUnreserve) { 
                    ... 
              }
         }
    }

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

YARN-8771.001.patch
13/Sep/18 10:25
7 kB
Tao Yang
YARN-8771.002.patch
14/Sep/18 03:11
7 kB
Tao Yang
YARN-8771.003.patch
19/Sep/18 03:20
7 kB
Tao Yang
YARN-8771.004.patch
19/Sep/18 09:01
7 kB
Tao Yang

Activity

People

Assignee:: Tao Yang

Reporter:: Tao Yang

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 13/Sep/18 10:09

Updated:: 14/Oct/19 15:37

Resolved:: 19/Sep/18 11:45