[YARN-9413] Queue resource leak after app fail for CapacityScheduler - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.1.2
Fix Version/s: 3.0.4, 3.3.0, 3.2.1, 3.1.3
Component/s: capacityscheduler
Labels:
None

Hadoop Flags:

Reviewed

Description

To reproduce this problem:

Submit an app which is configured to keep containers across app attempts and should fail after AM finished at first time (am-max-attempts=1).
App is started with 2 containers running on NM1 node.
Fail the AM of the application with PREEMPTED exit status which should not count towards max attempt retry but app will fail immediately.
Used resource of this queue leaks after app fail.

The root cause is the inconsistency of handling app attempt failure between RMAppAttemptImpl$BaseFinalTransition#transition and RMAppImpl$AttemptFailedTransition#transition:

After app fail, RMAppFailedAttemptEvent will be sent in RMAppAttemptImpl$BaseFinalTransition#transition, if exit status of AM container is PREEMPTED/ABORTED/DISKS_FAILED/KILLED_BY_RESOURCEMANAGER, it will not count towards max attempt retry, so that it will send AppAttemptRemovedSchedulerEvent with keepContainersAcrossAppAttempts=true and RMAppFailedAttemptEvent with transferStateFromPreviousAttempt=true.
RMAppImpl$AttemptFailedTransition#transition handle RMAppFailedAttemptEvent and will fail the app if its max app attempts is 1.
CapacityScheduler handles AppAttemptRemovedSchedulerEvent in CapcityScheduler#doneApplicationAttempt, it will skip killing and calling completion process for containers belong to this app, so that queue resource leak happens.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

YARN-9413.branch-3.0.001.patch
07/Apr/19 02:00
10 kB
Tao Yang
YARN-9413.003.patch
29/Mar/19 04:54
12 kB
Tao Yang
image-2019-03-29-10-47-47-953.png
29/Mar/19 02:47
85 kB
Tao Yang
YARN-9413.002.patch
28/Mar/19 13:06
6 kB
Tao Yang
YARN-9413.001.patch
27/Mar/19 04:12
6 kB
Tao Yang

Activity

People

Assignee:: Tao Yang

Reporter:: Tao Yang

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 27/Mar/19 04:06

Updated:: 08/Apr/19 05:46

Resolved:: 08/Apr/19 05:45