[YARN-8513] CapacityScheduler infinite loop when queue is near fully utilized - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 3.1.0, 2.9.1
Fix Version/s: None
Component/s: capacity scheduler, yarn
Labels:
None
Environment:

Ubuntu 14.04.5 and 16.04.4

YARN is configured with one label and 5 queues.

Description

ResourceManager does not respond to any request when queue is near fully utilized sometimes. Sending SIGTERM won't stop RM, only SIGKILL can. After RM restart, it can recover running jobs and start accepting new ones.

Seems like CapacityScheduler is in an infinite loop printing out the following log messages (more than 25,000 lines in a second):

2018-07-10 17:16:29,227 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: assignedContainer queue=root usedCapacity=0.99816763 absoluteUsedCapacity=0.99816763 used=<memory:16170624, vCores:1577> cluster=<memory:29441544, vCores:5792>
2018-07-10 17:16:29,227 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Failed to accept allocation proposal
2018-07-10 17:16:29,227 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator: assignedContainer application attempt=appattempt_1530619767030_1652_000001 container=null queue=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator@14420943 clusterResource=<memory:29441544, vCores:5792> type=NODE_LOCAL requestedPartition=

I encounter this problem several times after upgrading to YARN 2.9.1, while the same configuration works fine under version 2.7.3.

~~YARN-4477~~ is an infinite loop bug in FairScheduler, not sure if this is a similar problem.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

yarn3-resourcemanager.log
18/Aug/18 14:10
798 kB
Chen Yufei
yarn3-top
18/Aug/18 14:05
9 kB
Chen Yufei
yarn3-jstack5.log
18/Aug/18 14:04
153 kB
Chen Yufei
yarn3-jstack3.log
18/Aug/18 14:04
223 kB
Chen Yufei
yarn3-jstack2.log
18/Aug/18 14:04
151 kB
Chen Yufei
yarn3-jstack1.log
18/Aug/18 14:04
151 kB
Chen Yufei
yarn3-jstack4.log
18/Aug/18 14:04
152 kB
Chen Yufei
top-when-normal.log
16/Jul/18 02:15
2 kB
Chen Yufei
jstack-5.log
16/Jul/18 02:15
173 kB
Chen Yufei
jstack-1.log
16/Jul/18 02:15
173 kB
Chen Yufei
jstack-4.log
16/Jul/18 02:15
170 kB
Chen Yufei
jstack-3.log
16/Jul/18 02:15
169 kB
Chen Yufei
jstack-2.log
16/Jul/18 02:15
171 kB
Chen Yufei
top-during-lock.log
16/Jul/18 02:15
2 kB
Chen Yufei

Issue Links

is caused by

YARN-8896 Limit the maximum number of container assignments per heartbeat

Resolved

relates to

YARN-8896 Limit the maximum number of container assignments per heartbeat

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Chen Yufei

Votes:: 1 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 10/Jul/18 17:28

Updated:: 24/Oct/18 22:13

Resolved:: 24/Oct/18 22:13