Details
-
Bug
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
3.4.0
-
None
-
None
Description
Adding multiple queues in short succession via Mutation API will result in some race condition when adding the partition metrics for those queues, as noted by the unhandled exception:
2023-05-09 18:25:36,301 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: Initializing root.eca_m 2023-05-09 18:25:36,301 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager: Initialized queue: root.eca_m 2023-05-09 18:25:36,359 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue: LeafQueue:root.eca_mupdate max app related, maxApplications=1000, maxApplicationsPerUser=1000, Abs Cap:0.0, Cap: 0.0, MaxCap : 1.0 2023-05-09 18:25:36,359 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue: LeafQueue:root.eca_mupdate max app related, maxApplications=1000, maxApplicationsPerUser=1000, Abs Cap:NaN, Cap: NaN, MaxCap : NaN 2023-05-09 18:25:36,401 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: Initializing root.eca_m 2023-05-09 18:25:36,401 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager: Initialized queue: root.eca_m 2023-05-09 18:25:36,484 ERROR org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[Thread-26,5,main] threw an Exception. org.apache.hadoop.metrics2.MetricsException: Metrics source PartitionQueueMetrics,partition=,q0=root,q1=eca_m already exists! 2023-05-09 18:25:36,531 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: Initializing root.eca_m 2023-05-09 18:25:36,531 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: root: re-configured queue: root.eca_m: capacity=0.0, absoluteCapacity=0.0, usedResources=<memory:0, vCores:0>, usedCapacity=0.0, absoluteUsedCapacity=0.0, numApps=0, numContainers=0, effectiveMinResource=<memory:1152000, vCores:359> , effectiveMaxResource=<memory:2304000, vCores:718>
Initing the leaf queue root.eca_m should only happen once in during a reinit (twice if the validation endpoint is used), but in this case it happened thrice under a quarter of a second. This results in an unhandled exception in the async scheduling thread, which then will block new container allocation (existing ones can transition to other states however).
2023-05-09 18:25:36,484 ERROR org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[Thread-26,5,main] threw an Exception. org.apache.hadoop.metrics2.MetricsException: Metrics source PartitionQueueMetrics,partition=,q0=root,q1=eca_m already exists! at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.getPartitionQueueMetrics(QueueMetrics.java:355) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.setAvailableResourcesToUser(QueueMetrics.java:614) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.computeUserLimitAndSetHeadroom(LeafQueue.java:1545) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:1198) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:1109) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:927) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1724) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1659) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1816) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1562) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:558) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:605)
Even though Mutation API wasn't designed for this, the scheduling thread shouldn't react like to API calls.