Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-11503

Adding queues separately in short succession with Mutation API will stop CS allocating new containers

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 3.4.0
    • None
    • capacity scheduler
    • None

    Description

      Adding multiple queues in short succession via Mutation API will result in some race condition when adding the partition metrics for those queues, as noted by the unhandled exception:

      2023-05-09 18:25:36,301 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: Initializing root.eca_m
      2023-05-09 18:25:36,301 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager: Initialized queue: root.eca_m
      2023-05-09 18:25:36,359 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue: LeafQueue:root.eca_mupdate max app related, maxApplications=1000, maxApplicationsPerUser=1000, Abs Cap:0.0, Cap: 0.0, MaxCap : 1.0
      2023-05-09 18:25:36,359 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue: LeafQueue:root.eca_mupdate max app related, maxApplications=1000, maxApplicationsPerUser=1000, Abs Cap:NaN, Cap: NaN, MaxCap : NaN
      2023-05-09 18:25:36,401 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: Initializing root.eca_m
      2023-05-09 18:25:36,401 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager: Initialized queue: root.eca_m
      2023-05-09 18:25:36,484 ERROR org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[Thread-26,5,main] threw an Exception.
      org.apache.hadoop.metrics2.MetricsException: Metrics source PartitionQueueMetrics,partition=,q0=root,q1=eca_m already exists!
      2023-05-09 18:25:36,531 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: Initializing root.eca_m
      2023-05-09 18:25:36,531 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: root: re-configured queue: root.eca_m: capacity=0.0, absoluteCapacity=0.0, usedResources=<memory:0, vCores:0>, usedCapacity=0.0, absoluteUsedCapacity=0.0, numApps=0, numContainers=0, effectiveMinResource=<memory:1152000, vCores:359> , effectiveMaxResource=<memory:2304000, vCores:718>
      

      Initing the leaf queue root.eca_m should only happen once in during a reinit (twice if the validation endpoint is used), but in this case it happened thrice under a quarter of a second. This results in an unhandled exception in the async scheduling thread, which then will block new container allocation (existing ones can transition to other states however).

      2023-05-09 18:25:36,484 ERROR org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[Thread-26,5,main] threw an Exception.
      org.apache.hadoop.metrics2.MetricsException: Metrics source PartitionQueueMetrics,partition=,q0=root,q1=eca_m already exists!
      at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152)
      at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125)
      at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229)
      at org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.getPartitionQueueMetrics(QueueMetrics.java:355)
      at org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.setAvailableResourcesToUser(QueueMetrics.java:614)
      at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.computeUserLimitAndSetHeadroom(LeafQueue.java:1545)
      at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:1198)
      at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:1109)
      at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:927)
      at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1724)
      at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1659)
      at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1816)
      at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1562)
      at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:558)
      at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:605)
      

      Even though Mutation API wasn't designed for this, the scheduling thread shouldn't react like to API calls.

      Attachments

        Activity

          People

            Unassigned Unassigned
            bteke Benjamin Teke
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: