Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
Reviewed
Description
Replication Steps:
Apache Spark 3.5.1 and Apache Hadoop 3.3.6 (Capacity Scheduler)
spark.executor.memory 1024M
spark.driver.memory 2048M
spark.executor.cores 1
spark.executor.instances 20
spark.dynamicAllocation.enabled false
Based on the setup, there should be 20 spark executors, but from the ResourceManager (RM) UI, i could see that 32 executors were allocated and 12 of them were released in seconds. On analyzing the Spark ApplicationMaster (AM) logs, The following logs were observed.
4/06/24 14:10:14 INFO YarnAllocator: Will request 20 executor container(s) for ResourceProfile Id: 0, each with 1 core(s) and 1408 MB memory. with custom resources: <memory:1408, max memory:2147483647, vCores:1, max vCores:2147483647>
24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN, launching executors on 8 of them.
24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN, launching executors on 8 of them.
24/06/24 14:10:14 INFO YarnAllocator: Received 12 containers from YARN, launching executors on 4 of them.
24/06/24 14:10:17 INFO YarnAllocator: Received 4 containers from YARN, launching executors on 0 of them.
It was clear for the logs that extra allocated 12 containers are being ignored from Spark side. Inorder to debug this further, additional log lines were added to AppSchedulingInfo class in increment and decrement of container request to expose additional information about the request.
2024-06-24 14:10:14,075 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (IPC Server handler 42 on default port 8030): Updates PendingContainers: 0 Incremented by: 20 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,077 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 20 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,077 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 19 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,111 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 18 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,112 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 17 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,112 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 16 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,113 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 15 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,113 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 14 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,113 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 13 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,327 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 12 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,328 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 11 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,362 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 10 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,363 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 9 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,363 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 8 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,363 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 7 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,363 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 6 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,364 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 5 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_0000012024-06-24 14:10:14,454 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (IPC Server handler 35 on default port 8030): Updates PendingContainers: 4 Decremented by: 4 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,454 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (IPC Server handler 35 on default port 8030): Updates PendingContainers: 0 Incremented by: 12 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,578 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 12 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,579 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 11 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,614 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 10 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,614 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 9 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,614 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 8 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,615 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 7 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,615 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 6 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,615 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 5 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,829 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 4 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,829 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 3 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,864 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 2 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:14,864 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 1 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_0000012024-06-24 14:10:14,874 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (IPC Server handler 42 on default port 8030): Updates PendingContainers: 0 Incremented by: 4 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:15,080 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 4 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:15,081 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 3 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:15,115 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 2 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:15,115 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 1 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_000001 2024-06-24 14:10:17,931 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (IPC Server handler 3 on default port 8030): checking for deactivate of application :application_1719234929152_0004 2024-06-24 14:10:20,743 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Application application_1719234929152_0004 requests cleared
RCA
It looks like the issue is happening due to asynchronous nature of AMRMClient response communication. The newly allocated containers from Yarn is send as part of the next request from the client causing additional additional. Please check the following example. As per the above mentioned logs
- At time 2024-06-24 14:10:14,075, AM sends 20 container request to Yarn, since this is the first request, yarn had not allocated any containers and send 0 containers as response.
- At time 2024-06-24 14:10:14,454, AM sends 12 container request to Yarn which suggests that before this time there was an empty request ask to Yarn which had returned 8 containers as allocated in reponse. For this request was well the Yarn returned 8 container as allocated in response.
- At time 2024-06-24 14:10:14,874, AM again send 4 container request to yarn for which it returns 12 containers allocated (based on previous request) as response.
- Since all the container request is exhausted, AM sends empty request to yarn which returns 4 containers allocated (based on previous request)
- So in total 32 containers were allocated for 20 container request. The problem here is AM (client) is being notified about the allocated container request in the next heartbeat/request causing inconsistency.
Solutions
1. Modify the AM request in the Yarn/RM side
Normalize/update the AM container request to consider the allocated containers. This can be done in the respective scheduler side i.e newContainerAsk = AMContainerAsk - AllocatedContainers.
2. Make AM aware of allocated container
Before client making allocate request, Check AM should check with RM about container allocated. This approach requires additional AM-RM client communication which can be expense when there are large number of allocate requests.
PS: I could see similar behavior with Apache Tez and could happen with the latest hadoop-3.4.0 version as well.
Attachments
Issue Links
- links to