Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
Reviewed
Description
When killing a app, RM scheduler releases app's resource as soon as possible, then it might allocate these resource for new requests. But NM have not released them at that time.
The problem was found when we supported GPU as a resource(YARN-4122). Test environment: a NM had 6 GPUs, app A used all 6 GPUs, app B was requesting 3 GPUs. Killed app A, then RM released A's 6 GPUs, and allocated 3 GPUs to B. But when B tried to start container on NM, NM found it didn't have 3 GPUs to allocate because it had not released A's GPUs.
I think the problem also exists for CPU/Memory. It might cause OOM when memory is overused.