Details
-
New Feature
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
Description
As varieties of workloads are moving to YARN, including machine learning / deep learning which can speed up by leveraging GPU computation power. Workloads should be able to request GPU from YARN as simple as CPU and memory.
To make a complete GPU story, we should support following pieces:
1) GPU discovery/configuration: Admin can either config GPU resources and architectures on each node, or more advanced, NodeManager can automatically discover GPU resources and architectures and report to ResourceManager
2) GPU scheduling: YARN scheduler should account GPU as a resource type just like CPU and memory.
3) GPU isolation/monitoring: once launch a task with GPU resources, NodeManager should properly isolate and monitor task's resource usage.
For #2, YARN-3926 can support it natively. For #3, YARN-3611 has introduced an extensible framework to support isolation for different resource types and different runtimes.
Related JIRAs:
There're a couple of JIRAs (YARN-4122/YARN-5517) filed with similar goals but different solutions:
For scheduling:
YARN-4122/YARN-5517are all adding a new GPU resource type to Resource protocol instead of leveragingYARN-3926.
For isolation:
- And
YARN-4122proposed to use CGroups to do isolation which cannot solve the problem listed at https://github.com/NVIDIA/nvidia-docker/wiki/GPU-isolation#challenges such as minor device number mapping; load nvidia_uvm module; mismatch of CUDA/driver versions, etc.
Attachments
Attachments
Issue Links
- is related to
-
YARN-8200 Backport resource types/GPU features to branch-3.0/branch-2
- Resolved
- relates to
-
YARN-4122 Add support for GPU as a resource
- Resolved
-
YARN-5983 [Umbrella] Support for FPGA as a Resource in YARN
- Resolved
- requires
-
YARN-3926 [Umbrella] Extend the YARN resource model for easier resource-type management and profiles
- Resolved