[YARN-6223] [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation on YARN - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 3.1.0
Component/s: None
Labels:
None

Description

As varieties of workloads are moving to YARN, including machine learning / deep learning which can speed up by leveraging GPU computation power. Workloads should be able to request GPU from YARN as simple as CPU and memory.

To make a complete GPU story, we should support following pieces:
1) GPU discovery/configuration: Admin can either config GPU resources and architectures on each node, or more advanced, NodeManager can automatically discover GPU resources and architectures and report to ResourceManager

2) GPU scheduling: YARN scheduler should account GPU as a resource type just like CPU and memory.

3) GPU isolation/monitoring: once launch a task with GPU resources, NodeManager should properly isolate and monitor task's resource usage.

For #2, ~~YARN-3926~~ can support it natively. For #3, ~~YARN-3611~~ has introduced an extensible framework to support isolation for different resource types and different runtimes.

Related JIRAs:
There're a couple of JIRAs (~~YARN-4122~~/~~YARN-5517~~) filed with similar goals but different solutions:
For scheduling:

~~YARN-4122~~/~~YARN-5517~~ are all adding a new GPU resource type to Resource protocol instead of leveraging ~~YARN-3926~~.

For isolation:

And ~~YARN-4122~~ proposed to use CGroups to do isolation which cannot solve the problem listed at https://github.com/NVIDIA/nvidia-docker/wiki/GPU-isolation#challenges such as minor device number mapping; load nvidia_uvm module; mismatch of CUDA/driver versions, etc.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

YARN-6223.Natively-support-GPU-on-YARN-v1.pdf
01/Apr/17 03:40
169 kB
Wangda Tan
YARN-6223.wip.1.patch
01/Apr/17 03:40
31 kB
Wangda Tan
YARN-6223.wip.2.patch
28/Jun/17 02:38
69 kB
Wangda Tan
YARN-6223.wip.3.patch
14/Jul/17 22:49
128 kB
Wangda Tan

Issue Links

is related to

YARN-8200 Backport resource types/GPU features to branch-3.0/branch-2

Resolved

relates to

YARN-4122 Add support for GPU as a resource

Resolved

YARN-5983 [Umbrella] Support for FPGA as a Resource in YARN

Resolved

requires

YARN-3926 [Umbrella] Extend the YARN resource model for easier resource-type management and profiles

Resolved

Sub-Tasks

1.	Add support for GPU as a resource	Resolved	Jun Gong
2.	Add support in NodeManager to isolate GPU devices by using CGroups	Resolved	Wangda Tan
3.	[YARN-6223] Native code changes to support isolate GPU devices by using CGroups	Resolved	Wangda Tan
4.	Document GPU isolation feature	Resolved	Wangda Tan
5.	Support GPU isolation for docker container	Resolved	Wangda Tan
6.	Add support to show GPU in UI including metrics	Resolved	Wangda Tan
7.	GPU Isolation: Incorrect minor device numbers written to devices.deny file	Resolved	Jonathan Hung
8.	Use "docker volume inspect" to make sure that volumes for GPU drivers/libs are properly mounted.	Resolved	Wangda Tan
9.	Ensure volume to include GPU base libraries after created by plugin	Resolved	Wangda Tan
10.	Gpu Information page could be empty for nodes without GPU	Resolved	Sunil G
11.	GPU volume creation command fails when work preserving is disabled at NM	Resolved	Zian Chen
12.	Document YARN Ambari Integration Guide for GPU	Resolved	Zian Chen

Activity

People

Assignee:: Wangda Tan

Reporter:: Wangda Tan

Votes:: 4 Vote for this issue

Watchers:: 55 Start watching this issue

Dates

Created:: 23/Feb/17 00:46

Updated:: 14/Nov/18 17:38

Resolved:: 06/Apr/18 18:32