Details
Description
We enhance Hadoop with GPU support for better AI job scheduling.
Currently, YARN-3926 also supports GPU scheduling, which treats GPU as countable resource.
However, GPU placement is also very important to deep learning job for better efficiency.
For example, a 2-GPU job runs on gpu
could be faster than run on gpu
{0, 7}, if GPU 0 and 1 are under the same PCI-E switch while 0 and 7 are not.
We add the support to Hadoop 2.7.2 to enable GPU locality scheduling, which support fine-grained GPU placement.
A 64-bits bitmap is added to yarn Resource, which indicates both GPU usage and locality information in a node (up to 64 GPUs per node). '1' means available and '0' otherwise in the corresponding position of the bit.