[YARN-8423] GPU does not get released even though the application gets killed. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 3.2.0, 3.1.1
Component/s: yarn
Labels:
None

Target Version/s:

3.1.1
Hadoop Flags:

Reviewed

Description

Run an Tensor flow app requesting one GPU.
Kill the application once the GPU is allocated
Query the nodemanger once the application is killed.We see that GPU is not being released.

 curl -i <NM>/ws/v1/node/resources/yarn.io%2Fgpu
{"gpuDeviceInformation":{"gpus":[{"productName":"<productName>","uuid":"GPU-<UID>","minorNumber":0,"gpuUtilizations":{"overallGpuUtilization":0.0},"gpuMemoryUsage":{"usedMemoryMiB":73,"availMemoryMiB":12125,"totalMemoryMiB":12198},"temperature":{"currentGpuTemp":28.0,"maxGpuTemp":85.0,"slowThresholdGpuTemp":82.0}},{"productName":"<productName>","uuid":"GPU-<UID>","minorNumber":1,"gpuUtilizations":{"overallGpuUtilization":0.0},"gpuMemoryUsage":{"usedMemoryMiB":73,"availMemoryMiB":12125,"totalMemoryMiB":12198},"temperature":{"currentGpuTemp":28.0,"maxGpuTemp":85.0,"slowThresholdGpuTemp":82.0}}],"driverVersion":"<version>"},"totalGpuDevices":[{"index":0,"minorNumber":0},{"index":1,"minorNumber":1}],"assignedGpuDevices":[{"index":0,"minorNumber":0,"containerId":"container_<containerID>"}]}

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

YARN-8423.003.patch
26/Jun/18 01:46
10 kB
Sunil G
YARN-8423.002.patch
22/Jun/18 18:43
10 kB
Sunil G
YARN-8423.001.patch
14/Jun/18 19:03
5 kB
Sunil G
kill-container-nm.log
14/Jun/18 06:15
4 kB
Wangda Tan

Issue Links

is related to

YARN-8450 Blocking resources such as GPU/FPGA etc tend to release actual device slowly even after RM identifies it as COMPLETED

Open

relates to

YARN-8463 Add more tests for GPU allocation and release scenarios

Open

Activity

People

Assignee:: Sunil G

Reporter:: Sumana Sathish

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 12/Jun/18 23:16

Updated:: 27/Jun/18 02:57

Resolved:: 27/Jun/18 02:49