[YARN-4011] Jobs fail since nm-local-dir not cleaned up when rogue job fills up disk - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 2.4.0
Fix Version/s: None
Component/s: yarn
Labels:
None

Description

We observed jobs failed since tasks couldn't launch on nodes due to "java.io.IOException No space left on device".
On digging in further, we found a rogue job which filled up disk.
Specifically it was wrote a lot of map spills(like attempt_1432082376223_461647_m_000421_0_spill_10000.out) to nm-local-dir causing disk to fill up, and it failed/got killed, but didn't clean up these files in nm-local-dir.
So the disk remained full, causing subsequent jobs to fail.

This jira is created to address why files under nm-local-dir doesn't get cleaned up when job fails after filling up disk.

Attachments

Issue Links

duplicates

YARN-90 NodeManager should identify failed disks becoming good again

Closed

Activity

Descending order - Click to sort in ascending order

Maysam Yabandeh added a comment - 23/Sep/15 15:47

Thanks jlowe. I created ~~MAPREDUCE-6489~~ for this.

Maysam Yabandeh added a comment - 23/Sep/15 15:47 Thanks jlowe . I created MAPREDUCE-6489 for this.

Jason Darrell Lowe added a comment - 22/Sep/15 14:52

The mapreduce task can check for BYTES_WRITTEN counter and fail fast if it is above the configured limit.

I think having the MR framework provide an optional limit for local filesystem output is a reasonable request until a more sophisticated solution can be implemented by YARN directly.

Jason Darrell Lowe added a comment - 22/Sep/15 14:52 The mapreduce task can check for BYTES_WRITTEN counter and fail fast if it is above the configured limit. I think having the MR framework provide an optional limit for local filesystem output is a reasonable request until a more sophisticated solution can be implemented by YARN directly.

Maysam Yabandeh added a comment - 22/Sep/15 05:02

We face this problem quite often in our ad hoc cluster and are thinking to implement some basic checkers to make such misbehaved jobs fail fast.

Until we have a proper solution for yarn, we can have a mapreduce-specific solution in place to protect the cluster from rogue mapreduce tasks? The mapreduce task can check for BYTES_WRITTEN counter and fail fast if it is above the configured limit. It is true that written bytes is larger than the actual used disk space, but to detect a rogue task the exact value is not required and a very large value for written bytes to local disk is a good indicative that the task is misbehaved.

Thoughts?

Maysam Yabandeh added a comment - 22/Sep/15 05:02 We face this problem quite often in our ad hoc cluster and are thinking to implement some basic checkers to make such misbehaved jobs fail fast. Until we have a proper solution for yarn, we can have a mapreduce-specific solution in place to protect the cluster from rogue mapreduce tasks? The mapreduce task can check for BYTES_WRITTEN counter and fail fast if it is above the configured limit. It is true that written bytes is larger than the actual used disk space, but to detect a rogue task the exact value is not required and a very large value for written bytes to local disk is a good indicative that the task is misbehaved. Thoughts?

Ashwin Shankar added a comment - 04/Aug/15 03:27

Thanks Jason, this is exactly what I was looking for !

Ashwin Shankar added a comment - 04/Aug/15 03:27 Thanks Jason, this is exactly what I was looking for !

Jason Darrell Lowe added a comment - 03/Aug/15 20:59

Yes, we ran into this a while ago, see ~~YARN-2473~~ which was ultimately fixed by ~~YARN-90~~. There were a number of other fixes as well such as ~~YARN-3850~~ and ~~YARN-3925~~. Closing this as a duplicate of ~~YARN-90~~.

As for the ability to limit disk space, this has been discussed many times before and as far back as ~~MAPREDUCE-1100~~. An unsophisticated solution is to have the container monitor that already is looking for too much memory usage also monitor disk usage and kill the container if its too large. However this doesn't solve the problem for containers that are writing to locations that aren't container-specific (e.g.: where maps store their shuffle outputs, /tmp, etc. I think it could be difficult to enforce tasks that are filling the disk in arbitrary ways, but it could be straightforward to catch the task that simply is logging too much.

Jason Darrell Lowe added a comment - 03/Aug/15 20:59 Yes, we ran into this a while ago, see YARN-2473 which was ultimately fixed by YARN-90 . There were a number of other fixes as well such as YARN-3850 and YARN-3925 . Closing this as a duplicate of YARN-90 . As for the ability to limit disk space, this has been discussed many times before and as far back as MAPREDUCE-1100 . An unsophisticated solution is to have the container monitor that already is looking for too much memory usage also monitor disk usage and kill the container if its too large. However this doesn't solve the problem for containers that are writing to locations that aren't container-specific (e.g.: where maps store their shuffle outputs, /tmp, etc. I think it could be difficult to enforce tasks that are filling the disk in arbitrary ways, but it could be straightforward to catch the task that simply is logging too much.

Ashwin Shankar added a comment - 03/Aug/15 19:09

hey jlowe, have you encountered this issue before at Yahoo ? Also would it make sense to have a feature on NM to limit the amount of data
user/app can write to nm-local-dir to protect other users ? I'm looking into related jiras like ~~YARN-1781~~, which could be a band-aid to this problem.

Ashwin Shankar added a comment - 03/Aug/15 19:09 hey jlowe , have you encountered this issue before at Yahoo ? Also would it make sense to have a feature on NM to limit the amount of data user/app can write to nm-local-dir to protect other users ? I'm looking into related jiras like YARN-1781 , which could be a band-aid to this problem.

People

Assignee:: Unassigned

Reporter:: Ashwin Shankar

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 03/Aug/15 19:03

Updated:: 23/Sep/15 15:47

Resolved:: 03/Aug/15 20:59

Hadoop YARN