[TEZ-3130] A bad NodeManager can end up occupying all container launcher threads, delaying new launches - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 0.7.0
Fix Version/s: None
Component/s: None
Labels:
None

Target Version/s:

0.8.6

Description

If there's a single bad NodeManager, and a lot of containers allocated on this node - all container launcher threads can end up blocked on this node, delaying subsequent launches.
This is despite timeouts kicking in.
1) We should not allow all threads to be used up for a single NM
2) The retry policy could be enhanced to stop at ConnectionTimeouts (e.g. Node down)
3) Interrupt launch requests once Tez has detected a container as timed out.

Noticed by rajesh.balamohan - threads would lockup for 15 minutes in 0.7, and potentially infinitely on 0.8. That's another bug that needs investigation in 0.8.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Siddharth Seth

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 19/Feb/16 23:31

Updated:: 14/Mar/17 03:40