[TEZ-2738] ContainerLauncher tries to connect to unhealthy node for large number of times - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Target Version/s:

0.8.6

Description

Env: Ran a job with tez (built from master branch on aug-24).

One of the nodes went down in the middle of the run. And DAGAppMaster had a container launch in that node. After sometime, this node was declared as unhealthy. Even though the job lasted only for 7 minutes, DAGAppMaster was unresponsive after dag cleanup for > 1.5 hours. It kept on trying to connect to the unhealthy node. I will attach the logs in this JIRA.

ipc.client.connect.max.retries has been set to 50 in core-site.xml

 <property>
    <name>ipc.client.connect.max.retries</name>
    <value>50</value>
    <description>Defines the maximum number of retries for IPC connections.</description>
  </property>

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

log.txt.gz
25/Aug/15 07:26
65 kB
Rajesh Balamohan
logs.tar.gz
24/Aug/15 23:04
64 kB
Rajesh Balamohan

Activity

People

Assignee:: Unassigned

Reporter:: Rajesh Balamohan

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 24/Aug/15 23:01

Updated:: 14/Mar/17 03:40