Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
-
None
Description
Env: Ran a job with tez (built from master branch on aug-24).
One of the nodes went down in the middle of the run. And DAGAppMaster had a container launch in that node. After sometime, this node was declared as unhealthy. Even though the job lasted only for 7 minutes, DAGAppMaster was unresponsive after dag cleanup for > 1.5 hours. It kept on trying to connect to the unhealthy node. I will attach the logs in this JIRA.
ipc.client.connect.max.retries has been set to 50 in core-site.xml
<property> <name>ipc.client.connect.max.retries</name> <value>50</value> <description>Defines the maximum number of retries for IPC connections.</description> </property>