Uploaded image for project: 'Apache Tez'
  1. Apache Tez
  2. TEZ-2738

ContainerLauncher tries to connect to unhealthy node for large number of times

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      Env: Ran a job with tez (built from master branch on aug-24).

      One of the nodes went down in the middle of the run. And DAGAppMaster had a container launch in that node. After sometime, this node was declared as unhealthy. Even though the job lasted only for 7 minutes, DAGAppMaster was unresponsive after dag cleanup for > 1.5 hours. It kept on trying to connect to the unhealthy node. I will attach the logs in this JIRA.

      ipc.client.connect.max.retries has been set to 50 in core-site.xml

       <property>
          <name>ipc.client.connect.max.retries</name>
          <value>50</value>
          <description>Defines the maximum number of retries for IPC connections.</description>
        </property>
      

      Attachments

        1. log.txt.gz
          65 kB
          Rajesh Balamohan
        2. logs.tar.gz
          64 kB
          Rajesh Balamohan

        Activity

          People

            Unassigned Unassigned
            rajesh.balamohan Rajesh Balamohan
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: