Uploaded image for project: 'Apache Tez'
  1. Apache Tez
  2. TEZ-2509

YarnTaskSchedulerService should not try to allocate containers if AM is shutting down

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.6.2, 0.8.0-alpha, 0.7.1
    • None
    • None

    Description

      Observed when doing some recovery testing:

      Failure as during dag shutdown, 4 attempts of the same task failed.

      2015-06-01 07:38:27,184 INFO [Dispatcher thread: Central] history.HistoryEventHandler: [HISTORY][DAG:dag_1433141118424_0012_2][Event:TASK_FINISHED]: vertexName=initialmap, taskId=task_1433141118424_0012_2_00_000003, startTime=1433144297281, finishTime=1433144307184, timeTaken=9903, status=FAILED, successfulAttemptID=null, diagnostics=TaskAttempt 0 failed, info=[Container container_e02_1433141118424_0012_01_000018 hit an invalid transition - C_NM_STOP_SENT at RUNNING]
      TaskAttempt 1 failed, info=[AttemptId: attempt_1433141118424_0012_2_00_000003_1 cannot be allocated to container: container_e02_1433141118424_0012_01_000011 in STOP_REQUESTED state]
      TaskAttempt 2 failed, info=[Container container_e02_1433141118424_0012_01_000012 hit an invalid transition - C_NM_STOP_SENT at RUNNING]
      TaskAttempt 3 failed, info=[Container container_e02_1433141118424_0012_01_000025 hit an invalid transition - C_NM_STOP_SENT at RUNNING], counters=Counters: 0
      

      DAG kill signal received.

      2015-06-01 07:38:25,811 INFO [Thread-3] app.DAGAppMaster: DAGAppMasterShutdownHook invoked
      2015-06-01 07:38:25,811 INFO [Thread-3] app.DAGAppMaster: DAGAppMaster received a signal. Signaling TaskScheduler
      

      First attempt marked as failed as container was killed.

      2015-06-01 07:38:26,906 INFO [Dispatcher thread: Central] history.HistoryEventHandler: [HISTORY][DAG:dag_1433141118424_0012_2][Event:TASK_ATTEMPT_FINISHED]: vertexName=initialmap, taskAttemptId=attempt_1433141118424_0012_2_00_000003_0, startTime=1433144297281, finishTime=1433144306904, timeTaken=9623, status=FAILED, errorEnum=FRAMEWORK_ERROR, diagnostics=Container container_e02_1433141118424_0012_01_000018 hit an invalid transition - C_NM_STOP_SENT at RUNNING, counters=Counters: 0
      

      Subsequent attempt scheduled, assigned and eventually fails.

      2015-06-01 07:38:26,919 INFO [DelayedContainerManager] rm.YarnTaskSchedulerService: Assigning container to task, container=Container: [ContainerId: container_e02_1433141118424_0012_01_000011, NodeId: ip-172-31-18-41.ec2.internal:45454, NodeHttpAddress: ip-172-31-18-41.ec2.internal:8042, Resource: <memory:1536, vCores:1>, Priority: 2, Token: Token { kind: ContainerToken, service: 172.31.18.41:45454 }, ], task=attempt_1433141118424_0012_2_00_000003_1, containerHost=ip-172-31, localityMatchType=NodeLocal, matchedLocation=ip-172-31-18-41.ec2.internal, honorLocalityFlags=true, reusedContainer=true, delayedContainers=4, containerResourceMemory=1536, containerResourceVCores=1
      

      Scheduler stops too late.

      2015-06-01 07:38:27,403 DEBUG [Thread-3] service.AbstractService: Service: org.apache.tez.dag.app.rm.YarnTaskSchedulerService entered state STOPPED
      

      Attachments

        1. TEZ-2509.1.patch
          5 kB
          Hitesh Shah
        2. TEZ-2509.2.patch
          13 kB
          Hitesh Shah
        3. TEZ-2509.3.patch
          12 kB
          Hitesh Shah

        Activity

          People

            hitesh Hitesh Shah
            hitesh Hitesh Shah
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: