Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
Description
Observed when doing some recovery testing:
Failure as during dag shutdown, 4 attempts of the same task failed.
2015-06-01 07:38:27,184 INFO [Dispatcher thread: Central] history.HistoryEventHandler: [HISTORY][DAG:dag_1433141118424_0012_2][Event:TASK_FINISHED]: vertexName=initialmap, taskId=task_1433141118424_0012_2_00_000003, startTime=1433144297281, finishTime=1433144307184, timeTaken=9903, status=FAILED, successfulAttemptID=null, diagnostics=TaskAttempt 0 failed, info=[Container container_e02_1433141118424_0012_01_000018 hit an invalid transition - C_NM_STOP_SENT at RUNNING]
TaskAttempt 1 failed, info=[AttemptId: attempt_1433141118424_0012_2_00_000003_1 cannot be allocated to container: container_e02_1433141118424_0012_01_000011 in STOP_REQUESTED state]
TaskAttempt 2 failed, info=[Container container_e02_1433141118424_0012_01_000012 hit an invalid transition - C_NM_STOP_SENT at RUNNING]
TaskAttempt 3 failed, info=[Container container_e02_1433141118424_0012_01_000025 hit an invalid transition - C_NM_STOP_SENT at RUNNING], counters=Counters: 0
DAG kill signal received.
2015-06-01 07:38:25,811 INFO [Thread-3] app.DAGAppMaster: DAGAppMasterShutdownHook invoked 2015-06-01 07:38:25,811 INFO [Thread-3] app.DAGAppMaster: DAGAppMaster received a signal. Signaling TaskScheduler
First attempt marked as failed as container was killed.
2015-06-01 07:38:26,906 INFO [Dispatcher thread: Central] history.HistoryEventHandler: [HISTORY][DAG:dag_1433141118424_0012_2][Event:TASK_ATTEMPT_FINISHED]: vertexName=initialmap, taskAttemptId=attempt_1433141118424_0012_2_00_000003_0, startTime=1433144297281, finishTime=1433144306904, timeTaken=9623, status=FAILED, errorEnum=FRAMEWORK_ERROR, diagnostics=Container container_e02_1433141118424_0012_01_000018 hit an invalid transition - C_NM_STOP_SENT at RUNNING, counters=Counters: 0
Subsequent attempt scheduled, assigned and eventually fails.
2015-06-01 07:38:26,919 INFO [DelayedContainerManager] rm.YarnTaskSchedulerService: Assigning container to task, container=Container: [ContainerId: container_e02_1433141118424_0012_01_000011, NodeId: ip-172-31-18-41.ec2.internal:45454, NodeHttpAddress: ip-172-31-18-41.ec2.internal:8042, Resource: <memory:1536, vCores:1>, Priority: 2, Token: Token { kind: ContainerToken, service: 172.31.18.41:45454 }, ], task=attempt_1433141118424_0012_2_00_000003_1, containerHost=ip-172-31, localityMatchType=NodeLocal, matchedLocation=ip-172-31-18-41.ec2.internal, honorLocalityFlags=true, reusedContainer=true, delayedContainers=4, containerResourceMemory=1536, containerResourceVCores=1
Scheduler stops too late.
2015-06-01 07:38:27,403 DEBUG [Thread-3] service.AbstractService: Service: org.apache.tez.dag.app.rm.YarnTaskSchedulerService entered state STOPPED