Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Duplicate
-
None
-
None
-
None
Description
I encountered this "job hung" situation during one of the sort runs. Two tasks assigned to a TT were never rescheduled although the TT was lost and this led to the job getting stuck forever. This TT was assigned lots of tasks and everyone got rescheduled except these two. Here are the relevant log messages (below the JT logs has been split into two parts to bring out the sequence of events) for one of the tasks.
JT log:
---------
2007-01-24 10:53:09,564 INFO org.apache.hadoop.mapred.JobInProgress: Choosing normal task tip_0001_m_020699
2007-01-24 10:53:09,564 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'task_0001_m_020699_0' to tip tip_0001_m_020699, for tracker 'foo.com:7020'
TT log:
---------
2007-01-24 10:53:09,564 INFO org.apache.hadoop.mapred.TaskTracker: LaunchTaskAction: task_0001_m_020699_0
2007-01-24 10:53:12,180 INFO org.apache.hadoop.mapred.TaskTracker: task_0001_m_020699_0 0.0% hdfs://foo:50000/user/ddas/somedir/part002444:134217728+134217728
JT log:
---------
2007-01-24 11:05:32,409 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker 'foo.com:7020'
Looks like there is some race condition. Since only two out of the many tasks never got rescheduled, could mean that the JT was somehow unaware of the state of this two tasks after it assigned them to the (soon-to-be-lost) TT (did they get added to the relevant tables properly?).