Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1-win, 1.3.0
-
None
Description
Looks like we are reaching JVM manager inconsistent state which cases TT to crash:
2013-06-09 06:41:11,250 FATAL org.apache.hadoop.mapred.JvmManager: Inconsistent state!!! JVM Manager reached an unstable state while reaping a JVM for task: attempt_201306080400_104812_m_000001_0 Number of active JVMs:8 JVMId jvm_201306080400_104517_m_1331138312 #Tasks ran: 0 Currently busy? true Currently running: attempt_201306080400_104517_m_000001_0 JVMId jvm_201306080400_104641_m_-1631395161 #Tasks ran: 0 Currently busy? true Currently running: attempt_201306080400_104641_m_000000_0 JVMId jvm_201306080400_104494_m_-1702464703 #Tasks ran: 0 Currently busy? true Currently running: attempt_201306080400_104494_m_000000_0 JVMId jvm_201306080400_104784_m_1407576088 #Tasks ran: 0 Currently busy? true Currently running: attempt_201306080400_104784_m_000000_0 JVMId jvm_201306080400_104530_m_186665365 #Tasks ran: 0 Currently busy? true Currently running: attempt_201306080400_104530_m_000000_0 JVMId jvm_201306080400_104589_m_-1080246077 #Tasks ran: 0 Currently busy? true Currently running: attempt_201306080400_104589_m_000000_0 JVMId jvm_201306080400_104674_m_830017814 #Tasks ran: 0 Currently busy? true Currently running: attempt_201306080400_104674_m_000000_0 JVMId jvm_201306080400_104719_m_-226910128 #Tasks ran: 0 Currently busy? true Currently running: attempt_201306080400_104719_m_000000_0. Aborting. 2013-06-09 06:41:11,250 INFO org.apache.hadoop.mapred.TaskTracker: SHUTDOWN_MSG:
Although this causes TT to crash, the frequency of the error is rare and the error itself is recoverable so the priority of the issue is not high.
However, this does look like a bug in the JVM manager state machine. I'm guessing there is some race condition that we're hitting.
(Logs attached)