Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
0.20.0
-
None
-
Twitter Mesos Q4 Sprint 3
-
1
Description
RunState::recover() will return partial state if it cannot find or open the libprocess pid file. Specifically, it does not recover the 'completed' flag.
However, if the slave has removed the executor (because launch failed or the executor failed to register) the sentinel flag will be set and this fact should be recovered. This ensures that container recovery is not attempted later.
This was discovered when the LinuxLauncher failed to recover because it was asked to recover two containers with the same forkedPid. Investigation showed the executors both OOM'ed before registering, i.e., no libprocess pid file was present. However, the containerizer had detected the OOM, destroyed the container, and notified the slave which cleaned everything up: failing the task and calling removeExecutor (which writes the completed sentinel file.)