Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-2052

RunState::recover should always recover 'completed'

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.20.0
    • 0.21.0
    • agent, containerization
    • None
    • Twitter Mesos Q4 Sprint 3
    • 1

    Description

      RunState::recover() will return partial state if it cannot find or open the libprocess pid file. Specifically, it does not recover the 'completed' flag.

      However, if the slave has removed the executor (because launch failed or the executor failed to register) the sentinel flag will be set and this fact should be recovered. This ensures that container recovery is not attempted later.

      This was discovered when the LinuxLauncher failed to recover because it was asked to recover two containers with the same forkedPid. Investigation showed the executors both OOM'ed before registering, i.e., no libprocess pid file was present. However, the containerizer had detected the OOM, destroyed the container, and notified the slave which cleaned everything up: failing the task and calling removeExecutor (which writes the completed sentinel file.)

      Attachments

        Activity

          People

            idownes Ian Downes
            idownes Ian Downes
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: