Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-5332

TASK_LOST on slave restart potentially due to executor race condition

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Resolved
    • 0.26.0
    • None
    • agent, libprocess
    • None
    • Mesos 0.26
      Aurora 0.13

    Description

      When restarting the Mesos agent binary, tasks can end up as LOST. We lose from 20% to 50% of all tasks. They are killed by the Mesos agent via:

      I0505 08:42:06.781318 21738 slave.cpp:2702] Cleaning up un-reregistered executors
      I0505 08:42:06.781366 21738 slave.cpp:2720] Killing un-reregistered executor 'thermos-nobody-devel-service-28854-0-6a88d62e-656
      4-4e33-b0bb-1d8039d97afc' of framework 20151001-085346-58917130-5050-37976-0000 at executor(1)@10.X.X.X:40541
      I0505 08:42:06.781446 21738 slave.cpp:2720] Killing un-reregistered executor 'thermos-nobody-devel-service-23839-0-1d2cd0e6-699
      4-4cba-a9df-3dfc1552667f' of framework 20151001-085346-58917130-5050-37976-0000 at executor(1)@10.X.X.X:35757
      I0505 08:42:06.781466 21738 slave.cpp:2720] Killing un-reregistered executor 'thermos-nobody-devel-service-29970-0-478a7291-d070-4aa8
      -af21-6fda889f750c' of framework 20151001-085346-58917130-5050-37976-0000 at executor(1)@10.X.X.X:51463
      ...
      I0505 08:42:06.781558 21738 slave.cpp:4230] Finished recovery
      

      We have verified that the tasks and their executors are killed by the agent during startup. When stopping the agent using supervisorctl stop, the executors are still running (verified via ps aux). They are only killed once the agent tries to reregister.

      The issue is hard to reproduce:

      • When restarting the agent binary multiple times, tasks are only lost for the first restart.
      • It is much more likely to occur if the agent binary has been running for a longer period of time (> 7 days)

      Mesos is correctly sticking to the 2 seconds wait time before killing un-reregistered executors. The failed executors receive the reregistration request, but it seems like they fail to send a reply.

      A successful reregistration (not leading to LOST):

      I0505 08:41:59.581231 21664 exec.cpp:456] Slave exited, but framework has checkpointing enabled. Waiting 15mins to reconnect with slave 20160118-141153-92471562-5050-6270-S17
      I0505 08:42:04.780591 21665 exec.cpp:256] Received reconnect request from slave 20160118-141153-92471562-5050-6270-S17
      I0505 08:42:04.785297 21676 exec.cpp:233] Executor re-registered on slave 20160118-141153-92471562-5050-6270-S17
      I0505 08:42:04.788579 21676 exec.cpp:245] Executor::reregistered took 1.492339ms
      

      A failed one:

      I0505 08:42:04.779677  2389 exec.cpp:256] Received reconnect request from slave 20160118-141153-92471562-5050-6270-S17
      E0505 08:42:05.481374  2408 process.cpp:1911] Failed to shutdown socket with fd 11: Transport endpoint is not connected
      I0505 08:42:05.481374  2395 exec.cpp:456] Slave exited, but framework has checkpointing enabled. Waiting 15mins to reconnect with slave 20160118-141153-92471562-5050-6270-S17
      

      All task ending up in LOST have an output similar to the one posted above, i.e. log messages are in a wrong order.

      Anyone an idea what might be going on here?

      Attachments

        1. mesos-slave.log
          6.33 MB
          Stephan Erb
        2. executor-stderrV2.log
          4 kB
          Stephan Erb
        3. executor-stderr.log
          3 kB
          Stephan Erb
        4. executor-logs.tar.gz
          9 kB
          Stephan Erb

        Issue Links

          Activity

            People

              anandmazumdar Anand Mazumdar
              StephanErb Stephan Erb
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: