Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Resolved
-
0.26.0
-
None
-
None
-
Mesos 0.26
Aurora 0.13
Description
When restarting the Mesos agent binary, tasks can end up as LOST. We lose from 20% to 50% of all tasks. They are killed by the Mesos agent via:
I0505 08:42:06.781318 21738 slave.cpp:2702] Cleaning up un-reregistered executors I0505 08:42:06.781366 21738 slave.cpp:2720] Killing un-reregistered executor 'thermos-nobody-devel-service-28854-0-6a88d62e-656 4-4e33-b0bb-1d8039d97afc' of framework 20151001-085346-58917130-5050-37976-0000 at executor(1)@10.X.X.X:40541 I0505 08:42:06.781446 21738 slave.cpp:2720] Killing un-reregistered executor 'thermos-nobody-devel-service-23839-0-1d2cd0e6-699 4-4cba-a9df-3dfc1552667f' of framework 20151001-085346-58917130-5050-37976-0000 at executor(1)@10.X.X.X:35757 I0505 08:42:06.781466 21738 slave.cpp:2720] Killing un-reregistered executor 'thermos-nobody-devel-service-29970-0-478a7291-d070-4aa8 -af21-6fda889f750c' of framework 20151001-085346-58917130-5050-37976-0000 at executor(1)@10.X.X.X:51463 ... I0505 08:42:06.781558 21738 slave.cpp:4230] Finished recovery
We have verified that the tasks and their executors are killed by the agent during startup. When stopping the agent using supervisorctl stop, the executors are still running (verified via ps aux). They are only killed once the agent tries to reregister.
The issue is hard to reproduce:
- When restarting the agent binary multiple times, tasks are only lost for the first restart.
- It is much more likely to occur if the agent binary has been running for a longer period of time (> 7 days)
Mesos is correctly sticking to the 2 seconds wait time before killing un-reregistered executors. The failed executors receive the reregistration request, but it seems like they fail to send a reply.
A successful reregistration (not leading to LOST):
I0505 08:41:59.581231 21664 exec.cpp:456] Slave exited, but framework has checkpointing enabled. Waiting 15mins to reconnect with slave 20160118-141153-92471562-5050-6270-S17 I0505 08:42:04.780591 21665 exec.cpp:256] Received reconnect request from slave 20160118-141153-92471562-5050-6270-S17 I0505 08:42:04.785297 21676 exec.cpp:233] Executor re-registered on slave 20160118-141153-92471562-5050-6270-S17 I0505 08:42:04.788579 21676 exec.cpp:245] Executor::reregistered took 1.492339ms
A failed one:
I0505 08:42:04.779677 2389 exec.cpp:256] Received reconnect request from slave 20160118-141153-92471562-5050-6270-S17 E0505 08:42:05.481374 2408 process.cpp:1911] Failed to shutdown socket with fd 11: Transport endpoint is not connected I0505 08:42:05.481374 2395 exec.cpp:456] Slave exited, but framework has checkpointing enabled. Waiting 15mins to reconnect with slave 20160118-141153-92471562-5050-6270-S17
All task ending up in LOST have an output similar to the one posted above, i.e. log messages are in a wrong order.
Anyone an idea what might be going on here?
Attachments
Attachments
Issue Links
- is related to
-
MESOS-7569 Allow "old" executors with half-open connections to be preserved during agent upgrade / restart.
- Resolved
-
MESOS-5361 Consider introducing TCP KeepAlive for Libprocess sockets.
- Accepted
- relates to
-
MESOS-7057 Consider using the relink functionality of libprocess in the executor driver.
- Resolved
-
MESOS-5364 Consider adding `unlink` functionality to libprocess
- Resolved
-
MESOS-5361 Consider introducing TCP KeepAlive for Libprocess sockets.
- Accepted