[MESOS-5332] TASK_LOST on slave restart potentially due to executor race condition - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Resolved
Affects Version/s: 0.26.0
Fix Version/s: None
Component/s: agent, libprocess
Labels:
None
Environment:

Mesos 0.26
Aurora 0.13

Description

When restarting the Mesos agent binary, tasks can end up as LOST. We lose from 20% to 50% of all tasks. They are killed by the Mesos agent via:

I0505 08:42:06.781318 21738 slave.cpp:2702] Cleaning up un-reregistered executors
I0505 08:42:06.781366 21738 slave.cpp:2720] Killing un-reregistered executor 'thermos-nobody-devel-service-28854-0-6a88d62e-656
4-4e33-b0bb-1d8039d97afc' of framework 20151001-085346-58917130-5050-37976-0000 at executor(1)@10.X.X.X:40541
I0505 08:42:06.781446 21738 slave.cpp:2720] Killing un-reregistered executor 'thermos-nobody-devel-service-23839-0-1d2cd0e6-699
4-4cba-a9df-3dfc1552667f' of framework 20151001-085346-58917130-5050-37976-0000 at executor(1)@10.X.X.X:35757
I0505 08:42:06.781466 21738 slave.cpp:2720] Killing un-reregistered executor 'thermos-nobody-devel-service-29970-0-478a7291-d070-4aa8
-af21-6fda889f750c' of framework 20151001-085346-58917130-5050-37976-0000 at executor(1)@10.X.X.X:51463
...
I0505 08:42:06.781558 21738 slave.cpp:4230] Finished recovery

We have verified that the tasks and their executors are killed by the agent during startup. When stopping the agent using supervisorctl stop, the executors are still running (verified via ps aux). They are only killed once the agent tries to reregister.

The issue is hard to reproduce:

When restarting the agent binary multiple times, tasks are only lost for the first restart.
It is much more likely to occur if the agent binary has been running for a longer period of time (> 7 days)

Mesos is correctly sticking to the 2 seconds wait time before killing un-reregistered executors. The failed executors receive the reregistration request, but it seems like they fail to send a reply.

A successful reregistration (not leading to LOST):

I0505 08:41:59.581231 21664 exec.cpp:456] Slave exited, but framework has checkpointing enabled. Waiting 15mins to reconnect with slave 20160118-141153-92471562-5050-6270-S17
I0505 08:42:04.780591 21665 exec.cpp:256] Received reconnect request from slave 20160118-141153-92471562-5050-6270-S17
I0505 08:42:04.785297 21676 exec.cpp:233] Executor re-registered on slave 20160118-141153-92471562-5050-6270-S17
I0505 08:42:04.788579 21676 exec.cpp:245] Executor::reregistered took 1.492339ms

A failed one:

I0505 08:42:04.779677  2389 exec.cpp:256] Received reconnect request from slave 20160118-141153-92471562-5050-6270-S17
E0505 08:42:05.481374  2408 process.cpp:1911] Failed to shutdown socket with fd 11: Transport endpoint is not connected
I0505 08:42:05.481374  2395 exec.cpp:456] Slave exited, but framework has checkpointing enabled. Waiting 15mins to reconnect with slave 20160118-141153-92471562-5050-6270-S17

All task ending up in LOST have an output similar to the one posted above, i.e. log messages are in a wrong order.

Anyone an idea what might be going on here?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

executor-logs.tar.gz
07/May/16 20:15
9 kB
Stephan Erb
executor-stderrV2.log
06/May/16 07:53
4 kB
Stephan Erb
executor-stderr.log
05/May/16 21:54
3 kB
Stephan Erb
mesos-slave.log
05/May/16 21:50
6.33 MB
Stephan Erb

Issue Links

is related to

MESOS-7569 Allow "old" executors with half-open connections to be preserved during agent upgrade / restart.

Resolved

MESOS-5361 Consider introducing TCP KeepAlive for Libprocess sockets.

Accepted

relates to

MESOS-7057 Consider using the relink functionality of libprocess in the executor driver.

Resolved

MESOS-5364 Consider adding `unlink` functionality to libprocess

Resolved

MESOS-5361 Consider introducing TCP KeepAlive for Libprocess sockets.

Accepted

Activity

People

Assignee:: Anand Mazumdar

Reporter:: Stephan Erb

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 05/May/16 19:32

Updated:: 26/May/17 01:00

Resolved:: 11/May/16 22:02