Details
-
Bug
-
Status: Accepted
-
Major
-
Resolution: Unresolved
-
0.24.0
-
None
-
3
Description
The existing implementation of the scheduler driver does not re-register with the master under some network partition cases.
When a scheduler registers with the master:
1) master links to the framework
2) framework links to the master
It is possible for either of these links to break without the master changing. (Currently, the scheduler driver will only re-register if the master changes).
If both links break or if just link (1) breaks, the master views the framework as inactive and disconnected. This means the framework will not receive any more events (such as offers) from the master until it re-registers. There is currently no way for the scheduler to detect a one-way link breakage.
if link (2) breaks, it makes (almost) no difference to the scheduler. The scheduler usually uses the link to send messages to the master, but libprocess will create another socket if the persistent one is not available.
To fix link breakages for (1+2) and (2), the scheduler driver should implement a `::exited` event handler for the master's pid and trigger a master (re-)detection upon a disconnection. This in turn should make the driver (re)-register with the master. The scheduler library already does this: https://github.com/apache/mesos/blob/master/src/scheduler/scheduler.cpp#L395
See the related issue MESOS-5181 for link (1) breakage.
Attachments
Issue Links
- is related to
-
MESOS-5181 Master should reject calls from the scheduler driver if the scheduler is not connected.
- Resolved
- relates to
-
MESOS-2352 Scheduler::disconnected() should be called when the single master fails
- Open
-
MESOS-6676 Always re-link with scheduler during re-registration.
- Resolved
-
MESOS-5361 Consider introducing TCP KeepAlive for Libprocess sockets.
- Accepted
- supercedes
-
MESOS-887 Scheduler driver should use exited() to detect disconnection with Master.
- Open