Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
0.24.0
-
Mesosphere Sprint 34
-
1
Description
When a scheduler registers, the master will create a link from master to scheduler. If this link breaks, the master will consider the scheduler inactive and mark it as disconnected.
This causes a couple problems:
1) Master does not send offers to inactive schedulers. But these schedulers might consider themselves "registered" in a one-way network partition scenario.
2) Any calls from the inactive scheduler is still accepted, which leaves the scheduler in a starved, but semi-functional state.
See the related issue for more context: MESOS-5180
There should be an additional guard for registered, but inactive schedulers here:
https://github.com/apache/mesos/blob/94f4f4ebb7d491ec6da1473b619600332981dd8e/src/master/master.cpp#L1977
The HTTP API already does this:
https://github.com/apache/mesos/blob/94f4f4ebb7d491ec6da1473b619600332981dd8e/src/master/http.cpp#L459
Since the scheduler driver cannot return a 403, it may be necessary to return a Event::ERROR and force the scheduler to abort.
Attachments
Issue Links
- relates to
-
MESOS-5180 Scheduler driver does not detect disconnection with master and reregister.
- Accepted