[MESOS-1529] Handle a network partition between Master and Slave - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.20.0
Component/s: None
Labels:
None

Sprint:
Q3 Sprint 1, Q3 Sprint 2
Story Points:
5

Description

If a network partition occurs between a Master and Slave, the Master will remove the Slave (as it fails health check) and mark the tasks being run there as LOST. However, the Slave is not aware that it has been removed so the tasks will continue to run.

(To clarify a little bit: neither the master nor the slave receives 'exited' event, indicating that the connection between the master and slave is not closed).

There are at least two possible approaches to solving this issue:

1. Introduce a health check from Slave to Master so they have a consistent view of a network partition. We may still see this issue should a one-way connection error occur.

2. Be less aggressive about marking tasks and Slaves as lost. Wait until the Slave reappears and reconcile then. We'd still need to mark Slaves and tasks as potentially lost (zombie state) but maybe the Scheduler can make a more intelligent decision.

Attachments

Issue Links

is related to

MESOS-1668 Handle a temporary one-way master --> slave socket closure.

Resolved

MESOS-1879 Handle a temporary one-way slave --> master socket closure.

Accepted

MESOS-2110 Configurable Ping Timeouts

Resolved

Activity

People

Assignee:: Benjamin Mahler

Reporter:: Dominic Hamon

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 22/Jun/14 23:42

Updated:: 08/Jan/15 07:06

Resolved:: 04/Aug/14 22:04