[MESOS-1973] Slaves DoS master on re-registration - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 0.20.1
Fix Version/s: 0.21.0
Component/s: master
Labels:
None

Target Version/s:

0.22.0

Description

Recently we've noticed a problem where the master fails over and gets DoS'd by the slaves during re-registration. This is caused by a large swath of "Possibly orphaned completed task ..." log messages in the master.

After several hundred of these re-registrations, the master balloons and then gets OOM killed by the OS.

The temporary fix is to stop all the slaves, let a master get elected as leader, and then do a slow rolling restart of the slaves (i.e., start one slave every 500ms).

The fix might be to include an exponential backoff during slave re-registration.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

master-fail.log.gz
23/Oct/14 18:04
217 kB
Brenden Matthews

Activity

People

Assignee:: Unassigned

Reporter:: Brenden Matthews

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 23/Oct/14 18:03

Updated:: 21/Jan/15 22:07

Resolved:: 21/Jan/15 22:07