Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-1973

Slaves DoS master on re-registration

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 0.20.1
    • 0.21.0
    • master
    • None

    Description

      Recently we've noticed a problem where the master fails over and gets DoS'd by the slaves during re-registration. This is caused by a large swath of "Possibly orphaned completed task ..." log messages in the master.

      After several hundred of these re-registrations, the master balloons and then gets OOM killed by the OS.

      The temporary fix is to stop all the slaves, let a master get elected as leader, and then do a slow rolling restart of the slaves (i.e., start one slave every 500ms).

      The fix might be to include an exponential backoff during slave re-registration.

      Attachments

        1. master-fail.log.gz
          217 kB
          Brenden Matthews

        Activity

          People

            Unassigned Unassigned
            brenden Brenden Matthews
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: