Description
Recently we've noticed a problem where the master fails over and gets DoS'd by the slaves during re-registration. This is caused by a large swath of "Possibly orphaned completed task ..." log messages in the master.
After several hundred of these re-registrations, the master balloons and then gets OOM killed by the OS.
The temporary fix is to stop all the slaves, let a master get elected as leader, and then do a slow rolling restart of the slaves (i.e., start one slave every 500ms).
The fix might be to include an exponential backoff during slave re-registration.