[MESOS-6285] Agents may OOM during recovery if there are too many tasks or executors - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Abandoned
Affects Version/s: 1.0.1
Fix Version/s: None
Component/s: None
Labels:
- foundations
- mesosphere

Description

On an test cluster, we encountered a degenerate case where running the example long-lived-framework for over a week would render the agent un-recoverable.

The long-lived-framework creates one custom long-lived-executor and launches a single task on that executor every time it receives an offer from that agent. Over a week's worth of time, the framework manages to launch some 400k tasks (short sleeps) on one executor. During runtime, this is not problematic, as each completed task is quickly rotated out of the agent's memory (and checkpointed to disk).

During recovery, however, the agent reads every single task into memory, which leads to slow recovery; and often results in the agent being OOM-killed before it finishes recovering.

To repro this condition quickly:
1) Apply this patch to the long-lived-framework:

diff --git a/src/examples/long_lived_framework.cpp b/src/examples/long_lived_framework.cpp
index 7c57eb5..1263d82 100644
--- a/src/examples/long_lived_framework.cpp
+++ b/src/examples/long_lived_framework.cpp
@@ -358,16 +358,6 @@ private:
   // Helper to launch a task using an offer.
   void launch(const Offer& offer)
   {
-    int taskId = tasksLaunched++;
-    ++metrics.tasks_launched;
-
-    TaskInfo task;
-    task.set_name("Task " + stringify(taskId));
-    task.mutable_task_id()->set_value(stringify(taskId));
-    task.mutable_agent_id()->MergeFrom(offer.agent_id());
-    task.mutable_resources()->CopyFrom(taskResources);
-    task.mutable_executor()->CopyFrom(executor);
-
     Call call;
     call.set_type(Call::ACCEPT);
 
@@ -380,7 +370,23 @@ private:
     Offer::Operation* operation = accept->add_operations();
     operation->set_type(Offer::Operation::LAUNCH);
 
-    operation->mutable_launch()->add_task_infos()->CopyFrom(task);
+    // Launch as many tasks as possible in the given offer.
+    Resources remaining = Resources(offer.resources()).flatten();
+    while (remaining.contains(taskResources)) {
+      int taskId = tasksLaunched++;
+      ++metrics.tasks_launched;
+
+      TaskInfo task;
+      task.set_name("Task " + stringify(taskId));
+      task.mutable_task_id()->set_value(stringify(taskId));
+      task.mutable_agent_id()->MergeFrom(offer.agent_id());
+      task.mutable_resources()->CopyFrom(taskResources);
+      task.mutable_executor()->CopyFrom(executor);
+
+      operation->mutable_launch()->add_task_infos()->CopyFrom(task);
+
+      remaining -= taskResources;
+    }
 
     mesos->send(call);
   }

2) Run a master, agent, and long-lived-framework. On a 1 CPU, 1 GB agent + this patch, it should take about 10 minutes to build up sufficient task launches.

3) Restart the agent and watch it flail during recovery.

Attachments

Issue Links

is related to

MESOS-790 Make recovering frameworks in the Slave asynchronous.

Open

MESOS-8889 Support task.sentinel to avoid recovering finished tasks.

Open

relates to

MESOS-6286 Master does not remove an agent if it is responsive but not registered

Resolved

MESOS-7947 Add GC capability to nested containers

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Joseph Wu

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 29/Sep/16 20:18

Updated:: 16/Aug/21 07:40

Resolved:: 16/Aug/21 07:40