Description
So in a Mesos cluster we observed the following
$ jq '.orphan_tasks | length' state.json 1369 $ jq '.unregistered_frameworks | length' state.json 20162
Aside from unregistered_frameworks here being "the list of frameworkIDs for each orphan task" (described in MESOS-4973), the discrepancy between the two values above is surprising.
I think the problem is that we do this in the master:
From source:
foreachvalue (Slave* slave, slaves.registered) {
foreachvalue (Task* task, slave->tasks[framework->id()]) {
framework->addTask(task);
}
foreachvalue (const ExecutorInfo& executor,
slave->executors[framework->id()]) {
framework->addExecutor(slave->id, executor);
}
}
Here an operator[] is used whenever a framework subscribes regardless of whether this agent has tasks for the framework or not.
If the agent has no such task for this framework, then this {frameworkID: empty hashmap} entry will stay in the map indefinitely! If frameworks are ephemeral and new ones keep come in, the map grows unboundedly.
We should do tasks.contains(frameworkId) before using the [] operator.
Attachments
Issue Links
- breaks
-
MESOS-6482 Master check failure when marking an agent unreachable
- Resolved
- is related to
-
MESOS-4973 Duplicates in 'unregistered_frameworks' in /state
- Resolved