Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
1.6.0
-
None
Description
Some early task failures are not propagated to the framework. Here is an example of a marathon pod (mesos containerizer) definition with a non-existing image:
{ "id": "/fail", "containers": [ { "name": "container-1", "resources": { "cpus": 0.1, "mem": 128 }, "image": { "id": "non-existing-image-56789", "kind": "DOCKER" } } ], "scaling": { "instances": 1, "kind": "fixed" }, "networks": [ { "mode": "host" } ], "volumes": [], "fetch": [], "scheduling": { "placement": { "constraints": [] } } }
Here the status update the framework receives is TASK_FAILED (Executor terminated).
Here another example where a non-existing artifact is being fetched:
{ "id": "/fail2", "containers": [ { "name": "container-1", "resources": { "cpus": 0.1, "mem": 128 }, "image": { "id": "nginx", "kind": "DOCKER", "forcePull": false }, "artifacts": [ { "uri": "http://example.com/smth-non-existing-12345.tar.gz" } ] } ], "scaling": { "instances": 1, "kind": "fixed" }, "networks": [ { "mode": "host" } ], "volumes": [], "fetch": [], "scheduling": { "placement": { "constraints": [] } } }
which results in the same status update as above.
This is not an exhaustive list of such cases. I'm sure there are more failures along the fork-chain which are not properly propagated.
Frameworks (and their users) should always receive meaningful task failures reasons no matter where those failures happened. Otherwise, the only way to find out what happened is to grep agent logs.