Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
-
None
Description
There's several places where exceptions can reach the Dispatcher - which can cause a restart. Some of these may be valid and need to be evaluated.
e.g. TaskCommunicatorManager tracks known containers etc. In case of an error - it throws an unchecked exception, which I believe will reach the dispatcher directly. (Something like this happening would indicate a bug in the framework). Should this trigger a restart of the AM - or shutting down with an internal error?
The TaskSchedulerManager handles exceptions while processing events and dispatches a generic INTERNAL_ERRROR to the DAGAppMaster. This can be augmented with the reason for the error so that diagnostics are displayed correctly (or at least posted to the history service)
Also, what should be done when an exception does reach the Dispatcher.