Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Fixed
-
1.5.0
Description
Often, the ResourceManager learns faster about TaskManager failures/killings because it directly communicates with the underlying resource management framework. Instead of only relying on the JobManager's heartbeat to figure out that a TaskManager has died, we should additionally send a signal from the ResourceManager to the JobManager if a TaskManager has died. That way, we can react faster to TaskManager failures and recover our running job/s.
Attachments
Issue Links
- links to