Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
One of our design goals is fault tolerance. Tajo should handle a variety of node and task failures. Basically, current Tajo support some failure types. But, there are still room for improvements. The objective of this issue is to improve the fault tolerance features. This is an umbrella issue which tracks all required issues for handling task and node failure.
Attachments
Issue Links
- relates to
-
TAJO-1376 Add black list feature for unhealthy nodes
- Open
-
TAJO-1508 ResourceTracker does not update workers' resource capacities after the first join
- Resolved
-
TAJO-1563 Improve RPC error handling
- Resolved
- requires
-
TAJO-1215 ResourceTracker should notify node failure to QueryMaster
- Open
-
TAJO-1216 Output commit should be two phase commit
- Open
-
TAJO-1218 Implement straggler detector and the block list
- Open
-
TAJO-1217 Periodical task checker to detect ping timeout of tasks
- In Progress