Details
-
Sub-task
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
-
ghx-label-4
Description
Currently, Impala only propagates an overall_status from an executor to the coordinator. The overall_status is set in the QueryState and "If multiple fragments have errors, the first fragment to hit an error is given preference.".
The issue is that if multiple fragments fail, it is possible some of the errors should trigger a retry, while other errors shouldn't. For example, one fragment could fail due to faulty disks, but others could fail due to mem limit exceptions. These types of queries shouldn't be retried because it is likely the query will just fail again.
This can only happen if the non-retryable error occurs in a specific time window: [when the retryable error occurs, the query is cancelled]. Since any fragment failure causes the entire query to be cancelled, this can only occur if the non-retryable error occurs after the retryable error, but before the query is cancelled.