[IMPALA-9254] Queries should only be retried if all fragments fail with retryable errors - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: Distributed Exec
Labels:
None

Epic Color:
ghx-label-4

Description

Currently, Impala only propagates an overall_status from an executor to the coordinator. The overall_status is set in the QueryState and "If multiple fragments have errors, the first fragment to hit an error is given preference.".

The issue is that if multiple fragments fail, it is possible some of the errors should trigger a retry, while other errors shouldn't. For example, one fragment could fail due to faulty disks, but others could fail due to mem limit exceptions. These types of queries shouldn't be retried because it is likely the query will just fail again.

This can only happen if the non-retryable error occurs in a specific time window: [when the retryable error occurs, the query is cancelled]. Since any fragment failure causes the entire query to be cancelled, this can only occur if the non-retryable error occurs after the retryable error, but before the query is cancelled.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Sahil Takiar

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 16/Dec/19 21:28

Updated:: 21/Dec/20 20:44