Details
-
Improvement
-
Status: Resolved
-
Minor
-
Resolution: Duplicate
-
Impala 1.0.1
-
None
Description
The membership mechanism used to tell Impala servers about failures does not always detect fast crash-restarts. If a server restarts and re-registers before the state-store recognises that it has failed, the failure won't get reported to any other subscriber.
The right way to fix this, I think, is to track a version number in every subscriber. When a subscriber reconnects, it gets a new version number. For every query, we track the highest version number of the subscriber known at that time. Then if any backend executing a query has a higher version number, it's likely to have restarted since the query started. There might be a couple of false positives, since a node could conceivably restart between a scheduling assignment and actually receiving a query, but that's unlikely and better than false negatives.
Attachments
Issue Links
- duplicates
-
IMPALA-2990 Coordinator should timeout and cancel queries with unresponsive / stuck executors
- Resolved
- relates to
-
IMPALA-412 Impala might hang when an impalad die during query execution
- Resolved