Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
Impala 1.0
-
None
-
None
Description
Here's the detailed description of what could lead to a hang.
A query started and the client started fetching. The fetch will block because impala coorindator is blocked on waiting for data arrival (in DataStreamManager) from its child fragments. The fetch call is holding a lock on exec_state.
The wait for data arrival cannot detect if its child fragment instance is healthy running or not. It will wait until it's either cancelled, or some data arrives.
Now, all the child fragment instances are dead because the nodes die. The coordinator node is still running and waiting for data. Statestore detects the node failure and try to issue a query cancellation. However, it can't issue a query because the fetch call (FetchInternal) is holding the exec_state lock. CancelInternal() can't proceed because GetQueryExecState() can't lock the exec_state lock.
GetQueryExecState() is blocked on the exec_state lock while holding query_exec_state_map_lock_. This will cause the webserver to hang because the webserver is waiting on query_exec_state_map_lock_ to see which query is still alive.
Attachments
Issue Links
- is related to
-
IMPALA-414 Impala server cannot detect crash-restart failures reliably
- Resolved