Details
-
Sub-task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
Impala 4.0.0
-
None
-
ghx-label-11
Description
Currently, its possible for the admission service to have an incorrect view of what resources are being used in the cluster if there are rpc failures. For example, if the ReleaseQuery rpc fails, the coordinator will retry a few times and then give up. In this case, a query has completed by the admission service doesn't know and will not allow other queries to be scheduled with those resources.
We can solve this by adding a periodic heartbeat rpc from coordinators to the admission service. This heartbeat will include the query ids for all queries currently running at each coordinator, and then the admission service can clean up resources allocated to any queries that are not in the list, on the assumption that those queries must have completed already.