Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
2.0.0
-
None
-
None
-
None
Description
I'm not sure if I'll get to this, but wanted to log it as a known issue.
AsyncProcess has a design where it breaks the batch into sub-batches based on regionserver, then submits a callable per regionserver in a threadpool. In the main thread, it calls waitUntilDone() with an operation timeout. If the callables don't finish within the operation timeout, a SocketTimeoutException is thrown. This exception is not very useful because it doesn't give you any sense of how many calls were in progress, on which servers, or why it's delayed.
Recently we've been improving the adherence to operation timeout within the callables themselves. The main driver here has been to ensure we don't erroneously clear the meta cache for operation timeout related errors. So we've added a new OperationTimeoutExceededException, which is thrown from within the callables and does not cause a meta cache clear. The added benefit is that if these bubble up to the caller, they are wrapped in RetriesExhaustedWithDetailsException which includes a lot more info about which server and which action is affected.
Now we've covered most but not all cases where operation timeout is exceeded. So when exceeding operation timeout it's possible sometimes to see a SocketTimeoutException from waitUntilDone, and sometimes see OperationTimeoutExceededException from the callables. It will depend on which one fails first. It may be nice to finish the swing here, ensuring that we always throw OperationTimeoutExceededException from the callables.
The main remaining case is in the call to locateRegion, which hits meta and does not honor the call's operation timeout (instead meta operation timeout). Resolving this would require some refactoring of ConnectionImplementation.locateRegion to allow passing an operation timeout and having that affect the userRegionLock and meta scan.