[HBASE-28358] AsyncProcess inconsistent exception thrown for operation timeout - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 2.0.0
Fix Version/s: None
Component/s: None
Labels:
None

Description

I'm not sure if I'll get to this, but wanted to log it as a known issue.

AsyncProcess has a design where it breaks the batch into sub-batches based on regionserver, then submits a callable per regionserver in a threadpool. In the main thread, it calls waitUntilDone() with an operation timeout. If the callables don't finish within the operation timeout, a SocketTimeoutException is thrown. This exception is not very useful because it doesn't give you any sense of how many calls were in progress, on which servers, or why it's delayed.

Recently we've been improving the adherence to operation timeout within the callables themselves. The main driver here has been to ensure we don't erroneously clear the meta cache for operation timeout related errors. So we've added a new OperationTimeoutExceededException, which is thrown from within the callables and does not cause a meta cache clear. The added benefit is that if these bubble up to the caller, they are wrapped in RetriesExhaustedWithDetailsException which includes a lot more info about which server and which action is affected.

Now we've covered most but not all cases where operation timeout is exceeded. So when exceeding operation timeout it's possible sometimes to see a SocketTimeoutException from waitUntilDone, and sometimes see OperationTimeoutExceededException from the callables. It will depend on which one fails first. It may be nice to finish the swing here, ensuring that we always throw OperationTimeoutExceededException from the callables.

The main remaining case is in the call to locateRegion, which hits meta and does not honor the call's operation timeout (instead meta operation timeout). Resolving this would require some refactoring of ConnectionImplementation.locateRegion to allow passing an operation timeout and having that affect the userRegionLock and meta scan.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Bryan Beaudreault

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 11/Feb/24 14:35

Updated:: 18/Sep/24 09:22