[HBASE-27871] Meta replication stuck forever if wal it's still reading gets rolled and deleted - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.6.0, 2.4.16, 2.4.17, 2.5.4
Fix Version/s: 2.6.0, 2.4.18, 2.5.6
Component/s: meta replicas
Labels:
None

Description

This affects branch-2 based releases only (in master, ~~HBASE-26416~~ refactored region replication to not rely on the replication framework anymore).

Per the original meta region replicas design, we use most of the replication framework for communicating changes in the primary replica back to the secondary ones, but we skip storing the queue state in ZK. In the event of a region replication crash, we should let the related replication source thread be interrupted, so that
RegionReplicaReplicationEndpoint would set a new source from the scratch and make sure to update the secondary replicas.

We have run into a situation in one of our customers' cluster where the region replica source faced a long lag (probably because the RSes hosting the secondary replicas were busy and slower in processing the region replication entries), so that the current wal got rolled and eventually deleted whilst the replication source reader was still referring it. In such cases, ReplicationSourceReader only sees the IOException and keeps retrying the read indefinitely, but since the file is now gone, it will just get stuck there forever. In the particular case of FNFE (which I believe would only happen for region replication), we should just raise an exception and let RegionReplicaReplicationEndpoint handle it to reset the region replication source.

Attachments

Issue Links

links to

GitHub Pull Request #5241

GitHub Pull Request #5271

Activity

People

Assignee:: Wellington Chevreuil

Reporter:: Wellington Chevreuil

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 17/May/23 14:44

Updated:: 21/Jun/23 03:30

Resolved:: 20/Jun/23 14:18