Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
2.6.0, 2.4.16, 2.4.17, 2.5.4
-
None
Description
This affects branch-2 based releases only (in master, HBASE-26416 refactored region replication to not rely on the replication framework anymore).
Per the original meta region replicas design, we use most of the replication framework for communicating changes in the primary replica back to the secondary ones, but we skip storing the queue state in ZK. In the event of a region replication crash, we should let the related replication source thread be interrupted, so that
RegionReplicaReplicationEndpoint would set a new source from the scratch and make sure to update the secondary replicas.
We have run into a situation in one of our customers' cluster where the region replica source faced a long lag (probably because the RSes hosting the secondary replicas were busy and slower in processing the region replication entries), so that the current wal got rolled and eventually deleted whilst the replication source reader was still referring it. In such cases, ReplicationSourceReader only sees the IOException and keeps retrying the read indefinitely, but since the file is now gone, it will just get stuck there forever. In the particular case of FNFE (which I believe would only happen for region replication), we should just raise an exception and let RegionReplicaReplicationEndpoint handle it to reset the region replication source.