Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Won't Fix
-
None
-
None
-
None
-
None
Description
We were doing a cluster restart the other day. Some regionservers did not shut down cleanly. Upon restart our locality went from 99% to 5%. Upon looking at the AssignmentManager.joinCluster() code it calls AssignmentManager.processDeadServersAndRegionsInTransition().
If the failover flag gets set for any reason it seems we don't call assignAllUserRegions(). Then it looks like the balancer does the work in assigning those regions, we don't use a locality aware balancer and we lost our region locality.
I don't have a solid grasp on the reasoning for these checks but there could be some potential workarounds here.
1. After shutting down your cluster, move your WALs aside (replay later).
2. Clean up your zNodes
That seems to work, but requires a lot of manual labor. Another solution which I prefer would be to have a flag for ./start-hbase.sh --clean
If we start master with that flag then we do a check in AssignmentManager.processDeadServersAndRegionsInTransition() thus if this flag is set we call: assignAllUserRegions() regardless of the failover state.
I have a patch for the later solution, that is if I am understanding the logic correctly.
Attachments
Attachments
Issue Links
- is related to
-
HBASE-18036 HBase 1.x : Data locality is not maintained after cluster restart or SSH
- Resolved
-
HBASE-15251 During a cluster restart, Hmaster thinks it is a failover by mistake
- Resolved
-
HBASE-17791 Locality should not be affected for non-faulty region servers at startup
- Open