Details
-
Sub-task
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
None
-
None
-
Reviewed
Description
Shake-down of ModifyTableProcedure, talked this one out with Stack – "proper" fix is likely pending in HBASE-20682. Using MoveRegionProcedure is likely the wrong construct, we would want something specific to reopen (e.g. a ReopenProcedure).
However, we're in a really bad state right now. If there are non-open regions for a table which has a modify submitted against it, the entire system locks up in a fast-spin while holding the table's lock. This fills up HDFS with PV2 wals, and prevents you from doing anything in the hbase shell to try to fix those unassigned regions. You'll see spam in the master log like:
2018-06-07 03:21:29,448 WARN [PEWorker-1] procedure.ModifyTableProcedure: Retriable error trying to modify table=METRIC_RECORD_HOURLY_UUID (in state=MODIFY_TABLE_REOPEN_ALL_REGIONS) org.apache.hadoop.hbase.client.DoNotRetryRegionException: a3dc333606d38aeb6e2ab4b94233cfbc is not OPEN at org.apache.hadoop.hbase.master.procedure.AbstractStateMachineTableProcedure.checkOnline(AbstractStateMachineTableProcedure.java:193) at org.apache.hadoop.hbase.master.assignment.MoveRegionProcedure.<init>(MoveRegionProcedure.java:67) at org.apache.hadoop.hbase.master.assignment.AssignmentManager.createMoveRegionProcedure(AssignmentManager.java:767) at org.apache.hadoop.hbase.master.assignment.AssignmentManager.createReopenProcedures(AssignmentManager.java:705) at org.apache.hadoop.hbase.master.procedure.ModifyTableProcedure.executeFromState(ModifyTableProcedure.java:128) at org.apache.hadoop.hbase.master.procedure.ModifyTableProcedure.executeFromState(ModifyTableProcedure.java:50) at org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:184) at org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:850) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1472) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1240) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$800(ProcedureExecutor.java:75) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1760)
We unstuck out internal test cluster giving the following change on top of Sergey's HBASE-20657. When choosing the regions to reopen, if we filter out a table's regions to only be those which are currently OPEN. There may be some transient failures here as well, but a subsequent retry of the reopen step should filter out that change.