Details
Description
Making this as a sub-issue of parent issue which fails similar to how we are failing now.
Currently, I see that that TestShutdownBackupMaster test passes usually but it is warped in how it completes. It will do all retries just before the test timesout at 13minutes max...: e.g. you'll see this...
2020-12-02 22:07:34,200 DEBUG [master/stack:0:becomeActiveMaster] client.ConnectionImplementation(1009): locateRegionInMeta parentTable='hbase:meta', attempt=44 of 46 failed; retrying after sleep of 46
... so we'll do all the retries and then complete so the test looks like it 'succeeded' but it actually ran for Total time: 12:41 min... and the log is full of thread dumps because the cluster won't go down (The time is spent in the test shutdown).
Often though, we won't complete the retries in time and the test fails. It is in the flakey list.
Rather, we are supposed to fail out fast when we are shutting down. Below is the type of retry we see.
2020-12-02 10:53:35,540 INFO [Listener at localhost/61609] util.JVMClusterUtil(348): Shutdown of 2 master(s) and 2 regionserver(s) complete
2020-12-02 10:53:35,548 DEBUG [master/stack:0:becomeActiveMaster] client.ConnectionImplementation(1009): locateRegionInMeta parentTable='hbase:meta', attempt=2 of 46 failed; retrying after sleep of 46
org.apache.hadoop.hbase.DoNotRetryIOException: hconnection-0x1afa7f5b closed
at org.apache.hadoop.hbase.client.ConnectionImplementation.checkClosed(ConnectionImplementation.java:630)
at org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegion(ConnectionImplementation.java:815)
at org.apache.hadoop.hbase.client.ConnectionUtils$ShortCircuitingClusterConnection.locateRegion(ConnectionUtils.java:138)
at org.apache.hadoop.hbase.client.ConnectionImplementation.relocateRegion(ConnectionImplementation.java:803)
at org.apache.hadoop.hbase.client.ConnectionUtils$ShortCircuitingClusterConnection.relocateRegion(ConnectionUtils.java:138)
at org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegionInMeta(ConnectionImplementation.java:933)
at org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegion(ConnectionImplementation.java:823)
at org.apache.hadoop.hbase.client.ConnectionUtils$ShortCircuitingClusterConnection.locateRegion(ConnectionUtils.java:138)
at org.apache.hadoop.hbase.client.HRegionLocator.getRegionLocation(HRegionLocator.java:64)
at org.apache.hadoop.hbase.client.RegionLocator.getRegionLocation(RegionLocator.java:70)
at org.apache.hadoop.hbase.client.RegionLocator.getRegionLocation(RegionLocator.java:59)
at org.apache.hadoop.hbase.client.RegionServerCallable.prepare(RegionServerCallable.java:223)
at org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:105)
at org.apache.hadoop.hbase.client.HTable.get(HTable.java:383)
at org.apache.hadoop.hbase.client.HTable.get(HTable.java:357)
at org.apache.hadoop.hbase.master.TableNamespaceManager.get(TableNamespaceManager.java:141)
at org.apache.hadoop.hbase.master.TableNamespaceManager.isTableAvailableAndInitialized(TableNamespaceManager.java:278)
at org.apache.hadoop.hbase.master.TableNamespaceManager.start(TableNamespaceManager.java:103)
at org.apache.hadoop.hbase.master.ClusterSchemaServiceImpl.doStart(ClusterSchemaServiceImpl.java:63)
at org.apache.hbase.thirdparty.com.google.common.util.concurrent.AbstractService.startAsync(AbstractService.java:249)
at org.apache.hadoop.hbase.master.HMaster.initClusterSchemaService(HMaster.java:1224)
at org.apache.hadoop.hbase.master.TestShutdownBackupMaster$MockHMaster.initClusterSchemaService(TestShutdownBackupMaster.java:68)
at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:1021)
at org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2082)
at org.apache.hadoop.hbase.master.HMaster.lambda$run$0(HMaster.java:506)
See how a master is trying to become active and it won't relent trying to become active master even though this cluster is shutting down? See how we retry but the check for close of the connection is coming back with a DoNotRetryIOException? The exception is being swallowed. We keep going.
Fix looks simple enough.
Attachments
Issue Links
- links to