Details
-
Sub-task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
2.0.0-beta-1
-
None
-
Reviewed
Description
TestCoprocessorWhitelistMasterObserver is failing 33% of the time. In the logs it looks like the failure is related to Master initialization.
Following log is from https://builds.apache.org/job/HBase%20Nightly/job/branch-2/203
2018-01-26 02:36:36,686 WARN [M:0;1f0c4777c1ba:35049] master.TableNamespaceManager(307): Caught exception in initializing namespace table manager org.apache.hadoop.hbase.DoNotRetryIOException: hconnection-0x18cd2ac8 closed at org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegion(ConnectionImplementation.java:722) at org.apache.hadoop.hbase.client.ConnectionUtils$ShortCircuitingClusterConnection.locateRegion(ConnectionUtils.java:131) at org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegion(ConnectionImplementation.java:714) at org.apache.hadoop.hbase.client.ConnectionUtils$ShortCircuitingClusterConnection.locateRegion(ConnectionUtils.java:131) at org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegion(ConnectionImplementation.java:684) at org.apache.hadoop.hbase.client.ConnectionUtils$ShortCircuitingClusterConnection.locateRegion(ConnectionUtils.java:131) at org.apache.hadoop.hbase.client.ConnectionImplementation.getRegionLocation(ConnectionImplementation.java:562) at org.apache.hadoop.hbase.client.ConnectionUtils$ShortCircuitingClusterConnection.getRegionLocation(ConnectionUtils.java:131) at org.apache.hadoop.hbase.client.HRegionLocator.getRegionLocation(HRegionLocator.java:73) at org.apache.hadoop.hbase.client.RegionServerCallable.prepare(RegionServerCallable.java:223) at org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:105) at org.apache.hadoop.hbase.client.HTable.get(HTable.java:388) at org.apache.hadoop.hbase.client.HTable.get(HTable.java:362) at org.apache.hadoop.hbase.master.TableNamespaceManager.get(TableNamespaceManager.java:141) at org.apache.hadoop.hbase.master.TableNamespaceManager.isTableAvailableAndInitialized(TableNamespaceManager.java:281) at org.apache.hadoop.hbase.master.TableNamespaceManager.start(TableNamespaceManager.java:103) at org.apache.hadoop.hbase.master.ClusterSchemaServiceImpl.doStart(ClusterSchemaServiceImpl.java:62) at org.apache.hbase.thirdparty.com.google.common.util.concurrent.AbstractService.startAsync(AbstractService.java:226) at org.apache.hadoop.hbase.master.HMaster.initClusterSchemaService(HMaster.java:1059) at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:921) at org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2034) at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:553) at java.lang.Thread.run(Thread.java:748) 2018-01-26 02:36:36,691 ERROR [M:0;1f0c4777c1ba:35049] helpers.MarkerIgnoringBase(159): Failed to become active master java.lang.IllegalStateException: Expected the service ClusterSchemaServiceImpl [FAILED] to be RUNNING, but the service has FAILED at org.apache.hbase.thirdparty.com.google.common.util.concurrent.AbstractService.checkCurrentState(AbstractService.java:345) at org.apache.hbase.thirdparty.com.google.common.util.concurrent.AbstractService.awaitRunning(AbstractService.java:291) at org.apache.hadoop.hbase.master.HMaster.initClusterSchemaService(HMaster.java:1061) at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:921) at org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2034) at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:553) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.hadoop.hbase.DoNotRetryIOException: hconnection-0x18cd2ac8 closed at org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegion(ConnectionImplementation.java:722) at org.apache.hadoop.hbase.client.ConnectionUtils$ShortCircuitingClusterConnection.locateRegion(ConnectionUtils.java:131) at org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegion(ConnectionImplementation.java:714) at org.apache.hadoop.hbase.client.ConnectionUtils$ShortCircuitingClusterConnection.locateRegion(ConnectionUtils.java:131) at org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegion(ConnectionImplementation.java:684) at org.apache.hadoop.hbase.client.ConnectionUtils$ShortCircuitingClusterConnection.locateRegion(ConnectionUtils.java:131) at org.apache.hadoop.hbase.client.ConnectionImplementation.getRegionLocation(ConnectionImplementation.java:562) at org.apache.hadoop.hbase.client.ConnectionUtils$ShortCircuitingClusterConnection.getRegionLocation(ConnectionUtils.java:131) at org.apache.hadoop.hbase.client.HRegionLocator.getRegionLocation(HRegionLocator.java:73) at org.apache.hadoop.hbase.client.RegionServerCallable.prepare(RegionServerCallable.java:223) at org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:105) at org.apache.hadoop.hbase.client.HTable.get(HTable.java:388) at org.apache.hadoop.hbase.client.HTable.get(HTable.java:362) at org.apache.hadoop.hbase.master.TableNamespaceManager.get(TableNamespaceManager.java:141) at org.apache.hadoop.hbase.master.TableNamespaceManager.isTableAvailableAndInitialized(TableNamespaceManager.java:281) at org.apache.hadoop.hbase.master.TableNamespaceManager.start(TableNamespaceManager.java:103) at org.apache.hadoop.hbase.master.ClusterSchemaServiceImpl.doStart(ClusterSchemaServiceImpl.java:62) at org.apache.hbase.thirdparty.com.google.common.util.concurrent.AbstractService.startAsync(AbstractService.java:226) at org.apache.hadoop.hbase.master.HMaster.initClusterSchemaService(HMaster.java:1059) ... 4 more 2018-01-26 02:36:36,691 ERROR [M:0;1f0c4777c1ba:35049] helpers.MarkerIgnoringBase(143): Master server abort: loaded coprocessors are: [org.apache.hadoop.hbase.security.access.TestCoprocessorWhitelistMasterObserver$TestRegionObserver, org.apache.hadoop.hbase.security.access.CoprocessorWhitelistMasterObserver, org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint] 2018-01-26 02:36:36,691 ERROR [M:0;1f0c4777c1ba:35049] helpers.MarkerIgnoringBase(159): ***** ABORTING master 1f0c4777c1ba,35049,1516934184742: Unhandled exception. Starting shutdown. ***** java.lang.IllegalStateException: Expected the service ClusterSchemaServiceImpl [FAILED] to be RUNNING, but the service has FAILED at org.apache.hbase.thirdparty.com.google.common.util.concurrent.AbstractService.checkCurrentState(AbstractService.java:345) at org.apache.hbase.thirdparty.com.google.common.util.concurrent.AbstractService.awaitRunning(AbstractService.java:291) at org.apache.hadoop.hbase.master.HMaster.initClusterSchemaService(HMaster.java:1061) at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:921) at org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2034) at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:553) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.hadoop.hbase.DoNotRetryIOException: hconnection-0x18cd2ac8 closed at org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegion(ConnectionImplementation.java:722) at org.apache.hadoop.hbase.client.ConnectionUtils$ShortCircuitingClusterConnection.locateRegion(ConnectionUtils.java:131) at org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegion(ConnectionImplementation.java:714) at org.apache.hadoop.hbase.client.ConnectionUtils$ShortCircuitingClusterConnection.locateRegion(ConnectionUtils.java:131) at org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegion(ConnectionImplementation.java:684) at org.apache.hadoop.hbase.client.ConnectionUtils$ShortCircuitingClusterConnection.locateRegion(ConnectionUtils.java:131) at org.apache.hadoop.hbase.client.ConnectionImplementation.getRegionLocation(ConnectionImplementation.java:562) at org.apache.hadoop.hbase.client.ConnectionUtils$ShortCircuitingClusterConnection.getRegionLocation(ConnectionUtils.java:131) at org.apache.hadoop.hbase.client.HRegionLocator.getRegionLocation(HRegionLocator.java:73) at org.apache.hadoop.hbase.client.RegionServerCallable.prepare(RegionServerCallable.java:223) at org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:105) at org.apache.hadoop.hbase.client.HTable.get(HTable.java:388) at org.apache.hadoop.hbase.client.HTable.get(HTable.java:362) at org.apache.hadoop.hbase.master.TableNamespaceManager.get(TableNamespaceManager.java:141) at org.apache.hadoop.hbase.master.TableNamespaceManager.isTableAvailableAndInitialized(TableNamespaceManager.java:281) at org.apache.hadoop.hbase.master.TableNamespaceManager.start(TableNamespaceManager.java:103) at org.apache.hadoop.hbase.master.ClusterSchemaServiceImpl.doStart(ClusterSchemaServiceImpl.java:62) at org.apache.hbase.thirdparty.com.google.common.util.concurrent.AbstractService.startAsync(AbstractService.java:226) at org.apache.hadoop.hbase.master.HMaster.initClusterSchemaService(HMaster.java:1059) ... 4 more 2018-01-26 02:36:36,692 DEBUG [M:0;1f0c4777c1ba:35049] coprocessor.CoprocessorHost(289): Stop coprocessor org.apache.hadoop.hbase.security.access.CoprocessorWhitelistMasterObserver 2018-01-26 02:36:36,692 WARN [M:0;1f0c4777c1ba:35049] coprocessor.BaseEnvironment(99): Not stopping coprocessor org.apache.hadoop.hbase.security.access.CoprocessorWhitelistMasterObserver because not active (state=STOPPED) 2018-01-26 02:36:36,692 INFO [M:0;1f0c4777c1ba:35049] regionserver.HRegionServer(2142): ***** STOPPING region server '1f0c4777c1ba,35049,1516934184742' *****
Attachments
Attachments
Issue Links
- is a child of
-
HBASE-19878 Fix the timeout UTs introduced by HBASE-19873
- Resolved
psomogyi How'd you extract the exception? Going about nightlies, I was unable to get good log. The exception seems to come after the log run at
02:36:36,686 (Log here ends at https://builds.apache.org/job/HBase%20Nightly/job/branch-2/203/testReport/junit/org.apache.hadoop.hbase.security.access/TestCoprocessorWhitelistMasterObserver/org_apache_hadoop_hbase_security_access_TestCoprocessorWhitelistMasterObserver/ 2018-01-26 02:36:36,440)
I can't make the test fail locally which I'm guessing is what you are finding.
(I was looking to see why we are not archiving the surefire test run output seemingly. It looks like the flags are in place but i can't find the raw emissions. Need to dig more).
So, the odd thing about this test is conf.setInt("hbase.client.retries.number", 1) (I think). If stuff is slow around startup, we'll fail our one attempt. Its happening in
ClusterSchemaServiceImpl which is cast as a Guava Service. It is being started async. Are we stuck in
isTableAvailableAndInitialized failing our one attempt over and over. I can't tell. Or is it that Master fails to come up and then we are just stuck in mini cluster startup trying to scan a meta that is not coming? Let me try some debug and up the retries to 5.... Test should still fail fast....