Details
-
Sub-task
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
-
Reviewed
Description
The error message is a bit strange:
[ERROR] testThreeRSAbort(org.apache.hadoop.hbase.master.TestDLSAsyncFSWAL) Time elapsed: 20.627 s <<< ERROR!
org.apache.hadoop.hbase.TableNotFoundException: Region of 'hbase:namespace,,1513320505933.451650152885a3b41d0b1110deca513c.' is expected in the table of 'testThreeRSAbort', but hbase:meta says it is in the table of 'hbase:namespace'. hbase:meta might be damaged.
It fails for both FSHLog and AsyncFSWAL. Need to dig more.
Attachments
Attachments
- HBASE-19554.patch
- 8 kB
- Duo Zhang
- HBASE-19554-thread-dump.patch
- 2 kB
- Duo Zhang
Issue Links
- is broken by
-
HBASE-20037 Race when calling SequenceIdAccounting.resetHighest
- Resolved
-
HBASE-20020 Make sure we throw DoNotRetryIOException when ConnectionImplementation is closed
- Resolved
-
HBASE-19976 Dead lock if the worker threads in procedure executor are exhausted
- Closed
- is related to
-
HBASE-19787 Fix or disable tests broken in branch-2 so can cut beta-1
- Resolved
Activity
Will commit this patch first to see if there are any difference in later pre commit result.
-1 overall |
Vote | Subsystem | Runtime | Comment |
---|---|---|---|
0 | reexec | 2m 21s | Docker mode activated. |
Prechecks | |||
0 | findbugs | 0m 0s | Findbugs executables are not available. |
+1 | hbaseanti | 0m 0s | Patch does not have any anti-patterns. |
+1 | @author | 0m 0s | The patch does not contain any @author tags. |
+1 | test4tests | 0m 0s | The patch appears to include 2 new or modified test files. |
master Compile Tests | |||
+1 | mvninstall | 4m 38s | master passed |
+1 | compile | 0m 50s | master passed |
+1 | checkstyle | 1m 10s | master passed |
+1 | shadedjars | 5m 49s | branch has no errors when building our shaded downstream artifacts. |
+1 | javadoc | 0m 28s | master passed |
Patch Compile Tests | |||
+1 | mvninstall | 4m 42s | the patch passed |
+1 | compile | 0m 44s | the patch passed |
+1 | javac | 0m 44s | the patch passed |
-1 | checkstyle | 1m 5s | hbase-server: The patch generated 2 new + 269 unchanged - 2 fixed = 271 total (was 271) |
+1 | whitespace | 0m 0s | The patch has no whitespace issues. |
+1 | shadedjars | 4m 39s | patch has no errors when building our shaded downstream artifacts. |
+1 | hadoopcheck | 19m 20s | Patch does not cause any errors with Hadoop 2.6.5 2.7.4 or 3.0.0. |
+1 | javadoc | 0m 27s | the patch passed |
Other Tests | |||
-1 | unit | 105m 24s | hbase-server in the patch failed. |
+1 | asflicense | 0m 19s | The patch does not generate ASF License warnings. |
146m 18s |
Reason | Tests |
---|---|
Failed junit tests | hadoop.hbase.regionserver.wal.TestLogRolling |
Subsystem | Report/Notes |
---|---|
Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:eee3b01 |
JIRA Issue | |
JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12902859/HBASE-19554.patch |
Optional Tests | asflicense javac javadoc unit findbugs shadedjars hadoopcheck hbaseanti checkstyle compile |
uname | Linux 091eb3fa7b75 3.13.0-133-generic #182-Ubuntu SMP Tue Sep 19 15:49:21 UTC 2017 x86_64 GNU/Linux |
Build tool | maven |
Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh |
git revision | master / 03e79b7994 |
maven | version: Apache Maven 3.5.2 (138edd61fd100ec658bfa2d307c43b76940a5d7d; 2017-10-18T07:58:13Z) |
Default Java | 1.8.0_151 |
checkstyle | https://builds.apache.org/job/PreCommit-HBASE-Build/10557/artifact/patchprocess/diff-checkstyle-hbase-server.txt |
unit | https://builds.apache.org/job/PreCommit-HBASE-Build/10557/artifact/patchprocess/patch-unit-hbase-server.txt |
Test Results | https://builds.apache.org/job/PreCommit-HBASE-Build/10557/testReport/ |
modules | C: hbase-server U: hbase-server |
Console output | https://builds.apache.org/job/PreCommit-HBASE-Build/10557/console |
Powered by | Apache Yetus 0.6.0 http://yetus.apache.org |
This message was automatically generated.
I pushed these for you so you have debug when you get up Apache9
The failure was TestLogRolling... but there are timeouts too. Random. We seem to be inching into fun test timeout territory again. Will have to dig in...
FAILURE: Integrated in Jenkins build HBase-Trunk_matrix #4262 (See https://builds.apache.org/job/HBase-Trunk_matrix/4262/)
HBASE-19554 For debug: Modify HTU.waitUntilAllRegionsAssigned to handle (stack: rev 084e324fb9f769f221236e31aa97382d2677f911)
- (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/master/AbstractTestDLS.java
- (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java
Checked recent pre commit building, seems much better. And this a failure case
The error message
[ERROR] Tests run: 6, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 157.505 s <<< FAILURE! - in org.apache.hadoop.hbase.master.TestDLSAsyncFSWAL [ERROR] testThreeRSAbort(org.apache.hadoop.hbase.master.TestDLSAsyncFSWAL) Time elapsed: 49.462 s <<< ERROR! java.lang.RuntimeException: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=11, exceptions: Thu Dec 21 10:19:58 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745 Thu Dec 21 10:19:58 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.io.IOException: Call to 604d085d7ec5/172.17.0.2:57745 failed on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: 604d085d7ec5/172.17.0.2:57745 Thu Dec 21 10:19:58 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.io.IOException: Call to 604d085d7ec5/172.17.0.2:57745 failed on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: 604d085d7ec5/172.17.0.2:57745 Thu Dec 21 10:19:58 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.io.IOException: Call to 604d085d7ec5/172.17.0.2:57745 failed on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: 604d085d7ec5/172.17.0.2:57745 Thu Dec 21 10:19:59 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.io.IOException: Call to 604d085d7ec5/172.17.0.2:57745 failed on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: 604d085d7ec5/172.17.0.2:57745 Thu Dec 21 10:20:00 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745 Thu Dec 21 10:20:02 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745 Thu Dec 21 10:20:06 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745 Thu Dec 21 10:20:16 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745 Thu Dec 21 10:20:26 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745 Thu Dec 21 10:20:36 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745 Caused by: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=11, exceptions: Thu Dec 21 10:19:58 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745 Thu Dec 21 10:19:58 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.io.IOException: Call to 604d085d7ec5/172.17.0.2:57745 failed on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: 604d085d7ec5/172.17.0.2:57745 Thu Dec 21 10:19:58 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.io.IOException: Call to 604d085d7ec5/172.17.0.2:57745 failed on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: 604d085d7ec5/172.17.0.2:57745 Thu Dec 21 10:19:58 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.io.IOException: Call to 604d085d7ec5/172.17.0.2:57745 failed on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: 604d085d7ec5/172.17.0.2:57745 Thu Dec 21 10:19:59 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.io.IOException: Call to 604d085d7ec5/172.17.0.2:57745 failed on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: 604d085d7ec5/172.17.0.2:57745 Thu Dec 21 10:20:00 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745 Thu Dec 21 10:20:02 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745 Thu Dec 21 10:20:06 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745 Thu Dec 21 10:20:16 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745 Thu Dec 21 10:20:26 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745 Thu Dec 21 10:20:36 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745 Caused by: java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745 Caused by: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745 Caused by: java.net.ConnectException: Connection refused
The test output xml can not be generated which makes it really hard to find out the real problem...
[ERROR] Tests run: 6, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 406.921 s <<< FAILURE! - in org.apache.hadoop.hbase.master.TestDLSAsyncFSWAL [ERROR] testThreeRSAbort(org.apache.hadoop.hbase.master.TestDLSAsyncFSWAL) Time elapsed: 301.077 s <<< ERROR! org.junit.runners.model.TestTimedOutException: test timed out after 300000 milliseconds [ERROR] Errors: [ERROR] TestDLSAsyncFSWAL>AbstractTestDLS.testThreeRSAbort:401->Object.wait:460->Object.wait:-2 » TestTimedOut
I disabled TestDLSAsyncFSWAL and TestDLSFSHLog so I can cut the beta-1. They fail 40% and 33% respectively on branch-2. Disabled them in master too... since once fixed, we can add it in both places. Will be back to help out. Just disabling so can push out the beta. Thanks.
FAILURE: Integrated in Jenkins build HBase-Trunk_matrix #4528 (See https://builds.apache.org/job/HBase-Trunk_matrix/4528/)
HBASE-19554 reenable TestDLSAsyncFSWAL/TestDLSFSHLog for debugging (zhangduo: rev 88d6e06a1fe4130b96e0e9f0e2c2b189f0b6affd)
- (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestDLSFSHLog.java
- (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestDLSAsyncFSWAL.java
I guess the problem is related to meta reassign. zghaobac has already found something strange for AMv2 in HBASE-19965, so wait for his digging result.
And I found something strange in the output
java.lang.AssertionError at org.apache.hadoop.hbase.wal.WALKeyImpl.getWriteEntry(WALKeyImpl.java:82) at org.apache.hadoop.hbase.regionserver.wal.WALUtil.doFullAppendTransaction(WALUtil.java:159) at org.apache.hadoop.hbase.regionserver.wal.WALUtil.writeMarker(WALUtil.java:132) at org.apache.hadoop.hbase.regionserver.wal.WALUtil.writeRegionEventMarker(WALUtil.java:97) at org.apache.hadoop.hbase.regionserver.HRegion.writeRegionCloseMarker(HRegion.java:1103) at org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1615) at org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1437) at org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:104) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
The assertion is
public MultiVersionConcurrencyControl.WriteEntry getWriteEntry() throws InterruptedIOException { assert this.writeEntry != null; return this.writeEntry; }
I think the problem is introduced by HBASE-19929. Since we have closed WAL directly, then it is possible that we fail a WAL.append without assigning a mvcc writeEntry. Let me file a issue to fix this.
2018-02-12 04:56:20,895 DEBUG [PEWorker-16] procedure.ServerCrashProcedure(192): pid=132, state=RUNNABLE:SERVER_CRASH_PROCESS_META; ServerCrashProcedure server=asf911.gq1.ygridcore.net,41715,1518411358686, splitWal=true, meta=true; Processing hbase:meta that was on asf911.gq1.ygridcore.net,41715,1518411358686 2018-02-12 04:56:20,895 INFO [PEWorker-16] procedure2.ProcedureExecutor(1498): Initialized subprocedures=[{pid=135, ppid=132, state=RUNNABLE:RECOVER_META_SPLIT_LOGS; RecoverMetaProcedure failedMetaServer=asf911.gq1.ygridcore.net,41715,1518411358686, splitWal=true}]
Then there is no progress, so at last we time out. Let me add a thread dump when we are about to time out to see if we can find something.
After revisiting the log, I think I found a possible dead lock in recovery.
It is very simple, we only have 16 threads to run the procedures. During a failover, one region will lead to one AssignProcedure, and if meta region is also offline, then every AssignProcedure will be stuck at AssignProcedure.startTransition where we need to read something from meta region. So if the RecoverMetaProcedure can not be picked up by one of the worker thread before all of them are stuck at AssignProcedure.startTransition, then dead lock...
So I think we should have a special worker thread which can only execute meta related operations. And if a TableProcedureInterface is for meta table, then we should insert it into a special queue and let the special worker to run it immediately.
Let me open a new issue to address it.
Thanks.
FAILURE: Integrated in Jenkins build HBase-Trunk_matrix #4571 (See https://builds.apache.org/job/HBase-Trunk_matrix/4571/)
HBASE-19554 Dump thread info before failure (zhangduo: rev 8ff783f76d370199b81a24c48feaad1aa980285e)
- (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/master/AbstractTestDLS.java
Different problem
Now we time out when running testRecoveredEdits.
"Time-limited test" daemon prio=5 tid=23 timed_waiting java.lang.Thread.State: TIMED_WAITING at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:502) at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1406) at org.apache.zookeeper.ClientCnxn.close(ClientCnxn.java:1378) at org.apache.zookeeper.ZooKeeper.close(ZooKeeper.java:677) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.close(RecoverableZooKeeper.java:711) at org.apache.hadoop.hbase.zookeeper.ZKWatcher.close(ZKWatcher.java:596) at org.apache.hadoop.hbase.master.AbstractTestDLS.testRecoveredEdits(AbstractTestDLS.java:251) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) at org.junit.rules.RunRules.evaluate(RunRules.java:20) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.lang.Thread.run(Thread.java:748)
Let me dig more.
OK the problem is testMasterStartsUpWithLogSplittingWork, it spends nearly 13 minutes to finish
2018-02-20 02:29:56,586 INFO [Time-limited test] hbase.ResourceChecker(148): before: master.TestDLSAsyncFSWAL#testMasterStartsUpWithLogSplittingWork Thread=133, OpenFileDescriptor=893, MaxFileDescriptor=60000, SystemLoadAverage=819, ProcessCount=273, AvailableMemoryMB=3881 2018-02-20 02:42:38,021 INFO [Time-limited test] hbase.ResourceChecker(172): after: master.TestDLSAsyncFSWAL#testMasterStartsUpWithLogSplittingWork Thread=233 (was 133)
Let me dig more.
I think this is a problem which we have already addressed? We have already closed the ConnectionImplementation from outside, and then the initialize of HMaster should quit immediately. But it seems not, it keeps retrying on the closed connection... Let me check what's wrong here...
2018-02-20 02:31:12,578 DEBUG [M:0;asf911:38379] client.RpcRetryingCallerImpl(132): Call exception, tries=11, retries=11, started=48515 ms ago, cancelled=false, msg=hconnection-0x195e9495 closed, details=row 'default' on table 'hbase:namespace' at region=hbase:namespace,,1519093804365.3ff104b43ef4ce0a491e1f26b598813e., hostname=asf911.gq1.ygridcore.net,51776,1519093799312, seqNum=2, exception=java.io.IOException: hconnection-0x195e9495 closed at org.apache.hadoop.hbase.client.ConnectionImplementation.getTableState(ConnectionImplementation.java:1959) at org.apache.hadoop.hbase.client.ConnectionUtils$ShortCircuitingClusterConnection.getTableState(ConnectionUtils.java:131) at org.apache.hadoop.hbase.client.ConnectionImplementation.isTableDisabled(ConnectionImplementation.java:573) at org.apache.hadoop.hbase.client.ConnectionUtils$ShortCircuitingClusterConnection.isTableDisabled(ConnectionUtils.java:131) at org.apache.hadoop.hbase.client.RegionServerCallable.prepare(RegionServerCallable.java:219) at org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:105) at org.apache.hadoop.hbase.client.HTable.get(HTable.java:388) at org.apache.hadoop.hbase.client.HTable.get(HTable.java:362) at org.apache.hadoop.hbase.master.TableNamespaceManager.get(TableNamespaceManager.java:141) at org.apache.hadoop.hbase.master.TableNamespaceManager.isTableAvailableAndInitialized(TableNamespaceManager.java:278) at org.apache.hadoop.hbase.master.TableNamespaceManager.start(TableNamespaceManager.java:103) at org.apache.hadoop.hbase.master.ClusterSchemaServiceImpl.doStart(ClusterSchemaServiceImpl.java:62) at org.apache.hbase.thirdparty.com.google.common.util.concurrent.AbstractService.startAsync(AbstractService.java:226) at org.apache.hadoop.hbase.master.HMaster.initClusterSchemaService(HMaster.java:1053) at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:919) at org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2017) at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:559) at java.lang.Thread.run(Thread.java:748)
But it seems not, it keeps retrying on the closed connection... Let me check what's wrong here...
Yes. I thought this addressed because we explicitly 'close' the Master Connection on shutdown as of about a week or so ago. But looks like you figured that we need more than that – the check Connection is closed. Good one.
java.lang.AssertionError: expected:<10000> but was:<9825>
A bit strange, we roll wal with the same name twice...
2018-02-20 16:33:41,128 INFO [regionserver/asf903:0.logRoller] wal.AbstractFSWAL(690): Rolled WAL /user/jenkins/test-data/46c46c01-4dc8-4506-a228-a4460f2a28e9/WALs/asf903.gq1.ygridcore.net,52039,1519144411613/asf903.gq1.ygridcore.net%2C52039%2C1519144411613.1519144421057 with entries=614, filesize=124.20 KB; new WAL /user/jenkins/test-data/46c46c01-4dc8-4506-a228-a4460f2a28e9/WALs/asf903.gq1.ygridcore.net,52039,1519144411613/asf903.gq1.ygridcore.net%2C52039%2C1519144411613.1519144421109 2018-02-20 16:33:41,164 INFO [regionserver/asf903:0.logRoller] wal.AbstractFSWAL(690): Rolled WAL /user/jenkins/test-data/46c46c01-4dc8-4506-a228-a4460f2a28e9/WALs/asf903.gq1.ygridcore.net,52039,1519144411613/asf903.gq1.ygridcore.net%2C52039%2C1519144411613.1519144421109 with entries=789, filesize=124.20 KB; new WAL /user/jenkins/test-data/46c46c01-4dc8-4506-a228-a4460f2a28e9/WALs/asf903.gq1.ygridcore.net,52039,1519144411613/asf903.gq1.ygridcore.net%2C52039%2C1519144411613.1519144421143
Will dig more.
Oh they are not same... But still a bit strange that the file sizes are same but the entries numbers are different...
OK the numEntries is incorrect, will file an issue to address it, not critical, just logging.
And I think this is the problem
2018-02-20 16:33:41,183 INFO [regionserver/asf903:0.logRoller] wal.AbstractFSWAL(690): Rolled WAL /user/jenkins/test-data/46c46c01-4dc8-4506-a228-a4460f2a28e9/WALs/asf903.gq1.ygridcore.net,52039,1519144411613/asf903.gq1.ygridcore.net%2C52039%2C1519144411613.1519144421143 with entries=0, filesize=35.46 KB; new WAL /user/jenkins/test-data/46c46c01-4dc8-4506-a228-a4460f2a28e9/WALs/asf903.gq1.ygridcore.net,52039,1519144411613/asf903.gq1.ygridcore.net%2C52039%2C1519144411613.1519144421171 2018-02-20 16:33:41,184 DEBUG [regionserver/asf903:0.logRoller] wal.AbstractFSWAL(757): Create new FSHLog writer with pipeline: [DatanodeInfoWithStorage[127.0.0.1:60898,DS-fe8309dd-a316-4ef5-a474-7167557c5c76,DISK], DatanodeInfoWithStorage[127.0.0.1:53328,DS-0f9fbb3d-3cd1-47ea-9867-52f1ecac6f19,DISK], DatanodeInfoWithStorage[127.0.0.1:57523,DS-fc2097ed-4d98-40d7-a4a0-adfd31a5bf68,DISK]] 2018-02-20 16:33:41,184 INFO [regionserver/asf903:0.logRoller] wal.AbstractFSWAL(652): Archiving hdfs://localhost:55625/user/jenkins/test-data/46c46c01-4dc8-4506-a228-a4460f2a28e9/WALs/asf903.gq1.ygridcore.net,52039,1519144411613/asf903.gq1.ygridcore.net%2C52039%2C1519144411613.1519144421143 to hdfs://localhost:55625/user/jenkins/test-data/46c46c01-4dc8-4506-a228-a4460f2a28e9/oldWALs/asf903.gq1.ygridcore.net%2C52039%2C1519144411613.1519144421143
Here we move the wal file to oldWALs soon after rolling. We write to WAL directly in the test, so this should not happen. Let me dig more.
Oh, maybe the same problem with numEntries. We should also reset sequenceIdAccounting before processing new requests. Let me file an issue to address it.
All subtasks resolved. This family of tests no longer shows on the flakies list. Nice work Apache9
The message means that we do not have the records of the given table in meta table. And I found that the HTU.waitUntilAllRegionsAssigned will return in this case. Let me modify HTU.waitUntilAllRegionsAssigned first to not return in this case.