[HBASE-19554] AbstractTestDLS.testThreeRSAbort sometimes fails in pre commit - ASF JIRA

Duo Zhang added a comment - 19/Dec/17 13:12

The message means that we do not have the records of the given table in meta table. And I found that the HTU.waitUntilAllRegionsAssigned will return in this case. Let me modify HTU.waitUntilAllRegionsAssigned first to not return in this case.

Duo Zhang added a comment - 19/Dec/17 13:12 The message means that we do not have the records of the given table in meta table. And I found that the HTU.waitUntilAllRegionsAssigned will return in this case. Let me modify HTU.waitUntilAllRegionsAssigned first to not return in this case.

Duo Zhang added a comment - 19/Dec/17 13:22

Will commit this patch first to see if there are any difference in later pre commit result.

Duo Zhang added a comment - 19/Dec/17 13:22 Will commit this patch first to see if there are any difference in later pre commit result.

Hadoop QA added a comment - 19/Dec/17 16:07

-1 overall

Vote	Subsystem	Runtime	Comment
0	reexec	2m 21s	Docker mode activated.
			Prechecks
0	findbugs	0m 0s	Findbugs executables are not available.
+1	hbaseanti	0m 0s	Patch does not have any anti-patterns.
+1	@author	0m 0s	The patch does not contain any @author tags.
+1	test4tests	0m 0s	The patch appears to include 2 new or modified test files.
			master Compile Tests
+1	mvninstall	4m 38s	master passed
+1	compile	0m 50s	master passed
+1	checkstyle	1m 10s	master passed
+1	shadedjars	5m 49s	branch has no errors when building our shaded downstream artifacts.
+1	javadoc	0m 28s	master passed
			Patch Compile Tests
+1	mvninstall	4m 42s	the patch passed
+1	compile	0m 44s	the patch passed
+1	javac	0m 44s	the patch passed
-1	checkstyle	1m 5s	hbase-server: The patch generated 2 new + 269 unchanged - 2 fixed = 271 total (was 271)
+1	whitespace	0m 0s	The patch has no whitespace issues.
+1	shadedjars	4m 39s	patch has no errors when building our shaded downstream artifacts.
+1	hadoopcheck	19m 20s	Patch does not cause any errors with Hadoop 2.6.5 2.7.4 or 3.0.0.
+1	javadoc	0m 27s	the patch passed
			Other Tests
-1	unit	105m 24s	hbase-server in the patch failed.
+1	asflicense	0m 19s	The patch does not generate ASF License warnings.
		146m 18s

Reason	Tests
Failed junit tests	hadoop.hbase.regionserver.wal.TestLogRolling

Subsystem	Report/Notes
Docker	Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:eee3b01
JIRA Issue	~~HBASE-19554~~
JIRA Patch URL	https://issues.apache.org/jira/secure/attachment/12902859/HBASE-19554.patch
Optional Tests	asflicense javac javadoc unit findbugs shadedjars hadoopcheck hbaseanti checkstyle compile
uname	Linux 091eb3fa7b75 3.13.0-133-generic #182-Ubuntu SMP Tue Sep 19 15:49:21 UTC 2017 x86_64 GNU/Linux
Build tool	maven
Personality	/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh
git revision	master / 03e79b7994
maven	version: Apache Maven 3.5.2 (138edd61fd100ec658bfa2d307c43b76940a5d7d; 2017-10-18T07:58:13Z)
Default Java	1.8.0_151
checkstyle	https://builds.apache.org/job/PreCommit-HBASE-Build/10557/artifact/patchprocess/diff-checkstyle-hbase-server.txt
unit	https://builds.apache.org/job/PreCommit-HBASE-Build/10557/artifact/patchprocess/patch-unit-hbase-server.txt
Test Results	https://builds.apache.org/job/PreCommit-HBASE-Build/10557/testReport/
modules	C: hbase-server U: hbase-server
Console output	https://builds.apache.org/job/PreCommit-HBASE-Build/10557/console
Powered by	Apache Yetus 0.6.0 http://yetus.apache.org

This message was automatically generated.

Hadoop QA added a comment - 19/Dec/17 16:07 -1 overall Vote Subsystem Runtime Comment 0 reexec 2m 21s Docker mode activated. Prechecks 0 findbugs 0m 0s Findbugs executables are not available. +1 hbaseanti 0m 0s Patch does not have any anti-patterns. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 2 new or modified test files. master Compile Tests +1 mvninstall 4m 38s master passed +1 compile 0m 50s master passed +1 checkstyle 1m 10s master passed +1 shadedjars 5m 49s branch has no errors when building our shaded downstream artifacts. +1 javadoc 0m 28s master passed Patch Compile Tests +1 mvninstall 4m 42s the patch passed +1 compile 0m 44s the patch passed +1 javac 0m 44s the patch passed -1 checkstyle 1m 5s hbase-server: The patch generated 2 new + 269 unchanged - 2 fixed = 271 total (was 271) +1 whitespace 0m 0s The patch has no whitespace issues. +1 shadedjars 4m 39s patch has no errors when building our shaded downstream artifacts. +1 hadoopcheck 19m 20s Patch does not cause any errors with Hadoop 2.6.5 2.7.4 or 3.0.0. +1 javadoc 0m 27s the patch passed Other Tests -1 unit 105m 24s hbase-server in the patch failed. +1 asflicense 0m 19s The patch does not generate ASF License warnings. 146m 18s Reason Tests Failed junit tests hadoop.hbase.regionserver.wal.TestLogRolling Subsystem Report/Notes Docker Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:eee3b01 JIRA Issue HBASE-19554 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12902859/HBASE-19554.patch Optional Tests asflicense javac javadoc unit findbugs shadedjars hadoopcheck hbaseanti checkstyle compile uname Linux 091eb3fa7b75 3.13.0-133-generic #182-Ubuntu SMP Tue Sep 19 15:49:21 UTC 2017 x86_64 GNU/Linux Build tool maven Personality /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh git revision master / 03e79b7994 maven version: Apache Maven 3.5.2 (138edd61fd100ec658bfa2d307c43b76940a5d7d; 2017-10-18T07:58:13Z) Default Java 1.8.0_151 checkstyle https://builds.apache.org/job/PreCommit-HBASE-Build/10557/artifact/patchprocess/diff-checkstyle-hbase-server.txt unit https://builds.apache.org/job/PreCommit-HBASE-Build/10557/artifact/patchprocess/patch-unit-hbase-server.txt Test Results https://builds.apache.org/job/PreCommit-HBASE-Build/10557/testReport/ modules C: hbase-server U: hbase-server Console output https://builds.apache.org/job/PreCommit-HBASE-Build/10557/console Powered by Apache Yetus 0.6.0 http://yetus.apache.org This message was automatically generated.

Michael Stack added a comment - 20/Dec/17 18:29

I pushed these for you so you have debug when you get up Apache9

The failure was TestLogRolling... but there are timeouts too. Random. We seem to be inching into fun test timeout territory again. Will have to dig in...

Michael Stack added a comment - 20/Dec/17 18:29 I pushed these for you so you have debug when you get up Apache9 The failure was TestLogRolling... but there are timeouts too. Random. We seem to be inching into fun test timeout territory again. Will have to dig in...

Hudson added a comment - 21/Dec/17 00:23

FAILURE: Integrated in Jenkins build HBase-Trunk_matrix #4262 (See https://builds.apache.org/job/HBase-Trunk_matrix/4262/)
~~HBASE-19554~~ For debug: Modify HTU.waitUntilAllRegionsAssigned to handle (stack: rev 084e324fb9f769f221236e31aa97382d2677f911)

(edit) hbase-server/src/test/java/org/apache/hadoop/hbase/master/AbstractTestDLS.java
(edit) hbase-server/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java

Hudson added a comment - 21/Dec/17 00:23 FAILURE: Integrated in Jenkins build HBase-Trunk_matrix #4262 (See https://builds.apache.org/job/HBase-Trunk_matrix/4262/ ) HBASE-19554 For debug: Modify HTU.waitUntilAllRegionsAssigned to handle (stack: rev 084e324fb9f769f221236e31aa97382d2677f911) (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/master/AbstractTestDLS.java (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java

Duo Zhang added a comment - 22/Dec/17 01:31 - edited

Checked recent pre commit building, seems much better. And this a failure case

https://builds.apache.org/job/PreCommit-HBASE-Build/10609/artifact/patchprocess/patch-unit-hbase-server.txt

The error message

[ERROR] Tests run: 6, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 157.505 s <<< FAILURE! - in org.apache.hadoop.hbase.master.TestDLSAsyncFSWAL
[ERROR] testThreeRSAbort(org.apache.hadoop.hbase.master.TestDLSAsyncFSWAL)  Time elapsed: 49.462 s  <<< ERROR!
java.lang.RuntimeException: 
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=11, exceptions:
Thu Dec 21 10:19:58 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745
Thu Dec 21 10:19:58 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.io.IOException: Call to 604d085d7ec5/172.17.0.2:57745 failed on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: 604d085d7ec5/172.17.0.2:57745
Thu Dec 21 10:19:58 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.io.IOException: Call to 604d085d7ec5/172.17.0.2:57745 failed on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: 604d085d7ec5/172.17.0.2:57745
Thu Dec 21 10:19:58 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.io.IOException: Call to 604d085d7ec5/172.17.0.2:57745 failed on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: 604d085d7ec5/172.17.0.2:57745
Thu Dec 21 10:19:59 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.io.IOException: Call to 604d085d7ec5/172.17.0.2:57745 failed on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: 604d085d7ec5/172.17.0.2:57745
Thu Dec 21 10:20:00 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745
Thu Dec 21 10:20:02 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745
Thu Dec 21 10:20:06 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745
Thu Dec 21 10:20:16 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745
Thu Dec 21 10:20:26 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745
Thu Dec 21 10:20:36 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745

Caused by: org.apache.hadoop.hbase.client.RetriesExhaustedException: 
Failed after attempts=11, exceptions:
Thu Dec 21 10:19:58 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745
Thu Dec 21 10:19:58 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.io.IOException: Call to 604d085d7ec5/172.17.0.2:57745 failed on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: 604d085d7ec5/172.17.0.2:57745
Thu Dec 21 10:19:58 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.io.IOException: Call to 604d085d7ec5/172.17.0.2:57745 failed on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: 604d085d7ec5/172.17.0.2:57745
Thu Dec 21 10:19:58 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.io.IOException: Call to 604d085d7ec5/172.17.0.2:57745 failed on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: 604d085d7ec5/172.17.0.2:57745
Thu Dec 21 10:19:59 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.io.IOException: Call to 604d085d7ec5/172.17.0.2:57745 failed on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: 604d085d7ec5/172.17.0.2:57745
Thu Dec 21 10:20:00 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745
Thu Dec 21 10:20:02 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745
Thu Dec 21 10:20:06 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745
Thu Dec 21 10:20:16 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745
Thu Dec 21 10:20:26 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745
Thu Dec 21 10:20:36 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745

Caused by: java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745
Caused by: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745
Caused by: java.net.ConnectException: Connection refused

The test output xml can not be generated which makes it really hard to find out the real problem...

Duo Zhang added a comment - 22/Dec/17 01:31 - edited Checked recent pre commit building, seems much better. And this a failure case https://builds.apache.org/job/PreCommit-HBASE-Build/10609/artifact/patchprocess/patch-unit-hbase-server.txt The error message [ERROR] Tests run: 6, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 157.505 s <<< FAILURE! - in org.apache.hadoop.hbase.master.TestDLSAsyncFSWAL [ERROR] testThreeRSAbort(org.apache.hadoop.hbase.master.TestDLSAsyncFSWAL) Time elapsed: 49.462 s <<< ERROR! java.lang.RuntimeException: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=11, exceptions: Thu Dec 21 10:19:58 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745 Thu Dec 21 10:19:58 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.io.IOException: Call to 604d085d7ec5/172.17.0.2:57745 failed on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: 604d085d7ec5/172.17.0.2:57745 Thu Dec 21 10:19:58 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.io.IOException: Call to 604d085d7ec5/172.17.0.2:57745 failed on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: 604d085d7ec5/172.17.0.2:57745 Thu Dec 21 10:19:58 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.io.IOException: Call to 604d085d7ec5/172.17.0.2:57745 failed on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: 604d085d7ec5/172.17.0.2:57745 Thu Dec 21 10:19:59 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.io.IOException: Call to 604d085d7ec5/172.17.0.2:57745 failed on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: 604d085d7ec5/172.17.0.2:57745 Thu Dec 21 10:20:00 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745 Thu Dec 21 10:20:02 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745 Thu Dec 21 10:20:06 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745 Thu Dec 21 10:20:16 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745 Thu Dec 21 10:20:26 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745 Thu Dec 21 10:20:36 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745 Caused by: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=11, exceptions: Thu Dec 21 10:19:58 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745 Thu Dec 21 10:19:58 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.io.IOException: Call to 604d085d7ec5/172.17.0.2:57745 failed on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: 604d085d7ec5/172.17.0.2:57745 Thu Dec 21 10:19:58 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.io.IOException: Call to 604d085d7ec5/172.17.0.2:57745 failed on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: 604d085d7ec5/172.17.0.2:57745 Thu Dec 21 10:19:58 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.io.IOException: Call to 604d085d7ec5/172.17.0.2:57745 failed on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: 604d085d7ec5/172.17.0.2:57745 Thu Dec 21 10:19:59 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.io.IOException: Call to 604d085d7ec5/172.17.0.2:57745 failed on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: 604d085d7ec5/172.17.0.2:57745 Thu Dec 21 10:20:00 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745 Thu Dec 21 10:20:02 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745 Thu Dec 21 10:20:06 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745 Thu Dec 21 10:20:16 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745 Thu Dec 21 10:20:26 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745 Thu Dec 21 10:20:36 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745 Caused by: java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745 Caused by: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745 Caused by: java.net.ConnectException: Connection refused The test output xml can not be generated which makes it really hard to find out the real problem...

Duo Zhang added a comment - 25/Dec/17 01:04

https://builds.apache.org/job/PreCommit-HBASE-Build/10676/artifact/patchprocess/patch-unit-hbase-server.txt

[ERROR] Tests run: 6, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 406.921 s <<< FAILURE! - in org.apache.hadoop.hbase.master.TestDLSAsyncFSWAL
[ERROR] testThreeRSAbort(org.apache.hadoop.hbase.master.TestDLSAsyncFSWAL)  Time elapsed: 301.077 s  <<< ERROR!
org.junit.runners.model.TestTimedOutException: test timed out after 300000 milliseconds

[ERROR] Errors: 
[ERROR]   TestDLSAsyncFSWAL>AbstractTestDLS.testThreeRSAbort:401->Object.wait:460->Object.wait:-2 Â» TestTimedOut

Duo Zhang added a comment - 25/Dec/17 01:04 https://builds.apache.org/job/PreCommit-HBASE-Build/10676/artifact/patchprocess/patch-unit-hbase-server.txt [ERROR] Tests run: 6, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 406.921 s <<< FAILURE! - in org.apache.hadoop.hbase.master.TestDLSAsyncFSWAL [ERROR] testThreeRSAbort(org.apache.hadoop.hbase.master.TestDLSAsyncFSWAL) Time elapsed: 301.077 s <<< ERROR! org.junit.runners.model.TestTimedOutException: test timed out after 300000 milliseconds [ERROR] Errors: [ERROR] TestDLSAsyncFSWAL>AbstractTestDLS.testThreeRSAbort:401->Object.wait:460->Object.wait:-2 Â» TestTimedOut

Duo Zhang added a comment - 25/Dec/17 01:05

Still failed in countRows. There may be a problem with the assignments?

Duo Zhang added a comment - 25/Dec/17 01:05 Still failed in countRows. There may be a problem with the assignments?

Michael Stack added a comment - 12/Jan/18 22:02

I disabled TestDLSAsyncFSWAL and TestDLSFSHLog so I can cut the beta-1. They fail 40% and 33% respectively on branch-2. Disabled them in master too... since once fixed, we can add it in both places. Will be back to help out. Just disabling so can push out the beta. Thanks.

Michael Stack added a comment - 12/Jan/18 22:02 I disabled TestDLSAsyncFSWAL and TestDLSFSHLog so I can cut the beta-1. They fail 40% and 33% respectively on branch-2. Disabled them in master too... since once fixed, we can add it in both places. Will be back to help out. Just disabling so can push out the beta. Thanks.

Duo Zhang added a comment - 05/Feb/18 00:59

Reenable for debugging.

Duo Zhang added a comment - 05/Feb/18 00:59 Reenable for debugging.

Hudson added a comment - 05/Feb/18 04:55

FAILURE: Integrated in Jenkins build HBase-Trunk_matrix #4528 (See https://builds.apache.org/job/HBase-Trunk_matrix/4528/)
~~HBASE-19554~~ reenable TestDLSAsyncFSWAL/TestDLSFSHLog for debugging (zhangduo: rev 88d6e06a1fe4130b96e0e9f0e2c2b189f0b6affd)

(edit) hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestDLSFSHLog.java
(edit) hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestDLSAsyncFSWAL.java

Hudson added a comment - 05/Feb/18 04:55 FAILURE: Integrated in Jenkins build HBase-Trunk_matrix #4528 (See https://builds.apache.org/job/HBase-Trunk_matrix/4528/ ) HBASE-19554 reenable TestDLSAsyncFSWAL/TestDLSFSHLog for debugging (zhangduo: rev 88d6e06a1fe4130b96e0e9f0e2c2b189f0b6affd) (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestDLSFSHLog.java (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestDLSAsyncFSWAL.java

Duo Zhang added a comment - 09/Feb/18 12:22

I guess the problem is related to meta reassign. zghaobac has already found something strange for AMv2 in ~~HBASE-19965~~, so wait for his digging result.

And I found something strange in the output

java.lang.AssertionError
	at org.apache.hadoop.hbase.wal.WALKeyImpl.getWriteEntry(WALKeyImpl.java:82)
	at org.apache.hadoop.hbase.regionserver.wal.WALUtil.doFullAppendTransaction(WALUtil.java:159)
	at org.apache.hadoop.hbase.regionserver.wal.WALUtil.writeMarker(WALUtil.java:132)
	at org.apache.hadoop.hbase.regionserver.wal.WALUtil.writeRegionEventMarker(WALUtil.java:97)
	at org.apache.hadoop.hbase.regionserver.HRegion.writeRegionCloseMarker(HRegion.java:1103)
	at org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1615)
	at org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1437)
	at org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:104)
	at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

The assertion is

  public MultiVersionConcurrencyControl.WriteEntry getWriteEntry() throws InterruptedIOException {
    assert this.writeEntry != null;
    return this.writeEntry;
  }

I think the problem is introduced by ~~HBASE-19929~~. Since we have closed WAL directly, then it is possible that we fail a WAL.append without assigning a mvcc writeEntry. Let me file a issue to fix this.

Duo Zhang added a comment - 09/Feb/18 12:22 I guess the problem is related to meta reassign. zghaobac has already found something strange for AMv2 in HBASE-19965 , so wait for his digging result. And I found something strange in the output java.lang.AssertionError at org.apache.hadoop.hbase.wal.WALKeyImpl.getWriteEntry(WALKeyImpl.java:82) at org.apache.hadoop.hbase.regionserver.wal.WALUtil.doFullAppendTransaction(WALUtil.java:159) at org.apache.hadoop.hbase.regionserver.wal.WALUtil.writeMarker(WALUtil.java:132) at org.apache.hadoop.hbase.regionserver.wal.WALUtil.writeRegionEventMarker(WALUtil.java:97) at org.apache.hadoop.hbase.regionserver.HRegion.writeRegionCloseMarker(HRegion.java:1103) at org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1615) at org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1437) at org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:104) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) The assertion is public MultiVersionConcurrencyControl.WriteEntry getWriteEntry() throws InterruptedIOException { assert this .writeEntry != null ; return this .writeEntry; } I think the problem is introduced by HBASE-19929 . Since we have closed WAL directly, then it is possible that we fail a WAL.append without assigning a mvcc writeEntry. Let me file a issue to fix this.

Duo Zhang added a comment - 12/Feb/18 07:54

https://builds.apache.org/job/HBASE-Flaky-Tests/25832/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.master.TestDLSFSHLog-output.txt/*view*/

2018-02-12 04:56:20,895 DEBUG [PEWorker-16] procedure.ServerCrashProcedure(192): pid=132, state=RUNNABLE:SERVER_CRASH_PROCESS_META; ServerCrashProcedure server=asf911.gq1.ygridcore.net,41715,1518411358686, splitWal=true, meta=true; Processing hbase:meta that was on asf911.gq1.ygridcore.net,41715,1518411358686
2018-02-12 04:56:20,895 INFO  [PEWorker-16] procedure2.ProcedureExecutor(1498): Initialized subprocedures=[{pid=135, ppid=132, state=RUNNABLE:RECOVER_META_SPLIT_LOGS; RecoverMetaProcedure failedMetaServer=asf911.gq1.ygridcore.net,41715,1518411358686, splitWal=true}]

Then there is no progress, so at last we time out. Let me add a thread dump when we are about to time out to see if we can find something.

Duo Zhang added a comment - 12/Feb/18 07:54 https://builds.apache.org/job/HBASE-Flaky-Tests/25832/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.master.TestDLSFSHLog-output.txt/*view*/ 2018-02-12 04:56:20,895 DEBUG [PEWorker-16] procedure.ServerCrashProcedure(192): pid=132, state=RUNNABLE:SERVER_CRASH_PROCESS_META; ServerCrashProcedure server=asf911.gq1.ygridcore.net,41715,1518411358686, splitWal=true, meta=true; Processing hbase:meta that was on asf911.gq1.ygridcore.net,41715,1518411358686 2018-02-12 04:56:20,895 INFO [PEWorker-16] procedure2.ProcedureExecutor(1498): Initialized subprocedures=[{pid=135, ppid=132, state=RUNNABLE:RECOVER_META_SPLIT_LOGS; RecoverMetaProcedure failedMetaServer=asf911.gq1.ygridcore.net,41715,1518411358686, splitWal=true}] Then there is no progress, so at last we time out. Let me add a thread dump when we are about to time out to see if we can find something.

Duo Zhang added a comment - 12/Feb/18 08:07

What I pushed to master and branch-2.

Duo Zhang added a comment - 12/Feb/18 08:07 What I pushed to master and branch-2.

Duo Zhang added a comment - 12/Feb/18 09:01

After revisiting the log, I think I found a possible dead lock in recovery.

It is very simple, we only have 16 threads to run the procedures. During a failover, one region will lead to one AssignProcedure, and if meta region is also offline, then every AssignProcedure will be stuck at AssignProcedure.startTransition where we need to read something from meta region. So if the RecoverMetaProcedure can not be picked up by one of the worker thread before all of them are stuck at AssignProcedure.startTransition, then dead lock...

So I think we should have a special worker thread which can only execute meta related operations. And if a TableProcedureInterface is for meta table, then we should insert it into a special queue and let the special worker to run it immediately.

Let me open a new issue to address it.

Thanks.

Duo Zhang added a comment - 12/Feb/18 09:01 After revisiting the log, I think I found a possible dead lock in recovery. It is very simple, we only have 16 threads to run the procedures. During a failover, one region will lead to one AssignProcedure, and if meta region is also offline, then every AssignProcedure will be stuck at AssignProcedure.startTransition where we need to read something from meta region. So if the RecoverMetaProcedure can not be picked up by one of the worker thread before all of them are stuck at AssignProcedure.startTransition, then dead lock... So I think we should have a special worker thread which can only execute meta related operations. And if a TableProcedureInterface is for meta table, then we should insert it into a special queue and let the special worker to run it immediately. Let me open a new issue to address it. Thanks.

Hudson added a comment - 12/Feb/18 13:27

FAILURE: Integrated in Jenkins build HBase-Trunk_matrix #4571 (See https://builds.apache.org/job/HBase-Trunk_matrix/4571/)
~~HBASE-19554~~ Dump thread info before failure (zhangduo: rev 8ff783f76d370199b81a24c48feaad1aa980285e)

(edit) hbase-server/src/test/java/org/apache/hadoop/hbase/master/AbstractTestDLS.java

Hudson added a comment - 12/Feb/18 13:27 FAILURE: Integrated in Jenkins build HBase-Trunk_matrix #4571 (See https://builds.apache.org/job/HBase-Trunk_matrix/4571/ ) HBASE-19554 Dump thread info before failure (zhangduo: rev 8ff783f76d370199b81a24c48feaad1aa980285e) (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/master/AbstractTestDLS.java

Duo Zhang added a comment - 20/Feb/18 08:08

Different problem

https://builds.apache.org/job/HBASE-Flaky-Tests/26171/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.master.TestDLSAsyncFSWAL-output.txt/*view*/

Now we time out when running testRecoveredEdits.

"Time-limited test" daemon prio=5 tid=23 timed_waiting
java.lang.Thread.State: TIMED_WAITING
        at java.lang.Object.wait(Native Method)
        at java.lang.Object.wait(Object.java:502)
        at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1406)
        at org.apache.zookeeper.ClientCnxn.close(ClientCnxn.java:1378)
        at org.apache.zookeeper.ZooKeeper.close(ZooKeeper.java:677)
        at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.close(RecoverableZooKeeper.java:711)
        at org.apache.hadoop.hbase.zookeeper.ZKWatcher.close(ZKWatcher.java:596)
        at org.apache.hadoop.hbase.master.AbstractTestDLS.testRecoveredEdits(AbstractTestDLS.java:251)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
        at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
        at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
        at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
        at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
        at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
        at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
        at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
        at org.junit.rules.RunRules.evaluate(RunRules.java:20)
        at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
        at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
        at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
        at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
        at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
        at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
        at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
        at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
        at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
        at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
        at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
        at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.lang.Thread.run(Thread.java:748)

Let me dig more.

Duo Zhang added a comment - 20/Feb/18 08:08 Different problem https://builds.apache.org/job/HBASE-Flaky-Tests/26171/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.master.TestDLSAsyncFSWAL-output.txt/*view*/ Now we time out when running testRecoveredEdits. "Time-limited test" daemon prio=5 tid=23 timed_waiting java.lang.Thread.State: TIMED_WAITING at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:502) at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1406) at org.apache.zookeeper.ClientCnxn.close(ClientCnxn.java:1378) at org.apache.zookeeper.ZooKeeper.close(ZooKeeper.java:677) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.close(RecoverableZooKeeper.java:711) at org.apache.hadoop.hbase.zookeeper.ZKWatcher.close(ZKWatcher.java:596) at org.apache.hadoop.hbase.master.AbstractTestDLS.testRecoveredEdits(AbstractTestDLS.java:251) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) at org.junit.rules.RunRules.evaluate(RunRules.java:20) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.lang.Thread.run(Thread.java:748) Let me dig more.

Duo Zhang added a comment - 20/Feb/18 08:11

OK the problem is testMasterStartsUpWithLogSplittingWork, it spends nearly 13 minutes to finish

2018-02-20 02:29:56,586 INFO  [Time-limited test] hbase.ResourceChecker(148): before: master.TestDLSAsyncFSWAL#testMasterStartsUpWithLogSplittingWork Thread=133, OpenFileDescriptor=893, MaxFileDescriptor=60000, SystemLoadAverage=819, ProcessCount=273, AvailableMemoryMB=3881

2018-02-20 02:42:38,021 INFO  [Time-limited test] hbase.ResourceChecker(172): after: master.TestDLSAsyncFSWAL#testMasterStartsUpWithLogSplittingWork Thread=233 (was 133)

Let me dig more.

Duo Zhang added a comment - 20/Feb/18 08:11 OK the problem is testMasterStartsUpWithLogSplittingWork, it spends nearly 13 minutes to finish 2018-02-20 02:29:56,586 INFO [Time-limited test] hbase.ResourceChecker(148): before: master.TestDLSAsyncFSWAL#testMasterStartsUpWithLogSplittingWork Thread=133, OpenFileDescriptor=893, MaxFileDescriptor=60000, SystemLoadAverage=819, ProcessCount=273, AvailableMemoryMB=3881 2018-02-20 02:42:38,021 INFO [Time-limited test] hbase.ResourceChecker(172): after: master.TestDLSAsyncFSWAL#testMasterStartsUpWithLogSplittingWork Thread=233 (was 133) Let me dig more.

Duo Zhang added a comment - 20/Feb/18 08:16

I think this is a problem which we have already addressed? We have already closed the ConnectionImplementation from outside, and then the initialize of HMaster should quit immediately. But it seems not, it keeps retrying on the closed connection... Let me check what's wrong here...

2018-02-20 02:31:12,578 DEBUG [M:0;asf911:38379] client.RpcRetryingCallerImpl(132): Call exception, tries=11, retries=11, started=48515 ms ago, cancelled=false, msg=hconnection-0x195e9495 closed, details=row 'default' on table 'hbase:namespace' at region=hbase:namespace,,1519093804365.3ff104b43ef4ce0a491e1f26b598813e., hostname=asf911.gq1.ygridcore.net,51776,1519093799312, seqNum=2, exception=java.io.IOException: hconnection-0x195e9495 closed
        at org.apache.hadoop.hbase.client.ConnectionImplementation.getTableState(ConnectionImplementation.java:1959)
        at org.apache.hadoop.hbase.client.ConnectionUtils$ShortCircuitingClusterConnection.getTableState(ConnectionUtils.java:131)
        at org.apache.hadoop.hbase.client.ConnectionImplementation.isTableDisabled(ConnectionImplementation.java:573)
        at org.apache.hadoop.hbase.client.ConnectionUtils$ShortCircuitingClusterConnection.isTableDisabled(ConnectionUtils.java:131)
        at org.apache.hadoop.hbase.client.RegionServerCallable.prepare(RegionServerCallable.java:219)
        at org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:105)
        at org.apache.hadoop.hbase.client.HTable.get(HTable.java:388)
        at org.apache.hadoop.hbase.client.HTable.get(HTable.java:362)
        at org.apache.hadoop.hbase.master.TableNamespaceManager.get(TableNamespaceManager.java:141)
        at org.apache.hadoop.hbase.master.TableNamespaceManager.isTableAvailableAndInitialized(TableNamespaceManager.java:278)
        at org.apache.hadoop.hbase.master.TableNamespaceManager.start(TableNamespaceManager.java:103)
        at org.apache.hadoop.hbase.master.ClusterSchemaServiceImpl.doStart(ClusterSchemaServiceImpl.java:62)
        at org.apache.hbase.thirdparty.com.google.common.util.concurrent.AbstractService.startAsync(AbstractService.java:226)
        at org.apache.hadoop.hbase.master.HMaster.initClusterSchemaService(HMaster.java:1053)
        at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:919)
        at org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2017)
        at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:559)
        at java.lang.Thread.run(Thread.java:748)

Duo Zhang added a comment - 20/Feb/18 08:16 I think this is a problem which we have already addressed? We have already closed the ConnectionImplementation from outside, and then the initialize of HMaster should quit immediately. But it seems not, it keeps retrying on the closed connection... Let me check what's wrong here... 2018-02-20 02:31:12,578 DEBUG [M:0;asf911:38379] client.RpcRetryingCallerImpl(132): Call exception, tries=11, retries=11, started=48515 ms ago, cancelled=false, msg=hconnection-0x195e9495 closed, details=row 'default' on table 'hbase:namespace' at region=hbase:namespace,,1519093804365.3ff104b43ef4ce0a491e1f26b598813e., hostname=asf911.gq1.ygridcore.net,51776,1519093799312, seqNum=2, exception=java.io.IOException: hconnection-0x195e9495 closed at org.apache.hadoop.hbase.client.ConnectionImplementation.getTableState(ConnectionImplementation.java:1959) at org.apache.hadoop.hbase.client.ConnectionUtils$ShortCircuitingClusterConnection.getTableState(ConnectionUtils.java:131) at org.apache.hadoop.hbase.client.ConnectionImplementation.isTableDisabled(ConnectionImplementation.java:573) at org.apache.hadoop.hbase.client.ConnectionUtils$ShortCircuitingClusterConnection.isTableDisabled(ConnectionUtils.java:131) at org.apache.hadoop.hbase.client.RegionServerCallable.prepare(RegionServerCallable.java:219) at org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:105) at org.apache.hadoop.hbase.client.HTable.get(HTable.java:388) at org.apache.hadoop.hbase.client.HTable.get(HTable.java:362) at org.apache.hadoop.hbase.master.TableNamespaceManager.get(TableNamespaceManager.java:141) at org.apache.hadoop.hbase.master.TableNamespaceManager.isTableAvailableAndInitialized(TableNamespaceManager.java:278) at org.apache.hadoop.hbase.master.TableNamespaceManager.start(TableNamespaceManager.java:103) at org.apache.hadoop.hbase.master.ClusterSchemaServiceImpl.doStart(ClusterSchemaServiceImpl.java:62) at org.apache.hbase.thirdparty.com.google.common.util.concurrent.AbstractService.startAsync(AbstractService.java:226) at org.apache.hadoop.hbase.master.HMaster.initClusterSchemaService(HMaster.java:1053) at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:919) at org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2017) at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:559) at java.lang.Thread.run(Thread.java:748)

Michael Stack added a comment - 20/Feb/18 19:19

But it seems not, it keeps retrying on the closed connection... Let me check what's wrong here...

Yes. I thought this addressed because we explicitly 'close' the Master Connection on shutdown as of about a week or so ago. But looks like you figured that we need more than that – the check Connection is closed. Good one.

Michael Stack added a comment - 20/Feb/18 19:19 But it seems not, it keeps retrying on the closed connection... Let me check what's wrong here... Yes. I thought this addressed because we explicitly 'close' the Master Connection on shutdown as of about a week or so ago. But looks like you figured that we need more than that – the check Connection is closed. Good one.

Duo Zhang added a comment - 21/Feb/18 00:06 - edited

https://builds.apache.org/job/HBASE-Flaky-Tests-branch2.0/2393/testReport/junit/org.apache.hadoop.hbase.master/TestDLSFSHLog/testRecoveredEdits/

java.lang.AssertionError: expected:<10000> but was:<9825>

Duo Zhang added a comment - 21/Feb/18 00:06 - edited https://builds.apache.org/job/HBASE-Flaky-Tests-branch2.0/2393/testReport/junit/org.apache.hadoop.hbase.master/TestDLSFSHLog/testRecoveredEdits/ java.lang.AssertionError: expected:<10000> but was:<9825>

Duo Zhang added a comment - 21/Feb/18 01:15

A bit strange, we roll wal with the same name twice...

2018-02-20 16:33:41,128 INFO  [regionserver/asf903:0.logRoller] wal.AbstractFSWAL(690): Rolled WAL /user/jenkins/test-data/46c46c01-4dc8-4506-a228-a4460f2a28e9/WALs/asf903.gq1.ygridcore.net,52039,1519144411613/asf903.gq1.ygridcore.net%2C52039%2C1519144411613.1519144421057 with entries=614, filesize=124.20 KB; new WAL /user/jenkins/test-data/46c46c01-4dc8-4506-a228-a4460f2a28e9/WALs/asf903.gq1.ygridcore.net,52039,1519144411613/asf903.gq1.ygridcore.net%2C52039%2C1519144411613.1519144421109
2018-02-20 16:33:41,164 INFO  [regionserver/asf903:0.logRoller] wal.AbstractFSWAL(690): Rolled WAL /user/jenkins/test-data/46c46c01-4dc8-4506-a228-a4460f2a28e9/WALs/asf903.gq1.ygridcore.net,52039,1519144411613/asf903.gq1.ygridcore.net%2C52039%2C1519144411613.1519144421109 with entries=789, filesize=124.20 KB; new WAL /user/jenkins/test-data/46c46c01-4dc8-4506-a228-a4460f2a28e9/WALs/asf903.gq1.ygridcore.net,52039,1519144411613/asf903.gq1.ygridcore.net%2C52039%2C1519144411613.1519144421143

Will dig more.

Duo Zhang added a comment - 21/Feb/18 01:15 A bit strange, we roll wal with the same name twice... 2018-02-20 16:33:41,128 INFO [regionserver/asf903:0.logRoller] wal.AbstractFSWAL(690): Rolled WAL /user/jenkins/test-data/46c46c01-4dc8-4506-a228-a4460f2a28e9/WALs/asf903.gq1.ygridcore.net,52039,1519144411613/asf903.gq1.ygridcore.net%2C52039%2C1519144411613.1519144421057 with entries=614, filesize=124.20 KB; new WAL /user/jenkins/test-data/46c46c01-4dc8-4506-a228-a4460f2a28e9/WALs/asf903.gq1.ygridcore.net,52039,1519144411613/asf903.gq1.ygridcore.net%2C52039%2C1519144411613.1519144421109 2018-02-20 16:33:41,164 INFO [regionserver/asf903:0.logRoller] wal.AbstractFSWAL(690): Rolled WAL /user/jenkins/test-data/46c46c01-4dc8-4506-a228-a4460f2a28e9/WALs/asf903.gq1.ygridcore.net,52039,1519144411613/asf903.gq1.ygridcore.net%2C52039%2C1519144411613.1519144421109 with entries=789, filesize=124.20 KB; new WAL /user/jenkins/test-data/46c46c01-4dc8-4506-a228-a4460f2a28e9/WALs/asf903.gq1.ygridcore.net,52039,1519144411613/asf903.gq1.ygridcore.net%2C52039%2C1519144411613.1519144421143 Will dig more.

Duo Zhang added a comment - 21/Feb/18 06:13

Oh they are not same... But still a bit strange that the file sizes are same but the entries numbers are different...

Duo Zhang added a comment - 21/Feb/18 06:13 Oh they are not same... But still a bit strange that the file sizes are same but the entries numbers are different...

Duo Zhang added a comment - 21/Feb/18 07:03

OK the numEntries is incorrect, will file an issue to address it, not critical, just logging.

And I think this is the problem

2018-02-20 16:33:41,183 INFO  [regionserver/asf903:0.logRoller] wal.AbstractFSWAL(690): Rolled WAL /user/jenkins/test-data/46c46c01-4dc8-4506-a228-a4460f2a28e9/WALs/asf903.gq1.ygridcore.net,52039,1519144411613/asf903.gq1.ygridcore.net%2C52039%2C1519144411613.1519144421143 with entries=0, filesize=35.46 KB; new WAL /user/jenkins/test-data/46c46c01-4dc8-4506-a228-a4460f2a28e9/WALs/asf903.gq1.ygridcore.net,52039,1519144411613/asf903.gq1.ygridcore.net%2C52039%2C1519144411613.1519144421171
2018-02-20 16:33:41,184 DEBUG [regionserver/asf903:0.logRoller] wal.AbstractFSWAL(757): Create new FSHLog writer with pipeline: [DatanodeInfoWithStorage[127.0.0.1:60898,DS-fe8309dd-a316-4ef5-a474-7167557c5c76,DISK], DatanodeInfoWithStorage[127.0.0.1:53328,DS-0f9fbb3d-3cd1-47ea-9867-52f1ecac6f19,DISK], DatanodeInfoWithStorage[127.0.0.1:57523,DS-fc2097ed-4d98-40d7-a4a0-adfd31a5bf68,DISK]]
2018-02-20 16:33:41,184 INFO  [regionserver/asf903:0.logRoller] wal.AbstractFSWAL(652): Archiving hdfs://localhost:55625/user/jenkins/test-data/46c46c01-4dc8-4506-a228-a4460f2a28e9/WALs/asf903.gq1.ygridcore.net,52039,1519144411613/asf903.gq1.ygridcore.net%2C52039%2C1519144411613.1519144421143 to hdfs://localhost:55625/user/jenkins/test-data/46c46c01-4dc8-4506-a228-a4460f2a28e9/oldWALs/asf903.gq1.ygridcore.net%2C52039%2C1519144411613.1519144421143

Here we move the wal file to oldWALs soon after rolling. We write to WAL directly in the test, so this should not happen. Let me dig more.

Duo Zhang added a comment - 21/Feb/18 07:03 OK the numEntries is incorrect, will file an issue to address it, not critical, just logging. And I think this is the problem 2018-02-20 16:33:41,183 INFO [regionserver/asf903:0.logRoller] wal.AbstractFSWAL(690): Rolled WAL /user/jenkins/test-data/46c46c01-4dc8-4506-a228-a4460f2a28e9/WALs/asf903.gq1.ygridcore.net,52039,1519144411613/asf903.gq1.ygridcore.net%2C52039%2C1519144411613.1519144421143 with entries=0, filesize=35.46 KB; new WAL /user/jenkins/test-data/46c46c01-4dc8-4506-a228-a4460f2a28e9/WALs/asf903.gq1.ygridcore.net,52039,1519144411613/asf903.gq1.ygridcore.net%2C52039%2C1519144411613.1519144421171 2018-02-20 16:33:41,184 DEBUG [regionserver/asf903:0.logRoller] wal.AbstractFSWAL(757): Create new FSHLog writer with pipeline: [DatanodeInfoWithStorage[127.0.0.1:60898,DS-fe8309dd-a316-4ef5-a474-7167557c5c76,DISK], DatanodeInfoWithStorage[127.0.0.1:53328,DS-0f9fbb3d-3cd1-47ea-9867-52f1ecac6f19,DISK], DatanodeInfoWithStorage[127.0.0.1:57523,DS-fc2097ed-4d98-40d7-a4a0-adfd31a5bf68,DISK]] 2018-02-20 16:33:41,184 INFO [regionserver/asf903:0.logRoller] wal.AbstractFSWAL(652): Archiving hdfs://localhost:55625/user/jenkins/test-data/46c46c01-4dc8-4506-a228-a4460f2a28e9/WALs/asf903.gq1.ygridcore.net,52039,1519144411613/asf903.gq1.ygridcore.net%2C52039%2C1519144411613.1519144421143 to hdfs://localhost:55625/user/jenkins/test-data/46c46c01-4dc8-4506-a228-a4460f2a28e9/oldWALs/asf903.gq1.ygridcore.net%2C52039%2C1519144411613.1519144421143 Here we move the wal file to oldWALs soon after rolling. We write to WAL directly in the test, so this should not happen. Let me dig more.

Duo Zhang added a comment - 21/Feb/18 07:08

Oh, maybe the same problem with numEntries. We should also reset sequenceIdAccounting before processing new requests. Let me file an issue to address it.

Duo Zhang added a comment - 21/Feb/18 07:08 Oh, maybe the same problem with numEntries. We should also reset sequenceIdAccounting before processing new requests. Let me file an issue to address it.

Michael Stack added a comment - 22/Feb/18 18:13

All subtasks resolved. This family of tests no longer shows on the flakies list. Nice work Apache9

Michael Stack added a comment - 22/Feb/18 18:13 All subtasks resolved. This family of tests no longer shows on the flakies list. Nice work Apache9

HBase

AbstractTestDLS.testThreeRSAbort sometimes fails in pre commit

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates