[HBASE-19927] TestFullLogReconstruction flakey - ASF JIRA

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.0.0-beta-2, 2.0.0
Component/s: wal
Labels:
None

Hadoop Flags:

Reviewed

Description

Fails pretty frequently in hadoopqa builds.

There is a recent hang in org.apache.hadoop.hbase.TestFullLogReconstruction.tearDownAfterClass(TestFullLogReconstruction.java:68)

In here... https://builds.apache.org/job/PreCommit-HBASE-Build/11363/testReport/org.apache.hadoop.hbase/TestFullLogReconstruction/org_apache_hadoop_hbase_TestFullLogReconstruction/

... see here.

Thread 1250 (RS_CLOSE_META-edd281aedb18:59863-0):
State: TIMED_WAITING
Blocked count: 92
Waited count: 278
Stack:
java.lang.Object.wait(Native Method)
org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:133)
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.blockOnSync(AbstractFSWAL.java:718)
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:605)
org.apache.hadoop.hbase.regionserver.wal.WALUtil.doFullAppendTransaction(WALUtil.java:154)
org.apache.hadoop.hbase.regionserver.wal.WALUtil.writeFlushMarker(WALUtil.java:81)
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2645)
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2356)
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2328)
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2319)
org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1531)
org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1437)
org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:104)
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)

We missed a signal? We need to do an interrupt? The log is not all there in hadoopqa builds so hard to see all that is going on. This test is not in the flakey set either....

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HBASE-19927-addendum.patch
06/Feb/18 01:55
3 kB
Duo Zhang
HBASE-19927.patch
04/Feb/18 05:01
5 kB
Duo Zhang
js
04/Feb/18 04:02
171 kB
Duo Zhang
out
04/Feb/18 04:02
2.65 MB
Duo Zhang

Issue Links

relates to

HBASE-19929 Call RS.stop on a session expired RS may hang

Resolved

Activity

Ascending order - Click to sort in descending order

Duo Zhang added a comment - 04/Feb/18 00:28

IIRC this happened before. I used to introduce a close chore in AsyncFSWAL to solve it and later removed it. Let me revisit the commit history to find out what happened.

Duo Zhang added a comment - 04/Feb/18 00:28 IIRC this happened before. I used to introduce a close chore in AsyncFSWAL to solve it and later removed it. Let me revisit the commit history to find out what happened.

Duo Zhang added a comment - 04/Feb/18 00:42

And I think we should enable TestDLSFSHLog and TestDLSAsyncFSWAL again to fix them also?

Duo Zhang added a comment - 04/Feb/18 00:42 And I think we should enable TestDLSFSHLog and TestDLSAsyncFSWAL again to fix them also?

Duo Zhang added a comment - 04/Feb/18 00:46

~~HBASE-17053~~. Let me check again.

Duo Zhang added a comment - 04/Feb/18 00:46 HBASE-17053 . Let me check again.

Duo Zhang added a comment - 04/Feb/18 04:06

I can reproduce the problem locally, one out of four or five runs.

Upload the jstack and test output. Haven't found the root cause yet but obviously the test itself has a problem. In the output you can see that, we shut down the whole cluster before aborting a RS for session expired, so the test just test nothing...

First I plan to rewrite the test to make sure that we actually do a 'full log reconstruction'. This could solve the hang also. And then we could try to implement a test to reproduce the problem stably and see how to fix it.

Thanks.

Duo Zhang added a comment - 04/Feb/18 04:06 I can reproduce the problem locally, one out of four or five runs. Upload the jstack and test output. Haven't found the root cause yet but obviously the test itself has a problem. In the output you can see that, we shut down the whole cluster before aborting a RS for session expired, so the test just test nothing... First I plan to rewrite the test to make sure that we actually do a 'full log reconstruction'. This could solve the hang also. And then we could try to implement a test to reproduce the problem stably and see how to fix it. Thanks.

Michael Stack added a comment - 04/Feb/18 05:06

Thanks Apache9 Will take a look tomorrow probably at this stage. Always interested in problems in shutdown (smile).

Michael Stack added a comment - 04/Feb/18 05:06 Thanks Apache9 Will take a look tomorrow probably at this stage. Always interested in problems in shutdown (smile).

Duo Zhang added a comment - 04/Feb/18 05:20

Wait for the region server to crash before reading, tried several times, all passed.

And I think the problem is that, we enter the normal shutdown first, scheduled region closing, and then wait on all the regions to be closed. And during the closing, master moves the WAL directory so we will fail when writing flush marker, and AsyncFSWAL will trigger a log roll. And the LogRoller will get an exception when creating new log file and then call rs.abort.

I've already done a hack in AsyncFSWAL, that when we get 'Parent directory doesn't exist:' we will fail all the pending requests. But here it is not enough. We can only fail some of the flush maker writing, if there are still flush requests that write flush marker after the failure of log roll, it will be stuck for ever because we will not schedule a consumer to write it out, so we will not trigger another log roll and can not trigger the hack again...

A possible solution maybe that, when s log roller finds that it could not roll a WAL, before aborting the RS, shutdown the WAL directly first...

Duo Zhang added a comment - 04/Feb/18 05:20 Wait for the region server to crash before reading, tried several times, all passed. And I think the problem is that, we enter the normal shutdown first, scheduled region closing, and then wait on all the regions to be closed. And during the closing, master moves the WAL directory so we will fail when writing flush marker, and AsyncFSWAL will trigger a log roll. And the LogRoller will get an exception when creating new log file and then call rs.abort. I've already done a hack in AsyncFSWAL, that when we get 'Parent directory doesn't exist:' we will fail all the pending requests. But here it is not enough. We can only fail some of the flush maker writing, if there are still flush requests that write flush marker after the failure of log roll, it will be stuck for ever because we will not schedule a consumer to write it out, so we will not trigger another log roll and can not trigger the hack again... A possible solution maybe that, when s log roller finds that it could not roll a WAL, before aborting the RS, shutdown the WAL directly first...

Michael Stack added a comment - 04/Feb/18 05:28

Sounds fun. Sounds like issues we used have with the disruptor having to stuff 'signaling' edits into the pipe so stuff could get woken up to notice the shutdown state. Would be sweet if you took care of the async wal stuff. I'll take a look at the sequencing going down to see if stuff we can improve on. Thanks Apache9

Michael Stack added a comment - 04/Feb/18 05:28 Sounds fun. Sounds like issues we used have with the disruptor having to stuff 'signaling' edits into the pipe so stuff could get woken up to notice the shutdown state. Would be sweet if you took care of the async wal stuff. I'll take a look at the sequencing going down to see if stuff we can improve on. Thanks Apache9

Duo Zhang added a comment - 04/Feb/18 05:30

Let me open another issue to track this. Thanks stack.

Duo Zhang added a comment - 04/Feb/18 05:30 Let me open another issue to track this. Thanks stack .

Hadoop QA added a comment - 04/Feb/18 07:37

+1 overall

Vote	Subsystem	Runtime	Comment
0	reexec	0m 16s	Docker mode activated.
			Prechecks
0	findbugs	0m 0s	Findbugs executables are not available.
+1	hbaseanti	0m 0s	Patch does not have any anti-patterns.
+1	@author	0m 0s	The patch does not contain any @author tags.
+1	test4tests	0m 0s	The patch appears to include 2 new or modified test files.
			master Compile Tests
+1	mvninstall	7m 51s	master passed
+1	compile	1m 37s	master passed
+1	checkstyle	1m 54s	master passed
+1	shadedjars	9m 22s	branch has no errors when building our shaded downstream artifacts.
+1	javadoc	1m 5s	master passed
			Patch Compile Tests
+1	mvninstall	6m 24s	the patch passed
+1	compile	0m 43s	the patch passed
+1	javac	0m 43s	the patch passed
+1	checkstyle	1m 8s	hbase-server: The patch generated 0 new + 265 unchanged - 4 fixed = 265 total (was 269)
+1	whitespace	0m 0s	The patch has no whitespace issues.
+1	shadedjars	4m 35s	patch has no errors when building our shaded downstream artifacts.
+1	hadoopcheck	18m 12s	Patch does not cause any errors with Hadoop 2.6.5 2.7.4 or 3.0.0.
+1	javadoc	0m 29s	the patch passed
			Other Tests
+1	unit	99m 16s	hbase-server in the patch passed.
+1	asflicense	0m 14s	The patch does not generate ASF License warnings.
		147m 0s

Subsystem	Report/Notes
Docker	Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:eee3b01
JIRA Issue	~~HBASE-19927~~
JIRA Patch URL	https://issues.apache.org/jira/secure/attachment/12909128/HBASE-19927.patch
Optional Tests	asflicense javac javadoc unit findbugs shadedjars hadoopcheck hbaseanti checkstyle compile
uname	Linux b7c85ae0a15f 3.13.0-133-generic #182-Ubuntu SMP Tue Sep 19 15:49:21 UTC 2017 x86_64 GNU/Linux
Build tool	maven
Personality	/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh
git revision	master / bfd74686c7
maven	version: Apache Maven 3.5.2 (138edd61fd100ec658bfa2d307c43b76940a5d7d; 2017-10-18T07:58:13Z)
Default Java	1.8.0_151
Test Results	https://builds.apache.org/job/PreCommit-HBASE-Build/11380/testReport/
Max. process+thread count	5367 (vs. ulimit of 10000)
modules	C: hbase-server U: hbase-server
Console output	https://builds.apache.org/job/PreCommit-HBASE-Build/11380/console
Powered by	Apache Yetus 0.7.0 http://yetus.apache.org

This message was automatically generated.

Hadoop QA added a comment - 04/Feb/18 07:37 +1 overall Vote Subsystem Runtime Comment 0 reexec 0m 16s Docker mode activated. Prechecks 0 findbugs 0m 0s Findbugs executables are not available. +1 hbaseanti 0m 0s Patch does not have any anti-patterns. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 2 new or modified test files. master Compile Tests +1 mvninstall 7m 51s master passed +1 compile 1m 37s master passed +1 checkstyle 1m 54s master passed +1 shadedjars 9m 22s branch has no errors when building our shaded downstream artifacts. +1 javadoc 1m 5s master passed Patch Compile Tests +1 mvninstall 6m 24s the patch passed +1 compile 0m 43s the patch passed +1 javac 0m 43s the patch passed +1 checkstyle 1m 8s hbase-server: The patch generated 0 new + 265 unchanged - 4 fixed = 265 total (was 269) +1 whitespace 0m 0s The patch has no whitespace issues. +1 shadedjars 4m 35s patch has no errors when building our shaded downstream artifacts. +1 hadoopcheck 18m 12s Patch does not cause any errors with Hadoop 2.6.5 2.7.4 or 3.0.0. +1 javadoc 0m 29s the patch passed Other Tests +1 unit 99m 16s hbase-server in the patch passed. +1 asflicense 0m 14s The patch does not generate ASF License warnings. 147m 0s Subsystem Report/Notes Docker Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:eee3b01 JIRA Issue HBASE-19927 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12909128/HBASE-19927.patch Optional Tests asflicense javac javadoc unit findbugs shadedjars hadoopcheck hbaseanti checkstyle compile uname Linux b7c85ae0a15f 3.13.0-133-generic #182-Ubuntu SMP Tue Sep 19 15:49:21 UTC 2017 x86_64 GNU/Linux Build tool maven Personality /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh git revision master / bfd74686c7 maven version: Apache Maven 3.5.2 (138edd61fd100ec658bfa2d307c43b76940a5d7d; 2017-10-18T07:58:13Z) Default Java 1.8.0_151 Test Results https://builds.apache.org/job/PreCommit-HBASE-Build/11380/testReport/ Max. process+thread count 5367 (vs. ulimit of 10000) modules C: hbase-server U: hbase-server Console output https://builds.apache.org/job/PreCommit-HBASE-Build/11380/console Powered by Apache Yetus 0.7.0 http://yetus.apache.org This message was automatically generated.

Duo Zhang added a comment - 05/Feb/18 00:40

Let me commit this. Will use ~~HBASE-19929~~ to address the shutdown problem.

Duo Zhang added a comment - 05/Feb/18 00:40 Let me commit this. Will use HBASE-19929 to address the shutdown problem.

Michael Stack added a comment - 05/Feb/18 00:49

+1 go for it

Michael Stack added a comment - 05/Feb/18 00:49 +1 go for it

Duo Zhang added a comment - 05/Feb/18 00:53

Pushed to master and branch-2.

Thanks stack for reviewing.

Duo Zhang added a comment - 05/Feb/18 00:53 Pushed to master and branch-2. Thanks stack for reviewing.

Hudson added a comment - 05/Feb/18 04:55

FAILURE: Integrated in Jenkins build HBase-Trunk_matrix #4528 (See https://builds.apache.org/job/HBase-Trunk_matrix/4528/)
~~HBASE-19927~~ TestFullLogReconstruction flakey (zhangduo: rev e1cd10b002a07a35aa7666fcfbd01b54cfcff1bf)

(edit) hbase-server/src/test/java/org/apache/hadoop/hbase/TestFullLogReconstruction.java
(edit) hbase-server/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java

Hudson added a comment - 05/Feb/18 04:55 FAILURE: Integrated in Jenkins build HBase-Trunk_matrix #4528 (See https://builds.apache.org/job/HBase-Trunk_matrix/4528/ ) HBASE-19927 TestFullLogReconstruction flakey (zhangduo: rev e1cd10b002a07a35aa7666fcfbd01b54cfcff1bf) (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/TestFullLogReconstruction.java (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java

Michael Stack added a comment - 05/Feb/18 16:22

It still flakey? https://builds.apache.org/view/H-L/view/HBase/job/HBase-Find-Flaky-Tests-branch2.0/lastSuccessfulBuild/artifact/dashboard.html

Michael Stack added a comment - 05/Feb/18 16:22 It still flakey? https://builds.apache.org/view/H-L/view/HBase/job/HBase-Find-Flaky-Tests-branch2.0/lastSuccessfulBuild/artifact/dashboard.html

Duo Zhang added a comment - 06/Feb/18 00:39

A bit strange

2018-02-05 19:05:43,537 INFO  [Time-limited test] regionserver.HRegionServer(2116): ***** STOPPING region server 'asf903.gq1.ygridcore.net,57911,1517857533524' *****
2018-02-05 19:05:43,537 INFO  [Time-limited test] regionserver.HRegionServer(2130): STOPPED: Shutdown requested
2018-02-05 19:05:43,538 INFO  [Time-limited test] regionserver.HRegionServer(2116): ***** STOPPING region server 'asf903.gq1.ygridcore.net,50054,1517857533606' *****
2018-02-05 19:05:43,538 INFO  [RS:0;asf903:57911] regionserver.SplitLogWorker(160): Sending interrupt to stop the worker thread
2018-02-05 19:05:43,538 INFO  [Time-limited test] regionserver.HRegionServer(2130): STOPPED: Shutdown requested
2018-02-05 19:05:43,538 INFO  [Time-limited test] regionserver.HRegionServer(2116): ***** STOPPING region server 'asf903.gq1.ygridcore.net,42069,1517857533678' *****

2018-02-05 19:05:43,974 ERROR [regionserver/asf903:0.logRoller] helpers.MarkerIgnoringBase(159): ***** ABORTING region server asf903.gq1.ygridcore.net,57911,1517857533524: IOE in log roller *****

The aborting still happens after the stopping in shutdown. Let me check.

Duo Zhang added a comment - 06/Feb/18 00:39 A bit strange 2018-02-05 19:05:43,537 INFO [Time-limited test] regionserver.HRegionServer(2116): ***** STOPPING region server 'asf903.gq1.ygridcore.net,57911,1517857533524' ***** 2018-02-05 19:05:43,537 INFO [Time-limited test] regionserver.HRegionServer(2130): STOPPED: Shutdown requested 2018-02-05 19:05:43,538 INFO [Time-limited test] regionserver.HRegionServer(2116): ***** STOPPING region server 'asf903.gq1.ygridcore.net,50054,1517857533606' ***** 2018-02-05 19:05:43,538 INFO [RS:0;asf903:57911] regionserver.SplitLogWorker(160): Sending interrupt to stop the worker thread 2018-02-05 19:05:43,538 INFO [Time-limited test] regionserver.HRegionServer(2130): STOPPED: Shutdown requested 2018-02-05 19:05:43,538 INFO [Time-limited test] regionserver.HRegionServer(2116): ***** STOPPING region server 'asf903.gq1.ygridcore.net,42069,1517857533678' ***** 2018-02-05 19:05:43,974 ERROR [regionserver/asf903:0.logRoller] helpers.MarkerIgnoringBase(159): ***** ABORTING region server asf903.gq1.ygridcore.net,57911,1517857533524: IOE in log roller ***** The aborting still happens after the stopping in shutdown. Let me check.

Duo Zhang added a comment - 06/Feb/18 00:51

Oh, we join on the wrong class, we need to join on RegionServerThread...

Will upload an addendum soon...

Duo Zhang added a comment - 06/Feb/18 00:51 Oh, we join on the wrong class, we need to join on RegionServerThread... Will upload an addendum soon...

Duo Zhang added a comment - 06/Feb/18 02:04

Pushed the addendum to master and branch-2(after removing the unused import). Let's wait a while to see if it works.

Thanks.

Duo Zhang added a comment - 06/Feb/18 02:04 Pushed the addendum to master and branch-2(after removing the unused import). Let's wait a while to see if it works. Thanks.

Hudson added a comment - 06/Feb/18 08:19

FAILURE: Integrated in Jenkins build HBase-Trunk_matrix #4535 (See https://builds.apache.org/job/HBase-Trunk_matrix/4535/)
~~HBASE-19927~~ Addendum join on RegionServerThread instead of HRegionServer (zhangduo: rev f5197979aaac7e36b6af36b86ea8dc8d7774fabe)

(edit) hbase-server/src/test/java/org/apache/hadoop/hbase/TestFullLogReconstruction.java
(edit) hbase-server/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java

Hudson added a comment - 06/Feb/18 08:19 FAILURE: Integrated in Jenkins build HBase-Trunk_matrix #4535 (See https://builds.apache.org/job/HBase-Trunk_matrix/4535/ ) HBASE-19927 Addendum join on RegionServerThread instead of HRegionServer (zhangduo: rev f5197979aaac7e36b6af36b86ea8dc8d7774fabe) (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/TestFullLogReconstruction.java (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java

Michael Stack added a comment - 07/Feb/18 01:29

Reresolving. The addendum seems to have worked. No failures in last 24 hours (about 28 runs of the flakies suite). Good stuff.

Michael Stack added a comment - 07/Feb/18 01:29 Reresolving. The addendum seems to have worked. No failures in last 24 hours (about 28 runs of the flakies suite). Good stuff.

People

Assignee:: Duo Zhang

Reporter:: Michael Stack

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 03/Feb/18 17:51

Updated:: 21/Mar/18 22:23

Resolved:: 07/Feb/18 01:29