Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.0.0-beta-2, 2.0.0
    • wal
    • None
    • Reviewed

    Description

      Fails pretty frequently in hadoopqa builds.

      There is a recent hang in org.apache.hadoop.hbase.TestFullLogReconstruction.tearDownAfterClass(TestFullLogReconstruction.java:68)

      In here... https://builds.apache.org/job/PreCommit-HBASE-Build/11363/testReport/org.apache.hadoop.hbase/TestFullLogReconstruction/org_apache_hadoop_hbase_TestFullLogReconstruction/

      ... see here.

      Thread 1250 (RS_CLOSE_META-edd281aedb18:59863-0):
      State: TIMED_WAITING
      Blocked count: 92
      Waited count: 278
      Stack:
      java.lang.Object.wait(Native Method)
      org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:133)
      org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.blockOnSync(AbstractFSWAL.java:718)
      org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:605)
      org.apache.hadoop.hbase.regionserver.wal.WALUtil.doFullAppendTransaction(WALUtil.java:154)
      org.apache.hadoop.hbase.regionserver.wal.WALUtil.writeFlushMarker(WALUtil.java:81)
      org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2645)
      org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2356)
      org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2328)
      org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2319)
      org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1531)
      org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1437)
      org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:104)
      org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
      java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      java.lang.Thread.run(Thread.java:748)

      We missed a signal? We need to do an interrupt? The log is not all there in hadoopqa builds so hard to see all that is going on. This test is not in the flakey set either....

      Attachments

        1. HBASE-19927-addendum.patch
          3 kB
          Duo Zhang
        2. HBASE-19927.patch
          5 kB
          Duo Zhang
        3. js
          171 kB
          Duo Zhang
        4. out
          2.65 MB
          Duo Zhang

        Issue Links

          Activity

            zhangduo Duo Zhang added a comment -

            IIRC this happened before. I used to introduce a close chore in AsyncFSWAL to solve it and later removed it. Let me revisit the commit history to find out what happened.

            zhangduo Duo Zhang added a comment - IIRC this happened before. I used to introduce a close chore in AsyncFSWAL to solve it and later removed it. Let me revisit the commit history to find out what happened.
            zhangduo Duo Zhang added a comment -

            And I think we should enable TestDLSFSHLog and TestDLSAsyncFSWAL again to fix them also?

            zhangduo Duo Zhang added a comment - And I think we should enable TestDLSFSHLog and TestDLSAsyncFSWAL again to fix them also?
            zhangduo Duo Zhang added a comment -

            HBASE-17053. Let me check again.

            zhangduo Duo Zhang added a comment - HBASE-17053 . Let me check again.
            zhangduo Duo Zhang added a comment -

            I can reproduce the problem locally, one out of four or five runs.

            Upload the jstack and test output. Haven't found the root cause yet but obviously the test itself has a problem. In the output you can see that, we shut down the whole cluster before aborting a RS for session expired, so the test just test nothing...

            First I plan to rewrite the test to make sure that we actually do a 'full log reconstruction'. This could solve the hang also. And then we could try to implement a test to reproduce the problem stably and see how to fix it.

            Thanks.

            zhangduo Duo Zhang added a comment - I can reproduce the problem locally, one out of four or five runs. Upload the jstack and test output. Haven't found the root cause yet but obviously the test itself has a problem. In the output you can see that, we shut down the whole cluster before aborting a RS for session expired, so the test just test nothing... First I plan to rewrite the test to make sure that we actually do a 'full log reconstruction'. This could solve the hang also. And then we could try to implement a test to reproduce the problem stably and see how to fix it. Thanks.
            stack Michael Stack added a comment -

            Thanks Apache9 Will take a look tomorrow probably at this stage. Always interested in problems in shutdown (smile).

            stack Michael Stack added a comment - Thanks Apache9 Will take a look tomorrow probably at this stage. Always interested in problems in shutdown (smile).
            zhangduo Duo Zhang added a comment -

            Wait for the region server to crash before reading, tried several times, all passed.

            And I think the problem is that, we enter the normal shutdown first, scheduled region closing, and then wait on all the regions to be closed. And during the closing, master moves the WAL directory so we will fail when writing flush marker, and AsyncFSWAL will trigger a log roll. And the LogRoller will get an exception when creating new log file and then call rs.abort.

            I've already done a hack in AsyncFSWAL, that when we get 'Parent directory doesn't exist:' we will fail all the pending requests. But here it is not enough. We can only fail some of the flush maker writing, if there are still flush requests that write flush marker after the failure of log roll, it will be stuck for ever because we will not schedule a consumer to write it out, so we will not trigger another log roll and can not trigger the hack again...

            A possible solution maybe that, when s log roller finds that it could not roll a WAL, before aborting the RS, shutdown the WAL directly first...

            zhangduo Duo Zhang added a comment - Wait for the region server to crash before reading, tried several times, all passed. And I think the problem is that, we enter the normal shutdown first, scheduled region closing, and then wait on all the regions to be closed. And during the closing, master moves the WAL directory so we will fail when writing flush marker, and AsyncFSWAL will trigger a log roll. And the LogRoller will get an exception when creating new log file and then call rs.abort. I've already done a hack in AsyncFSWAL, that when we get 'Parent directory doesn't exist:' we will fail all the pending requests. But here it is not enough. We can only fail some of the flush maker writing, if there are still flush requests that write flush marker after the failure of log roll, it will be stuck for ever because we will not schedule a consumer to write it out, so we will not trigger another log roll and can not trigger the hack again... A possible solution maybe that, when s log roller finds that it could not roll a WAL, before aborting the RS, shutdown the WAL directly first...
            stack Michael Stack added a comment -

            Sounds fun. Sounds like issues we used have with the disruptor having to stuff 'signaling' edits into the pipe so stuff could get woken up to notice the shutdown state. Would be sweet if you took care of the async wal stuff. I'll take a look at the sequencing going down to see if stuff we can improve on. Thanks Apache9

            stack Michael Stack added a comment - Sounds fun. Sounds like issues we used have with the disruptor having to stuff 'signaling' edits into the pipe so stuff could get woken up to notice the shutdown state. Would be sweet if you took care of the async wal stuff. I'll take a look at the sequencing going down to see if stuff we can improve on. Thanks Apache9
            zhangduo Duo Zhang added a comment -

            Let me open another issue to track this. Thanks stack.

            zhangduo Duo Zhang added a comment - Let me open another issue to track this. Thanks stack .
            hadoopqa Hadoop QA added a comment -
            +1 overall



            Vote Subsystem Runtime Comment
            0 reexec 0m 16s Docker mode activated.
                  Prechecks
            0 findbugs 0m 0s Findbugs executables are not available.
            +1 hbaseanti 0m 0s Patch does not have any anti-patterns.
            +1 @author 0m 0s The patch does not contain any @author tags.
            +1 test4tests 0m 0s The patch appears to include 2 new or modified test files.
                  master Compile Tests
            +1 mvninstall 7m 51s master passed
            +1 compile 1m 37s master passed
            +1 checkstyle 1m 54s master passed
            +1 shadedjars 9m 22s branch has no errors when building our shaded downstream artifacts.
            +1 javadoc 1m 5s master passed
                  Patch Compile Tests
            +1 mvninstall 6m 24s the patch passed
            +1 compile 0m 43s the patch passed
            +1 javac 0m 43s the patch passed
            +1 checkstyle 1m 8s hbase-server: The patch generated 0 new + 265 unchanged - 4 fixed = 265 total (was 269)
            +1 whitespace 0m 0s The patch has no whitespace issues.
            +1 shadedjars 4m 35s patch has no errors when building our shaded downstream artifacts.
            +1 hadoopcheck 18m 12s Patch does not cause any errors with Hadoop 2.6.5 2.7.4 or 3.0.0.
            +1 javadoc 0m 29s the patch passed
                  Other Tests
            +1 unit 99m 16s hbase-server in the patch passed.
            +1 asflicense 0m 14s The patch does not generate ASF License warnings.
            147m 0s



            Subsystem Report/Notes
            Docker Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:eee3b01
            JIRA Issue HBASE-19927
            JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12909128/HBASE-19927.patch
            Optional Tests asflicense javac javadoc unit findbugs shadedjars hadoopcheck hbaseanti checkstyle compile
            uname Linux b7c85ae0a15f 3.13.0-133-generic #182-Ubuntu SMP Tue Sep 19 15:49:21 UTC 2017 x86_64 GNU/Linux
            Build tool maven
            Personality /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh
            git revision master / bfd74686c7
            maven version: Apache Maven 3.5.2 (138edd61fd100ec658bfa2d307c43b76940a5d7d; 2017-10-18T07:58:13Z)
            Default Java 1.8.0_151
            Test Results https://builds.apache.org/job/PreCommit-HBASE-Build/11380/testReport/
            Max. process+thread count 5367 (vs. ulimit of 10000)
            modules C: hbase-server U: hbase-server
            Console output https://builds.apache.org/job/PreCommit-HBASE-Build/11380/console
            Powered by Apache Yetus 0.7.0 http://yetus.apache.org

            This message was automatically generated.

            hadoopqa Hadoop QA added a comment - +1 overall Vote Subsystem Runtime Comment 0 reexec 0m 16s Docker mode activated.       Prechecks 0 findbugs 0m 0s Findbugs executables are not available. +1 hbaseanti 0m 0s Patch does not have any anti-patterns. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 2 new or modified test files.       master Compile Tests +1 mvninstall 7m 51s master passed +1 compile 1m 37s master passed +1 checkstyle 1m 54s master passed +1 shadedjars 9m 22s branch has no errors when building our shaded downstream artifacts. +1 javadoc 1m 5s master passed       Patch Compile Tests +1 mvninstall 6m 24s the patch passed +1 compile 0m 43s the patch passed +1 javac 0m 43s the patch passed +1 checkstyle 1m 8s hbase-server: The patch generated 0 new + 265 unchanged - 4 fixed = 265 total (was 269) +1 whitespace 0m 0s The patch has no whitespace issues. +1 shadedjars 4m 35s patch has no errors when building our shaded downstream artifacts. +1 hadoopcheck 18m 12s Patch does not cause any errors with Hadoop 2.6.5 2.7.4 or 3.0.0. +1 javadoc 0m 29s the patch passed       Other Tests +1 unit 99m 16s hbase-server in the patch passed. +1 asflicense 0m 14s The patch does not generate ASF License warnings. 147m 0s Subsystem Report/Notes Docker Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:eee3b01 JIRA Issue HBASE-19927 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12909128/HBASE-19927.patch Optional Tests asflicense javac javadoc unit findbugs shadedjars hadoopcheck hbaseanti checkstyle compile uname Linux b7c85ae0a15f 3.13.0-133-generic #182-Ubuntu SMP Tue Sep 19 15:49:21 UTC 2017 x86_64 GNU/Linux Build tool maven Personality /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh git revision master / bfd74686c7 maven version: Apache Maven 3.5.2 (138edd61fd100ec658bfa2d307c43b76940a5d7d; 2017-10-18T07:58:13Z) Default Java 1.8.0_151 Test Results https://builds.apache.org/job/PreCommit-HBASE-Build/11380/testReport/ Max. process+thread count 5367 (vs. ulimit of 10000) modules C: hbase-server U: hbase-server Console output https://builds.apache.org/job/PreCommit-HBASE-Build/11380/console Powered by Apache Yetus 0.7.0 http://yetus.apache.org This message was automatically generated.
            zhangduo Duo Zhang added a comment -

            Let me commit this. Will use HBASE-19929 to address the shutdown problem.

            zhangduo Duo Zhang added a comment - Let me commit this. Will use HBASE-19929 to address the shutdown problem.
            stack Michael Stack added a comment -

            +1 go for it

            stack Michael Stack added a comment - +1 go for it
            zhangduo Duo Zhang added a comment -

            Pushed to master and branch-2.

            Thanks stack for reviewing.

            zhangduo Duo Zhang added a comment - Pushed to master and branch-2. Thanks stack for reviewing.
            hudson Hudson added a comment -

            FAILURE: Integrated in Jenkins build HBase-Trunk_matrix #4528 (See https://builds.apache.org/job/HBase-Trunk_matrix/4528/)
            HBASE-19927 TestFullLogReconstruction flakey (zhangduo: rev e1cd10b002a07a35aa7666fcfbd01b54cfcff1bf)

            • (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/TestFullLogReconstruction.java
            • (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java
            hudson Hudson added a comment - FAILURE: Integrated in Jenkins build HBase-Trunk_matrix #4528 (See https://builds.apache.org/job/HBase-Trunk_matrix/4528/ ) HBASE-19927 TestFullLogReconstruction flakey (zhangduo: rev e1cd10b002a07a35aa7666fcfbd01b54cfcff1bf) (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/TestFullLogReconstruction.java (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java
            stack Michael Stack added a comment - It still flakey? https://builds.apache.org/view/H-L/view/HBase/job/HBase-Find-Flaky-Tests-branch2.0/lastSuccessfulBuild/artifact/dashboard.html
            zhangduo Duo Zhang added a comment -

            A bit strange

            2018-02-05 19:05:43,537 INFO  [Time-limited test] regionserver.HRegionServer(2116): ***** STOPPING region server 'asf903.gq1.ygridcore.net,57911,1517857533524' *****
            2018-02-05 19:05:43,537 INFO  [Time-limited test] regionserver.HRegionServer(2130): STOPPED: Shutdown requested
            2018-02-05 19:05:43,538 INFO  [Time-limited test] regionserver.HRegionServer(2116): ***** STOPPING region server 'asf903.gq1.ygridcore.net,50054,1517857533606' *****
            2018-02-05 19:05:43,538 INFO  [RS:0;asf903:57911] regionserver.SplitLogWorker(160): Sending interrupt to stop the worker thread
            2018-02-05 19:05:43,538 INFO  [Time-limited test] regionserver.HRegionServer(2130): STOPPED: Shutdown requested
            2018-02-05 19:05:43,538 INFO  [Time-limited test] regionserver.HRegionServer(2116): ***** STOPPING region server 'asf903.gq1.ygridcore.net,42069,1517857533678' *****
            
            2018-02-05 19:05:43,974 ERROR [regionserver/asf903:0.logRoller] helpers.MarkerIgnoringBase(159): ***** ABORTING region server asf903.gq1.ygridcore.net,57911,1517857533524: IOE in log roller *****
            

            The aborting still happens after the stopping in shutdown. Let me check.

            zhangduo Duo Zhang added a comment - A bit strange 2018-02-05 19:05:43,537 INFO [Time-limited test] regionserver.HRegionServer(2116): ***** STOPPING region server 'asf903.gq1.ygridcore.net,57911,1517857533524' ***** 2018-02-05 19:05:43,537 INFO [Time-limited test] regionserver.HRegionServer(2130): STOPPED: Shutdown requested 2018-02-05 19:05:43,538 INFO [Time-limited test] regionserver.HRegionServer(2116): ***** STOPPING region server 'asf903.gq1.ygridcore.net,50054,1517857533606' ***** 2018-02-05 19:05:43,538 INFO [RS:0;asf903:57911] regionserver.SplitLogWorker(160): Sending interrupt to stop the worker thread 2018-02-05 19:05:43,538 INFO [Time-limited test] regionserver.HRegionServer(2130): STOPPED: Shutdown requested 2018-02-05 19:05:43,538 INFO [Time-limited test] regionserver.HRegionServer(2116): ***** STOPPING region server 'asf903.gq1.ygridcore.net,42069,1517857533678' ***** 2018-02-05 19:05:43,974 ERROR [regionserver/asf903:0.logRoller] helpers.MarkerIgnoringBase(159): ***** ABORTING region server asf903.gq1.ygridcore.net,57911,1517857533524: IOE in log roller ***** The aborting still happens after the stopping in shutdown. Let me check.
            zhangduo Duo Zhang added a comment -

            Oh, we join on the wrong class, we need to join on RegionServerThread...

            Will upload an addendum soon...

            zhangduo Duo Zhang added a comment - Oh, we join on the wrong class, we need to join on RegionServerThread... Will upload an addendum soon...
            zhangduo Duo Zhang added a comment -

            Pushed the addendum to master and branch-2(after removing the unused import). Let's wait a while to see if it works.

             

            Thanks.

            zhangduo Duo Zhang added a comment - Pushed the addendum to master and branch-2(after removing the unused import). Let's wait a while to see if it works.   Thanks.
            hudson Hudson added a comment -

            FAILURE: Integrated in Jenkins build HBase-Trunk_matrix #4535 (See https://builds.apache.org/job/HBase-Trunk_matrix/4535/)
            HBASE-19927 Addendum join on RegionServerThread instead of HRegionServer (zhangduo: rev f5197979aaac7e36b6af36b86ea8dc8d7774fabe)

            • (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/TestFullLogReconstruction.java
            • (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java
            hudson Hudson added a comment - FAILURE: Integrated in Jenkins build HBase-Trunk_matrix #4535 (See https://builds.apache.org/job/HBase-Trunk_matrix/4535/ ) HBASE-19927 Addendum join on RegionServerThread instead of HRegionServer (zhangduo: rev f5197979aaac7e36b6af36b86ea8dc8d7774fabe) (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/TestFullLogReconstruction.java (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java
            stack Michael Stack added a comment -

            Reresolving. The addendum seems to have worked. No failures in last 24 hours (about 28 runs of the flakies suite). Good stuff.

            stack Michael Stack added a comment - Reresolving. The addendum seems to have worked. No failures in last 24 hours (about 28 runs of the flakies suite). Good stuff.

            People

              zhangduo Duo Zhang
              stack Michael Stack
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: