Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-19147 All branch-2 unit tests pass
  3. HBASE-19554

AbstractTestDLS.testThreeRSAbort sometimes fails in pre commit

Details

    • Sub-task
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 2.0.0-beta-2, 2.0.0
    • Recovery, wal
    • None
    • Reviewed

    Description

      https://builds.apache.org/job/PreCommit-HBASE-Build/10554/artifact/patchprocess/patch-unit-hbase-server.txt

      The error message is a bit strange:

      [ERROR] testThreeRSAbort(org.apache.hadoop.hbase.master.TestDLSAsyncFSWAL) Time elapsed: 20.627 s <<< ERROR!
      org.apache.hadoop.hbase.TableNotFoundException: Region of 'hbase:namespace,,1513320505933.451650152885a3b41d0b1110deca513c.' is expected in the table of 'testThreeRSAbort', but hbase:meta says it is in the table of 'hbase:namespace'. hbase:meta might be damaged.

      It fails for both FSHLog and AsyncFSWAL. Need to dig more.

      Attachments

        1. HBASE-19554.patch
          8 kB
          Duo Zhang
        2. HBASE-19554-thread-dump.patch
          2 kB
          Duo Zhang

        Issue Links

          Activity

            zhangduo Duo Zhang added a comment -

            The message means that we do not have the records of the given table in meta table. And I found that the HTU.waitUntilAllRegionsAssigned will return in this case. Let me modify HTU.waitUntilAllRegionsAssigned first to not return in this case.

            zhangduo Duo Zhang added a comment - The message means that we do not have the records of the given table in meta table. And I found that the HTU.waitUntilAllRegionsAssigned will return in this case. Let me modify HTU.waitUntilAllRegionsAssigned first to not return in this case.
            zhangduo Duo Zhang added a comment -

            Will commit this patch first to see if there are any difference in later pre commit result.

            zhangduo Duo Zhang added a comment - Will commit this patch first to see if there are any difference in later pre commit result.
            hadoopqa Hadoop QA added a comment -
            -1 overall



            Vote Subsystem Runtime Comment
            0 reexec 2m 21s Docker mode activated.
                  Prechecks
            0 findbugs 0m 0s Findbugs executables are not available.
            +1 hbaseanti 0m 0s Patch does not have any anti-patterns.
            +1 @author 0m 0s The patch does not contain any @author tags.
            +1 test4tests 0m 0s The patch appears to include 2 new or modified test files.
                  master Compile Tests
            +1 mvninstall 4m 38s master passed
            +1 compile 0m 50s master passed
            +1 checkstyle 1m 10s master passed
            +1 shadedjars 5m 49s branch has no errors when building our shaded downstream artifacts.
            +1 javadoc 0m 28s master passed
                  Patch Compile Tests
            +1 mvninstall 4m 42s the patch passed
            +1 compile 0m 44s the patch passed
            +1 javac 0m 44s the patch passed
            -1 checkstyle 1m 5s hbase-server: The patch generated 2 new + 269 unchanged - 2 fixed = 271 total (was 271)
            +1 whitespace 0m 0s The patch has no whitespace issues.
            +1 shadedjars 4m 39s patch has no errors when building our shaded downstream artifacts.
            +1 hadoopcheck 19m 20s Patch does not cause any errors with Hadoop 2.6.5 2.7.4 or 3.0.0.
            +1 javadoc 0m 27s the patch passed
                  Other Tests
            -1 unit 105m 24s hbase-server in the patch failed.
            +1 asflicense 0m 19s The patch does not generate ASF License warnings.
            146m 18s



            Reason Tests
            Failed junit tests hadoop.hbase.regionserver.wal.TestLogRolling



            Subsystem Report/Notes
            Docker Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:eee3b01
            JIRA Issue HBASE-19554
            JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12902859/HBASE-19554.patch
            Optional Tests asflicense javac javadoc unit findbugs shadedjars hadoopcheck hbaseanti checkstyle compile
            uname Linux 091eb3fa7b75 3.13.0-133-generic #182-Ubuntu SMP Tue Sep 19 15:49:21 UTC 2017 x86_64 GNU/Linux
            Build tool maven
            Personality /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh
            git revision master / 03e79b7994
            maven version: Apache Maven 3.5.2 (138edd61fd100ec658bfa2d307c43b76940a5d7d; 2017-10-18T07:58:13Z)
            Default Java 1.8.0_151
            checkstyle https://builds.apache.org/job/PreCommit-HBASE-Build/10557/artifact/patchprocess/diff-checkstyle-hbase-server.txt
            unit https://builds.apache.org/job/PreCommit-HBASE-Build/10557/artifact/patchprocess/patch-unit-hbase-server.txt
            Test Results https://builds.apache.org/job/PreCommit-HBASE-Build/10557/testReport/
            modules C: hbase-server U: hbase-server
            Console output https://builds.apache.org/job/PreCommit-HBASE-Build/10557/console
            Powered by Apache Yetus 0.6.0 http://yetus.apache.org

            This message was automatically generated.

            hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 2m 21s Docker mode activated.       Prechecks 0 findbugs 0m 0s Findbugs executables are not available. +1 hbaseanti 0m 0s Patch does not have any anti-patterns. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 2 new or modified test files.       master Compile Tests +1 mvninstall 4m 38s master passed +1 compile 0m 50s master passed +1 checkstyle 1m 10s master passed +1 shadedjars 5m 49s branch has no errors when building our shaded downstream artifacts. +1 javadoc 0m 28s master passed       Patch Compile Tests +1 mvninstall 4m 42s the patch passed +1 compile 0m 44s the patch passed +1 javac 0m 44s the patch passed -1 checkstyle 1m 5s hbase-server: The patch generated 2 new + 269 unchanged - 2 fixed = 271 total (was 271) +1 whitespace 0m 0s The patch has no whitespace issues. +1 shadedjars 4m 39s patch has no errors when building our shaded downstream artifacts. +1 hadoopcheck 19m 20s Patch does not cause any errors with Hadoop 2.6.5 2.7.4 or 3.0.0. +1 javadoc 0m 27s the patch passed       Other Tests -1 unit 105m 24s hbase-server in the patch failed. +1 asflicense 0m 19s The patch does not generate ASF License warnings. 146m 18s Reason Tests Failed junit tests hadoop.hbase.regionserver.wal.TestLogRolling Subsystem Report/Notes Docker Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:eee3b01 JIRA Issue HBASE-19554 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12902859/HBASE-19554.patch Optional Tests asflicense javac javadoc unit findbugs shadedjars hadoopcheck hbaseanti checkstyle compile uname Linux 091eb3fa7b75 3.13.0-133-generic #182-Ubuntu SMP Tue Sep 19 15:49:21 UTC 2017 x86_64 GNU/Linux Build tool maven Personality /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh git revision master / 03e79b7994 maven version: Apache Maven 3.5.2 (138edd61fd100ec658bfa2d307c43b76940a5d7d; 2017-10-18T07:58:13Z) Default Java 1.8.0_151 checkstyle https://builds.apache.org/job/PreCommit-HBASE-Build/10557/artifact/patchprocess/diff-checkstyle-hbase-server.txt unit https://builds.apache.org/job/PreCommit-HBASE-Build/10557/artifact/patchprocess/patch-unit-hbase-server.txt Test Results https://builds.apache.org/job/PreCommit-HBASE-Build/10557/testReport/ modules C: hbase-server U: hbase-server Console output https://builds.apache.org/job/PreCommit-HBASE-Build/10557/console Powered by Apache Yetus 0.6.0 http://yetus.apache.org This message was automatically generated.
            stack Michael Stack added a comment -

            I pushed these for you so you have debug when you get up Apache9

            The failure was TestLogRolling... but there are timeouts too. Random. We seem to be inching into fun test timeout territory again. Will have to dig in...

            stack Michael Stack added a comment - I pushed these for you so you have debug when you get up Apache9 The failure was TestLogRolling... but there are timeouts too. Random. We seem to be inching into fun test timeout territory again. Will have to dig in...
            hudson Hudson added a comment -

            FAILURE: Integrated in Jenkins build HBase-Trunk_matrix #4262 (See https://builds.apache.org/job/HBase-Trunk_matrix/4262/)
            HBASE-19554 For debug: Modify HTU.waitUntilAllRegionsAssigned to handle (stack: rev 084e324fb9f769f221236e31aa97382d2677f911)

            • (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/master/AbstractTestDLS.java
            • (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java
            hudson Hudson added a comment - FAILURE: Integrated in Jenkins build HBase-Trunk_matrix #4262 (See https://builds.apache.org/job/HBase-Trunk_matrix/4262/ ) HBASE-19554 For debug: Modify HTU.waitUntilAllRegionsAssigned to handle (stack: rev 084e324fb9f769f221236e31aa97382d2677f911) (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/master/AbstractTestDLS.java (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java
            zhangduo Duo Zhang added a comment - - edited

            Checked recent pre commit building, seems much better. And this a failure case

            https://builds.apache.org/job/PreCommit-HBASE-Build/10609/artifact/patchprocess/patch-unit-hbase-server.txt

            The error message

            [ERROR] Tests run: 6, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 157.505 s <<< FAILURE! - in org.apache.hadoop.hbase.master.TestDLSAsyncFSWAL
            [ERROR] testThreeRSAbort(org.apache.hadoop.hbase.master.TestDLSAsyncFSWAL)  Time elapsed: 49.462 s  <<< ERROR!
            java.lang.RuntimeException: 
            org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=11, exceptions:
            Thu Dec 21 10:19:58 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745
            Thu Dec 21 10:19:58 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.io.IOException: Call to 604d085d7ec5/172.17.0.2:57745 failed on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: 604d085d7ec5/172.17.0.2:57745
            Thu Dec 21 10:19:58 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.io.IOException: Call to 604d085d7ec5/172.17.0.2:57745 failed on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: 604d085d7ec5/172.17.0.2:57745
            Thu Dec 21 10:19:58 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.io.IOException: Call to 604d085d7ec5/172.17.0.2:57745 failed on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: 604d085d7ec5/172.17.0.2:57745
            Thu Dec 21 10:19:59 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.io.IOException: Call to 604d085d7ec5/172.17.0.2:57745 failed on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: 604d085d7ec5/172.17.0.2:57745
            Thu Dec 21 10:20:00 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745
            Thu Dec 21 10:20:02 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745
            Thu Dec 21 10:20:06 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745
            Thu Dec 21 10:20:16 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745
            Thu Dec 21 10:20:26 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745
            Thu Dec 21 10:20:36 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745
            
            Caused by: org.apache.hadoop.hbase.client.RetriesExhaustedException: 
            Failed after attempts=11, exceptions:
            Thu Dec 21 10:19:58 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745
            Thu Dec 21 10:19:58 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.io.IOException: Call to 604d085d7ec5/172.17.0.2:57745 failed on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: 604d085d7ec5/172.17.0.2:57745
            Thu Dec 21 10:19:58 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.io.IOException: Call to 604d085d7ec5/172.17.0.2:57745 failed on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: 604d085d7ec5/172.17.0.2:57745
            Thu Dec 21 10:19:58 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.io.IOException: Call to 604d085d7ec5/172.17.0.2:57745 failed on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: 604d085d7ec5/172.17.0.2:57745
            Thu Dec 21 10:19:59 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.io.IOException: Call to 604d085d7ec5/172.17.0.2:57745 failed on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: 604d085d7ec5/172.17.0.2:57745
            Thu Dec 21 10:20:00 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745
            Thu Dec 21 10:20:02 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745
            Thu Dec 21 10:20:06 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745
            Thu Dec 21 10:20:16 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745
            Thu Dec 21 10:20:26 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745
            Thu Dec 21 10:20:36 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745
            
            Caused by: java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745
            Caused by: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745
            Caused by: java.net.ConnectException: Connection refused
            

            The test output xml can not be generated which makes it really hard to find out the real problem...

            zhangduo Duo Zhang added a comment - - edited Checked recent pre commit building, seems much better. And this a failure case https://builds.apache.org/job/PreCommit-HBASE-Build/10609/artifact/patchprocess/patch-unit-hbase-server.txt The error message [ERROR] Tests run: 6, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 157.505 s <<< FAILURE! - in org.apache.hadoop.hbase.master.TestDLSAsyncFSWAL [ERROR] testThreeRSAbort(org.apache.hadoop.hbase.master.TestDLSAsyncFSWAL) Time elapsed: 49.462 s <<< ERROR! java.lang.RuntimeException: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=11, exceptions: Thu Dec 21 10:19:58 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745 Thu Dec 21 10:19:58 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.io.IOException: Call to 604d085d7ec5/172.17.0.2:57745 failed on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: 604d085d7ec5/172.17.0.2:57745 Thu Dec 21 10:19:58 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.io.IOException: Call to 604d085d7ec5/172.17.0.2:57745 failed on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: 604d085d7ec5/172.17.0.2:57745 Thu Dec 21 10:19:58 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.io.IOException: Call to 604d085d7ec5/172.17.0.2:57745 failed on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: 604d085d7ec5/172.17.0.2:57745 Thu Dec 21 10:19:59 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.io.IOException: Call to 604d085d7ec5/172.17.0.2:57745 failed on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: 604d085d7ec5/172.17.0.2:57745 Thu Dec 21 10:20:00 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745 Thu Dec 21 10:20:02 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745 Thu Dec 21 10:20:06 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745 Thu Dec 21 10:20:16 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745 Thu Dec 21 10:20:26 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745 Thu Dec 21 10:20:36 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745 Caused by: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=11, exceptions: Thu Dec 21 10:19:58 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745 Thu Dec 21 10:19:58 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.io.IOException: Call to 604d085d7ec5/172.17.0.2:57745 failed on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: 604d085d7ec5/172.17.0.2:57745 Thu Dec 21 10:19:58 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.io.IOException: Call to 604d085d7ec5/172.17.0.2:57745 failed on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: 604d085d7ec5/172.17.0.2:57745 Thu Dec 21 10:19:58 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.io.IOException: Call to 604d085d7ec5/172.17.0.2:57745 failed on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: 604d085d7ec5/172.17.0.2:57745 Thu Dec 21 10:19:59 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.io.IOException: Call to 604d085d7ec5/172.17.0.2:57745 failed on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: 604d085d7ec5/172.17.0.2:57745 Thu Dec 21 10:20:00 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745 Thu Dec 21 10:20:02 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745 Thu Dec 21 10:20:06 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745 Thu Dec 21 10:20:16 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745 Thu Dec 21 10:20:26 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745 Thu Dec 21 10:20:36 UTC 2017, RpcRetryingCaller{globalStartTime=1513851598131, pause=100, maxAttempts=11}, java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745 Caused by: java.net.ConnectException: Call to 604d085d7ec5/172.17.0.2:57745 failed on connection exception: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745 Caused by: org.apache.hadoop.hbase.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 604d085d7ec5/172.17.0.2:57745 Caused by: java.net.ConnectException: Connection refused The test output xml can not be generated which makes it really hard to find out the real problem...
            zhangduo Duo Zhang added a comment -

            https://builds.apache.org/job/PreCommit-HBASE-Build/10676/artifact/patchprocess/patch-unit-hbase-server.txt

            [ERROR] Tests run: 6, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 406.921 s <<< FAILURE! - in org.apache.hadoop.hbase.master.TestDLSAsyncFSWAL
            [ERROR] testThreeRSAbort(org.apache.hadoop.hbase.master.TestDLSAsyncFSWAL)  Time elapsed: 301.077 s  <<< ERROR!
            org.junit.runners.model.TestTimedOutException: test timed out after 300000 milliseconds
            
            [ERROR] Errors: 
            [ERROR]   TestDLSAsyncFSWAL>AbstractTestDLS.testThreeRSAbort:401->Object.wait:460->Object.wait:-2 » TestTimedOut
            
            zhangduo Duo Zhang added a comment - https://builds.apache.org/job/PreCommit-HBASE-Build/10676/artifact/patchprocess/patch-unit-hbase-server.txt [ERROR] Tests run: 6, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 406.921 s <<< FAILURE! - in org.apache.hadoop.hbase.master.TestDLSAsyncFSWAL [ERROR] testThreeRSAbort(org.apache.hadoop.hbase.master.TestDLSAsyncFSWAL) Time elapsed: 301.077 s <<< ERROR! org.junit.runners.model.TestTimedOutException: test timed out after 300000 milliseconds [ERROR] Errors: [ERROR] TestDLSAsyncFSWAL>AbstractTestDLS.testThreeRSAbort:401->Object.wait:460->Object.wait:-2 » TestTimedOut
            zhangduo Duo Zhang added a comment -

            Still failed in countRows. There may be a problem with the assignments?

            zhangduo Duo Zhang added a comment - Still failed in countRows. There may be a problem with the assignments?
            stack Michael Stack added a comment -

            I disabled TestDLSAsyncFSWAL and TestDLSFSHLog so I can cut the beta-1. They fail 40% and 33% respectively on branch-2. Disabled them in master too... since once fixed, we can add it in both places. Will be back to help out. Just disabling so can push out the beta. Thanks.

            stack Michael Stack added a comment - I disabled TestDLSAsyncFSWAL and TestDLSFSHLog so I can cut the beta-1. They fail 40% and 33% respectively on branch-2. Disabled them in master too... since once fixed, we can add it in both places. Will be back to help out. Just disabling so can push out the beta. Thanks.
            zhangduo Duo Zhang added a comment -

            Reenable for debugging.

            zhangduo Duo Zhang added a comment - Reenable for debugging.
            hudson Hudson added a comment -

            FAILURE: Integrated in Jenkins build HBase-Trunk_matrix #4528 (See https://builds.apache.org/job/HBase-Trunk_matrix/4528/)
            HBASE-19554 reenable TestDLSAsyncFSWAL/TestDLSFSHLog for debugging (zhangduo: rev 88d6e06a1fe4130b96e0e9f0e2c2b189f0b6affd)

            • (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestDLSFSHLog.java
            • (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestDLSAsyncFSWAL.java
            hudson Hudson added a comment - FAILURE: Integrated in Jenkins build HBase-Trunk_matrix #4528 (See https://builds.apache.org/job/HBase-Trunk_matrix/4528/ ) HBASE-19554 reenable TestDLSAsyncFSWAL/TestDLSFSHLog for debugging (zhangduo: rev 88d6e06a1fe4130b96e0e9f0e2c2b189f0b6affd) (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestDLSFSHLog.java (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestDLSAsyncFSWAL.java
            zhangduo Duo Zhang added a comment -

            I guess the problem is related to meta reassign. zghaobac has already found something strange for AMv2 in HBASE-19965, so wait for his digging result.

            And I found something strange in the output

            java.lang.AssertionError
            	at org.apache.hadoop.hbase.wal.WALKeyImpl.getWriteEntry(WALKeyImpl.java:82)
            	at org.apache.hadoop.hbase.regionserver.wal.WALUtil.doFullAppendTransaction(WALUtil.java:159)
            	at org.apache.hadoop.hbase.regionserver.wal.WALUtil.writeMarker(WALUtil.java:132)
            	at org.apache.hadoop.hbase.regionserver.wal.WALUtil.writeRegionEventMarker(WALUtil.java:97)
            	at org.apache.hadoop.hbase.regionserver.HRegion.writeRegionCloseMarker(HRegion.java:1103)
            	at org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1615)
            	at org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1437)
            	at org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:104)
            	at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
            	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
            	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
            	at java.lang.Thread.run(Thread.java:748)
            

            The assertion is

              public MultiVersionConcurrencyControl.WriteEntry getWriteEntry() throws InterruptedIOException {
                assert this.writeEntry != null;
                return this.writeEntry;
              }
            

            I think the problem is introduced by HBASE-19929. Since we have closed WAL directly, then it is possible that we fail a WAL.append without assigning a mvcc writeEntry. Let me file a issue to fix this.

            zhangduo Duo Zhang added a comment - I guess the problem is related to meta reassign. zghaobac has already found something strange for AMv2 in HBASE-19965 , so wait for his digging result. And I found something strange in the output java.lang.AssertionError at org.apache.hadoop.hbase.wal.WALKeyImpl.getWriteEntry(WALKeyImpl.java:82) at org.apache.hadoop.hbase.regionserver.wal.WALUtil.doFullAppendTransaction(WALUtil.java:159) at org.apache.hadoop.hbase.regionserver.wal.WALUtil.writeMarker(WALUtil.java:132) at org.apache.hadoop.hbase.regionserver.wal.WALUtil.writeRegionEventMarker(WALUtil.java:97) at org.apache.hadoop.hbase.regionserver.HRegion.writeRegionCloseMarker(HRegion.java:1103) at org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1615) at org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1437) at org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:104) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) The assertion is public MultiVersionConcurrencyControl.WriteEntry getWriteEntry() throws InterruptedIOException { assert this .writeEntry != null ; return this .writeEntry; } I think the problem is introduced by HBASE-19929 . Since we have closed WAL directly, then it is possible that we fail a WAL.append without assigning a mvcc writeEntry. Let me file a issue to fix this.
            zhangduo Duo Zhang added a comment -

            https://builds.apache.org/job/HBASE-Flaky-Tests/25832/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.master.TestDLSFSHLog-output.txt/*view*/

            2018-02-12 04:56:20,895 DEBUG [PEWorker-16] procedure.ServerCrashProcedure(192): pid=132, state=RUNNABLE:SERVER_CRASH_PROCESS_META; ServerCrashProcedure server=asf911.gq1.ygridcore.net,41715,1518411358686, splitWal=true, meta=true; Processing hbase:meta that was on asf911.gq1.ygridcore.net,41715,1518411358686
            2018-02-12 04:56:20,895 INFO  [PEWorker-16] procedure2.ProcedureExecutor(1498): Initialized subprocedures=[{pid=135, ppid=132, state=RUNNABLE:RECOVER_META_SPLIT_LOGS; RecoverMetaProcedure failedMetaServer=asf911.gq1.ygridcore.net,41715,1518411358686, splitWal=true}]
            

            Then there is no progress, so at last we time out. Let me add a thread dump when we are about to time out to see if we can find something.

            zhangduo Duo Zhang added a comment - https://builds.apache.org/job/HBASE-Flaky-Tests/25832/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.master.TestDLSFSHLog-output.txt/*view*/ 2018-02-12 04:56:20,895 DEBUG [PEWorker-16] procedure.ServerCrashProcedure(192): pid=132, state=RUNNABLE:SERVER_CRASH_PROCESS_META; ServerCrashProcedure server=asf911.gq1.ygridcore.net,41715,1518411358686, splitWal=true, meta=true; Processing hbase:meta that was on asf911.gq1.ygridcore.net,41715,1518411358686 2018-02-12 04:56:20,895 INFO [PEWorker-16] procedure2.ProcedureExecutor(1498): Initialized subprocedures=[{pid=135, ppid=132, state=RUNNABLE:RECOVER_META_SPLIT_LOGS; RecoverMetaProcedure failedMetaServer=asf911.gq1.ygridcore.net,41715,1518411358686, splitWal=true}] Then there is no progress, so at last we time out. Let me add a thread dump when we are about to time out to see if we can find something.
            zhangduo Duo Zhang added a comment -

            What I pushed to master and branch-2.

            zhangduo Duo Zhang added a comment - What I pushed to master and branch-2.
            zhangduo Duo Zhang added a comment -

            After revisiting the log, I think I found a possible dead lock in recovery.

            It is very simple, we only have 16 threads to run the procedures. During a failover, one region will lead to one AssignProcedure, and if meta region is also offline, then every AssignProcedure will be stuck at AssignProcedure.startTransition where we need to read something from meta region. So if the RecoverMetaProcedure can not be picked up by one of the worker thread before all of them are stuck at AssignProcedure.startTransition, then dead lock...

            So I think we should have a special worker thread which can only execute meta related operations. And if a TableProcedureInterface is for meta table, then we should insert it into a special queue and let the special worker to run it immediately.

            Let me open a new issue to address it.

            Thanks.

            zhangduo Duo Zhang added a comment - After revisiting the log, I think I found a possible dead lock in recovery. It is very simple, we only have 16 threads to run the procedures. During a failover, one region will lead to one AssignProcedure, and if meta region is also offline, then every AssignProcedure will be stuck at AssignProcedure.startTransition where we need to read something from meta region. So if the RecoverMetaProcedure can not be picked up by one of the worker thread before all of them are stuck at AssignProcedure.startTransition, then dead lock... So I think we should have a special worker thread which can only execute meta related operations. And if a TableProcedureInterface is for meta table, then we should insert it into a special queue and let the special worker to run it immediately. Let me open a new issue to address it. Thanks.
            hudson Hudson added a comment -

            FAILURE: Integrated in Jenkins build HBase-Trunk_matrix #4571 (See https://builds.apache.org/job/HBase-Trunk_matrix/4571/)
            HBASE-19554 Dump thread info before failure (zhangduo: rev 8ff783f76d370199b81a24c48feaad1aa980285e)

            • (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/master/AbstractTestDLS.java
            hudson Hudson added a comment - FAILURE: Integrated in Jenkins build HBase-Trunk_matrix #4571 (See https://builds.apache.org/job/HBase-Trunk_matrix/4571/ ) HBASE-19554 Dump thread info before failure (zhangduo: rev 8ff783f76d370199b81a24c48feaad1aa980285e) (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/master/AbstractTestDLS.java
            zhangduo Duo Zhang added a comment -

            Different problem

            https://builds.apache.org/job/HBASE-Flaky-Tests/26171/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.master.TestDLSAsyncFSWAL-output.txt/*view*/

            Now we time out when running testRecoveredEdits.

            "Time-limited test" daemon prio=5 tid=23 timed_waiting
            java.lang.Thread.State: TIMED_WAITING
                    at java.lang.Object.wait(Native Method)
                    at java.lang.Object.wait(Object.java:502)
                    at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1406)
                    at org.apache.zookeeper.ClientCnxn.close(ClientCnxn.java:1378)
                    at org.apache.zookeeper.ZooKeeper.close(ZooKeeper.java:677)
                    at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.close(RecoverableZooKeeper.java:711)
                    at org.apache.hadoop.hbase.zookeeper.ZKWatcher.close(ZKWatcher.java:596)
                    at org.apache.hadoop.hbase.master.AbstractTestDLS.testRecoveredEdits(AbstractTestDLS.java:251)
                    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
                    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
                    at java.lang.reflect.Method.invoke(Method.java:498)
                    at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
                    at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
                    at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
                    at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
                    at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
                    at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
                    at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
                    at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
                    at org.junit.rules.RunRules.evaluate(RunRules.java:20)
                    at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
                    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
                    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
                    at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
                    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
                    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
                    at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
                    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
                    at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
                    at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
                    at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
                    at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
                    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
                    at java.lang.Thread.run(Thread.java:748)
            

            Let me dig more.

            zhangduo Duo Zhang added a comment - Different problem https://builds.apache.org/job/HBASE-Flaky-Tests/26171/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.master.TestDLSAsyncFSWAL-output.txt/*view*/ Now we time out when running testRecoveredEdits. "Time-limited test" daemon prio=5 tid=23 timed_waiting java.lang.Thread.State: TIMED_WAITING at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:502) at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1406) at org.apache.zookeeper.ClientCnxn.close(ClientCnxn.java:1378) at org.apache.zookeeper.ZooKeeper.close(ZooKeeper.java:677) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.close(RecoverableZooKeeper.java:711) at org.apache.hadoop.hbase.zookeeper.ZKWatcher.close(ZKWatcher.java:596) at org.apache.hadoop.hbase.master.AbstractTestDLS.testRecoveredEdits(AbstractTestDLS.java:251) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) at org.junit.rules.RunRules.evaluate(RunRules.java:20) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.lang.Thread.run(Thread.java:748) Let me dig more.
            zhangduo Duo Zhang added a comment -

            OK the problem is testMasterStartsUpWithLogSplittingWork, it spends nearly 13 minutes to finish

            2018-02-20 02:29:56,586 INFO  [Time-limited test] hbase.ResourceChecker(148): before: master.TestDLSAsyncFSWAL#testMasterStartsUpWithLogSplittingWork Thread=133, OpenFileDescriptor=893, MaxFileDescriptor=60000, SystemLoadAverage=819, ProcessCount=273, AvailableMemoryMB=3881
            
            2018-02-20 02:42:38,021 INFO  [Time-limited test] hbase.ResourceChecker(172): after: master.TestDLSAsyncFSWAL#testMasterStartsUpWithLogSplittingWork Thread=233 (was 133)
            

            Let me dig more.

            zhangduo Duo Zhang added a comment - OK the problem is testMasterStartsUpWithLogSplittingWork, it spends nearly 13 minutes to finish 2018-02-20 02:29:56,586 INFO [Time-limited test] hbase.ResourceChecker(148): before: master.TestDLSAsyncFSWAL#testMasterStartsUpWithLogSplittingWork Thread=133, OpenFileDescriptor=893, MaxFileDescriptor=60000, SystemLoadAverage=819, ProcessCount=273, AvailableMemoryMB=3881 2018-02-20 02:42:38,021 INFO [Time-limited test] hbase.ResourceChecker(172): after: master.TestDLSAsyncFSWAL#testMasterStartsUpWithLogSplittingWork Thread=233 (was 133) Let me dig more.
            zhangduo Duo Zhang added a comment -

            I think this is a problem which we have already addressed? We have already closed the ConnectionImplementation from outside, and then the initialize of HMaster should quit immediately. But it seems not, it keeps retrying on the closed connection... Let me check what's wrong here...

            2018-02-20 02:31:12,578 DEBUG [M:0;asf911:38379] client.RpcRetryingCallerImpl(132): Call exception, tries=11, retries=11, started=48515 ms ago, cancelled=false, msg=hconnection-0x195e9495 closed, details=row 'default' on table 'hbase:namespace' at region=hbase:namespace,,1519093804365.3ff104b43ef4ce0a491e1f26b598813e., hostname=asf911.gq1.ygridcore.net,51776,1519093799312, seqNum=2, exception=java.io.IOException: hconnection-0x195e9495 closed
                    at org.apache.hadoop.hbase.client.ConnectionImplementation.getTableState(ConnectionImplementation.java:1959)
                    at org.apache.hadoop.hbase.client.ConnectionUtils$ShortCircuitingClusterConnection.getTableState(ConnectionUtils.java:131)
                    at org.apache.hadoop.hbase.client.ConnectionImplementation.isTableDisabled(ConnectionImplementation.java:573)
                    at org.apache.hadoop.hbase.client.ConnectionUtils$ShortCircuitingClusterConnection.isTableDisabled(ConnectionUtils.java:131)
                    at org.apache.hadoop.hbase.client.RegionServerCallable.prepare(RegionServerCallable.java:219)
                    at org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:105)
                    at org.apache.hadoop.hbase.client.HTable.get(HTable.java:388)
                    at org.apache.hadoop.hbase.client.HTable.get(HTable.java:362)
                    at org.apache.hadoop.hbase.master.TableNamespaceManager.get(TableNamespaceManager.java:141)
                    at org.apache.hadoop.hbase.master.TableNamespaceManager.isTableAvailableAndInitialized(TableNamespaceManager.java:278)
                    at org.apache.hadoop.hbase.master.TableNamespaceManager.start(TableNamespaceManager.java:103)
                    at org.apache.hadoop.hbase.master.ClusterSchemaServiceImpl.doStart(ClusterSchemaServiceImpl.java:62)
                    at org.apache.hbase.thirdparty.com.google.common.util.concurrent.AbstractService.startAsync(AbstractService.java:226)
                    at org.apache.hadoop.hbase.master.HMaster.initClusterSchemaService(HMaster.java:1053)
                    at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:919)
                    at org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2017)
                    at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:559)
                    at java.lang.Thread.run(Thread.java:748)
            
            zhangduo Duo Zhang added a comment - I think this is a problem which we have already addressed? We have already closed the ConnectionImplementation from outside, and then the initialize of HMaster should quit immediately. But it seems not, it keeps retrying on the closed connection... Let me check what's wrong here... 2018-02-20 02:31:12,578 DEBUG [M:0;asf911:38379] client.RpcRetryingCallerImpl(132): Call exception, tries=11, retries=11, started=48515 ms ago, cancelled=false, msg=hconnection-0x195e9495 closed, details=row 'default' on table 'hbase:namespace' at region=hbase:namespace,,1519093804365.3ff104b43ef4ce0a491e1f26b598813e., hostname=asf911.gq1.ygridcore.net,51776,1519093799312, seqNum=2, exception=java.io.IOException: hconnection-0x195e9495 closed at org.apache.hadoop.hbase.client.ConnectionImplementation.getTableState(ConnectionImplementation.java:1959) at org.apache.hadoop.hbase.client.ConnectionUtils$ShortCircuitingClusterConnection.getTableState(ConnectionUtils.java:131) at org.apache.hadoop.hbase.client.ConnectionImplementation.isTableDisabled(ConnectionImplementation.java:573) at org.apache.hadoop.hbase.client.ConnectionUtils$ShortCircuitingClusterConnection.isTableDisabled(ConnectionUtils.java:131) at org.apache.hadoop.hbase.client.RegionServerCallable.prepare(RegionServerCallable.java:219) at org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:105) at org.apache.hadoop.hbase.client.HTable.get(HTable.java:388) at org.apache.hadoop.hbase.client.HTable.get(HTable.java:362) at org.apache.hadoop.hbase.master.TableNamespaceManager.get(TableNamespaceManager.java:141) at org.apache.hadoop.hbase.master.TableNamespaceManager.isTableAvailableAndInitialized(TableNamespaceManager.java:278) at org.apache.hadoop.hbase.master.TableNamespaceManager.start(TableNamespaceManager.java:103) at org.apache.hadoop.hbase.master.ClusterSchemaServiceImpl.doStart(ClusterSchemaServiceImpl.java:62) at org.apache.hbase.thirdparty.com.google.common.util.concurrent.AbstractService.startAsync(AbstractService.java:226) at org.apache.hadoop.hbase.master.HMaster.initClusterSchemaService(HMaster.java:1053) at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:919) at org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2017) at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:559) at java.lang.Thread.run(Thread.java:748)
            stack Michael Stack added a comment -

            But it seems not, it keeps retrying on the closed connection... Let me check what's wrong here...

            Yes. I thought this addressed because we explicitly 'close' the Master Connection on shutdown as of about a week or so ago. But looks like you figured that we need more than that – the check Connection is closed. Good one.

            stack Michael Stack added a comment - But it seems not, it keeps retrying on the closed connection... Let me check what's wrong here... Yes. I thought this addressed because we explicitly 'close' the Master Connection on shutdown as of about a week or so ago. But looks like you figured that we need more than that – the check Connection is closed. Good one.
            zhangduo Duo Zhang added a comment - - edited https://builds.apache.org/job/HBASE-Flaky-Tests-branch2.0/2393/testReport/junit/org.apache.hadoop.hbase.master/TestDLSFSHLog/testRecoveredEdits/ java.lang.AssertionError: expected:<10000> but was:<9825>
            zhangduo Duo Zhang added a comment -

            A bit strange, we roll wal with the same name twice...

            2018-02-20 16:33:41,128 INFO  [regionserver/asf903:0.logRoller] wal.AbstractFSWAL(690): Rolled WAL /user/jenkins/test-data/46c46c01-4dc8-4506-a228-a4460f2a28e9/WALs/asf903.gq1.ygridcore.net,52039,1519144411613/asf903.gq1.ygridcore.net%2C52039%2C1519144411613.1519144421057 with entries=614, filesize=124.20 KB; new WAL /user/jenkins/test-data/46c46c01-4dc8-4506-a228-a4460f2a28e9/WALs/asf903.gq1.ygridcore.net,52039,1519144411613/asf903.gq1.ygridcore.net%2C52039%2C1519144411613.1519144421109
            2018-02-20 16:33:41,164 INFO  [regionserver/asf903:0.logRoller] wal.AbstractFSWAL(690): Rolled WAL /user/jenkins/test-data/46c46c01-4dc8-4506-a228-a4460f2a28e9/WALs/asf903.gq1.ygridcore.net,52039,1519144411613/asf903.gq1.ygridcore.net%2C52039%2C1519144411613.1519144421109 with entries=789, filesize=124.20 KB; new WAL /user/jenkins/test-data/46c46c01-4dc8-4506-a228-a4460f2a28e9/WALs/asf903.gq1.ygridcore.net,52039,1519144411613/asf903.gq1.ygridcore.net%2C52039%2C1519144411613.1519144421143
            

            Will dig more.

            zhangduo Duo Zhang added a comment - A bit strange, we roll wal with the same name twice... 2018-02-20 16:33:41,128 INFO [regionserver/asf903:0.logRoller] wal.AbstractFSWAL(690): Rolled WAL /user/jenkins/test-data/46c46c01-4dc8-4506-a228-a4460f2a28e9/WALs/asf903.gq1.ygridcore.net,52039,1519144411613/asf903.gq1.ygridcore.net%2C52039%2C1519144411613.1519144421057 with entries=614, filesize=124.20 KB; new WAL /user/jenkins/test-data/46c46c01-4dc8-4506-a228-a4460f2a28e9/WALs/asf903.gq1.ygridcore.net,52039,1519144411613/asf903.gq1.ygridcore.net%2C52039%2C1519144411613.1519144421109 2018-02-20 16:33:41,164 INFO [regionserver/asf903:0.logRoller] wal.AbstractFSWAL(690): Rolled WAL /user/jenkins/test-data/46c46c01-4dc8-4506-a228-a4460f2a28e9/WALs/asf903.gq1.ygridcore.net,52039,1519144411613/asf903.gq1.ygridcore.net%2C52039%2C1519144411613.1519144421109 with entries=789, filesize=124.20 KB; new WAL /user/jenkins/test-data/46c46c01-4dc8-4506-a228-a4460f2a28e9/WALs/asf903.gq1.ygridcore.net,52039,1519144411613/asf903.gq1.ygridcore.net%2C52039%2C1519144411613.1519144421143 Will dig more.
            zhangduo Duo Zhang added a comment -

            Oh they are not same... But still a bit strange that the file sizes are same but the entries numbers are different...

            zhangduo Duo Zhang added a comment - Oh they are not same... But still a bit strange that the file sizes are same but the entries numbers are different...
            zhangduo Duo Zhang added a comment -

            OK the numEntries is incorrect, will file an issue to address it, not critical, just logging.

            And I think this is the problem

            2018-02-20 16:33:41,183 INFO  [regionserver/asf903:0.logRoller] wal.AbstractFSWAL(690): Rolled WAL /user/jenkins/test-data/46c46c01-4dc8-4506-a228-a4460f2a28e9/WALs/asf903.gq1.ygridcore.net,52039,1519144411613/asf903.gq1.ygridcore.net%2C52039%2C1519144411613.1519144421143 with entries=0, filesize=35.46 KB; new WAL /user/jenkins/test-data/46c46c01-4dc8-4506-a228-a4460f2a28e9/WALs/asf903.gq1.ygridcore.net,52039,1519144411613/asf903.gq1.ygridcore.net%2C52039%2C1519144411613.1519144421171
            2018-02-20 16:33:41,184 DEBUG [regionserver/asf903:0.logRoller] wal.AbstractFSWAL(757): Create new FSHLog writer with pipeline: [DatanodeInfoWithStorage[127.0.0.1:60898,DS-fe8309dd-a316-4ef5-a474-7167557c5c76,DISK], DatanodeInfoWithStorage[127.0.0.1:53328,DS-0f9fbb3d-3cd1-47ea-9867-52f1ecac6f19,DISK], DatanodeInfoWithStorage[127.0.0.1:57523,DS-fc2097ed-4d98-40d7-a4a0-adfd31a5bf68,DISK]]
            2018-02-20 16:33:41,184 INFO  [regionserver/asf903:0.logRoller] wal.AbstractFSWAL(652): Archiving hdfs://localhost:55625/user/jenkins/test-data/46c46c01-4dc8-4506-a228-a4460f2a28e9/WALs/asf903.gq1.ygridcore.net,52039,1519144411613/asf903.gq1.ygridcore.net%2C52039%2C1519144411613.1519144421143 to hdfs://localhost:55625/user/jenkins/test-data/46c46c01-4dc8-4506-a228-a4460f2a28e9/oldWALs/asf903.gq1.ygridcore.net%2C52039%2C1519144411613.1519144421143
            

            Here we move the wal file to oldWALs soon after rolling. We write to WAL directly in the test, so this should not happen. Let me dig more.

            zhangduo Duo Zhang added a comment - OK the numEntries is incorrect, will file an issue to address it, not critical, just logging. And I think this is the problem 2018-02-20 16:33:41,183 INFO [regionserver/asf903:0.logRoller] wal.AbstractFSWAL(690): Rolled WAL /user/jenkins/test-data/46c46c01-4dc8-4506-a228-a4460f2a28e9/WALs/asf903.gq1.ygridcore.net,52039,1519144411613/asf903.gq1.ygridcore.net%2C52039%2C1519144411613.1519144421143 with entries=0, filesize=35.46 KB; new WAL /user/jenkins/test-data/46c46c01-4dc8-4506-a228-a4460f2a28e9/WALs/asf903.gq1.ygridcore.net,52039,1519144411613/asf903.gq1.ygridcore.net%2C52039%2C1519144411613.1519144421171 2018-02-20 16:33:41,184 DEBUG [regionserver/asf903:0.logRoller] wal.AbstractFSWAL(757): Create new FSHLog writer with pipeline: [DatanodeInfoWithStorage[127.0.0.1:60898,DS-fe8309dd-a316-4ef5-a474-7167557c5c76,DISK], DatanodeInfoWithStorage[127.0.0.1:53328,DS-0f9fbb3d-3cd1-47ea-9867-52f1ecac6f19,DISK], DatanodeInfoWithStorage[127.0.0.1:57523,DS-fc2097ed-4d98-40d7-a4a0-adfd31a5bf68,DISK]] 2018-02-20 16:33:41,184 INFO [regionserver/asf903:0.logRoller] wal.AbstractFSWAL(652): Archiving hdfs://localhost:55625/user/jenkins/test-data/46c46c01-4dc8-4506-a228-a4460f2a28e9/WALs/asf903.gq1.ygridcore.net,52039,1519144411613/asf903.gq1.ygridcore.net%2C52039%2C1519144411613.1519144421143 to hdfs://localhost:55625/user/jenkins/test-data/46c46c01-4dc8-4506-a228-a4460f2a28e9/oldWALs/asf903.gq1.ygridcore.net%2C52039%2C1519144411613.1519144421143 Here we move the wal file to oldWALs soon after rolling. We write to WAL directly in the test, so this should not happen. Let me dig more.
            zhangduo Duo Zhang added a comment -

            Oh, maybe the same problem with numEntries. We should also reset sequenceIdAccounting before processing new requests. Let me file an issue to address it.

            zhangduo Duo Zhang added a comment - Oh, maybe the same problem with numEntries. We should also reset sequenceIdAccounting before processing new requests. Let me file an issue to address it.
            stack Michael Stack added a comment -

            All subtasks resolved. This family of tests no longer shows on the flakies list. Nice work Apache9

            stack Michael Stack added a comment - All subtasks resolved. This family of tests no longer shows on the flakies list. Nice work Apache9

            People

              zhangduo Duo Zhang
              zhangduo Duo Zhang
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: