Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-12847

TestZKLessSplitOnCluster frequently times out in 0.98 builds

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 1.0.0, 0.98.10, 1.1.0
    • None
    • None

    Description

      Gets hung up in testSSHCleanupDaugtherRegionsOfAbortedSplit waiting on deleteTable

      "Thread-334" prio=10 tid=0x00007f15382da800 nid=0x40ae in Object.wait() [0x00007
      f1315f5d000]
         java.lang.Thread.State: TIMED_WAITING (on object monitor)
              at java.lang.Object.wait(Native Method)
              - waiting on <0x00000007e1b525b8> (a org.apache.hadoop.hbase.ipc.RpcClie
      nt$Call)
              at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1452)
              - locked <0x00000007e1b525b8> (a org.apache.hadoop.hbase.ipc.RpcClient$C
      all)
              at org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.ja
      va:1661)
              at org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementatio
      n.callBlockingMethod(RpcClient.java:1719)
              at org.apache.hadoop.hbase.protobuf.generated.MasterProtos$MasterService
      $BlockingStub.disableTable(MasterProtos.java:43749)
              at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplemen
      tation$5.disableTable(HConnectionManager.java:1995)
              at org.apache.hadoop.hbase.client.HBaseAdmin$5.call(HBaseAdmin.java:947)
              at org.apache.hadoop.hbase.client.HBaseAdmin$5.call(HBaseAdmin.java:942)
              at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcR
      etryingCaller.java:117)
              - locked <0x00000007dfffe938> (a org.apache.hadoop.hbase.client.RpcRetry
      ingCaller)
              at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcR
      etryingCaller.java:93)
              - locked <0x00000007dfffe938> (a org.apache.hadoop.hbase.client.RpcRetry
      ingCaller)
              at org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.
      java:3398)
              at org.apache.hadoop.hbase.client.HBaseAdmin.disableTableAsync(HBaseAdmi
      n.java:942)
              at org.apache.hadoop.hbase.client.HBaseAdmin.disableTable(HBaseAdmin.jav
      a:974)
              at org.apache.hadoop.hbase.HBaseTestingUtility.deleteTable(HBaseTestingUtility.java:1532)
              at org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster.testSSHCleanupDaugtherRegionsOfAbortedSplit(TestSplitTransactionOnCluster.java:1172)
      

      See attached test.log.good.gz and test.log.bad.gz.

      In test.log.bad at 2015-01-13 08:02:45,947 we acquire a lock on testSSHCleanupDaugtherRegionsOfAbortedSplit to start a disable but do nothing afterward. Nothing happens for one minute. Then looks like the client makes another request but it can't get the table lock. There's no progress until timeout.

      2015-01-13 08:02:45,947 INFO  [Thread-334] client.HBaseAdmin$5(945): Started disable of testSSHCleanupDaugtherRegionsOfAbortedSplit
      2015-01-13 08:02:45,948 INFO  [FifoRpcScheduler.handler1-thread-4] master.HMaster(2213): Client=apurtell//10.40.8.95 disable testSSHCleanupDaugtherRegionsOfAbortedSplit
      2015-01-13 08:02:45,950 DEBUG [FifoRpcScheduler.handler1-thread-4] lock.ZKInterProcessLockBase(226): Acquired a lock for /hbase/table-lock/testSSHCleanupDaugtherRegionsOfAbortedSplit/write-master:367740000000001
      2015-01-13 08:03:44,585 DEBUG [ip-10-40-8-95.us-west-2.compute.internal,36774,1421136162827-BalancerChore] master.HMaster(1553): Not running balancer because 3 region(s) in transition: {e4f79ac2b4711f7d906a291d94302d6e={e4f79ac2b4711f7d906a291d94302d6e state=SPLITTING, ts=1421136165837, server=ip-10-40-8-95.us-west-2.compute.internal,47769,1421136163047}, ebbcf66ec09960d42a2a49252599bb6d={ebbcf66ec09960d42a2a49252599bb6d state=SPLITTI...
      2015-01-13 08:03:46,210 INFO  [Thread-334] client.HBaseAdmin$5(945): Started disable of testSSHCleanupDaugtherRegionsOfAbortedSplit
      2015-01-13 08:03:46,211 INFO  [FifoRpcScheduler.handler1-thread-3] master.HMaster(2213): Client=apurtell//10.40.8.95 disable testSSHCleanupDaugtherRegionsOfAbortedSplit
      2015-01-13 08:03:46,214 DEBUG [FifoRpcScheduler.handler1-thread-3] master.TableLockManager$ZKTableLockManager$1(242): Table is locked by [tableName=
      default+testSSHCleanupDaugtherRegionsOfAbortedSplit, lockOwner=ip-10-40-8-95.us-west-2.compute.internal,36774,1421136162827, threadId=486, purpose=C_M_DISABLE_TABLE, isShared=false, createTime=1421136165948]
      

      If we look at test.log.good where the same sequence of events begin, the table is quickly disabled and deleted:

      2015-01-13 08:01:47,923 INFO  [Thread-333] client.HBaseAdmin$5(945): Started disable of testSSHCleanupDaugtherRegionsOfAbortedSplit
      2015-01-13 08:01:47,924 INFO  [FifoRpcScheduler.handler1-thread-5] master.HMaster(2213): Client=apurtell//10.40.8.95 disable testSSHCleanupDaugtherRegionsOfAbortedSplit
      2015-01-13 08:01:47,926 DEBUG [FifoRpcScheduler.handler1-thread-5] lock.ZKInterProcessLockBase(226): Acquired a lock for /hbase/table-lock/testSSHCleanupDaugtherRegionsOfAbortedSplit/write-master:383350000000001
      2015-01-13 08:01:47,932 INFO  [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] handler.DisableTableHandler(130): Attempting to disable table testSSHCleanupDaugtherRegionsOfAbortedSplit
      2015-01-13 08:01:47,933 INFO  [Thread-333] zookeeper.RecoverableZooKeeper(121): Process identifier=catalogtracker-on-hconnection-0x60cf9dfa connecting to ZooKeeper ensemble=localhost:59024
      2015-01-13 08:01:47,933 DEBUG [Thread-333] catalog.CatalogTracker(198): Starting catalog tracker org.apache.hadoop.hbase.catalog.CatalogTracker@610b218c
      2015-01-13 08:01:47,935 DEBUG [Thread-333-EventThread] zookeeper.ZooKeeperWatcher(313): catalogtracker-on-hconnection-0x60cf9dfa0x0, quorum=localhost:59024, baseZNode=/hbase Received ZooKeeper Event, type=None, state=SyncConnected, path=null
      2015-01-13 08:01:47,935 DEBUG [Thread-333] zookeeper.ZKUtil(430): catalogtracker-on-hconnection-0x60cf9dfa0x0, quorum=localhost:59024, baseZNode=/hbase Set watcher on existing znode=/hbase/meta-region-server
      2015-01-13 08:01:47,936 INFO  [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] handler.DisableTableHandler(190): Disabled table, testSSHCleanupDaugtherRegionsOfAbortedSplit, is done=true
      2015-01-13 08:01:47,936 DEBUG [Thread-333-EventThread] zookeeper.ZooKeeperWatcher(390): catalogtracker-on-hconnection-0x60cf9dfa-0x14ae2500c300017 connected
      2015-01-13 08:01:47,938 DEBUG [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] lock.ZKInterProcessLockBase(328): Released /hbase/table-lock/testSSHCleanupDaugtherRegionsOfAbortedSplit/write-master:383350000000001
      2015-01-13 08:01:47,941 DEBUG [Thread-333] catalog.CatalogTracker(222): Stopping catalog tracker org.apache.hadoop.hbase.catalog.CatalogTracker@610b218c
      2015-01-13 08:01:47,945 INFO  [Thread-333] client.HBaseAdmin(999): Disabled testSSHCleanupDaugtherRegionsOfAbortedSplit
      2015-01-13 08:01:47,946 INFO  [FifoRpcScheduler.handler1-thread-2] master.HMaster(2047): Client=apurtell//10.40.8.95 delete testSSHCleanupDaugtherRegionsOfAbortedSplit
      2015-01-13 08:01:47,948 DEBUG [FifoRpcScheduler.handler1-thread-2] lock.ZKInterProcessLockBase(226): Acquired a lock for /hbase/table-lock/testSSHCleanupDaugtherRegionsOfAbortedSplit/write-master:383350000000002
      2015-01-13 08:01:47,953 INFO  [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] handler.TableEventHandler(123): Handling table operation C_M_DELETE_TABLE on table testSSHCleanupDaugtherRegionsOfAbortedSplit
      2015-01-13 08:01:47,958 DEBUG [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] handler.DeleteTableHandler(93): Deleting regions from META
      2015-01-13 08:01:47,962 INFO  [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] catalog.MetaEditor(496): Deleted [{ENCODED => bc81e04ede70636e30fd1224df8bc70b, NAME => 'testSSHCleanupDaugtherRegionsOfAbortedSplit,,1421136107549.bc81e04ede70636e30fd1224df8bc70b.', STARTKEY => '', ENDKEY => ''}]
      2015-01-13 08:01:47,964 DEBUG [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] handler.DeleteTableHandler(104): Archiving region testSSHCleanupDaugtherRegionsOfAbortedSplit,,1421136107549.bc81e04ede70636e30fd1224df8bc70b. from FS
      2015-01-13 08:01:47,965 DEBUG [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] backup.HFileArchiver(93): ARCHIVING hdfs://localhost:43493/user/apurtell/hbase/.tmp/data/default/testSSHCleanupDaugtherRegionsOfAbortedSplit/bc81e04ede70636e30fd1224df8bc70b
      2015-01-13 08:01:47,969 DEBUG [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] backup.HFileArchiver(134): Archiving [class org.apache.hadoop.hbase.backup.HFileArchiver$FileablePath, file:hdfs://localhost:43493/user/apurtell/hbase/.tmp/data/default/testSSHCleanupDaugtherRegionsOfAbortedSplit/bc81e04ede70636e30fd1224df8bc70b/f]
      2015-01-13 08:01:47,978 DEBUG [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] backup.HFileArchiver(438): Finished archiving from class org.apache.hadoop.hbase.backup.HFileArchiver$FileablePath, file:hdfs://localhost:43493/user/apurtell/hbase/.tmp/data/default/testSSHCleanupDaugtherRegionsOfAbortedSplit/bc81e04ede70636e30fd1224df8bc70b/f/9b6aaa5d9dd44f3ab113a32c029d50ae, to hdfs://localhost:43493/user/apurtell/hbase/archive/data/default/testSSHCleanupDaugtherRegionsOfAbortedSplit/bc81e04ede70636e30fd1224df8bc70b/f/9b6aaa5d9dd44f3ab113a32c029d50ae
      2015-01-13 08:01:47,979 INFO  [IPC Server handler 6 on 43493] blockmanagement.BlockManager(1009): BLOCK* addToInvalidates: blk_1073741839_1015 127.0.0.1:49933 127.0.0.1:55797 127.0.0.1:57090 
      2015-01-13 08:01:47,979 DEBUG [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] backup.HFileArchiver(453): Deleted all region files in: hdfs://localhost:43493/user/apurtell/hbase/.tmp/data/default/testSSHCleanupDaugtherRegionsOfAbortedSplit/bc81e04ede70636e30fd1224df8bc70b
      2015-01-13 08:01:47,980 INFO  [IPC Server handler 2 on 43493] blockmanagement.BlockManager(1009): BLOCK* addToInvalidates: blk_1073741838_1014 127.0.0.1:57090 127.0.0.1:49933 127.0.0.1:55797 
      2015-01-13 08:01:47,980 DEBUG [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] handler.DeleteTableHandler(114): Table 'testSSHCleanupDaugtherRegionsOfAbortedSplit' archived!
      2015-01-13 08:01:47,980 DEBUG [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] handler.DeleteTableHandler(117): Removing 'testSSHCleanupDaugtherRegionsOfAbortedSplit' descriptor.
      2015-01-13 08:01:47,981 DEBUG [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] handler.DeleteTableHandler(121): Removing 'testSSHCleanupDaugtherRegionsOfAbortedSplit' from region states.
      2015-01-13 08:01:47,981 DEBUG [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] handler.DeleteTableHandler(125): Marking 'testSSHCleanupDaugtherRegionsOfAbortedSplit' as deleted.
      2015-01-13 08:01:47,984 DEBUG [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] lock.ZKInterProcessLockBase(328): Released /hbase/table-lock/testSSHCleanupDaugtherRegionsOfAbortedSplit/write-master:383350000000002
      

      The first interesting line missing in test.log.bad is:

      2015-01-13 08:01:47,932 INFO  [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] handler.DisableTableHandler(130): Attempting to disable table testSSHCleanupDaugtherRegionsOfAbortedSplit
      

      Nowhere in test.log.bad does "Attempting to disable table testSSHCleanupDaugtherRegionsOfAbortedSplit" appear.

      Attachments

        1. test.log.bad.gz
          25 kB
          Andrew Kyle Purtell
        2. test.log.good.gz
          118 kB
          Andrew Kyle Purtell
        3. HBASE-12847.patch
          1 kB
          Rajeshbabu Chintaguntla
        4. HBASE-12847_branch-1.patch
          1 kB
          Rajeshbabu Chintaguntla
        5. HBASE-12847_98.patch
          1 kB
          Rajeshbabu Chintaguntla

        Activity

          People

            rajeshbabu Rajeshbabu Chintaguntla
            apurtell Andrew Kyle Purtell
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: