Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-28428

Zookeeper ConnectionRegistry APIs should have timeout

    XMLWordPrintableJSON

Details

    • Reviewed

    Description

      Came across a couple of instances where active master failover happens around the same time as Zookeeper leader failover, leading to stuck HBase client if one of the threads is blocked on one of the ConnectionRegistry rpc calls. 
      ConnectionRegistry APIs are wrapped with CompletableFuture. However, their usages do not have any timeouts, which can potentially lead to the entire client in stuck state indefinitely as we take some global locks. For instance, getKeepAliveMasterService() takes
      masterLock, hence if getting active master from masterAddressZNode gets stuck, we can block any admin operation that needs getKeepAliveMasterService().
       
      Sample stacktrace that blocked all client operations that required table descriptor from Admin:

      jdk.internal.misc.Unsafe.park
      java.util.concurrent.locks.LockSupport.park
      java.util.concurrent.CompletableFuture$Signaller.block
      java.util.concurrent.ForkJoinPool.managedBlock
      java.util.concurrent.CompletableFuture.waitingGet
      java.util.concurrent.CompletableFuture.get
      org.apache.hadoop.hbase.client.ConnectionImplementation.get
      org.apache.hadoop.hbase.client.ConnectionImplementation.access$?
      org.apache.hadoop.hbase.client.ConnectionImplementation$MasterServiceStubMaker.makeStubNoRetries
      org.apache.hadoop.hbase.client.ConnectionImplementation$MasterServiceStubMaker.makeStub
      org.apache.hadoop.hbase.client.ConnectionImplementation.getKeepAliveMasterService
      org.apache.hadoop.hbase.client.ConnectionImplementation.getMaster
      org.apache.hadoop.hbase.client.MasterCallable.prepare
      org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries
      org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable
      org.apache.hadoop.hbase.client.HBaseAdmin.getTableDescriptor
      org.apache.hadoop.hbase.client.HTable.getDescriptororg.apache.phoenix.query.ConnectionQueryServicesImpl.getTableDescriptor
      org.apache.phoenix.query.DelegateConnectionQueryServices.getTableDescriptor
      org.apache.phoenix.util.IndexUtil.isGlobalIndexCheckerEnabled
      org.apache.phoenix.execute.MutationState.filterIndexCheckerMutations
      org.apache.phoenix.execute.MutationState.sendBatch
      org.apache.phoenix.execute.MutationState.send
      org.apache.phoenix.execute.MutationState.send
      org.apache.phoenix.execute.MutationState.commit
      org.apache.phoenix.jdbc.PhoenixConnection$?.call
      org.apache.phoenix.jdbc.PhoenixConnection$?.call
      org.apache.phoenix.call.CallRunner.run
      org.apache.phoenix.jdbc.PhoenixConnection.commit 

      Another similar incident is captured on PHOENIX-7233. In this case, retrieving clusterId from ZNode got stuck and that blocked client from being able to create any more HBase Connection. Stacktrace for referece:

      jdk.internal.misc.Unsafe.park
      java.util.concurrent.locks.LockSupport.park
      java.util.concurrent.CompletableFuture$Signaller.block
      java.util.concurrent.ForkJoinPool.managedBlock
      java.util.concurrent.CompletableFuture.waitingGet
      java.util.concurrent.CompletableFuture.get
      org.apache.hadoop.hbase.client.ConnectionImplementation.retrieveClusterId
      org.apache.hadoop.hbase.client.ConnectionImplementation.<init>
      jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance?
      jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance
      jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance
      java.lang.reflect.Constructor.newInstance
      org.apache.hadoop.hbase.client.ConnectionFactory.lambda$createConnection$?
      org.apache.hadoop.hbase.client.ConnectionFactory$$Lambda$?.run
      java.security.AccessController.doPrivileged
      javax.security.auth.Subject.doAs
      org.apache.hadoop.security.UserGroupInformation.doAs
      org.apache.hadoop.hbase.security.User$SecureHadoopUser.runAs
      org.apache.hadoop.hbase.client.ConnectionFactory.createConnection
      org.apache.hadoop.hbase.client.ConnectionFactory.createConnectionorg.apache.phoenix.query.ConnectionQueryServicesImpl.openConnection
      org.apache.phoenix.query.ConnectionQueryServicesImpl.access$?
      org.apache.phoenix.query.ConnectionQueryServicesImpl$?.call
      org.apache.phoenix.query.ConnectionQueryServicesImpl$?.call
      org.apache.phoenix.util.PhoenixContextExecutor.call
      org.apache.phoenix.query.ConnectionQueryServicesImpl.init
      org.apache.phoenix.jdbc.PhoenixDriver.getConnectionQueryServices
      org.apache.phoenix.jdbc.HighAvailabilityGroup.connectToOneCluster
      org.apache.phoenix.jdbc.ParallelPhoenixConnection.getConnection
      org.apache.phoenix.jdbc.ParallelPhoenixConnection.lambda$new$?
      org.apache.phoenix.jdbc.ParallelPhoenixConnection$$Lambda$?.get
      org.apache.phoenix.jdbc.ParallelPhoenixContext.lambda$chainOnConnClusterContext$?
      org.apache.phoenix.jdbc.ParallelPhoenixContext$$Lambda$?.apply  

      We should provide configurable timeout for all ConnectionRegistry APIs.

      Attachments

        Issue Links

          There are no Sub-Tasks for this issue.

          Activity

            People

              divneet18 Divneet Kaur
              vjasani Viraj Jasani
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: