Uploaded image for project: 'Geode'
  1. Geode
  2. GEODE-9808

Client ops fail with NoLocatorsAvailableException when all servers leave the DS

    XMLWordPrintableJSON

Details

    Description

      When there are no cache servers (only locators) in a cluster, client operations will fail with a misleading exception:

      org.apache.geode.cache.client.NoAvailableLocatorsException: Unable to connect to any locators in the list [gemfire-cluster-locator-0.gemfire-cluster-locator.namespace-1850250019.svc.cluster.local:10334, gemfire-cluster-locator-1.gemfire-cluster-locator.namespace-1850250019.svc.cluster.local:10334, gemfire-cluster-locator-2.gemfire-cluster-locator.namespace-1850250019.svc.cluster.local:10334]
          at org.apache.geode.cache.client.internal.AutoConnectionSourceImpl.findServer(AutoConnectionSourceImpl.java:174)
          at org.apache.geode.cache.client.internal.ConnectionFactoryImpl.createClientToServerConnection(ConnectionFactoryImpl.java:211)
          at org.apache.geode.cache.client.internal.pooling.ConnectionManagerImpl.createPooledConnection(ConnectionManagerImpl.java:196)
          at org.apache.geode.cache.client.internal.pooling.ConnectionManagerImpl.forceCreateConnection(ConnectionManagerImpl.java:227)
          at org.apache.geode.cache.client.internal.pooling.ConnectionManagerImpl.exchangeConnection(ConnectionManagerImpl.java:365)
          at org.apache.geode.cache.client.internal.OpExecutorImpl.execute(OpExecutorImpl.java:161)
          at org.apache.geode.cache.client.internal.OpExecutorImpl.execute(OpExecutorImpl.java:120)
          at org.apache.geode.cache.client.internal.PoolImpl.execute(PoolImpl.java:805)
          at org.apache.geode.cache.client.internal.PutOp.execute(PutOp.java:91)
      

      Even the client is able to connect to a locator, we encounter a NoAvailableLocatorsException exception with the message "Unable to connect to any locators in the list".

      Investigating the product code we see:

      1. If there are no cache servers in the cluster, ServerLocator.pickServer() will definitely construct a ClientConnectionResponse(null) which causes that object’s hasResult() to respond with false in the loop termination in AutoConnectionSourceImpl.queryLocators()
      1. Not only is the exception wording misleading in AutoConnectionSourceImpl.findServer()—it’s also misleading in at least two other calling locations in AutoConnectionSourceImpl: findReplacementServer() and findServersForQueue().
      1. In each of those cases the calling method translates a null response from queryLocators() into a throw of a NoAvailableLocatorsException
      1. an appropriate exception, NoAvailableServersException, already exists, for the case where we were able to contact a locator but the locator was not able to find any cache servers
      1. According to my Git spelunking queryLocators() has been obfuscating the true cause of the failure since at least 2015

      Without analyzing ServerLocator.pickServer() (LocatorLoadSnapshot.getServerForConnection()) to discern why two locators might disagree on how many cache servers are in the cluster, it seems to me that we should modify AutoConnectionSourceImpl.queryLocators() so that:

      • if it gets a ServerLocationResponse with hasResult() true, it immediately returns that as it does now
      • otherwise it keeps trying and it keeps track of the last (non-null) ServerLocationResponse it has received
      • it returns the last non-null ServerLocationResponse it received (otherwise it returns null)

      With that in hand, we can change the three call locations in AutoConnectionSourceImpl: findServer(), findReplacementServer(), and findServersForQueue() to each throw NoAvailableLocatorsException if no locator responded, or NoAvailableServersException if a locator responded with a ClientConnectionResponse for which hasResult() returns null.

      Attachments

        Issue Links

          Activity

            People

              donalevans Donal Evans
              burcham Bill Burcham
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: