Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
1.15.0
Description
When there are no cache servers (only locators) in a cluster, client operations will fail with a misleading exception:
org.apache.geode.cache.client.NoAvailableLocatorsException: Unable to connect to any locators in the list [gemfire-cluster-locator-0.gemfire-cluster-locator.namespace-1850250019.svc.cluster.local:10334, gemfire-cluster-locator-1.gemfire-cluster-locator.namespace-1850250019.svc.cluster.local:10334, gemfire-cluster-locator-2.gemfire-cluster-locator.namespace-1850250019.svc.cluster.local:10334] at org.apache.geode.cache.client.internal.AutoConnectionSourceImpl.findServer(AutoConnectionSourceImpl.java:174) at org.apache.geode.cache.client.internal.ConnectionFactoryImpl.createClientToServerConnection(ConnectionFactoryImpl.java:211) at org.apache.geode.cache.client.internal.pooling.ConnectionManagerImpl.createPooledConnection(ConnectionManagerImpl.java:196) at org.apache.geode.cache.client.internal.pooling.ConnectionManagerImpl.forceCreateConnection(ConnectionManagerImpl.java:227) at org.apache.geode.cache.client.internal.pooling.ConnectionManagerImpl.exchangeConnection(ConnectionManagerImpl.java:365) at org.apache.geode.cache.client.internal.OpExecutorImpl.execute(OpExecutorImpl.java:161) at org.apache.geode.cache.client.internal.OpExecutorImpl.execute(OpExecutorImpl.java:120) at org.apache.geode.cache.client.internal.PoolImpl.execute(PoolImpl.java:805) at org.apache.geode.cache.client.internal.PutOp.execute(PutOp.java:91)
Even the client is able to connect to a locator, we encounter a NoAvailableLocatorsException exception with the message "Unable to connect to any locators in the list".
Investigating the product code we see:
- If there are no cache servers in the cluster, ServerLocator.pickServer() will definitely construct a ClientConnectionResponse(null) which causes that object’s hasResult() to respond with false in the loop termination in AutoConnectionSourceImpl.queryLocators()
- Not only is the exception wording misleading in AutoConnectionSourceImpl.findServer()—it’s also misleading in at least two other calling locations in AutoConnectionSourceImpl: findReplacementServer() and findServersForQueue().
- In each of those cases the calling method translates a null response from queryLocators() into a throw of a NoAvailableLocatorsException
- an appropriate exception, NoAvailableServersException, already exists, for the case where we were able to contact a locator but the locator was not able to find any cache servers
- According to my Git spelunking queryLocators() has been obfuscating the true cause of the failure since at least 2015
Without analyzing ServerLocator.pickServer() (LocatorLoadSnapshot.getServerForConnection()) to discern why two locators might disagree on how many cache servers are in the cluster, it seems to me that we should modify AutoConnectionSourceImpl.queryLocators() so that:
- if it gets a ServerLocationResponse with hasResult() true, it immediately returns that as it does now
- otherwise it keeps trying and it keeps track of the last (non-null) ServerLocationResponse it has received
- it returns the last non-null ServerLocationResponse it received (otherwise it returns null)
With that in hand, we can change the three call locations in AutoConnectionSourceImpl: findServer(), findReplacementServer(), and findServersForQueue() to each throw NoAvailableLocatorsException if no locator responded, or NoAvailableServersException if a locator responded with a ClientConnectionResponse for which hasResult() returns null.
Attachments
Issue Links
- links to