Uploaded image for project: 'Geode'
  1. Geode
  2. GEODE-4990

Race Condition When Using Cluster Configuration Service

    XMLWordPrintableJSON

Details

    Description

      There's a small window of time on which the locator (enable-cluster-configuration=true) is receiving TCP requests from starting up servers (use-cluster-configuration=true) but the cluster configuration service is not yet fully loaded, because of this the servers fail to startup and throw the following exception:

      Exception in thread "main" org.apache.geode.cache.execute.FunctionException: java.lang.NullPointerException
      	at org.apache.geode.internal.cache.execute.FunctionStreamingResultCollector.getResult(FunctionStreamingResultCollector.java:232)
      	at org.apache.geode.internal.cache.ClusterConfigurationLoader.requestConfigurationFromLocators(ClusterConfigurationLoader.java:285)
      	at org.apache.geode.internal.cache.GemFireCacheImpl.requestSharedConfiguration(GemFireCacheImpl.java:1036)
      	at org.apache.geode.internal.cache.GemFireCacheImpl.<init>(GemFireCacheImpl.java:835)
      	at org.apache.geode.internal.cache.GemFireCacheImpl.basicCreate(GemFireCacheImpl.java:775)
      	at org.apache.geode.internal.cache.GemFireCacheImpl.create(GemFireCacheImpl.java:764)
      	at org.apache.geode.cache.CacheFactory.create(CacheFactory.java:175)
      	at org.apache.geode.cache.CacheFactory.create(CacheFactory.java:222)
      	at org.apache.geode.distributed.internal.DefaultServerLauncherCacheProvider.createCache(DefaultServerLauncherCacheProvider.java:52)
      	at org.apache.geode.distributed.ServerLauncher.createCache(ServerLauncher.java:844)
      	at org.apache.geode.distributed.ServerLauncher.start(ServerLauncher.java:762)
      	at org.apache.geode.distributed.ServerLauncher.run(ServerLauncher.java:692)
      	at org.apache.geode.distributed.ServerLauncher.main(ServerLauncher.java:226)
      Caused by: java.lang.NullPointerException
      	at org.apache.geode.management.internal.configuration.functions.GetClusterConfigurationFunction.execute(GetClusterConfigurationFunction.java:55)
      	at org.apache.geode.internal.cache.MemberFunctionStreamingMessage.process(MemberFunctionStreamingMessage.java:186)
      	at org.apache.geode.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:382)
      	at org.apache.geode.distributed.internal.DistributionMessage$1.run(DistributionMessage.java:448)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      	at org.apache.geode.distributed.internal.DistributionManager.runUntilShutdown(DistributionManager.java:1099)
      	at org.apache.geode.distributed.internal.DistributionManager.access$000(DistributionManager.java:108)
      	at org.apache.geode.distributed.internal.DistributionManager$9$1.run(DistributionManager.java:970)
      	at java.lang.Thread.run(Thread.java:748)
      

      The logs from the locator, when using log-level=fine show the following:

      [fine 2018/04/03 15:24:40.501 IST locator1 <Function Execution Processor1> tid=0x51] Exception occurred on remote member while executing Function: org.apache.geode.management.internal.configuration.functions.GetClusterConfigurationFunction
      java.lang.NullPointerException
      	at org.apache.geode.management.internal.configuration.functions.GetClusterConfigurationFunction.execute(GetClusterConfigurationFunction.java:55)
      	at org.apache.geode.internal.cache.MemberFunctionStreamingMessage.process(MemberFunctionStreamingMessage.java:186)
      	at org.apache.geode.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:382)
      	at org.apache.geode.distributed.internal.DistributionMessage$1.run(DistributionMessage.java:448)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      	at org.apache.geode.distributed.internal.DistributionManager.runUntilShutdown(DistributionManager.java:1099)
      	at org.apache.geode.distributed.internal.DistributionManager.access$000(DistributionManager.java:108)
      	at org.apache.geode.distributed.internal.DistributionManager$9$1.run(DistributionManager.java:970)
      	at java.lang.Thread.run(Thread.java:748)
      
      [fine 2018/04/03 15:24:40.502 IST locator1 <Function Execution Processor1> tid=0x51] Replying with exception: ReplyMessage processorId=2 from null with exception org.apache.geode.distributed.internal.ReplyException: java.lang.NullPointerException
      org.apache.geode.distributed.internal.ReplyException: java.lang.NullPointerException
      	at org.apache.geode.internal.cache.MemberFunctionStreamingMessage.process(MemberFunctionStreamingMessage.java:216)
      	at org.apache.geode.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:382)
      	at org.apache.geode.distributed.internal.DistributionMessage$1.run(DistributionMessage.java:448)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      	at org.apache.geode.distributed.internal.DistributionManager.runUntilShutdown(DistributionManager.java:1099)
      	at org.apache.geode.distributed.internal.DistributionManager.access$000(DistributionManager.java:108)
      	at org.apache.geode.distributed.internal.DistributionManager$9$1.run(DistributionManager.java:970)
      	at java.lang.Thread.run(Thread.java:748)
      Caused by: java.lang.NullPointerException
      	at org.apache.geode.management.internal.configuration.functions.GetClusterConfigurationFunction.execute(GetClusterConfigurationFunction.java:55)
      	at org.apache.geode.internal.cache.MemberFunctionStreamingMessage.process(MemberFunctionStreamingMessage.java:186)
      	... 8 more
      

      It doesn't matter wether the user has configured a value for the parameter locator-wait-time or not, the value is only used when establishing the TCP connection and, since the locator is already running, the server continues the startup process and tries to get the cluster configuration service, failing afterwards due to the NullPointerException.
      The problematic code can be seen below:

      GetClusterConfigurationFunction.java
      public class GetClusterConfigurationFunction implements InternalFunction {
        private static final Logger logger = LogService.getLogger();
      
        @Override
        public void execute(FunctionContext context) {
          InternalClusterConfigurationService clusterConfigurationService =
              InternalLocator.getLocator().getSharedConfiguration();
      
          Set<String> groups = (Set<String>) context.getArguments();
      
          logger.info("Received request for configuration  : {}", groups);
      
          try {
          // [XXXXX]: clusterConfigurationService is null as it hasn't been loaded yet.
            ConfigurationResponse response =
                clusterConfigurationService.createConfigurationResponse(groups);
            context.getResultSender().lastResult(response);
          } catch (IOException e) {
            logger.error("Unable to retrieve the cluster configuration", e);
            context.getResultSender().lastResult(e);
          }
        }
      }
      

      The servers, on the other hand, already have the logic in place to wait and retry the retrieval of the cluster configuration service when the response is null:

      ClusterConfigurationLoader.java
      public ConfigurationResponse requestConfigurationFromLocators(String groupList,
            Set<InternalDistributedMember> locatorList)
            throws ClusterConfigurationNotAvailableException, UnknownHostException {
      
          Set<String> groups = getGroups(groupList);
          GetClusterConfigurationFunction function = new GetClusterConfigurationFunction();
      
          ConfigurationResponse response = null;
      
          int attempts = 6;
          OUTER: while (attempts > 0) {
            for (InternalDistributedMember locator : locatorList) {
              logger.info("Attempting to retrieve cluster configuration from {} - {} attempts remaining",
                  locator.getName(), attempts);
              response = requestConfigurationFromOneLocator(locator, groups);
              if (response != null) {
                break OUTER;
              }
            }
      
            try {
              Thread.sleep(10000);
            } catch (InterruptedException ex) {
              break;
            }
      
            attempts--;
          }
      
          // if the response is null
          if (response == null) {
            throw new ClusterConfigurationNotAvailableException(
                "Unable to retrieve cluster configuration from the locator.");
          }
      
          return response;
        }
      

      That said, the fix seems to be quite simple and it implies modifying GetClusterConfigurationFunction to return null whenever the cluster configuration is enabled but not fully loaded yet, and throw an exception whenever the cluster configuration service is not enabled on the locator (allowing us to fail fast when the server is requesting the cluster configuration from a locator that doesn't have it enabled).

      Attachments

        Issue Links

          Activity

            People

              jjramos Juan Ramos
              jjramos Juan Ramos
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 3h 50m
                  3h 50m