Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-5444 Ozone Upgrades vNext
  3. HDDS-5514

Skip check for UNHEALTHY containers for datanode finalize.

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • None
    • None

    Description

      Here is a log that we got from a non-rolling upgrade:

      local/master(0766d2cd23afb29f0eb42cf95b09d3d2984c14fa) -> upstream/master(57d42b12d3b6451e2ac8519780e82993ecce3611)

      // code placeholder
      2021-07-27 20:49:48,491 [Command processor thread] INFO org.apache.hadoop.ozone.upgrade.UpgradeFinalizer: Finalization started.2021-07-27 20:49:48,502 [Command processor thread] WARN org.apache.hadoop.ozone.upgrade.UpgradeFinalizer: FinalizeUpgrade : Waiting for container to close, current state is: UNHEALTHY2021-07-27 20:49:48,503 [Command processor thread] INFO org.apache.hadoop.ozone.upgrade.UpgradeFinalizer: Pre Finalization checks failed on the DataNode.
      2021-07-27 20:49:48,503 [Command processor thread] WARN org.apache.hadoop.ozone.upgrade.DefaultUpgradeFinalizationExecutor: Upgrade Finalization failed with following Exception. 
      PREFINALIZE_VALIDATION_FAILED org.apache.hadoop.ozone.upgrade.UpgradeException: Pre Finalization checks failed on the DataNode.
              at org.apache.hadoop.ozone.container.upgrade.DataNodeUpgradeFinalizer.preFinalizeUpgrade(DataNodeUpgradeFinalizer.java:55)
              at org.apache.hadoop.ozone.container.upgrade.DataNodeUpgradeFinalizer.preFinalizeUpgrade(DataNodeUpgradeFinalizer.java:39)
              at org.apache.hadoop.ozone.upgrade.DefaultUpgradeFinalizationExecutor.execute(DefaultUpgradeFinalizationExecutor.java:48)        at org.apache.hadoop.ozone.upgrade.BasicUpgradeFinalizer.finalize(BasicUpgradeFinalizer.java:75)
              at org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine.finalizeUpgrade(DatanodeStateMachine.java:622)
              at org.apache.hadoop.ozone.container.common.statemachine.commandhandler.FinalizeNewLayoutVersionCommandHandler.handle(FinalizeNewLayoutVersionCommandHandler.java:78)
              at org.apache.hadoop.ozone.container.common.statemachine.commandhandler.CommandDispatcher.handle(CommandDispatcher.java:99)
              at org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine.lambda$initCommandHandlerThread$2(DatanodeStateMachine.java:551)
              at java.lang.Thread.run(Thread.java:748)2021-07-27 20:49:48,503 [Command processor thread] INFO org.apache.hadoop.ozone.container.common.statemachine.commandhandler.FinalizeNewLayoutVersionCommandHandler: Processing FinalizeNewLayoutVersionCommandHandler command.
      2021-07-27 20:49:48,503 [Command processor thread] INFO org.apache.hadoop.ozone.container.common.statemachine.commandhandler.FinalizeNewLayoutVersionCommandHandler: Finalize Upgrade called!
      

      Finalize on datanode checks whether there are containers at non-closed states:

      // DataNodeUpgradeFinalizer.java
      private boolean canFinalizeDataNode(DatanodeStateMachine dsm) {
        // Lets be sure that we do not have any open container before we return
        // from here. This function should be called in its own finalizer thread
        // context.
        Iterator<Container<?>> containerIt =
            dsm.getContainer().getController().getContainers();
        while (containerIt.hasNext()) {
          Container ctr = containerIt.next();
          ContainerProtos.ContainerDataProto.State state = ctr.getContainerState();
          switch (state) {
          case OPEN:
          case CLOSING:
          case UNHEALTHY:
            LOG.warn("FinalizeUpgrade : Waiting for container to close, current "
                + "state is: {}", state);
            return false;
          default:
            continue;
          }
        }
        return true;
      }
      

      But actually there may be a good many containers in UNHEALTHY states, at least in our deployment with about 400000 containers.

       

      Actually not all layout features require all containers to be non-UNHEALTHY states, such as SCM_HA and some potential features like Merging Rocksdb Instances for datanode, which don't touch container layout at all.

      And we may want to do non-rolling upgrade first and fix the UNHEALTHY containers later, maybe replication manager will handle them later but takes a plenty of time.

       

      So I suggest to add a flag to make it possible to turn off the check for UNHEALTHY containers.

      Attachments

        Issue Links

          Activity

            People

              markgui Mark Gui
              markgui Mark Gui
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: