Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-6462 Phase II : Erasure Coding Offline Recovery & Read/Write Improvements
  3. HDDS-6890

EC: Fix potential wrong replica read with over-replicated container.

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.3.0
    • None

    Description

      We've encountered a potential issue on the read path when we have over-replicated EC container replicas after Offline Recovery.

      Here is the scenario:

      1. 20 DNs, EC 10+4
      2. A background script keep writing and reading keys.(freon ockg & ockv)
      3. A background script that randomly kills DNs (< 4 for each round), wait for all containers to recover to the expected number of replicas(14). Then go for next round.
      4. After there are 8 killed DNs, restart all dead DNs before.
      5. Then we have potentially over-replicated containers.
      6. At that time, there are concurrent ockv and container delete commands in the cluster.
      7. And the ockv reports a digest mismatch during the short period of several minutes.
      8. But after all the deletions are done and all containers return to an expected number of replicas(14), the ockv command returns successfully.

      Outputs for the ockv command below:

      2022-06-10 13:21:27,923 [main] INFO storage.BlockInputStream: Unable to read information for block conID: 12 locID: 109611004723202307 bcsId: 0 from pipeline PipelineID=35db70e6-f362-4b3a-ab98-a89509547455: ContainerID 12 does not exist
      2022-06-10 13:21:28,613 [Thread-4] INFO freon.ProgressBar: Progress: 0.00 % (0 out of 10)
      2022-06-10 13:21:29,614 [Thread-4] INFO freon.ProgressBar: Progress: 0.00 % (0 out of 10)
      2022-06-10 13:21:30,614 [Thread-4] INFO freon.ProgressBar: Progress: 0.00 % (0 out of 10)
      2022-06-10 13:21:31,922 [Thread-4] INFO freon.ProgressBar: Progress: 0.00 % (0 out of 10)
      2022-06-10 13:21:32,922 [Thread-4] INFO freon.ProgressBar: Progress: 0.00 % (0 out of 10)
      2022-06-10 13:21:33,922 [Thread-4] INFO freon.ProgressBar: Progress: 0.00 % (0 out of 10)
      2022-06-10 13:21:33,978 [pool-2-thread-1] ERROR scm.XceiverClientGrpc: Failed to execute command cmdType: GetBlock
      traceID: ""
      containerID: 12
      datanodeUuid: "5b542d9f-5e91-4cc2-af48-af3c33b8c472"
      getBlock {
        blockID {
          containerID: 12
          localID: 109611004723202307
          blockCommitSequenceId: 0
        }
      }
       on the pipeline Pipeline[ Id: 8151bbc7-672c-42d6-9c40-c80ce69d4b3f, Nodes: 5b542d9f-5e91-4cc2-af48-af3c33b8c472{ip: 9.27.142.97, host: 9.27.142.97, ports: [REPLICATION=3096, RATIS=3092, RATIS_ADMIN=3095, RATIS_SERVER=3094, STANDALONE=3093], networkLocation: /default-rack, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}, ReplicationConfig: STANDALONE/ONE, State:CLOSED, leaderId:, CreationTimestamp2022-06-10T13:21:33.921+08:00[Asia/Shanghai]].
      2022-06-10 13:21:34,014 [pool-2-thread-1] INFO storage.BlockInputStream: Unable to read information for block conID: 12 locID: 109611004723202307 bcsId: 0 from pipeline PipelineID=8151bbc7-672c-42d6-9c40-c80ce69d4b3f: ContainerID 12 does not exist
      2022-06-10 13:21:34,142 [pool-2-thread-5] ERROR scm.XceiverClientGrpc: Failed to execute command cmdType: GetBlock
      traceID: ""
      containerID: 12
      datanodeUuid: "5b542d9f-5e91-4cc2-af48-af3c33b8c472"
      getBlock {
        blockID {
          containerID: 12
          localID: 109611004723202316
          blockCommitSequenceId: 0
        }
      }
       on the pipeline Pipeline[ Id: f421fc76-eebb-4474-8c77-c57944144df6, Nodes: 5b542d9f-5e91-4cc2-af48-af3c33b8c472{ip: 9.27.142.97, host: 9.27.142.97, ports: [REPLICATION=3096, RATIS=3092, RATIS_ADMIN=3095, RATIS_SERVER=3094, STANDALONE=3093], networkLocation: /default-rack, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}, ReplicationConfig: STANDALONE/ONE, State:CLOSED, leaderId:, CreationTimestamp2022-06-10T13:21:34.074+08:00[Asia/Shanghai]].
      2022-06-10 13:21:34,143 [pool-2-thread-5] INFO storage.BlockInputStream: Unable to read information for block conID: 12 locID: 109611004723202316 bcsId: 0 from pipeline PipelineID=f421fc76-eebb-4474-8c77-c57944144df6: ContainerID 12 does not exist
      2022-06-10 13:21:34,315 [pool-2-thread-7] ERROR scm.XceiverClientGrpc: Failed to execute command cmdType: GetBlock
      traceID: ""
      containerID: 12
      datanodeUuid: "5b542d9f-5e91-4cc2-af48-af3c33b8c472"
      getBlock {
        blockID {
          containerID: 12
          localID: 109611004723202310
          blockCommitSequenceId: 0
        }
      }
       on the pipeline Pipeline[ Id: 21b59767-0ea9-4214-9f56-61fd31d795f2, Nodes: 5b542d9f-5e91-4cc2-af48-af3c33b8c472{ip: 9.27.142.97, host: 9.27.142.97, ports: [REPLICATION=3096, RATIS=3092, RATIS_ADMIN=3095, RATIS_SERVER=3094, STANDALONE=3093], networkLocation: /default-rack, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}, ReplicationConfig: STANDALONE/ONE, State:CLOSED, leaderId:, CreationTimestamp2022-06-10T13:21:34.215+08:00[Asia/Shanghai]].
      2022-06-10 13:21:34,315 [pool-2-thread-7] INFO storage.BlockInputStream: Unable to read information for block conID: 12 locID: 109611004723202310 bcsId: 0 from pipeline PipelineID=21b59767-0ea9-4214-9f56-61fd31d795f2: ContainerID 12 does not exist
      2022-06-10 13:21:35,402 [Thread-4] INFO freon.ProgressBar: Progress: 0.00 % (0 out of 10)
      2022-06-10 13:21:36,418 [Thread-4] INFO freon.ProgressBar: Progress: 0.00 % (0 out of 10)
      2022-06-10 13:21:37,419 [Thread-4] INFO freon.ProgressBar: Progress: 0.00 % (0 out of 10)
      2022-06-10 13:21:38,001 [pool-2-thread-5] ERROR freon.BaseFreonGenerator: Error on executing task 4
      java.lang.IllegalStateException: Reference (=first) message digest doesn't match with digest of stripe_align_1654804614/4
              at org.apache.hadoop.ozone.freon.OzoneClientKeyValidator.validateDigest(OzoneClientKeyValidator.java:169)
              at org.apache.hadoop.ozone.freon.OzoneClientKeyValidator.validateKey(OzoneClientKeyValidator.java:133)
              at org.apache.hadoop.ozone.freon.BaseFreonGenerator.tryNextTask(BaseFreonGenerator.java:184)
              at org.apache.hadoop.ozone.freon.BaseFreonGenerator.taskLoop(BaseFreonGenerator.java:164)
              at org.apache.hadoop.ozone.freon.BaseFreonGenerator.lambda$startTaskRunners$1(BaseFreonGenerator.java:147)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
              at java.lang.Thread.run(Thread.java:748)
      2022-06-10 13:21:38,419 [Thread-4] INFO freon.ProgressBar: Progress: 10.00 % (1 out of 10)
      2022-06-10 13:21:39,419 [Thread-4] INFO freon.ProgressBar: Progress: 100.00 % (10 out of 10)
      One ore more freon test is failed.
      2022-06-10 13:21:39,823 [shutdown-hook-0] INFO metrics: type=TIMER, name=key-validate, count=10, min=5918.240321, max=7710.826968, mean=6891.752193143025, stddev=481.39096683740087, median=6798.957973, p75=7302.544457, p95=7710.826968, p98=7710.826968, p99=7710.826968, p999=7710.826968, mean_rate=1.1586365637687632, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/second, duration_unit=milliseconds
      2022-06-10 13:21:39,824 [shutdown-hook-0] INFO freon.BaseFreonGenerator: Total execution time (sec): 22
      2022-06-10 13:21:39,824 [shutdown-hook-0] INFO freon.BaseFreonGenerator: Failures: 1
      2022-06-10 13:21:39,824 [shutdown-hook-0] INFO freon.BaseFreonGenerator: Successful executions: 9
       [check_object] size=52428800 prefix=stripe_align_1654804614 fail!!!! 

      I've got some analysis for this problem below:

      The key clue of this problem is the following log:

      2022-06-10 13:21:27,923 [main] INFO storage.BlockInputStream: Unable to read information for block conID: 12 locID: 109611004723202307 bcsId: 0 from pipeline PipelineID=35db70e6-f362-4b3a-ab98-a89509547455: ContainerID 12 does not exist 

      This should be caused by a container delete command received before the ReadChunk request reached the DN.

      And `BlockInputStream` will try to refresh the pipeline and retry read, see:

      // BlockInputStream.java
      private void refreshPipeline(IOException cause) throws IOException {
        LOG.info("Unable to read information for block {} from pipeline {}: {}",
            blockID, pipeline.getId(), cause.getMessage());
        if (refreshPipelineFunction != null) {
          LOG.debug("Re-fetching pipeline for block {}", blockID);
          Pipeline newPipeline = refreshPipelineFunction.apply(blockID);
          if (newPipeline == null || newPipeline.sameDatanodes(pipeline)) {
            LOG.warn("No new pipeline for block {}", blockID);
            throw cause;
          } else {
            LOG.debug("New pipeline got for block {}", blockID);
            this.pipeline = newPipeline;
          }
        } else {
          throw cause;
        }
      } 

      But actually here the pipeline we try to refresh from SCM is a Standalone pipeline, so after SCM got this request via the SCM interface `getContainerWithPipelineBatch`, it will return a EC pipeline with all the DNs for this container in the class `ECPipelineProvider`, see:

      // ECPipelineProvider.java
      @Override public Pipeline createForRead(
          ECReplicationConfig replicationConfig,
          Set<ContainerReplica> replicas) {
        Map<DatanodeDetails, Integer> map = new HashMap<>();
        List<DatanodeDetails> dns = new ArrayList<>(replicas.size());
      
        for (ContainerReplica r : replicas) {
          map.put(r.getDatanodeDetails(), r.getReplicaIndex());
          dns.add(r.getDatanodeDetails());
        }
        return createPipelineInternal(replicationConfig, dns, map);
      }

      So for the retry read, it will pick the first node to read regardless of the replicaIndex, so we will get wrong chunks from the first node to read and thus the mismatch problem above, see:

      // XceiverClientGrpc.java
      public void connect() throws Exception {
        // connect to the closest node, if closest node doesn't exist, delegate to
        // first node, which is usually the leader in the pipeline.
        DatanodeDetails dn = topologyAwareRead ? this.pipeline.getClosestNode() :
            this.pipeline.getFirstNode();
        // just make a connection to the picked datanode at the beginning
        connectToDatanode(dn);
      }

       

      Attachments

        Issue Links

          Activity

            People

              markgui Mark Gui
              markgui Mark Gui
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: