Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-17392

NameNode rolls frequently with "EC replicas to be deleted are not in the candidate" error

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.3.6
    • None
    • namenode
    • None

    Description

      Recently upgraded my clusters from Hadoop v3.3.4 to Hadoop v3.3.6 and noticed a lot of Namenode instability.  Basically after about 1 hour, the active namenode shuts down and the "next" one takes over.

      Looking into the shutdown reasons, I'm seeing errors similar to

      2024-02-20 12:05:37,352 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Rescan of postponedMisreplicatedBlocks completed in 8 msecs. 6639943 blocks are left. 1 blocks were removed.
      2024-02-20 12:05:37,352 ERROR org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: RedundancyMonitor thread received Runtime exception.
      java.lang.IllegalArgumentException: The EC replicas to be deleted are not in the candidate list
          at org.apache.hadoop.thirdparty.com.google.common.base.Preconditions.checkArgument(Preconditions.java:144)
          at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseExcessRedundancyStriped(BlockManager.java:4082)
          at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseExcessRedundancies(BlockManager.java:3970)
          at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processExtraRedundancyBlock(BlockManager.java:3957)
          at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processMisReplicatedBlock(BlockManager.java:3898)
          at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.rescanPostponedMisreplicatedBlocks(BlockManager.java:2898)
          at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:5053)
          at java.lang.Thread.run(Thread.java:750)
      2024-02-20 12:05:37,357 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1: java.lang.IllegalArgumentException: The EC replicas to be deleted are not in the candidate list 

      Looking through the code path itself, there is a check for `Preconditions.checkArgument()` to ensure that a given block chosen for deletion is actually one of the valid blocks.  If not, then the NN shuts down.

      This is likely a symptom to a larger issue, such as how is a block being chosen that is not in the candidate list.

      The remainder of the cluster has services such as SPS and Balancer service disabled, so that the only movement of data should be whatever is "organically" chosen by the NameNode.

      Attachments

        Activity

          People

            Unassigned Unassigned
            tsetem Rick Weber
            Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: