Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.3.6
-
None
-
None
Description
Recently upgraded my clusters from Hadoop v3.3.4 to Hadoop v3.3.6 and noticed a lot of Namenode instability. Basically after about 1 hour, the active namenode shuts down and the "next" one takes over.
Looking into the shutdown reasons, I'm seeing errors similar to
2024-02-20 12:05:37,352 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Rescan of postponedMisreplicatedBlocks completed in 8 msecs. 6639943 blocks are left. 1 blocks were removed. 2024-02-20 12:05:37,352 ERROR org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: RedundancyMonitor thread received Runtime exception. java.lang.IllegalArgumentException: The EC replicas to be deleted are not in the candidate list at org.apache.hadoop.thirdparty.com.google.common.base.Preconditions.checkArgument(Preconditions.java:144) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseExcessRedundancyStriped(BlockManager.java:4082) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseExcessRedundancies(BlockManager.java:3970) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processExtraRedundancyBlock(BlockManager.java:3957) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processMisReplicatedBlock(BlockManager.java:3898) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.rescanPostponedMisreplicatedBlocks(BlockManager.java:2898) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:5053) at java.lang.Thread.run(Thread.java:750) 2024-02-20 12:05:37,357 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1: java.lang.IllegalArgumentException: The EC replicas to be deleted are not in the candidate list
Looking through the code path itself, there is a check for `Preconditions.checkArgument()` to ensure that a given block chosen for deletion is actually one of the valid blocks. If not, then the NN shuts down.
This is likely a symptom to a larger issue, such as how is a block being chosen that is not in the candidate list.
The remainder of the cluster has services such as SPS and Balancer service disabled, so that the only movement of data should be whatever is "organically" chosen by the NameNode.