Details
Description
Recently we found a postmortem case where the ANN seems to be in an infinite loop. From the logs it seems it just went through a rolling restart, and DNs are getting registered.
Later the NN become unresponsive, and from the stacktrace it's inside a do-while loop inside NetworkTopology#chooseRandom - part of what's done in HDFS-10320.
Going through the code and logs I'm not able to come up with any theory (thought about incorrect locking, or the Node object being modified outside of NetworkTopology, both seem impossible) why this is happening, but we should eliminate this loop.
stacktrace:
Stack: java.util.HashMap.hash(HashMap.java:338) java.util.HashMap.containsKey(HashMap.java:595) java.util.HashSet.contains(HashSet.java:203) org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:786) org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:732) org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseDataNode(BlockPlacementPolicyDefault.java:757) org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:692) org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:666) org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalRack(BlockPlacementPolicyDefault.java:573) org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:461) org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:368) org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:243) org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:115) org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4AdditionalDatanode(BlockManager.java:1596) org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:3599) org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:717)
Attachments
Attachments
Issue Links
- breaks
-
HADOOP-16385 Namenode crashes with "RedundancyMonitor thread received Runtime exception"
- Resolved
- is broken by
-
HDFS-10320 Rack failures may result in NN terminate
- Resolved
- is related to
-
HDFS-14999 Avoid Potential Infinite Loop in DFSNetworkTopology
- Resolved