Description
We have seen some occurrences of the Namenode getting very slow on delete operations, to the point where IBRs get blocked frequently and files fail to close. On one cluster in particular, after about 4 weeks of uptime, the Namenode started responding very poorly. Restarting it corrected the problem for another 4 weeks.
In that example, jstacks in the namenode always pointed to slow operations around a HDFS delete call which was performing an operation on the FoldedTreeSet structure. The captured jstacks always pointed at an operation on the folded tree set each time they were sampled:
"IPC Server handler 573 on 8020" #663 daemon prio=5 os_prio=0 tid=0x00007fe6a4087800 nid=0x97a6 runnable [0x00007fe67bdfd000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.hdfs.util.FoldedTreeSet.removeAndGet(FoldedTreeSet.java:879) at org.apache.hadoop.hdfs.util.FoldedTreeSet.remove(FoldedTreeSet.java:911) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeStorageInfo.removeBlock(DatanodeStorageInfo.java:263) at org.apache.hadoop.hdfs.server.blockmanagement.BlocksMap.removeBlock(BlocksMap.java:194) at org.apache.hadoop.hdfs.server.blockmanagement.BlocksMap.removeBlock(BlocksMap.java:108) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.removeBlockFromMap(BlockManager.java:3676) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.removeBlock(BlockManager.java:3507) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.removeBlocks(FSNamesystem.java:4158) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInternal(FSNamesystem.java:4132) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInt(FSNamesystem.java:4069) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:4053) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.delete(NameNodeRpcServer.java:845) at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.delete(AuthorizationProviderProxyClientProtocol.java:308) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.delete(ClientNamenodeProtocolServerSideTranslatorPB.java:603) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2216) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2212) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422)
The observation in this case, was that the namenode worked fine after a restart and then at some point after about 4 weeks of uptime, this problem started happening, and it would persist until the namenode was restarted. Then the problem did not return for about another 4 weeks.
On a completely different cluster and version, I recently came across a problem where files were again failing to close (last block does not have sufficient number of replicas) and the datanodes were logging a lot of messages like the following:
2019-11-27 09:00:49,678 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Took 21540ms to process 1 commands from NN
These messages had a range of durations and were fairly frequent. Focusing on the longer messages at around 20 seconds and checking a few different occurrences, jstacks showed as very similar problem to the NN issue, in that the folderTreeSet seems to be at the root of the problem.
The heartbeat thread is waiting on a lock:
"Thread-26" #243 daemon prio=5 os_prio=0 tid=0x00007f575225a000 nid=0x1d8bae waiting for monitor entry [0x00007f571ed76000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2004) - waiting to lock <0x00000006b79baf00> (a org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:699) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:645) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:730) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:539) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:696) at java.lang.Thread.run(Thread.java:748)
What seemed to be blocking it is:
"DataXceiver for client DFSClient_NONMAPREDUCE_672634602_52 at /10.71.224.51:37620 [Sending block BP-551684957-10.71.224.13-1505987946791:blk_1683565334_609907893]" #142018893 daemon prio=5 os_prio=0 tid=0x00007f573cd25000 nid=0x1912a runnable [0x00007f570acea000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.hdfs.util.FoldedTreeSet.get(FoldedTreeSet.java:449) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaMap.get(ReplicaMap.java:111) - locked <0x00000006b79baf00> (a org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getFile(FsDatasetImpl.java:2199) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.validateBlockFile(FsDatasetImpl.java:1954) - locked <0x00000006b79baf00> (a org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getBlockFile(FsDatasetImpl.java:709) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getBlockFile(FsDatasetImpl.java:702) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getMetaDataInputStream(FsDatasetImpl.java:220) at org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:301) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:537) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:148) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:103) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:246) at java.lang.Thread.run(Thread.java:748)
Out of curiosity, we restarted 3 of the DNs, and 24 hours since the restart, no messages like the above were logged, but they were getting logged at a reasonable frequency before the restart.
I have quite a few jstacks at 5 second frequency for some of the DNs on this cluster and many samples have a runnable thread with this line at top of the stack, which seems too frequent for a get operation:
at org.apache.hadoop.hdfs.util.FoldedTreeSet.get(FoldedTreeSet.java:449) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaMap.get(ReplicaMap.java:111) - locked <0x00000006b79baf00> (a org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
After the restart, the jstacks never show this line in any samples for at least the next week while were were monitoring the cluster.
Another interesting point, is that the JStack always shows line 449. This is the method it is running and the line is highlighted:
public E get(Object obj, Comparator<?> cmp) { Objects.requireNonNull(obj); Node<E> node = root; while (node != null) { E[] entries = node.entries; int leftIndex = node.leftIndex; int result = compare(obj, entries[leftIndex], cmp); if (result < 0) { node = node.left; } else if (result == 0) { return entries[leftIndex]; } else { int rightIndex = node.rightIndex; if (leftIndex != rightIndex) { result = compare(obj, entries[rightIndex], cmp); } if (result == 0) { return entries[rightIndex]; } else if (result > 0) { node = node.right; } else { int low = leftIndex + 1; int high = rightIndex - 1; while (low <= high) { int mid = (low + high) >>> 1; result = compare(obj, entries[mid], cmp); if (result > 0) { low = mid + 1; } else if (result < 0) { high = mid - 1; } else { return entries[mid]; } } return null; } } } // *** This is line 449 which the jstack always has at the top of the stack. return null; }
I cannot explain why that line is always the "running line" except if its somehow related to safepoints in the JVM.
Based on these two examples and a few other similar issues I have encountered in the namenode, I believe we have an issue with foldedTreeSet where it somehow degrades over time, to the point that a simple get operation takes a relatively long time. It does not seem to be able to recover from this without a process restart.
HDFS-13671 talks about reverting FoldedTreeSet (at least in the Namenode) for performance reasons. However I wonder if the problems people have seen with it are related to this performance degradation over time rather than the initial performance?
Attachments
Issue Links
- is related to
-
HDFS-13671 Namenode deletes large dir slowly caused by FoldedTreeSet#removeAndGet
- Resolved
-
HDFS-9260 Improve the performance and GC friendliness of NameNode startup and full block reports
- Resolved
- relates to
-
HDFS-15140 Replace FoldedTreeSet in Datanode with SortedSet or TreeMap
- Resolved