Description
Environment: 3 Node cluster with around 2M files & same number of blocks.
All file operations are normal, only during directory scan, which take more memory and some long GC Pause. This directory scan happens for every 6H (default value) which cause slow response to any file operations. Delay is around 5-8 seconds (In production this delay got increased to 30+ seconds with 8M blocks)
GC Configuration:
-Xms6144M
-Xmx12288M /8G
-XX:NewSize=614M
-XX:MaxNewSize=1228M
-XX:MetaspaceSize=128M
-XX:MaxMetaspaceSize=128M
-XX:CMSFullGCsBeforeCompaction=1
-XX:MaxDirectMemorySize=1G
-XX:+UseConcMarkSweepGC
-XX:+CMSParallelRemarkEnabled
-XX:+UseCMSCompactAtFullCollection
-XX:CMSInitiatingOccupancyFraction=80
Also we tried with G1 GC, but couldnt find much difference in the result.
-XX:+UseG1GC
-XX:MaxGCPauseMillis=200
-XX:InitiatingHeapOccupancyPercent=45
-XX:G1ReservePercent=10
2021-05-07 16:32:23,508 INFO org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: BlockPool BP-345634799-<IP>-1619695417333 Total blocks: 2767211, missing metadata files: 22, missing block files: 22, missing blocks in memory: 0, mismatched blocks: 0 2021-05-07 16:32:23,508 WARN org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Lock held time above threshold: lock identifier: FsDatasetRWLock lockHeldTimeMs=7061 ms. Suppressed 0 lock warnings. The stack trace is: java.lang.Thread.getStackTrace(Thread.java:1559) org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032) org.apache.hadoop.util.InstrumentedLock.logWarning(InstrumentedLock.java:148) org.apache.hadoop.util.InstrumentedLock.check(InstrumentedLock.java:186) org.apache.hadoop.util.InstrumentedReadLock.unlock(InstrumentedReadLock.java:78) org.apache.hadoop.util.AutoCloseableLock.release(AutoCloseableLock.java:84) org.apache.hadoop.util.AutoCloseableLock.close(AutoCloseableLock.java:96) org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.scan(DirectoryScanner.java:539) org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.reconcile(DirectoryScanner.java:416) org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.run(DirectoryScanner.java:359) java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
We have the following Jiras our code already. But still facing long lock held. - https://issues.apache.org/jira/browse/HDFS-15621, https://issues.apache.org/jira/browse/HDFS-15150, https://issues.apache.org/jira/browse/HDFS-15160, https://issues.apache.org/jira/browse/HDFS-13947
Attachments
Issue Links
- is fixed by
-
HDFS-15415 Reduce locking in Datanode DirectoryScanner
- Resolved