Details
Description
There are two core functions, report(#sendHeartbeat, #blockReport, #cacheReport) and #processCommand in #BPServiceActor main process flow. If processCommand cost long time it will block send report flow. Meanwhile processCommand could cost long time(over 1000s the worst case I meet) when IO load of DataNode is very high. Since some IO operations are under #datasetLock, So it has to wait to acquire #datasetLock long time when process some of commands(such as #DNA_INVALIDATE). In such case, #heartbeat will not send to NameNode in-time, and trigger other disasters.
I propose to improve #processCommand asynchronously and not block #BPServiceActor to send heartbeat back to NameNode when meet high IO load.
Notes:
1. Lifeline could be one effective solution, however some old branches are not support this feature.
2. IO operations under #datasetLock is another issue, I think we should solve it at another JIRA.
Attachments
Attachments
Issue Links
- breaks
-
HBASE-26970 TestMetaFixed fails reliably with Hadoop 3.2.3 and Hadoop 3.3.2
- Resolved
- is related to
-
HDFS-15651 Client could not obtain block when DN CommandProcessingThread exit
- Resolved
- relates to
-
HDFS-15113 Missing IBR when NameNode restart if open processCommand async feature
- Resolved
-
HDFS-15651 Client could not obtain block when DN CommandProcessingThread exit
- Resolved
-
HDFS-16586 Purge FsDatasetAsyncDiskService threadgroup; it causes BPServiceActor$CommandProcessingThread IllegalThreadStateException 'fatal exception and exit'
- Resolved
-
HDFS-15075 Remove process command timing from BPServiceActor
- Resolved