Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Won't Fix
-
None
-
None
Description
Our Ozone cluster has recently encountered some issues with data deletion. We found that the SCM was unable to automatically clean up the data in the deletion queue, preventing the completion of the entire deletion process. During our problem analysis, we discovered an issue with DeletedBlockLogImpl#onMessage. The UUID transmitted from the DN via RPC was not recognized by the SCM, resulting in an "Unknown Datanode" exception. We attempted to fix this issue and made some progress.
024-07-08 12:08:19,606 [scm2-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] WARN org.apache.hadoop.hdds.scm.block.SCMDeletedBlockTransactionStatusManager$SCMDeleteBlocksCommandStatusManager: Unknown Datanode: 9df75b64-d0e4-44ae-9bc0-9355371c8a5b Scm Command ID: 1720041450931 report status PENDING 2024-07-08 12:08:19,606 [scm2-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] WARN org.apache.hadoop.hdds.scm.block.SCMDeletedBlockTransactionStatusManager$SCMDeleteBlocksCommandStatusManager: Unknown Datanode: 9df75b64-d0e4-44ae-9bc0-9355371c8a5b Scm Command ID: 1719241427194 report status PENDING 2024-07-08 12:08:19,606 [scm2-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] WARN org.apache.hadoop.hdds.scm.block.SCMDeletedBlockTransactionStatusManager$SCMDeleteBlocksCommandStatusManager: Unknown Datanode: 9df75b64-d0e4-44ae-9bc0-9355371c8a5b Scm Command ID: 1720041450931 report status PENDING 2024-07-08 12:08:19,606 [scm2-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] WARN org.apache.hadoop.hdds.scm.block.SCMDeletedBlockTransactionStatusManager$SCMDeleteBlocksCommandStatusManager: Unknown Datanode: 9df75b64-d0e4-44ae-9bc0-9355371c8a5b Scm Command ID: 1719241427194 report status PENDING 2024-07-08 12:08:19,617 [scm2-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] WARN org.apache.hadoop.hdds.scm.block.SCMDeletedBlockTransactionStatusManager$SCMDeleteBlocksCommandStatusManager: Unknown Datanode: efadefd7-4d25-42fd-a6ef-fabd64c97d7f Scm Command ID: 1720041450023 report status PENDING 2024-07-08 12:08:19,664 [scm2-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] WARN org.apache.hadoop.hdds.scm.block.SCMDeletedBlockTransactionStatusManager$SCMDeleteBlocksCommandStatusManager: Unknown Datanode: 0c4b82eb-3856-4984-9b0d-d9670089921b Scm Command ID: 1720106401909 report status PENDING 2024-07-08 12:08:19,664 [scm2-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] WARN org.apache.hadoop.hdds.scm.block.SCMDeletedBlockTransactionStatusManager$SCMDeleteBlocksCommandStatusManager: Unknown Datanode: 0c4b82eb-3856-4984-9b0d-d9670089921b Scm Command ID: 1719241427294 report status PENDING
2024-07-12 08:35:37,032 [scm3-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] DEBUG org.apache.hadoop.hdds.scm.block.DeletedBlockLogImpl: remoteDnId = 888a550f-c59c-4dde-ba3e-3dcf8f9593e0, localDnId = 888a550f-c59c-4dde-ba3e-3dcf8f9593e0, remoteDnId == localDnId[false] 2024-07-12 08:35:37,032 [scm3-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] DEBUG org.apache.hadoop.hdds.scm.block.DeletedBlockLogImpl: remoteDnId = c7919796-18fa-4f00-af94-9b7ebc21a572, localDnId = c7919796-18fa-4f00-af94-9b7ebc21a572, remoteDnId == localDnId[false] 2024-07-12 08:35:37,032 [scm3-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] DEBUG org.apache.hadoop.hdds.scm.block.DeletedBlockLogImpl: remoteDnId = 596cd6c8-ecc7-48da-8039-75fe59d65846, localDnId = 596cd6c8-ecc7-48da-8039-75fe59d65846, remoteDnId == localDnId[false] 2024-07-12 08:35:37,033 [scm3-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] DEBUG org.apache.hadoop.hdds.scm.block.DeletedBlockLogImpl: remoteDnId = de559349-fd76-4a5a-9acb-007432ba1876, localDnId = de559349-fd76-4a5a-9acb-007432ba1876, remoteDnId == localDnId[false] 2024-07-12 08:35:37,033 [scm3-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] DEBUG org.apache.hadoop.hdds.scm.block.DeletedBlockLogImpl: remoteDnId = 6a750295-7e7c-4786-b28c-f78509c41a02, localDnId = 6a750295-7e7c-4786-b28c-f78509c41a02, remoteDnId == localDnId[false]
On July 8th, we applied this PR in the production environment. Currently, SCM deletion can proceed normally, as shown in the Grafana screenshot below.
Attachments
Attachments
Issue Links
- links to