Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-11121

DeletedBlockLogImpl#onMessage Inter-process communication UUID inconsistency.

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Won't Fix
    • None
    • None
    • SCM

    Description

      Our Ozone cluster has recently encountered some issues with data deletion. We found that the SCM was unable to automatically clean up the data in the deletion queue, preventing the completion of the entire deletion process. During our problem analysis, we discovered an issue with DeletedBlockLogImpl#onMessage. The UUID transmitted from the DN via RPC was not recognized by the SCM, resulting in an "Unknown Datanode" exception. We attempted to fix this issue and made some progress.

      024-07-08 12:08:19,606 [scm2-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] WARN org.apache.hadoop.hdds.scm.block.SCMDeletedBlockTransactionStatusManager$SCMDeleteBlocksCommandStatusManager: Unknown Datanode: 9df75b64-d0e4-44ae-9bc0-9355371c8a5b Scm Command ID: 1720041450931 report status PENDING
      2024-07-08 12:08:19,606 [scm2-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] WARN org.apache.hadoop.hdds.scm.block.SCMDeletedBlockTransactionStatusManager$SCMDeleteBlocksCommandStatusManager: Unknown Datanode: 9df75b64-d0e4-44ae-9bc0-9355371c8a5b Scm Command ID: 1719241427194 report status PENDING
      2024-07-08 12:08:19,606 [scm2-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] WARN org.apache.hadoop.hdds.scm.block.SCMDeletedBlockTransactionStatusManager$SCMDeleteBlocksCommandStatusManager: Unknown Datanode: 9df75b64-d0e4-44ae-9bc0-9355371c8a5b Scm Command ID: 1720041450931 report status PENDING
      2024-07-08 12:08:19,606 [scm2-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] WARN org.apache.hadoop.hdds.scm.block.SCMDeletedBlockTransactionStatusManager$SCMDeleteBlocksCommandStatusManager: Unknown Datanode: 9df75b64-d0e4-44ae-9bc0-9355371c8a5b Scm Command ID: 1719241427194 report status PENDING
      2024-07-08 12:08:19,617 [scm2-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] WARN org.apache.hadoop.hdds.scm.block.SCMDeletedBlockTransactionStatusManager$SCMDeleteBlocksCommandStatusManager: Unknown Datanode: efadefd7-4d25-42fd-a6ef-fabd64c97d7f Scm Command ID: 1720041450023 report status PENDING
      2024-07-08 12:08:19,664 [scm2-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] WARN org.apache.hadoop.hdds.scm.block.SCMDeletedBlockTransactionStatusManager$SCMDeleteBlocksCommandStatusManager: Unknown Datanode: 0c4b82eb-3856-4984-9b0d-d9670089921b Scm Command ID: 1720106401909 report status PENDING
      2024-07-08 12:08:19,664 [scm2-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] WARN org.apache.hadoop.hdds.scm.block.SCMDeletedBlockTransactionStatusManager$SCMDeleteBlocksCommandStatusManager: Unknown Datanode: 0c4b82eb-3856-4984-9b0d-d9670089921b Scm Command ID: 1719241427294 report status PENDING 
      2024-07-12 08:35:37,032 [scm3-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] DEBUG org.apache.hadoop.hdds.scm.block.DeletedBlockLogImpl: remoteDnId = 888a550f-c59c-4dde-ba3e-3dcf8f9593e0, localDnId = 888a550f-c59c-4dde-ba3e-3dcf8f9593e0, remoteDnId == localDnId[false]
      2024-07-12 08:35:37,032 [scm3-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] DEBUG org.apache.hadoop.hdds.scm.block.DeletedBlockLogImpl: remoteDnId = c7919796-18fa-4f00-af94-9b7ebc21a572, localDnId = c7919796-18fa-4f00-af94-9b7ebc21a572, remoteDnId == localDnId[false]
      2024-07-12 08:35:37,032 [scm3-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] DEBUG org.apache.hadoop.hdds.scm.block.DeletedBlockLogImpl: remoteDnId = 596cd6c8-ecc7-48da-8039-75fe59d65846, localDnId = 596cd6c8-ecc7-48da-8039-75fe59d65846, remoteDnId == localDnId[false]
      2024-07-12 08:35:37,033 [scm3-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] DEBUG org.apache.hadoop.hdds.scm.block.DeletedBlockLogImpl: remoteDnId = de559349-fd76-4a5a-9acb-007432ba1876, localDnId = de559349-fd76-4a5a-9acb-007432ba1876, remoteDnId == localDnId[false]
      2024-07-12 08:35:37,033 [scm3-EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] DEBUG org.apache.hadoop.hdds.scm.block.DeletedBlockLogImpl: remoteDnId = 6a750295-7e7c-4786-b28c-f78509c41a02, localDnId = 6a750295-7e7c-4786-b28c-f78509c41a02, remoteDnId == localDnId[false] 

      On July 8th, we applied this PR in the production environment. Currently, SCM deletion can proceed normally, as shown in the Grafana screenshot below.

      Attachments

        1. screenshot-1.png
          408 kB
          Shilun Fan
        2. image-2024-07-12-09-37-23-618.png
          294 kB
          Shilun Fan

        Issue Links

          Activity

            People

              slfan1989 Shilun Fan
              slfan1989 Shilun Fan
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: