Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-10680

Duplicate delete key blocks sent to SCM

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 1.4.0
    • 2.0.0, 1.4.1
    • None

    Description

      In one user's SCM audit log file, there are two DELETE_KEY_BLOCK records with almost the same blocks info, here are the top a few lines of each,

       

      024-04-11 08:42:37,811 | INFO  | SCMAudit | user=om/master02-cdpt5.***.com@***.COM | ip=10.59.2.2 | op=DELETE_KEY_BLOCK {KeyBlockToDelete=[BlockGroup[groupID='/icbigdata/wyg/1745367053961547777/2024-04-10/1777992236669792256', blockIDs=[conID: 164130 locID: 111677748060195553 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/03677746b84d4e14aa8c2a9cafd30227.xlsx', blockIDs=[conID: 164058 locID: 111677748060149956 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0a445479d3264d2c8244e3d52f5f4b24.xlsx', blockIDs=[conID: 163882 locID: 111677748060153718 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0a445479d3264d2c8244e3d52f5f4b24.xlsx', blockIDs=[conID: 164074 locID: 111677748060208039 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0f9caf5ee0284d2382a2ae5e698ad428.xlsx', blockIDs=[conID: 163515 locID: 111677748059929337 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0f9caf5ee0284d2382a2ae5e698ad428.xlsx', blockIDs=[conID: 164004 locID: 111677748060214180 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx', blockIDs=[conID: 162904 locID: 111677748059401411 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx', blockIDs=[conID: 164071 locID: 111677748060162845 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx', blockIDs=[conID: 164059 locID: 111677748060162941 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx', blockIDs=[conID: 164072 locID: 111677748060163107 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx'/, blockIDs=[conID: 164074 locID: 111677748060163112 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx', blockIDs=[conID: 163976 locID: 111677748060163154 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/2b308026f4a84f8ebbf2c669557532ab.xlsx', blockIDs=[conID: 163983 locID: 111677748060075978 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/31c07c5fe8c948c4b4a9bd5eb6423470.xlsx', blockIDs=[conID: 163995 locID: 111677748060152877 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/31c07c5fe8c948c4b4a9bd5eb6423470.xlsx', blockIDs=[conID: 163907 locID: 111677748060159708 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/33b7b03beade4ba28a565385987294ad.xlsx', blockIDs=[conID: 163887 locID: 111677748060148812 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/34131229d96d451788d012a23245d9fe.doc', blockIDs=[conID: 163923 locID: 111677748060046291 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/3cffc12d5a8c451fb46337911609bbba.xlsx', blockIDs=[conID: 164102 locID: 111677748060156005 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/3cffc12d5a8c451fb46337911609bbba.xlsx', blockIDs=[conID: 164091 locID: 111677748060156058 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/
      
      
      2024-04-11 08:43:35,862 | INFO  | SCMAudit | user=om/master02-cdpt5.***.com@***.COM | ip=10.59.2.2 | op=DELETE_KEY_BLOCK {KeyBlockToDelete=[BlockGroup[groupID='/icbigdata/wyg/1745367053961547777/2024-04-10/1777992236669792256', blockIDs=[conID: 164130 locID: 111677748060195553 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/03677746b84d4e14aa8c2a9cafd30227.xlsx', blockIDs=[conID: 164058 locID: 111677748060149956 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0a445479d3264d2c8244e3d52f5f4b24.xlsx', blockIDs=[conID: 163882 locID: 111677748060153718 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0a445479d3264d2c8244e3d52f5f4b24.xlsx', blockIDs=[conID: 164074 locID: 111677748060208039 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0f9caf5ee0284d2382a2ae5e698ad428.xlsx', blockIDs=[conID: 163515 locID: 111677748059929337 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0f9caf5ee0284d2382a2ae5e698ad428.xlsx', blockIDs=[conID: 164004 locID: 111677748060214180 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx', blockIDs=[conID: 162904 locID: 111677748059401411 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx', blockIDs=[conID: 164071 locID: 111677748060162845 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx', blockIDs=[conID: 164059 locID: 111677748060162941 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx', blockIDs=[conID: 164072 locID: 111677748060163107 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx', blockIDs=[conID: 164074 locID: 111677748060163112 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx', blockIDs=[conID: 163976 locID: 111677748060163154 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/2b308026f4a84f8ebbf2c669557532ab.xlsx', blockIDs=[conID: 163983 locID: 111677748060075978 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/31c07c5fe8c948c4b4a9bd5eb6423470.xlsx', blockIDs=[conID: 163995 locID: 111677748060152877 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/31c07c5fe8c948c4b4a9bd5eb6423470.xlsx', blockIDs=[conID: 163907 locID: 111677748060159708 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/ 

      Here is the OM logs(not the same time window because no OM logs available for that period, but the logic is the same)

      2024-04-10 14:31:03,015 WARN org.apache.hadoop.hdds.utils.BackgroundService: Background task execution failed
      java.lang.IllegalStateException: Volume and/or Bucket Name missing from Key Name.
              at org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:45)
              at org.apache.hadoop.ozone.om.service.KeyDeletingService.addToMap(KeyDeletingService.java:316)
              at org.apache.hadoop.ozone.om.service.KeyDeletingService.access$8(KeyDeletingService.java:312)
              at org.apache.hadoop.ozone.om.service.KeyDeletingService$KeyDeletingTask.submitPurgeKeysRequest(KeyDeletingService.java:251)
              at org.apache.hadoop.ozone.om.service.KeyDeletingService$KeyDeletingTask.call(KeyDeletingService.java:180)
              at org.apache.hadoop.hdds.utils.BackgroundService$PeriodicalTask.lambda$run$0(BackgroundService.java:121)
              at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1640)
              at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
              at java.util.concurrent.FutureTask.run(FutureTask.java:266)
              at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
              at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
              at java.lang.Thread.run(Thread.java:748)
      2024-04-10 14:31:50,170 WARN org.apache.hadoop.ipc.Server: Large response size 1051235 for call Call#118019916 Retry#0 org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest from 10.59.2.1:37063
      2024-04-10 14:31:50,272 WARN org.apache.hadoop.ipc.Server: Large response size 1051989 for call Call#118019932 Retry#0 org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest from 10.59.2.1:37063
      2024-04-10 14:32:05,664 INFO org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker: om2@group-9F198C4C3682-SegmentedRaftLogWorker: Rolling segment log-204433475_204439100 to index:204439100
      2024-04-10 14:32:05,665 INFO org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker: om2@group-9F198C4C3682-SegmentedRaftLogWorker: Rolled log segment from /metadata-disk/data06/om/ratis/d39ebeec-41e8-35f1-a92b-9f198c4c3682/current/log_inprogress_204433475 to /metadata-disk/data06/om/ratis/d39ebeec-41e8-35f1-a92b-9f198c4c3682/current/log_204433475-204439100
      2024-04-10 14:32:05,673 INFO org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker: om2@group-9F198C4C3682-SegmentedRaftLogWorker: created new log segment /metadata-disk/data06/om/ratis/d39ebeec-41e8-35f1-a92b-9f198c4c3682/current/log_inprogress_204439101
      2024-04-10 14:32:05,706 INFO org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys deleted: 20015, elapsed time: 62690ms
      2024-04-10 14:32:08,939 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for fdt.dfs@***.COM (auth:KERBEROS)
      2024-04-10 14:32:08,946 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for fdt.dfs@***.COM (auth:KERBEROS) for protocol=interface org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol
      2024-04-10 14:32:11,483 INFO org.apache.hadoop.ozone.om.request.key.OMKeyRequest: Detect allocated but uncommitted blocks [{blockID={containerID=164049, localID=111677748060155900}, length=268435456, offset=0, token=null, pipeline=null, createVersion=0, partNumber=0}] in key c61d3394d32145e4a9f23c6a3046cfc4.docx.
      ....
      2024-04-10 14:33:08,236 INFO org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys deleted: 20019, elapsed time: 62530ms
      

      After a further investigation, I believe the key which caused failure "java.lang.IllegalStateException: Volume and/or Bucket Name missing from Key Name." is "/s3v/gb-data/", the following info from SCM audit logs,

       BlockGroup[groupID='/s3v/gb-data/', blockIDs=[conID: 80813 locID: 111677748040965241 bcsId: 0]]
      

      It's a bucket. I'm not sure why a bucket is in the OM deletedTable, as I tested with "ozone fs -rm -r -f -skipTrash" with FSO, LEGACY and OBS, non of them will result in a bucket in deletedTable. So I guess it could be caused by history bugs in the past.

      Anyway, this "java.lang.IllegalStateException" will abort the current round of KeyDeletingTask#call, so the PurgeKeysRequest will not be sent to RATIS to delete the keys in deletedTable, so keys will be processed again in next KeyDeletingService iterator, and those blocks of those keys will be sent to SCM again and again. In a SCM log file, 256MB, holding all logs in 15s, there is 27 repeated log for one random picked block ID, similar to this.

      2024-03-12 18:28:55,727 WARN org.apache.hadoop.hdds.scm.block.DeletedBlockLogImpl: Container: #34680 was deleted for the transaction: txID: 481479568
      containerID: 34680
      localID: 111677748037003018
      localID: 111677748037003295
      localID: 111677748037003639
      localID: 111677748037003811
      localID: 111677748037003880
      localID: 111677748037004088
      localID: 111677748037004282
      localID: 111677748037004521
      localID: 111677748037004658
      localID: 111677748037004818
      localID: 111677748037005071
      localID: 111677748037005223
      localID: 111677748037005365
      localID: 111677748037005641
      localID: 111677748037005752
      localID: 111677748037005934
      localID: 111677748037002486
      localID: 111677748037002740
      count: 0
      

      The consequence is duplicates key block delete requests send to SCM again and again. And SCM is busy to create new delete transactions for it, and process those transactions. The user's leader SCM terminated itself frequently during this period, caused by long GC exceeds 60s.

      One more issue observed in this case is it took almost 60s to finish one round of key delete. Currently the key delete task runs every "ozone.block.deleting.service.interval" 60s. Currently it's not clear that this 60s, whether because SCM is slow or OM ratis is slow. Add more logs to tell that.

      sammi@SAMMICHEN-MB0 master02 % grep "Number of keys deleted" ozone-om.log 
      2024-03-31 16:39:35,197 INFO org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys deleted: 192, elapsed time: 200286ms
      2024-03-31 16:39:36,256 INFO org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys deleted: 192, elapsed time: 1058ms
      2024-03-31 22:22:54,103 INFO org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys deleted: 20000, elapsed time: 61224ms
      2024-03-31 22:23:57,816 INFO org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys deleted: 20000, elapsed time: 63712ms
      2024-04-01 22:03:32,881 INFO org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys deleted: 20000, elapsed time: 64292ms
      2024-04-01 22:04:36,903 INFO org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys deleted: 20000, elapsed time: 64021ms
      2024-04-01 22:52:54,213 INFO org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys deleted: 20000, elapsed time: 63099ms
      2024-04-01 22:53:56,579 INFO org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys deleted: 20000, elapsed time: 62366ms
      2024-04-04 10:35:48,089 INFO org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys deleted: 20007, elapsed time: 59047ms
      2024-04-08 11:50:33,453 INFO org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys deleted: 20023, elapsed time: 61243ms
      2024-04-08 11:51:38,612 INFO org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys deleted: 20080, elapsed time: 65159ms
      2024-04-10 14:32:05,706 INFO org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys deleted: 20015, elapsed time: 62690ms
      2024-04-10 14:33:08,236 INFO org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys deleted: 20019, elapsed time: 62530ms
      

      and

      sammi@SAMMICHEN-MB0 master02 % grep "Background task execution took" ozone-om.log
      2024-03-31 16:26:01,690 WARN org.apache.hadoop.hdds.utils.BackgroundService: KeyDeletingService Background task execution took 613583811660ns > 300000000000ns(timeout)
      2024-03-31 16:36:14,911 WARN org.apache.hadoop.hdds.utils.BackgroundService: KeyDeletingService Background task execution took 613220813293ns > 300000000000ns(timeout)
      2024-04-11 05:57:22,754 WARN org.apache.hadoop.hdds.utils.BackgroundService: KeyDeletingService Background task execution took 3459626398958ns > 300000000000ns(timeout)
      

      Attachments

        Issue Links

          Activity

            People

              Sammi Sammi Chen
              Sammi Sammi Chen
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: