Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
1.4.0
-
None
Description
In one user's SCM audit log file, there are two DELETE_KEY_BLOCK records with almost the same blocks info, here are the top a few lines of each,
024-04-11 08:42:37,811 | INFO | SCMAudit | user=om/master02-cdpt5.***.com@***.COM | ip=10.59.2.2 | op=DELETE_KEY_BLOCK {KeyBlockToDelete=[BlockGroup[groupID='/icbigdata/wyg/1745367053961547777/2024-04-10/1777992236669792256', blockIDs=[conID: 164130 locID: 111677748060195553 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/03677746b84d4e14aa8c2a9cafd30227.xlsx', blockIDs=[conID: 164058 locID: 111677748060149956 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0a445479d3264d2c8244e3d52f5f4b24.xlsx', blockIDs=[conID: 163882 locID: 111677748060153718 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0a445479d3264d2c8244e3d52f5f4b24.xlsx', blockIDs=[conID: 164074 locID: 111677748060208039 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0f9caf5ee0284d2382a2ae5e698ad428.xlsx', blockIDs=[conID: 163515 locID: 111677748059929337 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0f9caf5ee0284d2382a2ae5e698ad428.xlsx', blockIDs=[conID: 164004 locID: 111677748060214180 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx', blockIDs=[conID: 162904 locID: 111677748059401411 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx', blockIDs=[conID: 164071 locID: 111677748060162845 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx', blockIDs=[conID: 164059 locID: 111677748060162941 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx', blockIDs=[conID: 164072 locID: 111677748060163107 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx'/, blockIDs=[conID: 164074 locID: 111677748060163112 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx', blockIDs=[conID: 163976 locID: 111677748060163154 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/2b308026f4a84f8ebbf2c669557532ab.xlsx', blockIDs=[conID: 163983 locID: 111677748060075978 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/31c07c5fe8c948c4b4a9bd5eb6423470.xlsx', blockIDs=[conID: 163995 locID: 111677748060152877 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/31c07c5fe8c948c4b4a9bd5eb6423470.xlsx', blockIDs=[conID: 163907 locID: 111677748060159708 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/33b7b03beade4ba28a565385987294ad.xlsx', blockIDs=[conID: 163887 locID: 111677748060148812 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/34131229d96d451788d012a23245d9fe.doc', blockIDs=[conID: 163923 locID: 111677748060046291 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/3cffc12d5a8c451fb46337911609bbba.xlsx', blockIDs=[conID: 164102 locID: 111677748060156005 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/3cffc12d5a8c451fb46337911609bbba.xlsx', blockIDs=[conID: 164091 locID: 111677748060156058 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/ 2024-04-11 08:43:35,862 | INFO | SCMAudit | user=om/master02-cdpt5.***.com@***.COM | ip=10.59.2.2 | op=DELETE_KEY_BLOCK {KeyBlockToDelete=[BlockGroup[groupID='/icbigdata/wyg/1745367053961547777/2024-04-10/1777992236669792256', blockIDs=[conID: 164130 locID: 111677748060195553 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/03677746b84d4e14aa8c2a9cafd30227.xlsx', blockIDs=[conID: 164058 locID: 111677748060149956 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0a445479d3264d2c8244e3d52f5f4b24.xlsx', blockIDs=[conID: 163882 locID: 111677748060153718 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0a445479d3264d2c8244e3d52f5f4b24.xlsx', blockIDs=[conID: 164074 locID: 111677748060208039 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0f9caf5ee0284d2382a2ae5e698ad428.xlsx', blockIDs=[conID: 163515 locID: 111677748059929337 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/0f9caf5ee0284d2382a2ae5e698ad428.xlsx', blockIDs=[conID: 164004 locID: 111677748060214180 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx', blockIDs=[conID: 162904 locID: 111677748059401411 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx', blockIDs=[conID: 164071 locID: 111677748060162845 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx', blockIDs=[conID: 164059 locID: 111677748060162941 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx', blockIDs=[conID: 164072 locID: 111677748060163107 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx', blockIDs=[conID: 164074 locID: 111677748060163112 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/125e335029004f33ac73182afbb76900.xlsx', blockIDs=[conID: 163976 locID: 111677748060163154 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/2b308026f4a84f8ebbf2c669557532ab.xlsx', blockIDs=[conID: 163983 locID: 111677748060075978 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/31c07c5fe8c948c4b4a9bd5eb6423470.xlsx', blockIDs=[conID: 163995 locID: 111677748060152877 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/31c07c5fe8c948c4b4a9bd5eb6423470.xlsx', blockIDs=[conID: 163907 locID: 111677748060159708 bcsId: 0]], BlockGroup[groupID='/s3v/0154431609b14992a6a3f743be10c3a0/
Here is the OM logs(not the same time window because no OM logs available for that period, but the logic is the same)
2024-04-10 14:31:03,015 WARN org.apache.hadoop.hdds.utils.BackgroundService: Background task execution failed java.lang.IllegalStateException: Volume and/or Bucket Name missing from Key Name. at org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:45) at org.apache.hadoop.ozone.om.service.KeyDeletingService.addToMap(KeyDeletingService.java:316) at org.apache.hadoop.ozone.om.service.KeyDeletingService.access$8(KeyDeletingService.java:312) at org.apache.hadoop.ozone.om.service.KeyDeletingService$KeyDeletingTask.submitPurgeKeysRequest(KeyDeletingService.java:251) at org.apache.hadoop.ozone.om.service.KeyDeletingService$KeyDeletingTask.call(KeyDeletingService.java:180) at org.apache.hadoop.hdds.utils.BackgroundService$PeriodicalTask.lambda$run$0(BackgroundService.java:121) at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1640) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 2024-04-10 14:31:50,170 WARN org.apache.hadoop.ipc.Server: Large response size 1051235 for call Call#118019916 Retry#0 org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest from 10.59.2.1:37063 2024-04-10 14:31:50,272 WARN org.apache.hadoop.ipc.Server: Large response size 1051989 for call Call#118019932 Retry#0 org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest from 10.59.2.1:37063 2024-04-10 14:32:05,664 INFO org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker: om2@group-9F198C4C3682-SegmentedRaftLogWorker: Rolling segment log-204433475_204439100 to index:204439100 2024-04-10 14:32:05,665 INFO org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker: om2@group-9F198C4C3682-SegmentedRaftLogWorker: Rolled log segment from /metadata-disk/data06/om/ratis/d39ebeec-41e8-35f1-a92b-9f198c4c3682/current/log_inprogress_204433475 to /metadata-disk/data06/om/ratis/d39ebeec-41e8-35f1-a92b-9f198c4c3682/current/log_204433475-204439100 2024-04-10 14:32:05,673 INFO org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker: om2@group-9F198C4C3682-SegmentedRaftLogWorker: created new log segment /metadata-disk/data06/om/ratis/d39ebeec-41e8-35f1-a92b-9f198c4c3682/current/log_inprogress_204439101 2024-04-10 14:32:05,706 INFO org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys deleted: 20015, elapsed time: 62690ms 2024-04-10 14:32:08,939 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for fdt.dfs@***.COM (auth:KERBEROS) 2024-04-10 14:32:08,946 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for fdt.dfs@***.COM (auth:KERBEROS) for protocol=interface org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol 2024-04-10 14:32:11,483 INFO org.apache.hadoop.ozone.om.request.key.OMKeyRequest: Detect allocated but uncommitted blocks [{blockID={containerID=164049, localID=111677748060155900}, length=268435456, offset=0, token=null, pipeline=null, createVersion=0, partNumber=0}] in key c61d3394d32145e4a9f23c6a3046cfc4.docx. .... 2024-04-10 14:33:08,236 INFO org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys deleted: 20019, elapsed time: 62530ms
After a further investigation, I believe the key which caused failure "java.lang.IllegalStateException: Volume and/or Bucket Name missing from Key Name." is "/s3v/gb-data/", the following info from SCM audit logs,
BlockGroup[groupID='/s3v/gb-data/', blockIDs=[conID: 80813 locID: 111677748040965241 bcsId: 0]]
It's a bucket. I'm not sure why a bucket is in the OM deletedTable, as I tested with "ozone fs -rm -r -f -skipTrash" with FSO, LEGACY and OBS, non of them will result in a bucket in deletedTable. So I guess it could be caused by history bugs in the past.
Anyway, this "java.lang.IllegalStateException" will abort the current round of KeyDeletingTask#call, so the PurgeKeysRequest will not be sent to RATIS to delete the keys in deletedTable, so keys will be processed again in next KeyDeletingService iterator, and those blocks of those keys will be sent to SCM again and again. In a SCM log file, 256MB, holding all logs in 15s, there is 27 repeated log for one random picked block ID, similar to this.
2024-03-12 18:28:55,727 WARN org.apache.hadoop.hdds.scm.block.DeletedBlockLogImpl: Container: #34680 was deleted for the transaction: txID: 481479568
containerID: 34680
localID: 111677748037003018
localID: 111677748037003295
localID: 111677748037003639
localID: 111677748037003811
localID: 111677748037003880
localID: 111677748037004088
localID: 111677748037004282
localID: 111677748037004521
localID: 111677748037004658
localID: 111677748037004818
localID: 111677748037005071
localID: 111677748037005223
localID: 111677748037005365
localID: 111677748037005641
localID: 111677748037005752
localID: 111677748037005934
localID: 111677748037002486
localID: 111677748037002740
count: 0
The consequence is duplicates key block delete requests send to SCM again and again. And SCM is busy to create new delete transactions for it, and process those transactions. The user's leader SCM terminated itself frequently during this period, caused by long GC exceeds 60s.
One more issue observed in this case is it took almost 60s to finish one round of key delete. Currently the key delete task runs every "ozone.block.deleting.service.interval" 60s. Currently it's not clear that this 60s, whether because SCM is slow or OM ratis is slow. Add more logs to tell that.
sammi@SAMMICHEN-MB0 master02 % grep "Number of keys deleted" ozone-om.log 2024-03-31 16:39:35,197 INFO org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys deleted: 192, elapsed time: 200286ms 2024-03-31 16:39:36,256 INFO org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys deleted: 192, elapsed time: 1058ms 2024-03-31 22:22:54,103 INFO org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys deleted: 20000, elapsed time: 61224ms 2024-03-31 22:23:57,816 INFO org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys deleted: 20000, elapsed time: 63712ms 2024-04-01 22:03:32,881 INFO org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys deleted: 20000, elapsed time: 64292ms 2024-04-01 22:04:36,903 INFO org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys deleted: 20000, elapsed time: 64021ms 2024-04-01 22:52:54,213 INFO org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys deleted: 20000, elapsed time: 63099ms 2024-04-01 22:53:56,579 INFO org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys deleted: 20000, elapsed time: 62366ms 2024-04-04 10:35:48,089 INFO org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys deleted: 20007, elapsed time: 59047ms 2024-04-08 11:50:33,453 INFO org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys deleted: 20023, elapsed time: 61243ms 2024-04-08 11:51:38,612 INFO org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys deleted: 20080, elapsed time: 65159ms 2024-04-10 14:32:05,706 INFO org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys deleted: 20015, elapsed time: 62690ms 2024-04-10 14:33:08,236 INFO org.apache.hadoop.ozone.om.service.KeyDeletingService: Number of keys deleted: 20019, elapsed time: 62530ms
and
sammi@SAMMICHEN-MB0 master02 % grep "Background task execution took" ozone-om.log
2024-03-31 16:26:01,690 WARN org.apache.hadoop.hdds.utils.BackgroundService: KeyDeletingService Background task execution took 613583811660ns > 300000000000ns(timeout)
2024-03-31 16:36:14,911 WARN org.apache.hadoop.hdds.utils.BackgroundService: KeyDeletingService Background task execution took 613220813293ns > 300000000000ns(timeout)
2024-04-11 05:57:22,754 WARN org.apache.hadoop.hdds.utils.BackgroundService: KeyDeletingService Background task execution took 3459626398958ns > 300000000000ns(timeout)
Attachments
Issue Links
- links to