Details
-
Bug
-
Status: Resolved
-
Blocker
-
Resolution: Duplicate
-
None
-
None
Description
Seems like fix for deadlock (in HDDS-9871) caused cache inconsistency. Due to which SnapDiff and background services KeyDeletingService and SstFilteringService are failing.
KeyDeletingService logs:
2023-12-18 19:19:29,563 WARN [KeyDeletingService#0]-org.apache.hadoop.hdds.utils.BackgroundService: Background task execution failed java.lang.IllegalStateException: Cache map entry removal failure. The cache is in an inconsistent state. Expected OmSnapshot instance: org.apache.hadoop.ozone.om.snapshot.ReferenceCounted@4f63f85e, actual: null at com.google.common.base.Preconditions.checkState(Preconditions.java:512) at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.cleanupInternal(SnapshotCache.java:313) at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.cleanup(SnapshotCache.java:285) at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:220) at org.apache.hadoop.ozone.om.OmSnapshotManager.checkForSnapshot(OmSnapshotManager.java:625) at org.apache.hadoop.ozone.om.OmMetadataManagerImpl.getLatestActiveSnapshot(OmMetadataManagerImpl.java:1691) at org.apache.hadoop.ozone.om.OmMetadataManagerImpl.getPendingDeletionKeys(OmMetadataManagerImpl.java:1544) at org.apache.hadoop.ozone.om.KeyManagerImpl.getPendingDeletionKeys(KeyManagerImpl.java:607) at org.apache.hadoop.ozone.om.service.KeyDeletingService$KeyDeletingTask.call(KeyDeletingService.java:183) at org.apache.hadoop.hdds.utils.BackgroundService$PeriodicalTask.lambda$run$0(BackgroundService.java:121) at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834)
2023-12-18 19:20:28,768 WARN [KeyDeletingService#0]-org.apache.hadoop.hdds.utils.BackgroundService: Background task execution failed java.lang.IllegalStateException: Cache map entry removal failure. The cache is in an inconsistent state. Expected OmSnapshot instance: org.apache.hadoop.ozone.om.snapshot.ReferenceCounted@4f63f85e, actual: org.apache.hadoop.ozone.om.snapshot.ReferenceCounted@7656056 at com.google.common.base.Preconditions.checkState(Preconditions.java:512) at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.cleanupInternal(SnapshotCache.java:313) at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.cleanup(SnapshotCache.java:285) at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:220) at org.apache.hadoop.ozone.om.OmSnapshotManager.checkForSnapshot(OmSnapshotManager.java:625) at org.apache.hadoop.ozone.om.OmMetadataManagerImpl.getLatestActiveSnapshot(OmMetadataManagerImpl.java:1691) at org.apache.hadoop.ozone.om.OmMetadataManagerImpl.getPendingDeletionKeys(OmMetadataManagerImpl.java:1544) at org.apache.hadoop.ozone.om.KeyManagerImpl.getPendingDeletionKeys(KeyManagerImpl.java:607) at org.apache.hadoop.ozone.om.service.KeyDeletingService$KeyDeletingTask.call(KeyDeletingService.java:183) at org.apache.hadoop.hdds.utils.BackgroundService$PeriodicalTask.lambda$run$0(BackgroundService.java:121) at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834) 2023-12-18 19:20:28,768 WARN [SstFilteringService#0]-org.apache.hadoop.hdds.utils.BackgroundService: Background task execution failed java.lang.IllegalStateException: Cache map entry removal failure. The cache is in an inconsistent state. Expected OmSnapshot instance: org.apache.hadoop.ozone.om.snapshot.ReferenceCounted@4f63f85e, actual: null at com.google.common.base.Preconditions.checkState(Preconditions.java:512) at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.cleanupInternal(SnapshotCache.java:313) at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.cleanup(SnapshotCache.java:285) at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:220) at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:153) at org.apache.hadoop.ozone.om.SstFilteringService$SstFilteringTask.call(SstFilteringService.java:185) at org.apache.hadoop.hdds.utils.BackgroundService$PeriodicalTask.lambda$run$0(BackgroundService.java:121) at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834)
SSTFilteringService logs:
2023-12-18 19:21:28,742 ERROR [SstFilteringService#0]-org.apache.hadoop.ozone.om.OmSnapshotManager: Failed to retrieve snapshot: /vol-t2gj8/buck-07uux/snap-5griw java.io.IOException: Failed init RocksDB, db path : /var/lib/hadoop-ozone/om/data913140/db.snapshots/checkpointState/om.db-4e72e3fd-58e4-4814-b8fa-869fb3e8741b, exception :org.rocksdb.RocksDBException lock hold by current process, acquire time 1702927228 acquiring thread 139777345017600: /var/lib/hadoop-ozone/om/data913140/db.snapshots/checkpointState/om.db-4e72e3fd-58e4-4814-b8fa-869fb3e8741b/LOCK: No locks available at org.apache.hadoop.hdds.utils.db.RDBStore.<init>(RDBStore.java:180) at org.apache.hadoop.hdds.utils.db.DBStoreBuilder.build(DBStoreBuilder.java:220) at org.apache.hadoop.ozone.om.OmMetadataManagerImpl.loadDB(OmMetadataManagerImpl.java:598) at org.apache.hadoop.ozone.om.OmMetadataManagerImpl.<init>(OmMetadataManagerImpl.java:406) at org.apache.hadoop.ozone.om.OmSnapshotManager$1.load(OmSnapshotManager.java:357) at org.apache.hadoop.ozone.om.OmSnapshotManager$1.load(OmSnapshotManager.java:1) at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.lambda$0(SnapshotCache.java:171) at java.base/java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1908) at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:167) at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:153) at org.apache.hadoop.ozone.om.SstFilteringService$SstFilteringTask.call(SstFilteringService.java:185) at org.apache.hadoop.hdds.utils.BackgroundService$PeriodicalTask.lambda$run$0(BackgroundService.java:121) at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834) Caused by: java.io.IOException: class org.apache.hadoop.hdds.utils.db.RocksDatabase: Failed to open /var/lib/hadoop-ozone/om/data913140/db.snapshots/checkpointState/om.db-4e72e3fd-58e4-4814-b8fa-869fb3e8741b; status : IOError; message : lock hold by current process, acquire time 1702927228 a cquiring thread 139777345017600: /var/lib/hadoop-ozone/om/data913140/db.snapshots/checkpointState/om.db-4e72e3fd-58e4-4814-b8fa-869fb3e8741b/LOCK: No locks available at org.apache.hadoop.hdds.utils.HddsServerUtil.toIOException(HddsServerUtil.java:667) at org.apache.hadoop.hdds.utils.db.RocksDatabase.toIOException(RocksDatabase.java:91) at org.apache.hadoop.hdds.utils.db.RocksDatabase.open(RocksDatabase.java:168) at org.apache.hadoop.hdds.utils.db.RDBStore.<init>(RDBStore.java:113) ... 18 more Caused by: org.rocksdb.RocksDBException: lock hold by current process, acquire time 1702927228 acquiring thread 139777345017600: /var/lib/hadoop-ozone/om/data913140/db.snapshots/checkpointState/om.db-4e72e3fd-58e4-4814-b8fa-869fb3e8741b/LOCK: No locks available at org.rocksdb.RocksDB.open(Native Method) at org.rocksdb.RocksDB.open(RocksDB.java:307) at org.apache.hadoop.hdds.utils.db.managed.ManagedRocksDB.open(ManagedRocksDB.java:75) at org.apache.hadoop.hdds.utils.db.RocksDatabase.open(RocksDatabase.java:154) ... 19 more 2023-12-18 19:21:28,743 WARN [SstFilteringService#0]-org.apache.hadoop.hdds.utils.BackgroundService: Background task execution failed java.lang.IllegalStateException: java.io.IOException: Failed init RocksDB, db path : /var/lib/hadoop-ozone/om/data913140/db.snapshots/checkpointState/om.db-4e72e3fd-58e4-4814-b8fa-869fb3e8741b, exception :org.rocksdb.RocksDBException lock hold by current process, acquire time 1702927228 acqu iring thread 139777345017600: /var/lib/hadoop-ozone/om/data913140/db.snapshots/checkpointState/om.db-4e72e3fd-58e4-4814-b8fa-869fb3e8741b/LOCK: No locks available at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.lambda$0(SnapshotCache.java:179) at java.base/java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1908) at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:167) at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:153) at org.apache.hadoop.ozone.om.SstFilteringService$SstFilteringTask.call(SstFilteringService.java:185) at org.apache.hadoop.hdds.utils.BackgroundService$PeriodicalTask.lambda$run$0(BackgroundService.java:121) at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834) Caused by: java.io.IOException: Failed init RocksDB, db path : /var/lib/hadoop-ozone/om/data913140/db.snapshots/checkpointState/om.db-4e72e3fd-58e4-4814-b8fa-869fb3e8741b, exception :org.rocksdb.RocksDBException lock hold by current process, acquire time 1702927228 acquiring thread 139777345 017600: /var/lib/hadoop-ozone/om/data913140/db.snapshots/checkpointState/om.db-4e72e3fd-58e4-4814-b8fa-869fb3e8741b/LOCK: No locks available at org.apache.hadoop.hdds.utils.db.RDBStore.<init>(RDBStore.java:180) at org.apache.hadoop.hdds.utils.db.DBStoreBuilder.build(DBStoreBuilder.java:220) at org.apache.hadoop.ozone.om.OmMetadataManagerImpl.loadDB(OmMetadataManagerImpl.java:598) at org.apache.hadoop.ozone.om.OmMetadataManagerImpl.<init>(OmMetadataManagerImpl.java:406) at org.apache.hadoop.ozone.om.OmSnapshotManager$1.load(OmSnapshotManager.java:357) at org.apache.hadoop.ozone.om.OmSnapshotManager$1.load(OmSnapshotManager.java:1) at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.lambda$0(SnapshotCache.java:171) ... 12 more Caused by: java.io.IOException: class org.apache.hadoop.hdds.utils.db.RocksDatabase: Failed to open /var/lib/hadoop-ozone/om/data913140/db.snapshots/checkpointState/om.db-4e72e3fd-58e4-4814-b8fa-869fb3e8741b; status : IOError; message : lock hold by current process, acquire time 1702927228 a cquiring thread 139777345017600: /var/lib/hadoop-ozone/om/data913140/db.snapshots/checkpointState/om.db-4e72e3fd-58e4-4814-b8fa-869fb3e8741b/LOCK: No locks available at org.apache.hadoop.hdds.utils.HddsServerUtil.toIOException(HddsServerUtil.java:667) at org.apache.hadoop.hdds.utils.db.RocksDatabase.toIOException(RocksDatabase.java:91) at org.apache.hadoop.hdds.utils.db.RocksDatabase.open(RocksDatabase.java:168) at org.apache.hadoop.hdds.utils.db.RDBStore.<init>(RDBStore.java:113) ... 18 more Caused by: org.rocksdb.RocksDBException: lock hold by current process, acquire time 1702927228 acquiring thread 139777345017600: /var/lib/hadoop-ozone/om/data913140/db.snapshots/checkpointState/om.db-4e72e3fd-58e4-4814-b8fa-869fb3e8741b/LOCK: No locks available at org.rocksdb.RocksDB.open(Native Method) at org.rocksdb.RocksDB.open(RocksDB.java:307) at org.apache.hadoop.hdds.utils.db.managed.ManagedRocksDB.open(ManagedRocksDB.java:75) at org.apache.hadoop.hdds.utils.db.RocksDatabase.open(RocksDatabase.java:154) ... 19 more
2023-12-18 19:19:28,741 ERROR [SstFilteringService#0]-org.apache.hadoop.ozone.om.SstFilteringService: Exception encountered while filtering a snapshot java.io.IOException: Rocks Database is closed at org.apache.hadoop.hdds.utils.db.RocksDatabase.assertClose(RocksDatabase.java:445) at org.apache.hadoop.hdds.utils.db.RocksDatabase.deleteFilesNotMatchingPrefix(RocksDatabase.java:958) at org.apache.hadoop.ozone.om.SstFilteringService$SstFilteringTask.call(SstFilteringService.java:193) at org.apache.hadoop.hdds.utils.BackgroundService$PeriodicalTask.lambda$run$0(BackgroundService.java:121) at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834)
SnapDiff logs:
2023-12-18 19:33:12,222 ERROR [snapshot-diff-job-thread-id-8]-org.apache.hadoop.ozone.om.snapshot.SnapshotDiffManager: Caught unchecked exception during diff report generation for volume: voljngnm bucket: bucketjngnm, fromSnapshot: snap-t1gfn and toSnapshot: snap-h1xmg java.lang.IllegalStateException: Cache map entry removal failure. The cache is in an inconsistent state. Expected OmSnapshot instance: org.apache.hadoop.ozone.om.snapshot.ReferenceCounted@4f63f85e, actual: null at com.google.common.base.Preconditions.checkState(Preconditions.java:512) at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.cleanupInternal(SnapshotCache.java:313) at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.cleanup(SnapshotCache.java:285) at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:220) at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:153) at org.apache.hadoop.ozone.om.snapshot.SnapshotDiffManager.generateSnapshotDiffReport(SnapshotDiffManager.java:841) at org.apache.hadoop.ozone.om.snapshot.SnapshotDiffManager.lambda$2(SnapshotDiffManager.java:722) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834)
I suspect it might be a race condition between cleanupInternal() and invalidate(). It wasn't a problem earlier because they were synchronize by ConcurrentHashMap's lock. But it could be something else.
Attachments
Issue Links
- duplicates
-
HDDS-10076 SnapshotCache closes RocksDB instance with Reference
- Resolved
-
HDDS-10103 Snapshot read calls are failing due to SnapshotCache's inconsistency
- Resolved
- is a clone of
-
HDDS-9871 [snapshot] Deadlock in SnapshotCache
- Resolved
- is superceded by
-
HDDS-10103 Snapshot read calls are failing due to SnapshotCache's inconsistency
- Resolved