Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-9965

[Snapshot] SnapshotCache is in inconsistent state

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Duplicate
    • None
    • None
    • Snapshot

    Description

      Seems like fix for deadlock (in HDDS-9871) caused cache inconsistency. Due to which SnapDiff and background services KeyDeletingService and SstFilteringService are failing.

      KeyDeletingService logs:

      2023-12-18 19:19:29,563 WARN [KeyDeletingService#0]-org.apache.hadoop.hdds.utils.BackgroundService: Background task execution failed
      java.lang.IllegalStateException: Cache map entry removal failure. The cache is in an inconsistent state. Expected OmSnapshot instance: org.apache.hadoop.ozone.om.snapshot.ReferenceCounted@4f63f85e, actual: null
              at com.google.common.base.Preconditions.checkState(Preconditions.java:512)
              at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.cleanupInternal(SnapshotCache.java:313)
              at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.cleanup(SnapshotCache.java:285)
              at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:220)
              at org.apache.hadoop.ozone.om.OmSnapshotManager.checkForSnapshot(OmSnapshotManager.java:625)
              at org.apache.hadoop.ozone.om.OmMetadataManagerImpl.getLatestActiveSnapshot(OmMetadataManagerImpl.java:1691)
              at org.apache.hadoop.ozone.om.OmMetadataManagerImpl.getPendingDeletionKeys(OmMetadataManagerImpl.java:1544)
              at org.apache.hadoop.ozone.om.KeyManagerImpl.getPendingDeletionKeys(KeyManagerImpl.java:607)
              at org.apache.hadoop.ozone.om.service.KeyDeletingService$KeyDeletingTask.call(KeyDeletingService.java:183)
              at org.apache.hadoop.hdds.utils.BackgroundService$PeriodicalTask.lambda$run$0(BackgroundService.java:121)
              at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736)
              at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
              at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
              at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
              at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
              at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
              at java.base/java.lang.Thread.run(Thread.java:834) 
      2023-12-18 19:20:28,768 WARN [KeyDeletingService#0]-org.apache.hadoop.hdds.utils.BackgroundService: Background task execution failed
      java.lang.IllegalStateException: Cache map entry removal failure. The cache is in an inconsistent state. Expected OmSnapshot instance: org.apache.hadoop.ozone.om.snapshot.ReferenceCounted@4f63f85e, actual: org.apache.hadoop.ozone.om.snapshot.ReferenceCounted@7656056
              at com.google.common.base.Preconditions.checkState(Preconditions.java:512)
              at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.cleanupInternal(SnapshotCache.java:313)
              at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.cleanup(SnapshotCache.java:285)
              at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:220)
              at org.apache.hadoop.ozone.om.OmSnapshotManager.checkForSnapshot(OmSnapshotManager.java:625)
              at org.apache.hadoop.ozone.om.OmMetadataManagerImpl.getLatestActiveSnapshot(OmMetadataManagerImpl.java:1691)
              at org.apache.hadoop.ozone.om.OmMetadataManagerImpl.getPendingDeletionKeys(OmMetadataManagerImpl.java:1544)
              at org.apache.hadoop.ozone.om.KeyManagerImpl.getPendingDeletionKeys(KeyManagerImpl.java:607)
              at org.apache.hadoop.ozone.om.service.KeyDeletingService$KeyDeletingTask.call(KeyDeletingService.java:183)
              at org.apache.hadoop.hdds.utils.BackgroundService$PeriodicalTask.lambda$run$0(BackgroundService.java:121)
              at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736)
              at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
              at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
              at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
              at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
              at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
              at java.base/java.lang.Thread.run(Thread.java:834)
      2023-12-18 19:20:28,768 WARN [SstFilteringService#0]-org.apache.hadoop.hdds.utils.BackgroundService: Background task execution failed
      java.lang.IllegalStateException: Cache map entry removal failure. The cache is in an inconsistent state. Expected OmSnapshot instance: org.apache.hadoop.ozone.om.snapshot.ReferenceCounted@4f63f85e, actual: null
              at com.google.common.base.Preconditions.checkState(Preconditions.java:512)
              at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.cleanupInternal(SnapshotCache.java:313)
              at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.cleanup(SnapshotCache.java:285)
              at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:220)
              at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:153)
              at org.apache.hadoop.ozone.om.SstFilteringService$SstFilteringTask.call(SstFilteringService.java:185)
              at org.apache.hadoop.hdds.utils.BackgroundService$PeriodicalTask.lambda$run$0(BackgroundService.java:121)
              at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736)
              at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
              at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
              at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
              at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
              at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
              at java.base/java.lang.Thread.run(Thread.java:834)
      

      SSTFilteringService logs:

      2023-12-18 19:21:28,742 ERROR [SstFilteringService#0]-org.apache.hadoop.ozone.om.OmSnapshotManager: Failed to retrieve snapshot: /vol-t2gj8/buck-07uux/snap-5griw
      java.io.IOException: Failed init RocksDB, db path : /var/lib/hadoop-ozone/om/data913140/db.snapshots/checkpointState/om.db-4e72e3fd-58e4-4814-b8fa-869fb3e8741b, exception :org.rocksdb.RocksDBException lock hold by current process, acquire time 1702927228 acquiring thread 139777345017600: /var/lib/hadoop-ozone/om/data913140/db.snapshots/checkpointState/om.db-4e72e3fd-58e4-4814-b8fa-869fb3e8741b/LOCK: No locks available
              at org.apache.hadoop.hdds.utils.db.RDBStore.<init>(RDBStore.java:180)
              at org.apache.hadoop.hdds.utils.db.DBStoreBuilder.build(DBStoreBuilder.java:220)
              at org.apache.hadoop.ozone.om.OmMetadataManagerImpl.loadDB(OmMetadataManagerImpl.java:598)
              at org.apache.hadoop.ozone.om.OmMetadataManagerImpl.<init>(OmMetadataManagerImpl.java:406)
              at org.apache.hadoop.ozone.om.OmSnapshotManager$1.load(OmSnapshotManager.java:357)
              at org.apache.hadoop.ozone.om.OmSnapshotManager$1.load(OmSnapshotManager.java:1)
              at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.lambda$0(SnapshotCache.java:171)
              at java.base/java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1908)
              at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:167)
              at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:153)
              at org.apache.hadoop.ozone.om.SstFilteringService$SstFilteringTask.call(SstFilteringService.java:185)
              at org.apache.hadoop.hdds.utils.BackgroundService$PeriodicalTask.lambda$run$0(BackgroundService.java:121)
              at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736)
              at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
              at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
              at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
              at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
              at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
              at java.base/java.lang.Thread.run(Thread.java:834)
      Caused by: java.io.IOException: class org.apache.hadoop.hdds.utils.db.RocksDatabase: Failed to open /var/lib/hadoop-ozone/om/data913140/db.snapshots/checkpointState/om.db-4e72e3fd-58e4-4814-b8fa-869fb3e8741b; status : IOError; message : lock hold by current process, acquire time 1702927228 a
      cquiring thread 139777345017600: /var/lib/hadoop-ozone/om/data913140/db.snapshots/checkpointState/om.db-4e72e3fd-58e4-4814-b8fa-869fb3e8741b/LOCK: No locks available
              at org.apache.hadoop.hdds.utils.HddsServerUtil.toIOException(HddsServerUtil.java:667)
              at org.apache.hadoop.hdds.utils.db.RocksDatabase.toIOException(RocksDatabase.java:91)
              at org.apache.hadoop.hdds.utils.db.RocksDatabase.open(RocksDatabase.java:168)
              at org.apache.hadoop.hdds.utils.db.RDBStore.<init>(RDBStore.java:113)
              ... 18 more
      Caused by: org.rocksdb.RocksDBException: lock hold by current process, acquire time 1702927228 acquiring thread 139777345017600: /var/lib/hadoop-ozone/om/data913140/db.snapshots/checkpointState/om.db-4e72e3fd-58e4-4814-b8fa-869fb3e8741b/LOCK: No locks available
              at org.rocksdb.RocksDB.open(Native Method)
              at org.rocksdb.RocksDB.open(RocksDB.java:307)
              at org.apache.hadoop.hdds.utils.db.managed.ManagedRocksDB.open(ManagedRocksDB.java:75)
              at org.apache.hadoop.hdds.utils.db.RocksDatabase.open(RocksDatabase.java:154)
              ... 19 more
      2023-12-18 19:21:28,743 WARN [SstFilteringService#0]-org.apache.hadoop.hdds.utils.BackgroundService: Background task execution failed
      java.lang.IllegalStateException: java.io.IOException: Failed init RocksDB, db path : /var/lib/hadoop-ozone/om/data913140/db.snapshots/checkpointState/om.db-4e72e3fd-58e4-4814-b8fa-869fb3e8741b, exception :org.rocksdb.RocksDBException lock hold by current process, acquire time 1702927228 acqu
      iring thread 139777345017600: /var/lib/hadoop-ozone/om/data913140/db.snapshots/checkpointState/om.db-4e72e3fd-58e4-4814-b8fa-869fb3e8741b/LOCK: No locks available
              at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.lambda$0(SnapshotCache.java:179)
              at java.base/java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1908)
              at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:167)
              at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:153)
              at org.apache.hadoop.ozone.om.SstFilteringService$SstFilteringTask.call(SstFilteringService.java:185)
              at org.apache.hadoop.hdds.utils.BackgroundService$PeriodicalTask.lambda$run$0(BackgroundService.java:121)
              at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736)
              at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
              at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
              at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
              at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
              at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
              at java.base/java.lang.Thread.run(Thread.java:834)
      Caused by: java.io.IOException: Failed init RocksDB, db path : /var/lib/hadoop-ozone/om/data913140/db.snapshots/checkpointState/om.db-4e72e3fd-58e4-4814-b8fa-869fb3e8741b, exception :org.rocksdb.RocksDBException lock hold by current process, acquire time 1702927228 acquiring thread 139777345
      017600: /var/lib/hadoop-ozone/om/data913140/db.snapshots/checkpointState/om.db-4e72e3fd-58e4-4814-b8fa-869fb3e8741b/LOCK: No locks available
              at org.apache.hadoop.hdds.utils.db.RDBStore.<init>(RDBStore.java:180)
              at org.apache.hadoop.hdds.utils.db.DBStoreBuilder.build(DBStoreBuilder.java:220)
              at org.apache.hadoop.ozone.om.OmMetadataManagerImpl.loadDB(OmMetadataManagerImpl.java:598)
              at org.apache.hadoop.ozone.om.OmMetadataManagerImpl.<init>(OmMetadataManagerImpl.java:406)
              at org.apache.hadoop.ozone.om.OmSnapshotManager$1.load(OmSnapshotManager.java:357)
              at org.apache.hadoop.ozone.om.OmSnapshotManager$1.load(OmSnapshotManager.java:1)
              at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.lambda$0(SnapshotCache.java:171)
              ... 12 more
      Caused by: java.io.IOException: class org.apache.hadoop.hdds.utils.db.RocksDatabase: Failed to open /var/lib/hadoop-ozone/om/data913140/db.snapshots/checkpointState/om.db-4e72e3fd-58e4-4814-b8fa-869fb3e8741b; status : IOError; message : lock hold by current process, acquire time 1702927228 a
      cquiring thread 139777345017600: /var/lib/hadoop-ozone/om/data913140/db.snapshots/checkpointState/om.db-4e72e3fd-58e4-4814-b8fa-869fb3e8741b/LOCK: No locks available
              at org.apache.hadoop.hdds.utils.HddsServerUtil.toIOException(HddsServerUtil.java:667)
              at org.apache.hadoop.hdds.utils.db.RocksDatabase.toIOException(RocksDatabase.java:91)
              at org.apache.hadoop.hdds.utils.db.RocksDatabase.open(RocksDatabase.java:168)
              at org.apache.hadoop.hdds.utils.db.RDBStore.<init>(RDBStore.java:113)
              ... 18 more
      Caused by: org.rocksdb.RocksDBException: lock hold by current process, acquire time 1702927228 acquiring thread 139777345017600: /var/lib/hadoop-ozone/om/data913140/db.snapshots/checkpointState/om.db-4e72e3fd-58e4-4814-b8fa-869fb3e8741b/LOCK: No locks available
              at org.rocksdb.RocksDB.open(Native Method)
              at org.rocksdb.RocksDB.open(RocksDB.java:307)
              at org.apache.hadoop.hdds.utils.db.managed.ManagedRocksDB.open(ManagedRocksDB.java:75)
              at org.apache.hadoop.hdds.utils.db.RocksDatabase.open(RocksDatabase.java:154)
              ... 19 more
      2023-12-18 19:19:28,741 ERROR [SstFilteringService#0]-org.apache.hadoop.ozone.om.SstFilteringService: Exception encountered while filtering a snapshot
      java.io.IOException: Rocks Database is closed
              at org.apache.hadoop.hdds.utils.db.RocksDatabase.assertClose(RocksDatabase.java:445)
              at org.apache.hadoop.hdds.utils.db.RocksDatabase.deleteFilesNotMatchingPrefix(RocksDatabase.java:958)
              at org.apache.hadoop.ozone.om.SstFilteringService$SstFilteringTask.call(SstFilteringService.java:193)
              at org.apache.hadoop.hdds.utils.BackgroundService$PeriodicalTask.lambda$run$0(BackgroundService.java:121)
              at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736)
              at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
              at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
              at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
              at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
              at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
              at java.base/java.lang.Thread.run(Thread.java:834)
      

      SnapDiff logs:

      2023-12-18 19:33:12,222 ERROR [snapshot-diff-job-thread-id-8]-org.apache.hadoop.ozone.om.snapshot.SnapshotDiffManager: Caught unchecked exception during diff report generation for volume: voljngnm bucket: bucketjngnm, fromSnapshot: snap-t1gfn and toSnapshot: snap-h1xmg
      java.lang.IllegalStateException: Cache map entry removal failure. The cache is in an inconsistent state. Expected OmSnapshot instance: org.apache.hadoop.ozone.om.snapshot.ReferenceCounted@4f63f85e, actual: null
              at com.google.common.base.Preconditions.checkState(Preconditions.java:512)
              at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.cleanupInternal(SnapshotCache.java:313)
              at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.cleanup(SnapshotCache.java:285)
              at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:220)
              at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:153)
              at org.apache.hadoop.ozone.om.snapshot.SnapshotDiffManager.generateSnapshotDiffReport(SnapshotDiffManager.java:841)
              at org.apache.hadoop.ozone.om.snapshot.SnapshotDiffManager.lambda$2(SnapshotDiffManager.java:722)
              at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
              at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
              at java.base/java.lang.Thread.run(Thread.java:834)
      

      I suspect it might be a race condition between cleanupInternal() and invalidate(). It wasn't a problem earlier because they were synchronize by ConcurrentHashMap's lock. But it could be something else.

      Attachments

        Issue Links

          Activity

            People

              aswinshakil Aswin Shakil
              hemantk Hemant Kumar
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: