[HDDS-10799] SCMBlockDeletingService stuck in PAUSING state - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 1.4.0
Fix Version/s: None
Component/s: SCM
Labels:
None

Description

SCM has a number of internal services (they implement the org.apache.hadoop.hdds.scm.ha.SCMService interface). The interface has a method for notifying the services about changes in raft or in safe mode. On testing the blocks deletion service a strange behavior was detected:

transactions flushed to DB (i.e. snapshots was taken)
containers are closed
BUT transactions aren't sent to DNs - and we have a number of mlns of non-handled blocks deletion transactions

After an investigation of the problem it appears that the event of exiting of the SCM from a safe mode was triggered multiple times, and eventually the SCMBlockDeletingService was moved to PAUSING state:

org.apache.hadoop.hdds.scm.block.SCMBlockDeletingService#notifyStatusChanged

  public void notifyStatusChanged() {
    serviceLock.lock();
    try {
      if (scmContext.isLeaderReady() && !scmContext.isInSafeMode() &&
          serviceStatus != ServiceStatus.RUNNING) {
        safemodeExitMillis = clock.millis();
        serviceStatus = ServiceStatus.RUNNING;
      } else {
        serviceStatus = ServiceStatus.PAUSING;
      }
    } finally {
      serviceLock.unlock();
    }
  }

1st trigger: SCM is LEADER, SCM is NOT in safe mode, the service is NOT in RUNNING state -> the service has been transitioned to RUNNING state
2nd trigger: SCM is LEADER, SCM is NOT in safe mode, the service IS in RUNNING state (as a result ofthe 1st trigger) -> the service has been transitioned to PAUSING state

Attachments

Issue Links

is fixed by

HDDS-9962 Intermittent timeout in TestBlockDeletion.testBlockDeletion

Resolved

Activity

People

Assignee:: Vyacheslav Tutrinov

Reporter:: Vyacheslav Tutrinov

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 03/May/24 06:47

Updated:: 06/May/24 18:03

Resolved:: 06/May/24 18:03