Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Duplicate
-
1.4.0
-
None
-
None
Description
SCM has a number of internal services (they implement the org.apache.hadoop.hdds.scm.ha.SCMService interface). The interface has a method for notifying the services about changes in raft or in safe mode. On testing the blocks deletion service a strange behavior was detected:
- transactions flushed to DB (i.e. snapshots was taken)
- containers are closed
- BUT transactions aren't sent to DNs - and we have a number of mlns of non-handled blocks deletion transactions
After an investigation of the problem it appears that the event of exiting of the SCM from a safe mode was triggered multiple times, and eventually the SCMBlockDeletingService was moved to PAUSING state:
org.apache.hadoop.hdds.scm.block.SCMBlockDeletingService#notifyStatusChanged
public void notifyStatusChanged() { serviceLock.lock(); try { if (scmContext.isLeaderReady() && !scmContext.isInSafeMode() && serviceStatus != ServiceStatus.RUNNING) { safemodeExitMillis = clock.millis(); serviceStatus = ServiceStatus.RUNNING; } else { serviceStatus = ServiceStatus.PAUSING; } } finally { serviceLock.unlock(); } }
- 1st trigger: SCM is LEADER, SCM is NOT in safe mode, the service is NOT in RUNNING state -> the service has been transitioned to RUNNING state
- 2nd trigger: SCM is LEADER, SCM is NOT in safe mode, the service IS in RUNNING state (as a result ofthe 1st trigger) -> the service has been transitioned to PAUSING state
Attachments
Issue Links
- is fixed by
-
HDDS-9962 Intermittent timeout in TestBlockDeletion.testBlockDeletion
- Resolved