Details
-
Sub-task
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
None
Description
When we cleanup the application in the timeoutPlaceholderProcessing() we have two cases.
- First case we clean up all lingering placeholder allocations on the running app
- Second case is the fail of the which cleans up lingering asks no response needed from the shim) and all placeholders after which we fail the app.
The cleanup of the placeholders in both these cases are instigated by the core and we need to wait for the cleanup to happen on the shim side before we proceed. It is not like the remove of the app signalled by the RM. This comes as an unexpected request for the shim not when the app is deleted on the shim side.
For case 1 we do not have a problem. The placeholders are terminated and the app runs as per normal and is not moved to Completed until all is finished. We do NOT have an issue in the states leading to Completed as we have already handled it there (see below)
For the failure case we immediately unlink the queue as we move into the FAILED state. As the move calls the moveTerminatedApp() via the callback. That causes an issue. We should be waiting for the shim to respond back to the core with the confirmation of the removal.
This might require a new state to do this in two steps: trigger the cleanup move to Failing state, when all is cleaned up move to Failed.
BTW: introducing a new state for Failing should also include the rename of Waiting to Completing as that is inline with what the state does and lines up between the two final states.
Attachments
Issue Links
- causes
-
YUNIKORN-581 Update state machine doc
- Closed
- is related to
-
YUNIKORN-567 Queue resources are not cleaned up after placeholder cleanup
- Closed
- relates to
-
YUNIKORN-577 Correctly handle ask releases on TIMEOUT
- Closed
- links to