Details

    Description

      We will fail task post allocated, but we don't update the pod to terminal state.

      For example we bind pod volume failed post allocated, the pod will not go to terminal state, it will fail:

      Pod event:

      Events:
        Type     Reason                 Age   From      Message
        ----     ------                 ----  ----      -------
        Normal   Scheduling             30s   yunikorn  dev-nnjxy/pod-btv0y is queued and waiting for allocation
        Normal   Scheduled              30s   yunikorn  Successfully assigned dev-nnjxy/pod-btv0y to node yktest-worker
        Warning  PodVolumesBindFailure  20s   yunikorn  bind volumes to pod failed, name: dev-nnjxy/pod-btv0y, binding volumes: context deadline exceeded
        Normal   TaskFailed             20s   yunikorn  Task dev-nnjxy/pod-btv0y is failed

      Pod pending not going to terminal state

      2024-09-20T11:22:27.601Z    INFO    shim.fsm    cache/task_state.go:381    Task state transition    {"app": "yunikorn-dev-03c96-autogen", "task": "6f3dd7fa-72b4-40cf-a700-43e51394a06b", "taskAlias": "dev-03c96/pod-bgg9h", "source": "Scheduling", "destination": "Allocated", "event": "TaskAllocated"}
      2024-09-20T11:22:37.606Z    DEBUG    shim.cache.task    cache/task.go:499    prepare to send release request    {"applicationID": "yunikorn-dev-03c96-autogen", "taskID": "6f3dd7fa-72b4-40cf-a700-43e51394a06b", "taskAlias": "dev-03c96/pod-bgg9h", "allocationKey": "6f3dd7fa-72b4-40cf-a700-43e51394a06b", "task": "Allocated", "terminationType": ""}
      2024-09-20T11:22:37.606Z    DEBUG    core.scheduler    scheduler/scheduler.go:117    enqueued event    {"eventType": "*rmevent.RMUpdateAllocationEvent", "event": {"Request":{"releases":{"allocationsToRelease":[{"partitionName":"[mycluster]default","applicationID":"yunikorn-dev-03c96-autogen","terminationType":1,"message":"task completed","allocationKey":"6f3dd7fa-72b4-40cf-a700-43e51394a06b"}]},"rmID":"mycluster"}}, "currentQueueSize": 0}
      2024-09-20T11:22:37.606Z    ERROR    shim.cache.task    cache/task.go:475    task failed    {"appID": "yunikorn-dev-03c96-autogen", "taskID": "6f3dd7fa-72b4-40cf-a700-43e51394a06b", "reason": "bind volumes to pod failed, name: dev-03c96/pod-bgg9h, binding volumes: context deadline exceeded"}
      2024-09-20T11:22:37.606Z    INFO    shim.fsm    cache/task_state.go:381    Task state transition    {"app": "yunikorn-dev-03c96-autogen", "task": "6f3dd7fa-72b4-40cf-a700-43e51394a06b", "taskAlias": "dev-03c96/pod-bgg9h", "source": "Allocated", "destination": "Failed", "event": "TaskFail"}
      2024-09-20T11:22:37.606Z    INFO    core.scheduler.partition    scheduler/partition.go:1359    removing allocation from application    {"appID": "yunikorn-dev-03c96-autogen", "allocationKey": "6f3dd7fa-72b4-40cf-a700-43e51394a06b", "terminationType": "STOPPED_BY_RM"}
      2024-09-20T11:22:37.606Z    DEBUG    core.scheduler.ugm    ugm/manager.go:132    Decreasing resource usage    {"user": "kubernetes-admin", "queue path": "root.dev-03c96", "application": "yunikorn-dev-03c96-autogen", "resource": "map[pods:1]", "removeApp": true}
      2024-09-20T11:22:37.606Z    DEBUG    core.scheduler.ugm    ugm/manager.go:152    Decreasing resource usage for user    {"user": "kubernetes-admin", "queue path": "root.dev-03c96", "application": "yunikorn-dev-03c96-autogen", "group": "", "resource": "map[pods:1]", "removeApp": true}
      2024-09-20T11:22:37.606Z    DEBUG    core.scheduler.ugm    ugm/queue_tracker.go:132    Decreasing resource usage    {"queue path": "root", "hierarchy": ["root", "dev-03c96"], "application": "yunikorn-dev-03c96-autogen", "resource": "map[pods:1]", "removeApp": true}
      2024-09-20T11:22:37.607Z    DEBUG    core.scheduler.ugm    ugm/queue_tracker.go:132    Decreasing resource usage    {"queue path": "root.dev-03c96", "hierarchy": ["dev-03c96"], "application": "yunikorn-dev-03c96-autogen", "resource": "map[pods:1]", "removeApp": true}
      2024-09-20T11:22:37.607Z    DEBUG    core.scheduler.ugm    ugm/queue_tracker.go:159    Removed application from running applications    {"application": "yunikorn-dev-03c96-autogen", "queue path": "root.dev-03c96", "queue name": "dev-03c96"}
      2024-09-20T11:22:37.608Z    DEBUG    core.scheduler.ugm    ugm/queue_tracker.go:165    Successfully decreased resource usage    {"queue path": "root.dev-03c96", "application": "yunikorn-dev-03c96-autogen", "resource": "map[pods:1]", "total resource after decreasing": "map[]", "total applications after decreasing": 0}
      2024-09-20T11:22:37.608Z    DEBUG    core.scheduler.ugm    ugm/queue_tracker.go:159    Removed application from running applications    {"application": "yunikorn-dev-03c96-autogen", "queue path": "root", "queue name": "root"}
      2024-09-20T11:22:37.608Z    DEBUG    core.scheduler.ugm    ugm/queue_tracker.go:165    Successfully decreased resource usage    {"queue path": "root", "application": "yunikorn-dev-03c96-autogen", "resource": "map[pods:1]", "total resource after decreasing": "map[]", "total applications after decreasing": 0}
      2024-09-20T11:22:37.608Z    DEBUG    core.scheduler.application    objects/application.go:336    Application state timer initiated    {"appID": "yunikorn-dev-03c96-autogen", "state": "Completing", "timeout": "30s"}
      2024-09-20T11:22:37.608Z    INFO    core.scheduler.fsm    objects/application_state.go:147    Application state transition    {"appID": "yunikorn-dev-03c96-autogen", "source": "Running", "destination": "Completing", "event": "completeApplication"}
      2024-09-20T11:22:37.608Z    DEBUG    core.rmproxy    rmproxy/rmproxy.go:60    enqueue event    {"eventType": "*rmevent.RMApplicationUpdateEvent", "event": {"RmID":"mycluster","AcceptedApplications":[],"RejectedApplications":[],"UpdatedApplications":[{"applicationID":"yunikorn-dev-03c96-autogen","state":"Completing","stateTransitionTimestamp":1726831357608331511,"message":"completeApplication"}]}, "currentQueueSize": 0}
      2024-09-20T11:22:37.608Z    INFO    core.scheduler.application    objects/application.go:615    ask removed successfully from application    {"appID": "yunikorn-dev-03c96-autogen", "ask": "6f3dd7fa-72b4-40cf-a700-43e51394a06b", "pendingDelta": "nil resource"}
      2024-09-20T11:22:37.608Z    DEBUG    core.rmproxy    rmproxy/rmproxy.go:60    enqueue event    {"eventType": "*rmevent.RMReleaseAllocationEvent", "event": {"RmID":"mycluster","ReleasedAllocations":[{"partitionName":"[mycluster]default","applicationID":"yunikorn-dev-03c96-autogen","terminationType":1,"message":"allocation remove as per RM request","allocationKey":"6f3dd7fa-72b4-40cf-a700-43e51394a06b"}]}, "currentQueueSize": 1}
      2024-09-20T11:22:37.608Z    DEBUG    shim.rmcallback    cache/scheduler_callback.go:108    UpdateApplication callback received    {"UpdateApplicationResponse": "updated:{applicationID:\"yunikorn-dev-03c96-autogen\" state:\"Completing\" stateTransitionTimestamp:1726831357608331511 message:\"completeApplication\"}"}
      2024-09-20T11:22:37.608Z    DEBUG    shim.rmcallback    cache/scheduler_callback.go:137    status update callback received    {"appId": "yunikorn-dev-03c96-autogen", "new status": "Completing"}
      2024-09-20T11:22:37.608Z    DEBUG    shim.rmcallback    cache/scheduler_callback.go:47    UpdateAllocation callback received    {"UpdateAllocationResponse": "released:{partitionName:\"[mycluster]default\" applicationID:\"yunikorn-dev-03c96-autogen\" terminationType:STOPPED_BY_RM message:\"allocation remove as per RM request\" allocationKey:\"6f3dd7fa-72b4-40cf-a700-43e51394a06b\"}"}
      2024-09-20T11:22:38.605Z    INFO    shim.cache.application    cache/application.go:239    task removed    {"appID": "yunikorn-dev-03c96-autogen", "taskID": "6f3dd7fa-72b4-40cf-a700-43e51394a06b"}
      2024-09-20T11:23:07.607Z    DEBUG    core.scheduler.application    objects/application.go:352    Application state: auto progress    {"applicationID": "yunikorn-dev-03c96-autogen", "state": "Completing"}
      2024-09-20T11:23:07.607Z    DEBUG    core.scheduler.application    objects/application.go:384    Application state timer cleared    {"appID": "yunikorn-dev-03c96-autogen", "state": "Completing"}
      2024-09-20T11:23:07.607Z    DEBUG    core.scheduler.application    objects/application.go:336    Application state timer initiated    {"appID": "yunikorn-dev-03c96-autogen", "state": "Completed", "timeout": "72h0m0s"}
      2024-09-20T11:23:07.607Z    INFO    core.scheduler.fsm    objects/application_state.go:147    Application state transition    {"appID": "yunikorn-dev-03c96-autogen", "source": "Completing", "destination": "Completed", "event": "completeApplication"}
      2024-09-20T11:23:07.607Z    DEBUG    core.rmproxy    rmproxy/rmproxy.go:60    enqueue event    {"eventType": "*rmevent.RMApplicationUpdateEvent", "event": {"RmID":"mycluster","AcceptedApplications":[],"RejectedApplications":[],"UpdatedApplications":[{"applicationID":"yunikorn-dev-03c96-autogen","state":"Completed","stateTransitionTimestamp":1726831387607633721,"message":"completeApplication"}]}, "currentQueueSize": 0}
      2024-09-20T11:23:07.607Z    DEBUG    shim.rmcallback    cache/scheduler_callback.go:108    UpdateApplication callback received    {"UpdateApplicationResponse": "updated:{applicationID:\"yunikorn-dev-03c96-autogen\" state:\"Completed\" stateTransitionTimestamp:1726831387607633721 message:\"completeApplication\"}"}
      2024-09-20T11:23:07.607Z    DEBUG    shim.rmcallback    cache/scheduler_callback.go:137    status update callback received    {"appId": "yunikorn-dev-03c96-autogen", "new status": "Completed"}
      2024-09-20T11:23:07.607Z    INFO    core.scheduler.queue    objects/queue.go:830    Application completed and removed from queue    {"queueName": "root.dev-03c96", "applicationID": "yunikorn-dev-03c96-autogen"}
      2024-09-20T11:23:07.607Z    INFO    core.scheduler.partition    scheduler/partition.go:1539    Removing terminated application from the application list    {"appID": "yunikorn-dev-03c96-autogen", "app status": "Completed"}
      2024-09-20T11:23:07.607Z    INFO    core.scheduler.application.usage    objects/application_summary.go:60    YK_APP_SUMMARY: {ApplicationID: yunikorn-dev-03c96-autogen, SubmissionTime: 1726831345581, StartTime: 1726831347597, FinishTime: 1726831387607, User: kubernetes-admin, Queue: root.dev-03c96, State: Completed, RmID: mycluster, ResourceUsage: TrackedResource{UNKNOWN:pods=10}, PreemptedResource: TrackedResource{}, PlaceholderResource: TrackedResource{}} 

      Attachments

        Issue Links

          Activity

            People

              zhuqi Qi Zhu
              zhuqi Qi Zhu
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: