Uploaded image for project: 'Apache YuniKorn'
  1. Apache YuniKorn
  2. YUNIKORN-1615

Node occupied resource is negative

    XMLWordPrintableJSON

Details

    Description

      After some tasks complete, the Yunikorn scheduler reported node used resource with negative resource and it cause the scheduling in chaos. I tried to restart the scheduler and it will report negative resource eventually after complete some tasks. In Yunikorn scheduler log I found the following log:

      2023-03-01T18:10:40.038Z        INFO    cache/nodes.go:140      report occupied resources updates       {"node": "172.18.45.234", "request": {"nodes":[{"nodeID":"172.18.45.234","action":2,"attributes":{"ready":"true"},"schedulableResource":{"resources":{"ephemeral-storage":{"value":520021631754},"hugepages-1Gi":{},"hugepages-2Mi":{},"memory":{"value":201131376640},"pods":{"value":110},"vcore":{"value":40000}}},"occupiedResource":{"resources":{"memory":{"value":-10126244160},"vcore":{"value":-9700}}}}],"rmID":"k8s_dios"}}
      2023-03-01T18:10:44.635Z        INFO    cache/nodes.go:140      report occupied resources updates       {"node": "172.18.45.228", "request": {"nodes":[{"nodeID":"172.18.45.228","action":2,"attributes":{"ready":"true"},"schedulableResource":{"resources":{"ephemeral-storage":{"value":249175645796},"hugepages-1Gi":{},"hugepages-2Mi":{},"memory":{"value":269682475008},"pods":{"value":110},"vcore":{"value":40000}}},"occupiedResource":{"resources":{"memory":{"value":-10314987840},"vcore":{"value":-9400}}}}],"rmID":"k8s_dios"}}
      2023-03-01T18:10:44.870Z        INFO    cache/nodes.go:140      report occupied resources updates       {"node": "172.18.45.230", "request": {"nodes":[{"nodeID":"172.18.45.230","action":2,"attributes":{"ready":"true"},"schedulableResource":{"resources":{"ephemeral-storage":{"value":249175645796},"hugepages-1Gi":{},"hugepages-2Mi":{},"memory":{"value":269682475008},"pods":{"value":110},"vcore":{"value":40000}}},"occupiedResource":{"resources":{"memory":{"value":-8829204224},"vcore":{"value":-8500}}}}],"rmID":"k8s_dios"}}
      2023-03-01T18:10:49.279Z        INFO    cache/nodes.go:140      report occupied resources updates       {"node": "172.18.45.235", "request": {"nodes":[{"nodeID":"172.18.45.235","action":2,"attributes":{"ready":"true"},"schedulableResource":{"resources":{"ephemeral-storage":{"value":520021631754},"hugepages-1Gi":{},"hugepages-2Mi":{},"memory":{"value":201131372544},"pods":{"value":110},"vcore":{"value":40000}}},"occupiedResource":{"resources":{"memory":{"value":-8504048512},"vcore":{"value":-7800}}}}],"rmID":"k8s_dios"}}
      2023-03-01T18:15:42.686Z        INFO    cache/nodes.go:140      report occupied resources updates       {"node": "172.18.45.230", "request": {"nodes":[{"nodeID":"172.18.45.230","action":2,"attributes":{"ready":"true"},"schedulableResource":{"resources":{"ephemeral-storage":{"value":249175645796},"hugepages-1Gi":{},"hugepages-2Mi":{},"memory":{"value":269682475008},"pods":{"value":110},"vcore":{"value":40000}}},"occupiedResource":{"resources":{"memory":{"value":-9902946048},"vcore":{"value":-9500}}}}],"rmID":"k8s_dios"}}
      2023-03-01T18:15:43.857Z        INFO    cache/nodes.go:140      report occupied resources updates       {"node": "172.18.45.234", "request": {"nodes":[{"nodeID":"172.18.45.234","action":2,"attributes":{"ready":"true"},"schedulableResource":{"resources":{"ephemeral-storage":{"value":520021631754},"hugepages-1Gi":{},"hugepages-2Mi":{},"memory":{"value":201131376640},"pods":{"value":110},"vcore":{"value":40000}}},"occupiedResource":{"resources":{"memory":{"value":-11199985984},"vcore":{"value":-10700}}}}],"rmID":"k8s_dios"}}
      2023-03-01T18:15:49.229Z        INFO    cache/nodes.go:140      report occupied resources updates       {"node": "172.18.45.235", "request": {"nodes":[{"nodeID":"172.18.45.235","action":2,"attributes":{"ready":"true"},"schedulableResource":{"resources":{"ephemeral-storage":{"value":520021631754},"hugepages-1Gi":{},"hugepages-2Mi":{},"memory":{"value":201131372544},"pods":{"value":110},"vcore":{"value":40000}}},"occupiedResource":{"resources":{"memory":{"value":-9577790336},"vcore":{"value":-8800}}}}],"rmID":"k8s_dios"}}
      2023-03-01T18:15:54.457Z        INFO    cache/nodes.go:140      report occupied resources updates       {"node": "172.18.45.228", "request": {"nodes":[{"nodeID":"172.18.45.228","action":2,"attributes":{"ready":"true"},"schedulableResource":{"resources":{"ephemeral-storage":{"value":249175645796},"hugepages-1Gi":{},"hugepages-2Mi":{},"memory":{"value":269682475008},"pods":{"value":110},"vcore":{"value":40000}}},"occupiedResource":{"resources":{"memory":{"value":-11388729664},"vcore":{"value":-10400}}}}],"rmID":"k8s_dios"}}

      Yunikorn UI

       

      Health Check Result & Log

       

      2023-03-02T03:25:52.310Z        WARN    scheduler/health_checker.go:176 Scheduler is not healthy        {"health check values": [{"Name":"Scheduling errors","Succeeded":true,"Description":"Check for scheduling error entries in metrics","DiagnosisMessage":"There were 0 scheduling errors logged in the metrics"},{"Name":"Failed nodes","Succeeded":true,"Description":"Check for failed nodes entries in metrics","DiagnosisMessage":"There were 0 failed nodes logged in the metrics"},{"Name":"Negative resources","Succeeded":true,"Description":"Check for negative resources in the partitions","DiagnosisMessage":"Partitions with negative resources: []"},{"Name":"Negative resources","Succeeded":false,"Description":"Check for negative resources in the nodes","DiagnosisMessage":"Nodes with negative resources: [\"172.18.45.228\" \"172.18.45.235\" \"172.18.45.234\" \"172.18.45.230\"]"},{"Name":"Consistency of data","Succeeded":true,"Description":"Check if a node's allocated resource <= total resource of the node","DiagnosisMessage":"Nodes with inconsistent data: []"},{"Name":"Consistency of data","Succeeded":true,"Description":"Check if total partition resource == sum of the node resources from the partition","DiagnosisMessage":"Partitions with inconsistent data: []"},{"Name":"Consistency of data","Succeeded":true,"Description":"Check if node total resource = allocated resource + occupied resource + available resource","DiagnosisMessage":"Nodes with inconsistent data: []"},{"Name":"Consistency of data","Succeeded":true,"Description":"Check if node capacity >= allocated resources on the node","DiagnosisMessage":"Nodes with inconsistent data: []"},{"Name":"Reservation check","Succeeded":true,"Description":"Check the reservation nr compared to the number of nodes","DiagnosisMessage":"Reservation/node nr ratio: [0.000000]"},{"Name":"Orphan allocation on node check","Succeeded":true,"Description":"Check if there are orphan allocations on the nodes","DiagnosisMessage":"Orphan allocations: []"},{"Name":"Orphan allocation on app check","Succeeded":true,"Description":"Check if there are orphan allocations on the applications","DiagnosisMessage":"OrphanAllocations: []"}]} 

       

       

      Kubekubernetes version
      Server Version: version.Info

      {Major:"1", Minor:"20", GitVersion:"v1.20.8", GitCommit:"5575935422cc1cf5169dfc8847cb587aa47bac5a", GitTreeState:"clean", BuildDate:"2021-06-16T12:53:07Z", GoVersion:"go1.15.13", Compiler:"gc", Platform:"linux/amd64"}

      Attachments

        1. fullstatedump.json
          1.86 MB
          Jie Ke
        2. image-2023-03-02-11-25-14-484.png
          31 kB
          Jie Ke
        3. image-2023-03-02-11-23-34-052.png
          34 kB
          Jie Ke

        Issue Links

          Activity

            People

              ccondit Craig Condit
              kej1 Jie Ke
              Votes:
              2 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: