Autoscaler: Bug: Disagreement between real scheduler and CA's simulation on whether a pod is schedulable

Created on 9 Dec 2020 · 5Comments · Source: kubernetes/autoscaler

Which component are you using?:
Cluster Autoscaler

What version of the component are you using?:

v1.17.1 CA

v1.18.6 K8s

Component version:

What k8s version are you using (kubectl version)?:

Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3", GitCommit:"06ad960bfd03b39c8310aaf92d1e7c12ce618213", GitTreeState:"clean", BuildDate:"2020-02-13T18:06:54Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.6", GitCommit:"dff82dc0de47299ab66c83c626e08b245ab19037", GitTreeState:"clean", BuildDate:"2020-07-15T16:51:04Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}

What environment is this in?:

AWS

What did you expect to happen?:

An unschedulable pod trigger a new node creation

What happened instead?:

The CA determines that an unschedulable pod can infact be scheduled onto a particular node. This is incorrect and it cannot.

    cluster-autoscaler-aws-cluster-autoscaler-76c5554bc8-hdjhw aws-cluster-autoscaler I1209 17:27:34.557231       1 filter_out_schedulable.go:125] Pod worker-1234 marked as unschedulable can be scheduled on existing node ip-10-224-3-175.us-west-2.compute.internal. Ignoring in scale up.

    status:
      conditions:
      - lastProbeTime: null
        lastTransitionTime: "2020-12-09T16:34:53Z"
        message: '0/18 nodes are available: 16 Insufficient cpu, 2 node(s) had taint {node-role.kubernetes.io/ingress:
          }, that the pod didn''t tolerate, 5 Insufficient memory.'
````


The pod in question has a CPU request of 4 cores and the node that the CA believes to be a good fit for the pod has exactly 4 cores of schedulable space available.

resources:
limits:
cpu: "4"
memory: 8Gi
requests:
cpu: "4"
memory: 8Gi
```

(75% of 16 cores = 4 cores available)

The disagreement seems to be that the real scheduler believes that

if pod_requests < node_available; then schedule

Whereas the CA thinks

if pod_requests <= node_available; then schedule

This causes the pod to get stuck pending since the pod is never scheduled and the scenario doesn't trigger the CA to create a new node.

How to reproduce it (as minimally and precisely as possible):

Attempt to schedule a pod with a CPU request exactly equal to what is available on a node. Observe that the pod never schedules and the CA never creates a new node for the unscheduled pod.

Anything else we need to know?:

After briefly looking into the issue myself, its unclear where the discrepancy may be coming from. It seems as though the main decision making is delegated to the real scheduler's library, which you would imagine to be making the same decision

Thinking now, given there is a major version difference between the CA and k8s, its possible that the scheduler library itself has changed on whether requests must be <= or < between those two.

kinbug lifecyclstale

Source

bpinske

Most helpful comment

We only support running "matching" CA and kubernetes versions (ie. CA 1.16 on 1.16 cluster, CA 1.17 on 1.17 cluster, etc), because CA is specifically using the scheduler code from a matching version. Scheduler code happens over time and the behavior in different minor version can differ, leading to exactly this type of issues.

To be honest, most of the time running different versions actually works fine (once again - no one is promising that, you're doing it at your own risk), but as per @umialpha comment above 1.18 is a huge change in how scheduling works (and a complete rewrite of how CA interacts with scheduler) and so using CA >= 1.18 on a cluster <1.18 (or vice versa) is very likely to lead to CA and scheduler disagreeing.

MaciekPytel on 14 Dec 2020

👍2

All 5 comments

I've checked the code and compare CA 17.1 with K8s v17.3, CA uses scheduler logic to predict. So there shouldn't be any disagreement. I think it is due to other problems.

umialpha on 14 Dec 2020

This is mentioned in the OP, but its somewhat hard to read so I'll clarify it.

I am running K8s v1.18.6 and CA 1.17.1. I recognize the FAQ recommends that you keep the CA and k8s version in lockstep, presumably to avoid this exact situation of potentially using different scheduler algo versions.

bpinske on 14 Dec 2020

I think k8s v1.18 has breaking change in scheduler part( it moves all the logic into schedulerframework plugins). CA1.17 still uses k8s v1.17 scheduler logic. So there is mismatch between them. I suggests you to upgrade CA to v1.18 to see whether the problem still occurs.

umialpha on 14 Dec 2020

MaciekPytel on 14 Dec 2020

👍2

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale