Autoscaler: Figure out and implement custom handling for MatchInterPodAffinity predicate

Created on 24 Aug 2017  路  18Comments  路  Source: kubernetes/autoscaler

As part of working on CA performance we run a large scale-up test with some additional logging including call count and total duration spent in each predicate. The results are as follows:

E0824 09:27:54.815425       8 predicates.go:192] Predicate statistics for MaxAzureDiskVolumeCount: called 59985 times, total time 25.328538ms, mean duration 422ns
E0824 09:27:54.815489       8 predicates.go:192] Predicate statistics for MatchInterPodAffinity: called 59985 times, total time 1m48.73252767s, mean duration 1.812661ms
E0824 09:27:54.815497       8 predicates.go:192] Predicate statistics for CheckNodeCondition: called 59985 times, total time 49.05121ms, mean duration 817ns
E0824 09:27:54.815502       8 predicates.go:192] Predicate statistics for MaxEBSVolumeCount: called 59985 times, total time 24.906838ms, mean duration 415ns
E0824 09:27:54.815508       8 predicates.go:192] Predicate statistics for GeneralPredicates: called 59985 times, total time 114.434325ms, mean duration 1.907碌s
E0824 09:27:54.815534       8 predicates.go:192] Predicate statistics for NoDiskConflict: called 59985 times, total time 26.067526ms, mean duration 434ns
E0824 09:27:54.815553       8 predicates.go:192] Predicate statistics for NoVolumeNodeConflict: called 59985 times, total time 38.554035ms, mean duration 642ns
E0824 09:27:54.815559       8 predicates.go:192] Predicate statistics for CheckNodeDiskPressure: called 59985 times, total time 19.062642ms, mean duration 317ns
E0824 09:27:54.815564       8 predicates.go:192] Predicate statistics for PodToleratesNodeTaints: called 59985 times, total time 22.448605ms, mean duration 374ns
E0824 09:27:54.815568       8 predicates.go:192] Predicate statistics for MaxGCEPDVolumeCount: called 59985 times, total time 61.944698ms, mean duration 1.032碌s
E0824 09:27:54.815572       8 predicates.go:192] Predicate statistics for NoVolumeZoneConflict: called 59985 times, total time 64.231254ms, mean duration 1.07碌s 
E0824 09:27:54.815578       8 predicates.go:192] Predicate statistics for PodFitsResources: called 357952 times, total time 514.838808ms, mean duration 1.438碌s
E0824 09:27:54.815584       8 predicates.go:192] Predicate statistics for ready: called 59985 times, total time 50.931691ms, mean duration 849ns
E0824 09:27:54.815588       8 predicates.go:192] Predicate statistics for CheckNodeMemoryPressure: called 59985 times, total time 285.070472ms, mean duration 4.752碌s

It turns out that MatchInterPodAffinity predicate is 3 orders of magnitude slower compared to other predicates. This is likely because contrary to scheduler we don't do any precomputation for it and we don't maintain predicateMeta object.

After a quick glance at predicate code it makes sense - it needs to iterate over all existing pods to check if any of them has pod antiaffinity on the pod we're running predicates for. This brings up another problem - how does it get all pods and nodes? We only provide NodeInfo for a single node, the rest comes out of informer. However, that means it reflects the real state of the cluster, not our simulated state. If we've already placed a pod with zone-level antiaffinity on a simulated node it won't prevent adding pods to other simulated nodes in the same zone.

Bottom line is that using zone-level antiaffinity can cause CA to "overshoot" creating some nodes for pods that won't be able to schedule on them anyway. Fortunately, this is a pretty unlikely edge case and we will scale-down the unnecessary nodes without any problem.

cluster-autoscaler kinbug kinfeature lifecyclfrozen

Most helpful comment

@aleksandra-malinowska pointed out to me that a deployment with self-pod-antiaffinity causes nodes to be added one by one. This is because the problem with affinity predicate taking data from informers mentioned in the description of this issue is still not fixed.

Other optimizations were easier to implement and got us to where we met our performance target and we actually never started using predicateMeta when binpacking. Note that this is trickier than in case of scale-down, because:

  1. The state of cluster is constantly changing and we need to update predicateMeta in a way that doesn't cripple our performance and introduce bugs (using stale predicateMeta would break binpacking, likely in subtle and hard to debug ways).
  2. Regardless of how smart we are about calculating predicateMeta it will likely still hurt performance if there are no pods using hard pod affinity / antiaffinity. We should make sure this doesn't happen.

All 18 comments

This will not be easy to deal with and it looks like we can work around the performance problem with #253, so I think this is best left to 1.9 timeframe.

@davidopp @bsalamat Do you have any recommendations?

cc: @shyamjvs @porridge @gmarek @wojtek-t

It turns out that MatchInterPodAffinity predicate is 3 orders of magnitude slower compared to other predicates. This is likely because contrary to scheduler we don't do any precomputation for it

@wojtek-t I remember discussing with you sometime ago about not precomputing stuff in scheduler. It changed in the meanwhile? :)

@wojtek-t I remember discussing with you sometime ago about not precomputing stuff in scheduler. It changed in the meanwhile? :)

I don't understand. Can you clarify?

Sorry - When we were discussing about for e.g zone-level soft inter-pod anti-affinity, fwiu the scheduler was doing computation of O(nodes*pods) when scheduling a new pod. Though it seemed to be doable in O(pods + nodes) with precomputation. And from the comment it seemed like it's precomputing now. But I think I misunderstood as soft anti-affinity uses priorities instead of hard anti-affinity which uses predicates. Correct me if I said something stupid.

Yes - we were discussing some potential optimizations in scheduler. And those are still not done. But we do some caching in scheduler, and we don't do anything in cluster autoscaler.

We only provide NodeInfo for a single node, the rest comes out of informer. However, that means it reflects the real state of the cluster, not our simulated state. If we've already placed a pod with zone-level antiaffinity on a simulated node it won't prevent adding pods to other simulated nodes in the same zone.

If I'm understanding your statement correctly, @bsalamat ran into almost exactly the same problem in the preemption work (need to run scheduler predicates against cluster state minus preempted pods on some node). He wrote a function FilteredList() that is like List() but subtracts out the pods that match a filter provided as an argument. You might be able to do the opposite, define a function that does a List() but also returns some extra pods that you supply as an argument?

I may be misunderstanding your issue completely though.

@bskiba ran some follow-up performance tests and I've spent time experimenting with predicates. Below are results of those investigations:

  1. Fixing this is by far the most important performance problem we have. The longest simulation in performance test took 2m36, out of which 2m31 was spent inside MatchInterPodAffinity (in a cluster without a single pod using affinity or antiaffinity).
  2. We already have NodeInfo objects including for simulated nodes and pods. What we need is to be able to pass all of this and make predicate use this as the current state of the cluster. Looking at the code it seems this can be actually done pretty easily by calculating predicateMetadata before running predicates. If my understanding is correct this should solve the correctness issue, though I haven't tested it yet.
    Incidentally calculating predicateMeta improves the performance significantly, reducing maximum simulation time from 2m36s to 16.5s. In our test cluster we only tend to run a single MatchInterPodAffinity call per pre-calculation (as we only run it after other predicates have passed and we only need to find a single node we can fit pod onto). The performance gain should we significantly bigger if someone was actually using antiaffinity. So pre-calculating predicateMeta at least for some of the predicates we run seems to be the way to go.
  3. Even after applying the above we still spend >50% time doing affinity / antiaffinity related calculations in a cluster without a single pod using those features.
    CA logic is executed for a given point in time and we have a list of all pods that we consider. We can check all of them and if none of them defines affinity or antiaffinity we can disable the MatchInterPodAffinity and predicateMetadata calculation for the rest of the loop. I haven't tested this yet but I expect significant performance gains.

My current plan is to have PRs implementing both points 2 and 3 ready by tomorrow noon.

@davidopp @bsalamat does this approach sound reasonable to you?

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

/lifecycle frozen
/remove-lifecycle stale

We should look more deeply into https://github.com/kubernetes/kubernetes/issues/54189 and see how it impacts CA. We will get those changes with godeps update for 1.2.0 anyway, but we should spend some time to understand them. In particular we should make sure that:

  1. The changes will play nice with our handling of MatchInterPodAffinity predicate.
  2. Will they improve our performance out of the box? Can we do something to get more juice out of them? Our strategy of 'tell users to never use pod affinity in large clusters and disable the predicate' may become even less attractive if kube-dns ends up using it by default (https://github.com/kubernetes/kubernetes/pull/57683).

@mwielgus ^

@aleksandra-malinowska pointed out to me that a deployment with self-pod-antiaffinity causes nodes to be added one by one. This is because the problem with affinity predicate taking data from informers mentioned in the description of this issue is still not fixed.

Other optimizations were easier to implement and got us to where we met our performance target and we actually never started using predicateMeta when binpacking. Note that this is trickier than in case of scale-down, because:

  1. The state of cluster is constantly changing and we need to update predicateMeta in a way that doesn't cripple our performance and introduce bugs (using stale predicateMeta would break binpacking, likely in subtle and hard to debug ways).
  2. Regardless of how smart we are about calculating predicateMeta it will likely still hurt performance if there are no pods using hard pod affinity / antiaffinity. We should make sure this doesn't happen.

@MaciekPytel what do you mean self-pod-antiaffinity in
a deployment with self-pod-antiaffinity causes nodes to be added one by one.?

It is a soft AntiAffinity (preferredDuringSchedulingIgnoredDuringExecution)?

It's a hard Pod AntiAffinity (requiredDuringSchedulingIgnoredDuringExecution) against a Pod's own labels, e.g., spreading kube-dns load across nodes.

Cluster Autoscaler completely ignores "soft" scheduler requirements (aka scores). preferredDuringScheduling don't impact CA's decisions or performance in any way.

Thanks, so it is safe to use soft Anti Affinity since it completely ignores. Thanks

Was this page helpful?
0 / 5 - 0 ratings