As part of working on CA performance we run a large scale-up test with some additional logging including call count and total duration spent in each predicate. The results are as follows:
E0824 09:27:54.815425 8 predicates.go:192] Predicate statistics for MaxAzureDiskVolumeCount: called 59985 times, total time 25.328538ms, mean duration 422ns
E0824 09:27:54.815489 8 predicates.go:192] Predicate statistics for MatchInterPodAffinity: called 59985 times, total time 1m48.73252767s, mean duration 1.812661ms
E0824 09:27:54.815497 8 predicates.go:192] Predicate statistics for CheckNodeCondition: called 59985 times, total time 49.05121ms, mean duration 817ns
E0824 09:27:54.815502 8 predicates.go:192] Predicate statistics for MaxEBSVolumeCount: called 59985 times, total time 24.906838ms, mean duration 415ns
E0824 09:27:54.815508 8 predicates.go:192] Predicate statistics for GeneralPredicates: called 59985 times, total time 114.434325ms, mean duration 1.907碌s
E0824 09:27:54.815534 8 predicates.go:192] Predicate statistics for NoDiskConflict: called 59985 times, total time 26.067526ms, mean duration 434ns
E0824 09:27:54.815553 8 predicates.go:192] Predicate statistics for NoVolumeNodeConflict: called 59985 times, total time 38.554035ms, mean duration 642ns
E0824 09:27:54.815559 8 predicates.go:192] Predicate statistics for CheckNodeDiskPressure: called 59985 times, total time 19.062642ms, mean duration 317ns
E0824 09:27:54.815564 8 predicates.go:192] Predicate statistics for PodToleratesNodeTaints: called 59985 times, total time 22.448605ms, mean duration 374ns
E0824 09:27:54.815568 8 predicates.go:192] Predicate statistics for MaxGCEPDVolumeCount: called 59985 times, total time 61.944698ms, mean duration 1.032碌s
E0824 09:27:54.815572 8 predicates.go:192] Predicate statistics for NoVolumeZoneConflict: called 59985 times, total time 64.231254ms, mean duration 1.07碌s
E0824 09:27:54.815578 8 predicates.go:192] Predicate statistics for PodFitsResources: called 357952 times, total time 514.838808ms, mean duration 1.438碌s
E0824 09:27:54.815584 8 predicates.go:192] Predicate statistics for ready: called 59985 times, total time 50.931691ms, mean duration 849ns
E0824 09:27:54.815588 8 predicates.go:192] Predicate statistics for CheckNodeMemoryPressure: called 59985 times, total time 285.070472ms, mean duration 4.752碌s
It turns out that MatchInterPodAffinity predicate is 3 orders of magnitude slower compared to other predicates. This is likely because contrary to scheduler we don't do any precomputation for it and we don't maintain predicateMeta object.
After a quick glance at predicate code it makes sense - it needs to iterate over all existing pods to check if any of them has pod antiaffinity on the pod we're running predicates for. This brings up another problem - how does it get all pods and nodes? We only provide NodeInfo for a single node, the rest comes out of informer. However, that means it reflects the real state of the cluster, not our simulated state. If we've already placed a pod with zone-level antiaffinity on a simulated node it won't prevent adding pods to other simulated nodes in the same zone.
Bottom line is that using zone-level antiaffinity can cause CA to "overshoot" creating some nodes for pods that won't be able to schedule on them anyway. Fortunately, this is a pretty unlikely edge case and we will scale-down the unnecessary nodes without any problem.
This will not be easy to deal with and it looks like we can work around the performance problem with #253, so I think this is best left to 1.9 timeframe.
@davidopp @bsalamat Do you have any recommendations?
cc: @shyamjvs @porridge @gmarek @wojtek-t
It turns out that MatchInterPodAffinity predicate is 3 orders of magnitude slower compared to other predicates. This is likely because contrary to scheduler we don't do any precomputation for it
@wojtek-t I remember discussing with you sometime ago about not precomputing stuff in scheduler. It changed in the meanwhile? :)
@wojtek-t I remember discussing with you sometime ago about not precomputing stuff in scheduler. It changed in the meanwhile? :)
I don't understand. Can you clarify?
Sorry - When we were discussing about for e.g zone-level soft inter-pod anti-affinity, fwiu the scheduler was doing computation of O(nodes*pods) when scheduling a new pod. Though it seemed to be doable in O(pods + nodes) with precomputation. And from the comment it seemed like it's precomputing now. But I think I misunderstood as soft anti-affinity uses priorities instead of hard anti-affinity which uses predicates. Correct me if I said something stupid.
Yes - we were discussing some potential optimizations in scheduler. And those are still not done. But we do some caching in scheduler, and we don't do anything in cluster autoscaler.
We only provide NodeInfo for a single node, the rest comes out of informer. However, that means it reflects the real state of the cluster, not our simulated state. If we've already placed a pod with zone-level antiaffinity on a simulated node it won't prevent adding pods to other simulated nodes in the same zone.
If I'm understanding your statement correctly, @bsalamat ran into almost exactly the same problem in the preemption work (need to run scheduler predicates against cluster state minus preempted pods on some node). He wrote a function FilteredList() that is like List() but subtracts out the pods that match a filter provided as an argument. You might be able to do the opposite, define a function that does a List() but also returns some extra pods that you supply as an argument?
I may be misunderstanding your issue completely though.
@bskiba ran some follow-up performance tests and I've spent time experimenting with predicates. Below are results of those investigations:
My current plan is to have PRs implementing both points 2 and 3 ready by tomorrow noon.
@davidopp @bsalamat does this approach sound reasonable to you?
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Prevent issues from auto-closing with an /lifecycle frozen comment.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale
/lifecycle frozen
/remove-lifecycle stale
We should look more deeply into https://github.com/kubernetes/kubernetes/issues/54189 and see how it impacts CA. We will get those changes with godeps update for 1.2.0 anyway, but we should spend some time to understand them. In particular we should make sure that:
@mwielgus ^
@aleksandra-malinowska pointed out to me that a deployment with self-pod-antiaffinity causes nodes to be added one by one. This is because the problem with affinity predicate taking data from informers mentioned in the description of this issue is still not fixed.
Other optimizations were easier to implement and got us to where we met our performance target and we actually never started using predicateMeta when binpacking. Note that this is trickier than in case of scale-down, because:
@MaciekPytel what do you mean self-pod-antiaffinity in
a deployment with self-pod-antiaffinity causes nodes to be added one by one.?
It is a soft AntiAffinity (preferredDuringSchedulingIgnoredDuringExecution)?
It's a hard Pod AntiAffinity (requiredDuringSchedulingIgnoredDuringExecution) against a Pod's own labels, e.g., spreading kube-dns load across nodes.
Cluster Autoscaler completely ignores "soft" scheduler requirements (aka scores). preferredDuringScheduling don't impact CA's decisions or performance in any way.
Thanks, so it is safe to use soft Anti Affinity since it completely ignores. Thanks
Most helpful comment
@aleksandra-malinowska pointed out to me that a deployment with self-pod-antiaffinity causes nodes to be added one by one. This is because the problem with affinity predicate taking data from informers mentioned in the description of this issue is still not fixed.
Other optimizations were easier to implement and got us to where we met our performance target and we actually never started using predicateMeta when binpacking. Note that this is trickier than in case of scale-down, because: