/area autoscale
/kind dev
/assign @josephburnett
The default Elafros autoscaling strategy is based on concurrent requests and the set point is hardcoded at 1.0. We need to connect scaling to actual resource usage (CPU and memory). One way we can do that is to use Vertical Pod Autoscaling to adjust the resource requests of a revision to just what's needed to serve 1 request concurrently.
I did some initial testing with vertical pod autoscaling (vpa) on an Elafros revision.
The test was run on a 3x 32-core node GKE cluster. The revision was a cpu-bound workload which calculates the largest prime less than 10 million (about 1/2 cpu second per request). The initial cpu request size was 2000m (2 cpus/pod).
I started with 100 qps and waited for the vpa to narrow in on its recommendation. Then I increased the qps to 200 for 5 minutes. And then dropped qps back to 100. The entire test duration was 20 minutes.


When qps was increased to 200, the desired pod count roughly doubled, which is normal behavior by the Elafros autoscaler. However the cluster did not have enough capacity to schedule all requested pods (17/22). As a result, each scheduled pod was doing slightly more work during the 200 qps period.
The vpa settled on an initial recommendation of 4151m (~4 cpus/pod). When qps increased to 200, the recommendation increased to 5563m (~5.5 cpus/pod). After the load returned to 100 qps, the recommendation dropped to 5046m (~5 cpus/pod). Some time later, when the revision was idle, the recommendation returned to 4151m.
Vertical pod autoscaling seems stable with horizontal pod autoscaling (the Elafros autoscaler). It is able to increase and decrease cpu recommendations to match the workload. Note: the Elafros autoscaler and the vpa use different signals for autoscaling (request concurrency and cpu usage respectively) which likely accounts for their playing nicely together.
However increasing recommendations and pod sizes causes fragmentation of schedulable cpu space (especially in smaller node sizes such as 8-cores). With fragmentation, fewer pods can be scheduled and the load on each scheduled pod increases. This may lead to increased recommendations again, in a loop.
(raw data)
@josephburnett great test setup!
I think it would also be very interesting to see the total execution time for each calculation and plot it -- are you tracking that data?
@mbehrendt, I can easily get that data by turning on Istio sidecar injection. Then per-revision qps, latency and error rates will show up in Grafana (where I got the autoscaler graphs).
There are a few other test scenarios I want to test out as well:
One thing I noticed is that the VPA will never recommmend less than 350m cpu per container (1 cpu/pod). Currently we start with 500m cpu/pod, so VPA is always going up. I had to increase the starting cpu request to 2000m to verify decreasing recommendations. I want to do some more testing around this.
/milestone Ice Box
I've opened a separate issue to track the beta implementation of this feature with some improvements: https://github.com/knative/serving/issues/2929
Most helpful comment
I did some initial testing with vertical pod autoscaling (vpa) on an Elafros revision.
Test Conditions
The test was run on a 3x 32-core node GKE cluster. The revision was a cpu-bound workload which calculates the largest prime less than 10 million (about 1/2 cpu second per request). The initial cpu request size was 2000m (2 cpus/pod).
I started with 100 qps and waited for the vpa to narrow in on its recommendation. Then I increased the qps to 200 for 5 minutes. And then dropped qps back to 100. The entire test duration was 20 minutes.
Results
When qps was increased to 200, the desired pod count roughly doubled, which is normal behavior by the Elafros autoscaler. However the cluster did not have enough capacity to schedule all requested pods (17/22). As a result, each scheduled pod was doing slightly more work during the 200 qps period.
The vpa settled on an initial recommendation of 4151m (~4 cpus/pod). When qps increased to 200, the recommendation increased to 5563m (~5.5 cpus/pod). After the load returned to 100 qps, the recommendation dropped to 5046m (~5 cpus/pod). Some time later, when the revision was idle, the recommendation returned to 4151m.
Conclusions
Vertical pod autoscaling seems stable with horizontal pod autoscaling (the Elafros autoscaler). It is able to increase and decrease cpu recommendations to match the workload. Note: the Elafros autoscaler and the vpa use different signals for autoscaling (request concurrency and cpu usage respectively) which likely accounts for their playing nicely together.
However increasing recommendations and pod sizes causes fragmentation of schedulable cpu space (especially in smaller node sizes such as 8-cores). With fragmentation, fewer pods can be scheduled and the load on each scheduled pod increases. This may lead to increased recommendations again, in a loop.
(raw data)