Serving: Autoscaler to scrape pods for metrics

Created on 10 Oct 2018  路  27Comments  路  Source: knative/serving

Current behavior

Currently the autoscaler gets it's metrics directly from the queue-proxy. (The queue pushes to the autoscaler).

Desired behavior

Instead we want to collect pod metrics for a Prometheus endpoint on the pods (https://github.com/knative/serving/issues/2202). Either by scraping with Prometheus or directly with the autoscaler.

Example: Prometheus scraping the activator

https://github.com/knative/serving/blob/c6d9dc9409114a831aacd6f59b727c90f5901064/config/monitoring/200-common/300-prometheus/100-scrape-config.yaml#L56

areautoscale kinfeature

Most helpful comment

Also, sampling will dramatically reduce the load on the autoscaler for large revisions. E.g. for a revision with 1000 pods, the autoscaler will only need to process 3 data points per second instead of 1000.

All 27 comments

I'm taking a first stab at this. But instead of having Prometheus scrape the pods, I'm having the Autoscaler scrape them directly. This is because having Prometheus as a hard dependency and critical-path runtime component is a concern.

Scraping pod is hard when sidecar injection / mTLS is enabled I believe. This is a limitation of Istio, and for Prometheus we may have to use the version that comes with Istio to overcome this limitation. Not sure if we can do this for our autoscaler.

/cc @ZhiminXiang

Istio is working on a feature to support scraping pod from the Prometheus pre-installed by Istio when mTLS is enabled.
Currently we use our own Prometheus, not the one installed by Istio, which could potentially block scraping pod after mTLS is enabled.
/cc @mdemirhan

Scraping pod is hard when sidecar injection / mTLS is enabled

@tcnghia What makes it hard?

@josephburnett if we want to enable mTLS, then Prometheus is required to have Istio sidecar injected. After injecting sidecar into Prometheus, Prometheus is not able to scrape pod directly as Istio sidecar does not allow accessing Pod by using Pod IP.

Istio sidecar does not allow accessing Pod by using Pod IP

What technical reason prevents the Istio sidecar from allowing access by Pod IP? E.g. can we fix this?

One limitation is that all the network graph is replicated in all the Envoy sidecars. Restricting the network nodes to K8s Services, VirtualServices, and similar helps them (a) reduce the graph size, and more than that (b) also reduce how much updates they need to handle.

There are pending work on Istio side to support incremental network update, but it is still in design phase.

What I want is to stop injecting knowledge about the autoscaler into the deployment from within the revision controller which is a layering violation (@mattmoor). I just want pods to expose their metrics in the standard Prometheus format and whatever autoscaling implementation is in charge of the PodAutoscaler will go get it.

The limitations I'm trying to work around are:

  1. The goodies are on individual pods.
  2. Requests from within the mesh cannot access individual Pods.
  3. The Autoscaler is across the network.
  4. Requests across the network must be within the mesh (so they can use mTLS).

How about this as a way to bootstrap metrics into the mesh? Create a daemonset pusher pair, one within the mesh and one without. The pusher outside (A) scrapes pod metrics and the pusher inside (B) forwards them to the autoscaler over mTLS.

image

cc @mattmoor @evankanderson @vaikas-google @mdemirhan @tcnghia

I want to make sure I understand the comment in https://github.com/knative/serving/issues/2203#issuecomment-435511406. Right now, the only way to access a pod across the service mesh is when there is a VirtualService entry which includes the Pod. For various reasons (scalability, sanity, latency), we don't want to create a VirtualService for each Pod to enable individual addressability.

A few thoughts:

  1. Can we turn around the pull from the Pods to a "Push", or a 2-way channel (e.g. POST /scalemetrics ==> streaming body request of metric values and/or gRPC equivalent of the same)?
  2. Can we add an Istio feature to enable direct access to the local Pod? This would make the Envoy config = [(cluster-wide) Istio mesh config] + [Pod-local direct access config].

Item 1 seems particularly attractive, if we can use something like GRPC or a streaming response to collect items at a fine granularity

@evankanderson we push today, which has some problems:

  1. We need to know where to push (hinders pluggability)
  2. We don't authenticate the metrics endpoint at all, which puts more trust in the user pod than I'd like (think caller ID).

Switching to a pull model (theoretically) enables a trusted process to ask for request metrics from untrusted pods with a stronger sense of identity.

I'll pitch a third (strawman) alternative: expose the metrics endpoint via a Service on the mesh, and have the endpoint return the name of the Pod serving the request, then sample over the Service according to the scale of the Revision (assuming reasonable load balancing, this should get us pretty close).

@mattmoor, your third strawman proposal is not so bad. One thing we have going for us is that we don't have to have all the data points to determine what the likely average pod request concurrency is. We can sample the population.

Because Istio is load-balancing, I would expect population variance to be pretty low. E.g. if traffic spikes, I would expect to see a spike in traffic across most of the pods. I think that 16 samples would be enough. And we can collect those over the 6-second panic window so querying a revision metric service at 3 qps should be enough. @glyn and @jchesterpivotal, what do you think?

I would propose that the PodAutoscaler Reconciler would own the metrics service. It can set it up for kpa class PodAutoscalers and get the labels from the scaleRef target.

image

Random Internet calculator says ...
image

Also, sampling will dramatically reduce the load on the autoscaler for large revisions. E.g. for a revision with 1000 pods, the autoscaler will only need to process 3 data points per second instead of 1000.

When a sample turns out to be inaccurate, which it is bound to be some of the time, the autoscaler will either under- or over-scale. Under-scaling will presumably impact response time; over-scaling will increase costs unnecessarily. A later sample may correct the problem although the overall effect will be some measure of instability, possibly even under a fairly constant load.

Perhaps decreasing the sample size as load increases, as I think @mattmoor may have been suggesting, would help?

Yes, I think reducing sample rate as the total load increases is fine.

The overall behaviour of the system should be favourably non-linear with scale, since small populations are noisier and pooling sources of variability can lower average variability. As load goes up, service levels should become more stable for any given amount of change in load.

I haven't thought through how this would affect a PID controller. I assume that one or more of the PID coefficients would change according to the total load situation -- for example, do we make the controller more sensitive to short term fluctuations (P, D) if load is lower?

I haven't thought through how this would affect a PID controller. I assume that one or more of the PID coefficients would change according to the total load situation [...]

I suspect that a PID controller with variable coefficients opens up a whole realm of possibilities, some good and some not so good.

Touch茅.

A later sample may correct the problem although the overall effect will be some measure of instability, possibly even under a fairly constant load.

The affects of a bad sample will be more pronounced in the 6-second panic window (rather than the 60-second stable window). But the panic window is only used to scale when the autoscaler is panicking. And then it never decreases the number of pod, precisely to prevent this kind of flapping (instability).

With 3 qps, a single bad sample will be averaged with 180 other samples. So the effect of a bad sample will be not so much.

Additionally, we can add some hysteresis to dampen the noise. This is something I wanted to do anyway because we do see a little bit of hopping between two values.

@glyn, @jchesterpivotal what do you think about the 3 qps value? That's 3 qps for the whole revision. So the 6-second sample size will always be 18. And the 60-second sample size will always be 180. Even at high scale.

I'm going to move forward with the proposal to sample a metrics service, owned by the KPA: https://github.com/knative/serving/issues/2203#issuecomment-436039743

We can continue to discuss what the appropriate rate per scale is.

I think 3 samples/sec is fine; we can tune it easily enough with further testing and simulation.

One interesting point @tcnghia brought up is the accuracy of our scale-to-zero signal when we are sampling metrics (instead of collecting them all). I think we are still okay because 1) requests will be routed through the KBuffer before we turn down the Deployment, and the KBuffer still pushes metrics to the Autoscaler. And 2) because before scale-to-zero, the Deployment will be at low scale (likely 1 or close to it) and we will sample that one pod 3 times per second, so we're unlikely to miss a request. That kind of race condition already existed anyway, that a request comes in after we decide to scale to zero and we miss it. Thus the grace period.

@glyn, @jchesterpivotal what do you think about the 3 qps value?

I have no idea whether this will be sufficient. I think you'll need to try it, or at least model it fairly realistically, to have any confidence. (This feels very much like the problem of performance optimisation and the same _dictum_ probably applies: "Measure, don't guess".)

/assign @yanweiguo
as I will be out for the next 2 weeks.

@k4leung4: GitHub didn't allow me to assign the following users: as, I, will, be.

Note that only knative members and repo collaborators can be assigned.
For more information please see the contributor guide

In response to this:

/assign @yanweiguo as I will be

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k4leung4: GitHub didn't allow me to assign the following users: Yanwei, Guo.

Note that only knative members and repo collaborators can be assigned.
For more information please see the contributor guide

In response to this:

/assign Yanwei Guo

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

/assign @yanweiguo

@yanweiguo you might wait until next week to start on this when I'm back in Seattle (Dec 3rd). As part of adding the autoscaling.knative.dev/target annotation (https://github.com/knative/serving/issues/2205), I'm relayering the KPA metrics interface to allow KPA autoscaler targets to be updated. I want to avoid any massive code merge conflicts. 馃檪

Also, it seems that Envoy ignores ports not mentioned in the K8s Service so you may be able to scrape metrics directly after all. It's worth reconsidering which approach to take (direct or service, sample or exhaustive).

I was discussing Prometheus doing the scraping for us with @k4leung4 yesterday and wanted to summarize a couple of cons of that approach in case it is being considered the path to move forward:

  • Prometheus does not scale horizontally and sharding needs to be controlled manually. This can get tricky as the scale of the cluster grows. We have a similar problem with Autoscaler itself, and this can be tied into that somehow.
  • If sampling will be utilized (and seems like that is a strong possibility), Prometheus will not give us the flexibility we want - for example, you might want to sample X pods for Service A and Y pods for Service B, and achieving that level of control via Prometheus can get quite tricky and might require frequent reprogramming Prometheus.
  • Reprogramming Prometheus (i.e. telling it what to scrape) requires updating a config file and restarting Prometheus (afaik) and that is going to be very disruptive if that is going to be done frequently. If we will do sampling based on the size of services, or if we will pick the pods to scrape ourselves, we will need to reprogram very frequently and that will probably cause a very poor experience.
Was this page helpful?
0 / 5 - 0 ratings

Related issues

VladimirSmogitel picture VladimirSmogitel  路  7Comments

mattmoor picture mattmoor  路  7Comments

vtereso picture vtereso  路  5Comments

josephburnett picture josephburnett  路  6Comments

greghaynes picture greghaynes  路  6Comments