Things that need sorting out:
cc @mrice32
Some broad context and a few points of discussion are here: https://github.com/christian-posta/envoy-microservices-patterns/issues/2 (probably not terribly relevant).
Why would Envoy need to implement this itself? The C++ lib for Prometheus that we promote on prometheus.io seems to have histogram support.
I would encourage to improve the existing libraries instead of rolling a custom implementation for native Prometheus support.
@brancz Envoy has a very large mount of plumbing already in place for stats, export, etc. It also has a very specific threading model. We also need built-in native HDR histograms in Envoy for other features. We will definitely investigate the various libraries available and figure out the right path forward before anyone starts work on this feature. (We first need to find someone who wants to build this specific feature, having HDR histogram support in Envoy is orthogonal and I will open a separate issue on that).
I just synced up with the Prometheus team. Here is the update:
@lita @jmphilli the first part of this (counter/gauge output) is a really good beginner ticket so if you want to give it a go that would be great. Basically we need to do the following:
/stats handler for Prometheus format here: https://github.com/envoyproxy/envoy/blob/master/source/server/http/admin.cc#L303More planning is required for histograms. We can deal with that later.
If at all possible, /metrics is the default for Prometheus. It's configurable con Prometheus side, but usually when there is native Prometheus support projects adapt this convention.
Description and type information about a metric is technically optional, however it is best practice to have, as it will allow Prometheus to do certain optimizations based on the type. A description could be helpful for display when querying (both of those things are not implemented today, but have been discussed and the consensus is that this information is good to have). However if this is problematic to add, then this can be done in follow up improvements as well.
:+1: for having /metrics be the Prometheus output path.
Sure doing /metrics for the Prometheus admin path is easy enough.
I can update the path to be /metrics then. Right now, I have implemented to be /status?fomat= prometheus
@lita Cool, thanks.
Typically for Prometheus, we use Content-Type headers, rather than URL params to adjust transport formats. We use the text format as the fallback method.
@lita FYI you can use https://github.com/envoyproxy/envoy/pull/1919 to optionally compress the output. I would do this as a follow up.
Update here: Once https://github.com/envoyproxy/envoy/pull/3130#pullrequestreview-114472861 lands we will be able to trivially export histogram summaries for Prometheus, and then we can consider this issue fixed. If someone wants to sign up for doing the histogram export that would be awesome!!!
I'll do this eventually if nobody else has yet, but I won't have time to start on it for awhile. If/when I start working on it, I'll comment here and assign it to myself. If that hasn't happened yet, someone else can claim it.
Is the plan to enable export of histograms as well as summaries?
Summaries are nice for looking at a single instance, but our dashboards produce an aggregate from multiple instances together and to do that the raw histogram type is needed.
@JonathanO the way the histogram library works it should be pretty straightforward to export both summaries and the full histograms as desired. cc @ramaraochavali @postwait
There's no need to bother with summaries, they have almost no usefulness since they can't be aggregated.
Yeah, I'd argue against doing summaries at all. They can be generated by another (external) tool.
OK. Whatever makes people happy is fine with me. Either way the change should be pretty trivial.
We use one of two ways to do this:
The library has a serialize function that produces a binary encoding and then we pass around a base64 string of this. This is the common way we do it b/c it is a small representation and doesn't require floating-point parsing on the other end.
The second way is to JSON-ize it by providing an array of strings that look like "H[bin]=count", so:
"something": [ "H[0.1]=34", "H[0.41]=32", "H[140000]=1" ]
This is easier to read and parse if you don't have the library on the other end.
The HistogramStatistics class can be expanded to have the "histogram" representation and can be very trivial change using the methods suggested by @postwait
I'm looking at https://github.com/envoyproxy/envoy/pull/3130, and I'm possibly confused, but it's not actually implementing histograms, rather it's implementing pre-computed summary quantiles.
This is not going to be useful envoy users, as the pre-computed quantiles are not statistically comparable between instances.
@SuperQ We have the complete histogram implementation. But what we are displaying in /admin/stats is the summary information of that Histogram. As mentioned by me in the above comment, you can change the HistogramStatistics class to have the complete raw histogram output stored with the approaches suggested by @postwait with the passed in histogram_t*
Ok, that makes more sense, sorry. For Prometheus, we need to translate things into our standard text format.
@SuperQ ok. In that case you could get JSON and translate to your format may be? Not sure if there is a better way here. @postwait can confirm
We want to include this natively, so getting the data directly from whatever internal datastructure is best. JSON is slow to deal with, which is why we have Prometheus format in the first place. We used to deal with JSON, but dropped it due to CPU cost.
@SuperQ It would great is Prom could be extended to support dense histograms natively. But a whole text line with floating-parsing per bin is pretty crazy slow. How about just supporting the encoded histogram natively?
It would not be hard to convert each histogram bin into a separate prom text line, but that seems overkill and quite noisy. Could be thousands of lines (due to thousands of bins).
@postwait We would like to have higher density histogram transport and storage, but for now we have to keep it limited to about 10 buckets. This is typically more than enough for basic monitoring and alerting needs, since the goal here is "is my traffic in SLO/SLA". We also typically have users wanting to store months to years worth of data, this gets expensive quickly, even with good sample compression.
Also note, that a typical users is going to be collecting bucket samples on the order of every 15 seconds. Advanced users may tun this down to single digit seconds, but again, this is for service health monitoring, not performance analysis.
@SuperQ I wasn't commenting on what you should store on the other end -- that's really application specific. But the protocol was extended to have the dense histogram, then the remote reader can decide. They could store a few arbitrary quantiles or rework the bins as they see fit for their application.
Sure, I agree, I've been pushing for a denser histogram option as part of the OpenMetrics protocol work. But I haven't made any progress there. In practice it's not a big deal because we gzip the transport anyway. This reduces the wire size down enough that we don't really care.
For the purposes of this ticket, I think that the aim is to support existing Prometheus 1.x and 2.x users. The implementation therefore needs to expose the data in the current Prometheus histogram wire format.
@SuperQ am I right in thinking that the definition of the buckets also needs to be static, and consistent across instances for Prometheus' histogram_quantile() to work correctly? Are they configurable in the implementation used for #3130 ?
Yes, histogram buckets need to be consistent across a deployment so that all instances can be compared together. Prometheus uses cumulative less or equal to buckets. We have a "typical network request buckets" pattern. These values are reasonably useful for bucketing http request durations for a wide variety of users. Of course they don't work for all use cases, so it should be a configurable list of durations in seconds.
.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10
You can produce exactly that by walking and summing the bins in the llhist. I warn that this erodes the long tail wickedly under common workloads.
This should be a simplistic for loop.
Right, but like I said, this is for "out of SLA" type monitoring. If my SLA is 99th under 1 second, the buckets over that, and the long tail, are just noise.
I think different histograms will need different buckets. For instance, connect time should nearly always be sub-second, but request-duration could be very long, depending on the use case.
@ggreenway the llhist handles this already. How this is communicated back to prom is the question. We could simply report back all used bin boundaries... I think the issue is that this turns into "too many" time series on the back end, but it would only be a couple hundred (on the upper end) in practice.
I'll stand on my soapbox briefly and say if the data is only being used for a boolean "out of SLA" value, then this is reasonable, but if the user has the ability to graph the 99th percentile (or some other estimated quantile) then (1) you know they will and (2) it is negligently misleading.
+1 for tailoring a small set of buckets to boolean an "in/out" of SLA argument,
but -1 on providing a tragically misleading p99 calculation to users.
FWIW I agree very strongly with @postwait on this. With that said, I don't use Prometheus so I don't have much skin in this game right now.
Regardless of the merits of different numbers of buckets, for Prometheus support we need the ability to specify a fixed set of buckets to output (potentially several different sets of buckets as @ggreenway pointed out.) Perhaps with a set of defaults which match what typical Prometheus exporters do. This is just how things work in the world of Prometheus today, and what Prometheus users (should) be expecting.
While I agree that more buckets is clearly better for accuracy, in reality a 99th %ile that comes out at, say, 20s when it should be 90s, doesn't really matter all that much if your SLO (and therefore a higher density of your buckets) is around 2s. 20s and 90s are both sufficiently close to +Inf in this case that not having high value buckets isn't much of a problem. (Though it is annoying to have to choose good bucket boundaries for the expected value range in the first place.)
It would indeed be nice if one day Prometheus had a solution that scales to a large number of buckets (though presumably they'd still have to be consistently defined across instance for aggregation to work), but right now it doesn't, so I guess that's what we've got to work with.
(We're a Prometheus user, right now we use the statsd sink to feed locally running Prometheus exporters that calculate our histograms for Prometheus to consume. I'm really looking forward to being able to get rid of that hackery!)
But I think we're in agreement that the buckets should be configurable, probably both the number of buckets and the range. So reducing the accuracy at the tail is then a user/configuration decision.
histogram buckets need to be consistent across a deployment
I see a problem with that. For example in the process of deploying changes to these settings, there will be instances with different buckets sets simultaneously.
@curlup Yes, that will happen, but it's not a big problem, or a blocker to this feature.
Is the plan to enable export of histograms as well as summaries?
What's the update on this issue? we're using prometheus and right now (using version 1.9.0-dev-8629ef783) and it's not possible to graph latency since cluster.<cluster>.upstream_cx_length_ms is not exported.
Current status is someone needs to work on exporting histograms in a format that Prometheus expects.
I'd be happy to help out on this and work on it but would need some tips/pointers of where to integrate (I don't know the internal code base of Envoy all that well, but happy to learn).
One must be able to specify the bins for the export so that we can degrade the internal histogram into the form. Once that configuration is in place, I'm happy to provide pointers as to how to munge a circllhist object into those summaries.
Another option which I've seen people successfully do it just expose all of it.. it only ends up being a hundred or so bin boundaries on average. I know people doing that with prom.
@postwait, aren't the histograms already calculated and exposed over /stats and just need the Prometheus endpoints? I see the histogram data there which is why I thought maybe it was just a formatting thing to expose the interface (but I could be over simplifying).
I don't think they are exposed, did I miss that commit?! I thought it was just summary information.
I think @redhotpenguin is looking at getting that part to work.
Is this it? https://github.com/envoyproxy/envoy/pull/3130
/stats/ exposes the calculated quantiles, but not the bin counts.
From my dev instance:
cluster.service1.external.upstream_rq_time: P0(nan,13) P25(nan,13.25) P50(nan,13.5) P75(nan,13.75) P90(nan,13.9) P95(nan,13.95) P99(nan,13.99) P99.5(nan,13.995) P99.9(nan,13.999) P100(nan,14)
I'm working on exposing the log linear histogram data serialized in a stats sink. Have been trying to wrangle bazel into building a grpc client to hit the existing endpoint - the default metrics service usage isn't that clear to me currently.
The existing Metrics Service, currently exposes Quantiles only via the gRPC end point. But it is very easy to add the bin support as it follows the Promotheus proto. You should change here https://github.com/envoyproxy/envoy/blob/master/source/extensions/stat_sinks/metrics_service/grpc_metrics_service_impl.cc#L64 to do that. ParentHistogram has the log linear histogram. You can expose it via ParentHistogram interface.
For whoever wants to work on this, here are some code references. I don't think this is too hard to finish, someone just needs to dig in:
1) https://github.com/envoyproxy/envoy/blob/master/source/server/http/admin.cc#L620
2) For an example of how histograms are written out for other things see the gRPC code that @ramaraochavali referenced above or also here: https://github.com/envoyproxy/envoy/blob/master/source/server/http/admin.cc#L676
Basically the work here as @postwait said is twofold:
1) Decide if we are going to output standard Prometheus fixed buckets or just output the entire histogram which is also possible.
2) If fixed buckets, we will need to degrade the internal histogram data to the format Prometheus expects. @postwait @ramaraochavali et al can help here. This will probably involve some histogram interface additions, but nothing too complicated, as we will just be operating on the already latched histograms, not threading issues to contend with.
3) Then just write them out. :)
Hey @stevesloka, are you working on this?
No I have not, if you have cycles feel free to take it. If not I can try soon, my week has gotten away from me.
馃憢, I managed to pick this up and do most of the plumbing in https://github.com/envoyproxy/envoy/pull/5601 to expose buckets and plumb that through into the Prometheus output. Hopefully someone can get a chance to review it. I did initially want to get in the configurable buckets but the PR was getting quite complex as is and I wanted to make sure it was along the right lines.
(this is my first real foray in proper C++ code so please do review it with a fine tooth comb 馃槂, thanks!)
Hey,
What about support for a parameter usedonly?(https://www.envoyproxy.io/docs/envoy/latest/operations/admin.html?highlight=usedonly#get--stats?usedonly). It is supported now for statsd and json format, but not for prometheus format: https://github.com/envoyproxy/envoy/blob/master/source/server/http/admin.cc#L617
From my experience, it is also respected when Envoy publish metrics to statsd. That is a huge advantage, because having 1000+ clusters and connecting to only couple of them, we don't want to publish metrics for every cluster. It would be very helpful if prometheus will support that too.
If you think this is easy enough, I could try to implement this (but I have a very little C++ experience).
Exposing only the metrics that have been updated since the last scrape breaks Prometheus in various ways (time-series are incorrectly marked as stale; two different Prometheuses scraping would interfere with each other). Printing these metrics is not a performance concern neither is processing them in Prometheus. Besides that, it's off topic. If you want to only collect metrics from individual Envoy servers, that's just a matter of configuring Prometheus not to scrape those servers.
Agreed. usedonly doesn't make sense for Prometheus. See https://www.robustperception.io/existential-issues-with-metrics.
@brancz @pschultz just clarify the usedonly semantics - usedonly indicates if a metric has ever been updated since Envoy start. This does not indicate whether it has been updated since the last scrape. Even if it is not updated since last scrape if it had ever been updated, that metric will be treated as usedonly.
See https://www.envoyproxy.io/docs/envoy/latest/operations/admin#get--stats?usedonly for more details
Thanks for the clarification. It still wouldn't work well with Prometheus as when a process restarts and a counter resets, there is a difference in a counter being 0 and not there at all. The first is the continuation of a time-series, the later means the time-series has ended.
I am in favor of adding usedonly. It is off by default, so it only affects people that opt in to it. I think it's useful in some cases.
For Prometheus it's a violation of the format. For statsd I absolutely see the use.
@brancz I agree understand your point about why this could be problematic. But the lack of usedonly doesn't prevent this case from happening anyways.
For instance, something Envoy is working on is the ability to lazy-load some configuration. For example, when a connection comes in with an SNI value that hasn't been seen before, Envoy could contact the mgmt/config server and ask for config. This would cause new metrics to appear, and if Envoy were to be restarted, those metrics would disappear again until a connection caused Envoy to load that particular bit of config again.
Another related use case we hear about is configurations that are very very large, where only a small subset is expected to be used on any given Envoy instance. (Some would argue that a better control plane should be used to prevent this, but the world is complicated.) Only publishing metrics for the small portion of config that is used can hugely reduce the number of published metrics.
Given this context, do you still feel that it would always be inappropriate to have usedonly be available for prometheus metrics?
Given this context, do you still feel that it would always be inappropriate to have usedonly be available for prometheus metrics?
FWIW I think we should add it. This is a tremendous perf benefit in many cases for metrics backends.
I also believe we should add usedonly (it's actually a further addition i'm wrapping up in a separate branch after the histogram changes are merged). It drastically reduces the number of metrics (especially if you have hundreds of clusters).
FWIW: In our set up, when we scrape Envoy with Prometheus, we add a label to each metric to identify the Envoy pod it came from (so we can isolate metrics for a specific Envoy pod). If a pod restarts, it gets a new label combination. It increases the cardinality but for this use case, a usedonly would be perfect.
There's a big difference in metrics being hidden because they were _never_ measured/incremented and whether this has happened since the last scrape. The latter would always be problematic for use with Prometheus, but the way I read your description now @ggreenway, it feels like the first. The first I think would be ok to have as it would be a conscious decision by a user, meaning they make a conscious decision and know that a non existent metric may mean the same as a 0 metric.
@brancz To clarify, metrics disappearing after the last scrape is a violation and would indeed break a lot of Prometheus set ups. This isn't what's happening here. Metrics accumulate over the lifetime of the Envoy process.
The purpose of usedonly is _only_ excluding metrics which haven't been touched/emitted even once for the lifetime of the process
Sorry for misunderstanding. That would be ok if it is an explicit action by a user, which a query parameter would be.
Most helpful comment
If at all possible,
/metricsis the default for Prometheus. It's configurable con Prometheus side, but usually when there is native Prometheus support projects adapt this convention.Description and type information about a metric is technically optional, however it is best practice to have, as it will allow Prometheus to do certain optimizations based on the type. A description could be helpful for display when querying (both of those things are not implemented today, but have been discussed and the consensus is that this information is good to have). However if this is problematic to add, then this can be done in follow up improvements as well.