I've been using the Prometheus support in Caddy 1.x. This has worked extremely well. I would like to be able to upgrade to v2, but it doesn't seem like there's any out of the box monitoring options.
IMO, Prometheus monitoring support should really be a first-class core feature. I'd be willing to contribute this code if it would be accepted.
Pinging @hairyhenderson. He's working on Prometheus module (here: github.com/hairyhenderson/caddyprom) and brought up metrics in discussions before.
@SuperQ Hi! I'd be glad to have any help I can get with Prometheus support 馃槈
For now my plugin mostly does what the v1 version did, but there are some possibilities to gather other metrics by hooking in as a log encoder.
I've been playing with some of it in the https://github.com/hairyhenderson/caddyprom/tree/log-encoder-metrics branch, but TBH it's really awkward.
I've also been trying to convince @mholt that Prometheus support needs to be in the core, but we (quite rightly) need to prove why it can't just be a module 馃槈
Providing it as a module is maybe fine. We do this for CoreDNS, which uses the Caddy v1 plugin system.
What I'm mostly looking for is that it's an available in the main repo, just like logging is.
To me, the main reason it's important in core is that it's just as useful as pprof. Arguably more useful than pprof.
To be able to get metrics from all parts of Caddy, we have to be able to get metrics from arbitrary modules: in other words, a standard interface that all modules use to export metrics in some consistent, expected manner, that is readable not only by prometheus but by other metrics consumers as well. In other words:
This is basically what our logging facilities already do: modules emit logs when things happen, and they are structured, and then logging modules can read & encode those structures any way they want to.
So I bet there is a way to make our existing logging functions facilitate metrics, i.e. metrics modules like Prometheus and others can just read the logs that are emitted and aggregate counts and export them however they need to.
Prometheus metrics are readable by many other metrics consumers. It's a defacto standard like StatsD. We're in the process of evolving the defacto standard into a real IETF standard. Everything from DataDog to Zabbix can read the Prometheus metrics exposition protocol.
There's not a need to build a complicated interface. Modules can register their metrics just like you would use expvar. It's just more functional and more supported by monitoring systems than expvar.
I'm not opposed to adding more facilities for modules to be able to export information, but if we didn't and instead could take advantage of logging, then we can centralize _all_ the metrics into one place, and individual modules wouldn't have to think "Oh, I should emit this metric" -- instead they just emit logs as usual and the metrics module could transform them into Prometheus (or whatever IETF-standard-ized) metrics in a single place.
... that was a long sentence. Basically:
I'm just saying we should explore that last option thoroughly before deciding to add what is almost the same thing as existing logging into the core of Caddy.
The problem is there is no generic way to turn log events into metrics without the module writer needing to write the transformation for the metric aggregator.
For example:
IMO, metrics is a separate developer interface to logs. I may want to put a hook in for a metric counter where I do not want a log line. For example, if Caddy were to have a cache module. I would want to count cache requests and hits. But I don't need or want to put a logging feature in the hot path of the cache. But a metric counter for cache misses is extremely important. And Prometheus makes this cheap (sub 15 CPU nanoseconds for Inc()
)
I would want to count cache requests and hits. But I don't need or want to put a logging feature in the hot path of the cache.
Understandable, but because logs are structured, logs can be configured to reject certain entries, or accept only certain entries.
So what do you propose for implementation?
The simplest possible implementation is to include muxWrap.mux.Handle("/metrics", promhttp.Handler())
in admin.go.
Then, any module that wishes to have metrics can add metrics with promauto registration.
Here's a quick-n-dirty implementation of a basic metrics API.
https://github.com/caddyserver/caddy/compare/master...SuperQ:metrics
This basic implementation doesn't stop the logging infra from also having some kind of auto-generated metrics. It just allows for any part of the code to register metrics in either the standard way or some simplified wrapper way.
@hairyhenderson what do you think of that?
@mholt I'm aligned with @SuperQ. The quick & dirty implementation is pretty much spot-on, except that I don't think I would put the Observe
call inside the if s.shouldLogRequest(r) {
block (whether or not a request should be logged is orthogonal to whether its metrics should be observed).
Also, while I think adding /metrics
to the Admin API seems like a good idea, I don't want to have to expose the whole admin API to be able to gather metrics. That opens up potential security issues.
For that matter, I think having /debug/pprof
available separately from the admin API would also be useful (we've used https://github.com/conprof/conprof on occasion with that endpoint for profiling various Go apps).
The admin server is actually extensible, by the way -- for example, technically the very important and central /load
endpoint is a module: https://github.com/caddyserver/caddy/blob/d5341625561bf67c40241a4fb649b0a4991e71b2/caddyconfig/load.go#L34-L62
With that in mind - sounds good to me. Is that all you would need? Want to draft up a PR, with a few core metrics you want to emit as well? I guess we can just add more with time then.
Oh... what would the config surface for this look like?
The admin server is actually extensible, by the way...
Ok, but does this mean we'd have to choose between exposing metrics _or_ keeping the admin API private? To me, observability is orthogonal to administration.
Oh... what would the config surface for this look like?
IMO it should be enabled by default, though people upgrading from Caddy v1 to Caddy v2 may want to retain the v1 (and caddyprom/v2) behaviour of having a separate listener on :9180
that serves /metrics
.
As far as options, the v1 options are a useful guide: https://github.com/miekg/caddy-prometheus#use
@hairyhenderson Yes, adding the Observe in the shouldLogRequest was just a quick hack so I could easily get the response Status and hook into the defer
func.
I feel like the admin endpoint is probably fine for also metrics/pprof. What if the admin API had a read-only mode?
But :9180
also works for me.
For config, typically we allow the listening host port changes, and optionally adjusting the route path in case users want to put the metrics port behind some other reverse proxy and need something like /caddy/metrics
.
@hairyhenderson
Ok, but does this mean we'd have to choose between exposing metrics or keeping the admin API private? To me, observability is orthogonal to administration.
I don't think so -- the endpoints are all the same, whether they are from modules or baked-in. Admin API privacy is also irrelevant/orthogonal.
IMO it should be enabled by default
I'd be OK with this if there's no perf overhead (and nanoseconds don't concern me -- milliseconds and allocations do).
though people upgrading from Caddy v1 to Caddy v2 may want to retain the v1 (and caddyprom/v2) behaviour of having a separate listener on :9180 that serves /metrics.
Perhaps the actual HTTP handler part of it could optionally be registered as an http.handlers
module. No need to copy the code: just satisfy the caddyhttp.MiddlewareHandler interface, could be done with a small wrapper I think. That way, they can chain a metrics handler into any of their HTTP servers at any point.
@SuperQ
What if the admin API had a read-only mode?
I've mulled over some thoughts for advanced IAM/permission controls on the admin endpoint, but that'd be separate from this project.
For config, typically we allow the listening host port changes, and optionally adjusting the route path in case users want to put the metrics port behind some other reverse proxy and need something like /caddy/metrics.
If the metrics handler doubles as a Caddy HTTP handler module like I described, then the listening port and path are all taken care of by the user's config, no need to separately configure that in the metrics module.
Other than those things, is there really no other configuration for metrics? (I mean, that's great if not -- just want to check!)
Nope, for a typical setup, there's not much configuration. This is an advantage of the Prometheus data model for metrics. The instrumented software doesn't need any configuration. As the Caddy project, we only need to think about what and how to instrument. Not how to configure a monitoring system.
For things like histogram buckets, that configuration would be defined in the module. We _might_ want to have a metrics default duration bucket and size bucket that modules could inherit from. But that's not something we need in a first iteration.
Cool. I'd be happy to review a pull request that adds this, then.
@hairyhenderson Do you want to take a first pass at a PR to add the basic metrics endpoint. Maybe a better version of the simple patch I made?
@mholt Since this feature is still being worked on. I would suggest, having a look at https://opentelemetry.io/ to expose metrics in v2. I would be happy to help and discuss more if needed.
By the way:
We're in the process of evolving the defacto standard into a real IETF standard.
IETF standardization process has a tendency to change a lot of things before it gets codified. If we implement this too early, do we risk being constrained to breaking changes to eventually support the future standard?
@ankur-anand I am actively working with the OTel metrics group, which committed to supporting OM as a first-class wire format.
@mholt: I have talked to the Ops-WG chair and as we have an incredible amount of working code and native adoption, it should not be an issue. If it would be, I would submit as Independent Informational outside of any WG; I did that in the past and it worked.
supporting OM as a first-class wire format
Just as a point of translation - OM == OpenMetrics, which more-or-less == the Prometheus format. So, I think it's safe to use the Prometheus Go client. (Right?)
@hairyhenderson Yes, OM == OpenMetrics, which is where Prometheus is going with it's own format. OpenMetrics was founded by Prometheus developers. Basically as we look to extend Prometheus functionality, we want to extend it via OpenMetrics. With the added benefit of even more cross-community work. Prometheus already supports the draft format.
The Go client doesn't yet output OpenMetrics format, but there are issues to implement it as the draft solidifies.
So yes, it's a safe bet.
@SuperQ
Do you want to take a first pass
Sure - I can try and fit this in over the next couple days.
If the metrics handler doubles as a Caddy HTTP handler module like I described, then the listening port and path are all taken care of by the user's config, no need to separately configure that in the metrics module.
@mholt If I'm interpreting this right, I think what you're envisioning is a JSON config something like this:
{
"apps": {
"http": {
"servers": {
"srv0": {
"listen": [ ":9180" ],
"routes": {
"match": [{ "path": [ "/metrics" ] }],
"handle": [{ "handler": "prometheus" }],
"terminal": true
}
}
}
}
}
}
I'll use this as a starting point if it makes sense
@hairyhenderson Yep! That's exactly right. Possible Caddyfile equivalent:
:9180/metrics
prometheus
Could also consider using the name "metrics" instead of "prometheus" if it's going to be the one-stop-shop for all metric emissions.
consider using the name "metrics" instead of "prometheus"
Sure!
Just FYI I haven't forgotten about this - just been swamped lately... Hoping to find time to tackle some of this next week!
Please, would there be a possibility to add an option to enable/disable per host metrics resolution as well?
We use Caddy v2 for serving hundreds of hosts and it is complicated to obtain stats such as requests per host without log parsing.
Please, would there be a possibility to add an option to enable/disable per host metrics resolution as well?
Thanks for posting here, @viliampucik 馃槈. Yup, that makes a lot of sense. I'll see if I can fit it into my draft PR at #3709, otherwise it can be added separately.
Loosely related, as a stopgap you can automate metric-generation from logs with Loki; in a nicer way than with e.g. mtail.
That being said, doing it natively is obviously better.
Most helpful comment
Just FYI I haven't forgotten about this - just been swamped lately... Hoping to find time to tackle some of this next week!