Caddy: Metrics monitoring in v2

Created on 10 May 2020 · 31Comments · Source: caddyserver/caddy

I've been using the Prometheus support in Caddy 1.x. This has worked extremely well. I would like to be able to upgrade to v2, but it doesn't seem like there's any out of the box monitoring options.

IMO, Prometheus monitoring support should really be a first-class core feature. I'd be willing to contribute this code if it would be accepted.

discussion feature request

Source

SuperQ

👍14

Most helpful comment

Just FYI I haven't forgotten about this - just been swamped lately... Hoping to find time to tackle some of this next week!

hairyhenderson on 6 Jun 2020

❤15

All 31 comments

Pinging @hairyhenderson. He's working on Prometheus module (here: github.com/hairyhenderson/caddyprom) and brought up metrics in discussions before.

Mohammed90 on 10 May 2020

👍3

@SuperQ Hi! I'd be glad to have any help I can get with Prometheus support 😉

For now my plugin mostly does what the v1 version did, but there are some possibilities to gather other metrics by hooking in as a log encoder.

I've been playing with some of it in the https://github.com/hairyhenderson/caddyprom/tree/log-encoder-metrics branch, but TBH it's really awkward.

I've also been trying to convince @mholt that Prometheus support needs to be in the core, but we (quite rightly) need to prove why it can't just be a module 😉

hairyhenderson on 11 May 2020

Providing it as a module is maybe fine. We do this for CoreDNS, which uses the Caddy v1 plugin system.

What I'm mostly looking for is that it's an available in the main repo, just like logging is.

To me, the main reason it's important in core is that it's just as useful as pprof. Arguably more useful than pprof.

SuperQ on 11 May 2020

To be able to get metrics from all parts of Caddy, we have to be able to get metrics from arbitrary modules: in other words, a standard interface that all modules use to export metrics in some consistent, expected manner, that is readable not only by prometheus but by other metrics consumers as well. In other words:

Arbitrary modules need to be able to export metrics in a consistent way
Metrics modules need to be able to consume them in their own ways

This is basically what our logging facilities already do: modules emit logs when things happen, and they are structured, and then logging modules can read & encode those structures any way they want to.

So I bet there is a way to make our existing logging functions facilitate metrics, i.e. metrics modules like Prometheus and others can just read the logs that are emitted and aggregate counts and export them however they need to.

mholt on 11 May 2020

👍1

Prometheus metrics are readable by many other metrics consumers. It's a defacto standard like StatsD. We're in the process of evolving the defacto standard into a real IETF standard. Everything from DataDog to Zabbix can read the Prometheus metrics exposition protocol.

There's not a need to build a complicated interface. Modules can register their metrics just like you would use expvar. It's just more functional and more supported by monitoring systems than expvar.

SuperQ on 11 May 2020

I'm not opposed to adding more facilities for modules to be able to export information, but if we didn't and instead could take advantage of logging, then we can centralize _all_ the metrics into one place, and individual modules wouldn't have to think "Oh, I should emit this metric" -- instead they just emit logs as usual and the metrics module could transform them into Prometheus (or whatever IETF-standard-ized) metrics in a single place.

... that was a long sentence. Basically:

Currently, modules can emit structured, zero-alloc logs as liberally as they need/want to.
We _could_ add yet another emission facility for metrics specifically, but this is more for module developers to think about and more for Caddy to manage.
If we can simply have a metrics module consume logs and turn them into metrics, it could be much simpler and easier all-around. Maybe.

I'm just saying we should explore that last option thoroughly before deciding to add what is almost the same thing as existing logging into the core of Caddy.

mholt on 11 May 2020

The problem is there is no generic way to turn log events into metrics without the module writer needing to write the transformation for the metric aggregator.

For example:

HTTP response duration. Do you want to simply count the responses by status and status duration? Or would you like to configure a Histogram. What Histogram buckets?
HTTP response size: Same here, do you want to just count the bytes? Or generate a histogram of bytes. Do you want to include any other tags like routes? The Histogram bucket values for bytes will be very different from duration.
For things like errors. Do you need to label tag or not?
For things like current active connections. You need to maintain a gauge that is incremented with a mutex when a connection is made and decremented when the connection is completed.

IMO, metrics is a separate developer interface to logs. I may want to put a hook in for a metric counter where I do not want a log line. For example, if Caddy were to have a cache module. I would want to count cache requests and hits. But I don't need or want to put a logging feature in the hot path of the cache. But a metric counter for cache misses is extremely important. And Prometheus makes this cheap (sub 15 CPU nanoseconds for Inc())

SuperQ on 11 May 2020

I would want to count cache requests and hits. But I don't need or want to put a logging feature in the hot path of the cache.

Understandable, but because logs are structured, logs can be configured to reject certain entries, or accept only certain entries.

So what do you propose for implementation?

mholt on 11 May 2020

The simplest possible implementation is to include muxWrap.mux.Handle("/metrics", promhttp.Handler()) in admin.go.

Then, any module that wishes to have metrics can add metrics with promauto registration.

Here's a quick-n-dirty implementation of a basic metrics API.

https://github.com/caddyserver/caddy/compare/master...SuperQ:metrics

SuperQ on 11 May 2020

This basic implementation doesn't stop the logging infra from also having some kind of auto-generated metrics. It just allows for any part of the code to register metrics in either the standard way or some simplified wrapper way.

SuperQ on 11 May 2020

@hairyhenderson what do you think of that?

mholt on 11 May 2020

@mholt I'm aligned with @SuperQ. The quick & dirty implementation is pretty much spot-on, except that I don't think I would put the Observe call inside the if s.shouldLogRequest(r) { block (whether or not a request should be logged is orthogonal to whether its metrics should be observed).

Also, while I think adding /metrics to the Admin API seems like a good idea, I don't want to have to expose the whole admin API to be able to gather metrics. That opens up potential security issues.

For that matter, I think having /debug/pprof available separately from the admin API would also be useful (we've used https://github.com/conprof/conprof on occasion with that endpoint for profiling various Go apps).

hairyhenderson on 11 May 2020

The admin server is actually extensible, by the way -- for example, technically the very important and central /load endpoint is a module: https://github.com/caddyserver/caddy/blob/d5341625561bf67c40241a4fb649b0a4991e71b2/caddyconfig/load.go#L34-L62

With that in mind - sounds good to me. Is that all you would need? Want to draft up a PR, with a few core metrics you want to emit as well? I guess we can just add more with time then.

Oh... what would the config surface for this look like?

mholt on 11 May 2020

The admin server is actually extensible, by the way...

Ok, but does this mean we'd have to choose between exposing metrics _or_ keeping the admin API private? To me, observability is orthogonal to administration.

Oh... what would the config surface for this look like?

IMO it should be enabled by default, though people upgrading from Caddy v1 to Caddy v2 may want to retain the v1 (and caddyprom/v2) behaviour of having a separate listener on :9180 that serves /metrics.

As far as options, the v1 options are a useful guide: https://github.com/miekg/caddy-prometheus#use

hairyhenderson on 11 May 2020

@hairyhenderson Yes, adding the Observe in the shouldLogRequest was just a quick hack so I could easily get the response Status and hook into the defer func.

I feel like the admin endpoint is probably fine for also metrics/pprof. What if the admin API had a read-only mode?

But :9180 also works for me.

For config, typically we allow the listening host port changes, and optionally adjusting the route path in case users want to put the metrics port behind some other reverse proxy and need something like /caddy/metrics.

SuperQ on 11 May 2020

@hairyhenderson

Ok, but does this mean we'd have to choose between exposing metrics or keeping the admin API private? To me, observability is orthogonal to administration.

I don't think so -- the endpoints are all the same, whether they are from modules or baked-in. Admin API privacy is also irrelevant/orthogonal.

IMO it should be enabled by default

I'd be OK with this if there's no perf overhead (and nanoseconds don't concern me -- milliseconds and allocations do).

though people upgrading from Caddy v1 to Caddy v2 may want to retain the v1 (and caddyprom/v2) behaviour of having a separate listener on :9180 that serves /metrics.

Perhaps the actual HTTP handler part of it could optionally be registered as an http.handlers module. No need to copy the code: just satisfy the caddyhttp.MiddlewareHandler interface, could be done with a small wrapper I think. That way, they can chain a metrics handler into any of their HTTP servers at any point.

@SuperQ

What if the admin API had a read-only mode?

I've mulled over some thoughts for advanced IAM/permission controls on the admin endpoint, but that'd be separate from this project.

For config, typically we allow the listening host port changes, and optionally adjusting the route path in case users want to put the metrics port behind some other reverse proxy and need something like /caddy/metrics.

If the metrics handler doubles as a Caddy HTTP handler module like I described, then the listening port and path are all taken care of by the user's config, no need to separately configure that in the metrics module.

Other than those things, is there really no other configuration for metrics? (I mean, that's great if not -- just want to check!)

mholt on 11 May 2020

Nope, for a typical setup, there's not much configuration. This is an advantage of the Prometheus data model for metrics. The instrumented software doesn't need any configuration. As the Caddy project, we only need to think about what and how to instrument. Not how to configure a monitoring system.

For things like histogram buckets, that configuration would be defined in the module. We _might_ want to have a metrics default duration bucket and size bucket that modules could inherit from. But that's not something we need in a first iteration.

SuperQ on 12 May 2020

👍1

Cool. I'd be happy to review a pull request that adds this, then.

mholt on 12 May 2020

@hairyhenderson Do you want to take a first pass at a PR to add the basic metrics endpoint. Maybe a better version of the simple patch I made?

SuperQ on 12 May 2020

@mholt Since this feature is still being worked on. I would suggest, having a look at https://opentelemetry.io/ to expose metrics in v2. I would be happy to help and discuss more if needed.

ankur-anand on 12 May 2020

By the way:

We're in the process of evolving the defacto standard into a real IETF standard.

IETF standardization process has a tendency to change a lot of things before it gets codified. If we implement this too early, do we risk being constrained to breaking changes to eventually support the future standard?

mholt on 12 May 2020

@ankur-anand I am actively working with the OTel metrics group, which committed to supporting OM as a first-class wire format.
@mholt: I have talked to the Ops-WG chair and as we have an incredible amount of working code and native adoption, it should not be an issue. If it would be, I would submit as Independent Informational outside of any WG; I did that in the past and it worked.

RichiH on 12 May 2020

👍2

supporting OM as a first-class wire format

Just as a point of translation - OM == OpenMetrics, which more-or-less == the Prometheus format. So, I think it's safe to use the Prometheus Go client. (Right?)

hairyhenderson on 12 May 2020

@hairyhenderson Yes, OM == OpenMetrics, which is where Prometheus is going with it's own format. OpenMetrics was founded by Prometheus developers. Basically as we look to extend Prometheus functionality, we want to extend it via OpenMetrics. With the added benefit of even more cross-community work. Prometheus already supports the draft format.

The Go client doesn't yet output OpenMetrics format, but there are issues to implement it as the draft solidifies.

So yes, it's a safe bet.

SuperQ on 12 May 2020

👍1

@SuperQ

Do you want to take a first pass

Sure - I can try and fit this in over the next couple days.

If the metrics handler doubles as a Caddy HTTP handler module like I described, then the listening port and path are all taken care of by the user's config, no need to separately configure that in the metrics module.

@mholt If I'm interpreting this right, I think what you're envisioning is a JSON config something like this:

{
  "apps": {
    "http": {
      "servers": {
        "srv0": {
          "listen": [ ":9180" ],
          "routes": {
            "match": [{ "path": [ "/metrics" ] }],
            "handle": [{ "handler": "prometheus" }],
            "terminal": true
          }
        }
      }
    }
  }
}

I'll use this as a starting point if it makes sense

hairyhenderson on 12 May 2020

@hairyhenderson Yep! That's exactly right. Possible Caddyfile equivalent:

:9180/metrics

prometheus

Could also consider using the name "metrics" instead of "prometheus" if it's going to be the one-stop-shop for all metric emissions.

mholt on 12 May 2020

👍4

consider using the name "metrics" instead of "prometheus"

Sure!

hairyhenderson on 12 May 2020

Just FYI I haven't forgotten about this - just been swamped lately... Hoping to find time to tackle some of this next week!

hairyhenderson on 6 Jun 2020

❤15

Please, would there be a possibility to add an option to enable/disable per host metrics resolution as well?

We use Caddy v2 for serving hundreds of hosts and it is complicated to obtain stats such as requests per host without log parsing.

viliampucik on 3 Sep 2020

👍3

Please, would there be a possibility to add an option to enable/disable per host metrics resolution as well?

Thanks for posting here, @viliampucik 😉. Yup, that makes a lot of sense. I'll see if I can fit it into my draft PR at #3709, otherwise it can be added separately.

hairyhenderson on 4 Sep 2020

Loosely related, as a stopgap you can automate metric-generation from logs with Loki; in a nicer way than with e.g. mtail.

That being said, doing it natively is obviously better.

RichiH on 9 Sep 2020

Was this page helpful?

0 / 5 - 0 ratings