Go: net/http/pprof: add seconds= to mutex/block profile handler

Created on 10 Jan 2018  路  30Comments  路  Source: golang/go

Using the mutex and block profiles from net/http/pprof currently requires modifying the application to call runtime.SetBlockProfileRate and runtime.SetMutexProfileFraction. Since it's not clear what the performance impact of enabling these profiles is, few people will actually do it.

I'd like to add the following parameters, to make using these two profiles easier to use:

/debug/pprof/block?seconds=1&rate=1
/debug/pprof/mutex?seconds=1&frac=1

If the seconds parameter isn't specified the endpoints will behave as they currently do. If it is specified the block profile rate or mutex profile fraction will be set (default 1) and then the handler sleeps for the specified duration. At the end of the duration the profile is written and the original profile rate / fraction is restored.

This proposal requires changing SetBlockProfileRate to return the previous rate, similar to SetMutexProfileFraction.

NeedsFix Proposal Proposal-Accepted help wanted

Most helpful comment

@hyangah do you have any updates? I am really looking forward to this change.

All 30 comments

The profiles are reporting things that happened earlier in the program. The rate change should really happen once, at program startup. Otherwise if you hit the URL twice with different rates the second will report the first one's samples but scaled badly. Or do we now subtract inside the handler?

The defaults are zero, which is admittedly not very useful. Maybe we should fix that instead.

/cc @hyangah @pjweinbgo

Yes, the current mutex/block profiling implementation doesn't work well with rate changes.

Inside google, we implemented custom handlers that take a 'seconds' parameter and return the delta, and they are served from separate endpoints -- the additional parameter changes the semantics of data so much that we thought they are better to be served from different endpoints.

BTW, I don't see code for scaling samples using the sampling rate and that is a bug.
(https://github.com/golang/go/blob/master/src/runtime/pprof/pprof.go#L357) Is it too late for 1.10?

It's not absolutely too late, since this is a new feature, but it's awfully late.

I might be mistaken, but it looks like at least the mutex profile does scale samples (but ignores rate changes)?

It seems there are two options:

  1. Enable block and mutex profiling by default. But what is the correct number?
  2. Make current code work with rate changes and implement new (?) endpoints.

The former is a simple change but takes more work, the latter vice versa. I'd like to have mutex and block profiling work out of the box, so I'm happy to work on both.

1) support for difference reporting (seconds parameter): I think this is reasonable, but first consider whether this can be implemented from the client side. I.e., capturing the profiles twice and then asking pprof to report difference (-base flag). Would it be too inconvenient?

2) support for changing the sampling rate on the fly: this is difficult. they are not meant to be changed.

3) default to non-zero values: can anyone add measurement and benchmark.

(fyi, the bug I mentioned in my previous comment is fixed now (for mutex profile) and is not relevant to this proposal.)

Maybe if they are off then second=XXX rate=XXX can turn them on and send a diff back?
We should probably add the diff code for Go 1.11 anyway.

/cc @hyangah.

No objection. Just once the handler is hit with rate=XXX param, we may want to make it clear that the output from the default /debug/pprof/mutex endpoint is no longer interpretable.

I probably misunderstand how the whole profile machinery works under the hood. My initial proposal wasn't to enable differential profiling, but to make the mutex and block profile useable out of the box. My idea was to just toggle the them on / off on an as-needed basis. That was naive, for all the points you have pointed out.

As an application developer, I have no idea what the cost of enabling these profiles is. Should I sample one in thousand events? One in a hundred? The documentation isn't helping in that case. That's what I would like to solve.

Re finding a default value, I tried following @hyangah's suggestion to run a bunch of benchmarks (from https://github.com/golang/go/wiki/Benchmarks) and see how they change, but didn't get far due to noisy results. Is there a better corpus I could use?

@hyangah what do you think we should do here to make these profiles easy to use and helpful to developers?

How about resetting the profile counters if the sampling period is changed?

Since we migrated all these profiles to the proto buf format, the output can include information like the collection time and the duration. For legacy format output, we can output the duration as a comment.

/debug/pprof/{mutex,block} endpoints keep reporting the counters since the current sampling rate is set.
The new rate=XXX param changes the sampling rate.

After talking with @pjweinbgo I became less optimistic about choosing the non-zero default values. Their impact varies a lot depending on applications. For example, benchmarks that don't involve many mutex will not be affected.

Sounds good to me. Would you add the seconds parameter as well?

P.S. As I said I'd like to take a stab at this, if you don't mind @hyangah ? I'd probably split this in two CLs:

  • Reset profile data on rate change
  • Changes to net/http/pprof to add new parameters

Change https://golang.org/cl/94335 mentions this issue: runtime: reset block, mutex profile counters after rate change

There is concern about exposing the rate change functionality in production. Especially when the change can affect the server performance and when the service owner already tuned the parameters best for the service. I agree that's a dangerous feature.

@aalexand proposed an alternative that adds the 'seconds' parameter.

  • If 'seconds' param is given, report diff.
  • If the rate is 0 and 'seconds' param is given, set the rate for a predefined value (maybe 100 for mutex profile), report the diff after the seconds, and reset the rate back to 0.
  • If the rate is 0 and 'seconds' param is not given, return zero output (the current behavior)

@lmb, does this still meet your use case?

There is concern about exposing the rate change functionality in production. Especially when the change can affect the server performance [...]

If I understand correctly, the worry is that an operator would set an value that slows down a production service, and causes degradation? In my mind accessing the pprof endpoints is a privileged operation, so I expect the operator to know what they are doing / take responsibility. The new proposal feels like we are trying to blunt the tool because we don't trust the developer.

If the rate is 0 and 'seconds' param is given, set the rate for a predefined value (maybe 100 for mutex profile)

I think this is a catch 22, since again we do not know what the right value for different services is. Which is why it makes sense to add that argument in the first place.

In my mind accessing the pprof endpoints is a privileged operation.
we do not know what the right value for different services is. Which is why it makes sense to add that argument in the first place.

Go team should decide, but: no existing debug/pprof handler allows changing the profiling configuration parameters, so introducing that capability needs serious justification IMO. If you want different mutex profiling rate for different services, you could enable it in main() with the rate that works best. Giving a way to experiment with different rate values on a production service sounds like a bad idea to me. You can also implement your own HTTP handler with such parameter and call SetMutexProfileFraction prior to delegating to the default handler.

Curious: do you have examples of when SetMutexProfileFraction(1) didn't work for you and why? I'd naively expect that if a given contention is not visible in the 1% sample, it's not worth looking at, but I might be missing something.

@lmb It is common to have multiple levels of privileged operations in production systems. One for the full privilege to modify the production system. Another for the subset of the privileged operations such as read-only access for monitoring/debugging or temporary mutation ops for probing or profiling. Maybe you want to allow the second group of privilege to the common monitoring service, or the semi-trusted group of people who help debugging production issue, but you won't still want to give out the full privilege of your server.

The current /debug/pprof endpoints are operating with the second privilege group, so exposing the rate change feature through the endpoints doesn't seem like a good idea.

FYI most Go servers in Google have been running with 1% mutex profile rate (i.e. SetMutexProfileFraction(100)) for a while. Low enough to avoid regression but high enough to capture interesting events. That seems to be a good candidate for the default rate when profiling is temporary enabled by the seconds param.

Go team should decide, but: no existing debug/pprof handler allows changing the profiling configuration parameters, so introducing that capability needs serious justification IMO.

Is your worry changing the parameters, or crippling the service? Tracing a sufficiently busy service via pprof/trace will right now degrade it. How is that different?

Curious: do you have examples of when SetMutexProfileFraction(1) didn't work for you and why?

No, because I haven't gotten to that stage. I need to enable mutex profiling to see its effect, but don't have a way of evaluating it without making a release.

The current /debug/pprof endpoints are operating with the second privilege group [...]

Similar argument: I don't think pprof endpoints were / are written with that in mind, but maybe I'm wrong.

FYI most Go servers in Google have been running with 1% mutex profile rate (i.e. SetMutexProfileFraction(100)) for a while.

Thats useful to know, thanks! I'll just bite the bullet, enable it at that level and see what the impact is.

@lmb I'm unclear on the current status of this proposal. Are there further changes that you think would be useful in 1.11? Thanks.

It looks like we have support for ?seconds= in /debug/pprof/profile and /debug/pprof/trace but not mutexprofile etc. Is the proposed support for seconds= in mutex (do two requests and diff them) planned to be added to the pprof tool itself or the HTTP endpoint? Either way, seconds= sounds fine. I'd rather avoid rate= if possible for the reasons given above.

Obviously we missed doing anything for Go 1.11.

@rsc I think it would make sense to have it added to the HTTP endpoint for symmetry with /debug/pprof/profile and /debug/pprof/trace.

@hyangah Do we already have code somewhere for profiles diffs for mutex profiles and other cumulative profiles? If so can we publish that into Go 1.12?

Note that we should still make the default behavior (seconds= empty string or unspecified) return the cumulative for-all-time counts.

@rsc yes, we do have a code internally used.

The version of implementation we use uses github.com/google/pprof/profile for diff computation. Currently, the github.com/google/pprof/profile is vendorred under src/cmd/vendor, and also an old, partial copy of it is also in src/runtime/pprof/internal. None of them is accessible from src/net/http/pprof. Can we move the vendored upstream pprof somewhere accessible by all these three places? (e.g. src/internal or src/vendor?)

Change https://golang.org/cl/147598 mentions this issue: net/http/pprof: mutex, block profiles support seconds param

@hyangah do you have any updates? I am really looking forward to this change.

Gentle ping @hyangah, I added some code review suggestions to your CL a couple of months ago, please take a look. Thank you!

Change https://golang.org/cl/229537 mentions this issue: net/http/pprof: allow "seconds" parameters to most profiles

It may be too late鈥攕orry鈥攂ut I wonder whether instead of seconds=5 it鈥檇 be better to do time=5s (or any other duration).

@josharian yes, I like your suggestion but 'seconds' preexisted for cpu profile and execution trace endpoints.

Was this page helpful?
0 / 5 - 0 ratings