tracing has some performance implications and also mixes profiling spans and with logging. Due to a discussion on the matrix, we've decided to move all spans to the profiling crate (with no features enabled) and all logging to the log crate.
tracing is indeed a poor profiling library, but an excellent logging library. It's a shame to abandon it for what it's good at because it doesn't solve a problem it's not really intended for. When used sparingly to provide context to log messages (rather than heavily as an approximation of profiling instrumentation), spans can make diagnostics far easier to follow, and the benefits of structured logging are well known.
Consider retaining tracing for actual logging purposes, just replacing heavy use of spans with profiling (and retaining/adding spans for logging context as needed).
I'm not sold on using tracing for logging only. log is simpler, and works fine for us, so I'd rather stick to it.
Also related to #1189
I'm curious about what performance issues you're having that are related to tracing. In particular, our benchmarks show that tracing performance is generally very similar to the log crate's, and in some cases significantly better. Regardless of whether or not tracing is the right fit for wgpu's use case, I'd really appreciate details on any tracing-related performance problems. We try to take the overhead of adding diagnostics very seriously, and if there's any issues I've missed, I'd really like to be aware of them.
Thanks!
@hawkw if it's useful, there are some details in https://github.com/gfx-rs/wgpu/issues/1189#issuecomment-772802402 Some downstream projects reported seeing significant overhead when using many (~thousands) of spans even when the spans aren't being logged, so in that issue we were considering making it an optional feature
I did some investigations of the approximate timing for suppressed events and spans in log and tracing https://github.com/Imberflur/instrument-bench
I was thinking about this and just realized replacing tracing spans with profiling scopes could be detrimental for profiling if done everywhere there is a span right now because some are very fine grained and profiling provides no method for selective deactivation.
@Imberflur this is very nice!
because some are very fine grained and profiling provides no method for selective deactivation.
I don't think we need profile everywhere. One of the benefits of separating log from profiling is that they can happen at different frequency. We can have very detailed logs under trace severity that are compiled out, but the profiling only needs to care about places that can actually take time.
Another thing I will note is that path based selective inactivation of logging events appears to be more expensive for log than tracing, in all cases. I don't know whether this can be improved by optimizing log and the frontend crates or is a fundamental limitation of log. The main lesson here though is probably not to use the Trace level globally when profiling since I assume in the crates that use log, only events at this level will be abundant enough to cause issues.
Another thing I will note is that path based selective inactivation of logging events appears to be more expensive for
logthantracing, in all cases. I don't know whether this can be improved by optimizinglogand the frontend crates or is a fundamental limitation oflog.
AFAIK (as the maintainer of tracing), this is a more or less fundamental limitation for log that can't easily be resolved without redesigning some fundamental aspects of how the crate works.
There's a bit of a tradeoff here: tracing generates a small static at every macro expansion that can be used to cache whether that individual event is enabled, so that very granular filtering only requires a load and branch to skip disabled events. However, this requires some part of the event to always live for the 'static lifetime. log, on the other hand, has Record and Metadata structs that are generic over a lifetime. This makes it much easier to provide builder-style APIs for dynamically constructing records outside of a macro, and makes it easier to consume log records emitted by foreign (non-Rust) code. However, it means that log can only implement a fast path for skipping disabled records when they're below a _global_ max level; there isn't an easy way to cache more granular filtering in this design.
IMO both designs are valid; when designing tracing, we chose to bias for the best possible performance for pure-Rust projects and to optimize for more granular filtering (since I've personally seen it used quite frequently). This was a choice we were able to make partially because log exists and makes the opposite choice, so users who really care about efficiently consuming logs from non-Rust code have other options. :)
Some downstream projects reported seeing significant overhead when using many (~thousands) of spans even when the spans aren't being logged, so in that issue we were considering making it an optional feature
Hmm, I think I know what's up here. When a tracing event is disabled, nothing happens --- the macros essentially expand to a big if statement, with all the code for actually observing the event inside the if. This means that if the event is disabled, the compiled code basically just does a single jump to skip over actually recording anything. On the other hand, when a span is disabled, it constructs a disabled instance of the Span struct whose fields are empty. This is because the Span struct might have methods (like Span::enter) called on it, or be passed to other functions, and so on. When it's disabled, these calls don't do anything, but we still have to have the struct to pass around. Actually constructing the Span generates a bit more assembly, even though it's empty, so disabled spans have a bit more overhead than disabled events (which are very cheap). The overhead of constructing the Span struct isn't that costly, but when you have a very large number of extremely granular spans, it probably starts to add up!
Some of that overhead is probably inherent to our design --- if we have to return the Span struct, we'll always have to construct it on the stack, and we can't really avoid that. There are some optimizations that reduce this overhead a little. For example, a recent tracing release added #[inline] attributes on some of the Span struct's methods, and changed them to call a second private method that actually records data in the case that the span is enabled. That way, when the span is disabled, Span::enter and friends shouldn't actually result in pushing a stack frame. However, there's only so much we can do there --- at the end of the day, we're constructing a struct and we can't really change that.
I think the profiling crate can probably avoid some of this overhead, because its scope! macro doesn't return anything; skipping them can be as cheap as skipping a tracing event. But, that would limit some of the features we want to provide, like being able to attach Spans to futures, or being able to enter and exit the same span multiple times. So, if your use case is _just_ to create a very large number of extremely granular profiling scopes, and to have them be totally free when not profiling, the profiling crate may very well be a better choice!
With all that in mind, here's what I'd personally recommend (with the caveat that I'm a complete outsider who's not intimately familiar with wgpu's requirements): I think wgpu should replace uses of tracing's span macros in hot paths with the profiling crate's scope! macro, but _continue using tracing's event macros for logging, rather than switching to the log crate_.
The reasoning behind this is that, while there are adapters in both directions that allow log users to consume tracing events and tracing users to consume log records, emitting events using tracing natively has several advantages for downstream projects using tracing. When a tracing user wants to consume log records as tracing events, they can do this using an adapter layer implementing a log logger that emits events. These events cannot participate in the per-event filter caching system I described above, so there's a significant performance impact when using more granular per-module filtering. Furthermore, there's some inherent overhead to going through an adapter layer instead of emitting events directly, and the log records are recorded as an unstructured textual message rather than as structured data.
On the other hand, when using tracing's event macros natively, there's a feature flag that causes the macro expansions to also include log's macros. This can even be switched on automatically based on whether or not a tracing subscriber is in use. The difference here is that the log records are emitted natively using log's macros, so log users don't have the additional overhead of going through an adapter layer that converts tracing events into log records, while tracing users can still benefit from tracing's additional filtering optimizations.
Of course, it's up to you, that's just my suggestion. Let me know if you have any further questions!
@hawkw Thanks for the in depth explanation!
I have one question.
When a tracing user wants to consume log records as tracing events, they can do this using an adapter layer implementing a log logger that emits events.
When testing things I was able to observe log record output from tracing-subscriber::fmt without using this adapter. I was actually a bit confused since the docs seemed to imply this adapter was necessary as well. Is there something I'm missing here?
When testing things I was able to observe
logrecord output fromtracing-subscriber::fmtwithout using this adapter. I was actually a bit confused since the docs seemed to imply this adapter was necessary as well. Is there something I'm missing here?
tracing-subscriber has an optional dependency on tracing-log which is enabled by default; when it's on, the subscriber initialization functions will try to set it up automatically so that everything works out of the box. The docs should mention this but it might be kind of hard to spot?
Thank you @hawkw for the analysis, and taking time to write this down! 鉂わ笍
I'm not worried about selective filtering performance. I just want release users to run fast and only be bothered with important things. So a global filter on Error seems just fine. If anybody is doing selective filtering, presumably they are knees deep in the project (i.e. developers, likely in Debug mode, etc), and they have already opted into the extra cost.
Same logic can be applied to log vs tracing choice for logging specifically. I'm not worried about the extra cost of the adapter, since presumably we are talking about scenarios of development, where some costs are expected, unlike running on the end-user machine.
The obvious upside of using log is that it's standard and lightweight.
the macros essentially expand to a big if statement
Aside from the Span construction overhead, wouldn't this if potentially hurt instruction cache efficiency, too? Not to mention the ability for the compiler to inline things - seeing this big if would make it do less inlining, affecting performance as well.
TL;DR: it seems to me that we should still proceed with the plan in the subject of this issue. Although, I'd love to work together with @hawkw in the future... Right now it seems that the design decisions of tracing aren't benefiting us this much.
I'm not worried about selective filtering performance. I just want release users to run fast and only be bothered with important things. So a global filter on
Errorseems just fine. If anybody is doing selective filtering, presumably they are knees deep in the project (i.e. developers, likely in Debug mode, etc), and they have already opted into the extra cost.
Just for clarity, the selective filtering overhead doesn't just apply to when users choose to enable logs _from wgpu by module path_, it applies any time logs at a certain level are enabled in some places. The fastest path for both tracing and log is "no logs at a given level are enabled anywhere". If a user application, for example, enables the Info level for the application crate only, without enabling it for wgpu, the log crate's filtering approach takes a big performance hit when skipping any Info-level logs in wgpu --- in this case, each time an info! macro is hit, there's at least one function call, which will do string prefix matching at least once (possibly several times if there are multiple different modules for which logs are enabled). With the tracing approach, the string matching only occurs a single time on application startup; any subsequent time an info! macro is hit, there's only a relaxed load and and a branch.
Whether or not you care about performance in this use case is, of course, up to you. :) I just wanted to make sure it was clear that this overhead occurs _any_ time more granular filtering is in use, rather than only introducing overhead if it is being used to enable logs from wgpu.
the macros essentially expand to a big if statement
Aside from the
Spanconstruction overhead, wouldn't thisifpotentially hurt instruction cache efficiency, too? Not to mention the ability for the compiler to inline things - seeing this bigifwould make it do less inlining, affecting performance as well.
This is a concern with both tracing and log's logging macros; they both essentially expand to an if-statement. Actually recording the event or log record occurs behind a function call that should never be inlined in both cases. The only way to avoid generating _any_ code for disabled instrumentation is to use compile-time filtering (which both tracing and log support, more or less identically).
When they aren't totally compiled out, though, there definitely is a cost to generating additional instructions even when jumping over them is cheap; the impact on the optimization is definitely worth being aware of in extremely hot code.
Thank you for correction @hawkw !
If a user application, for example, enables the Info level for the application crate only, without enabling it for wgpu
They would first have to enable the Info level globally, which implies consequences. So the extra overhead is fine for these cases, I think. We care most about performance-sensitive workloads for the end users, where nobody will tweak the compile time filter.
This is a concern with both tracing and log's logging macros; they both essentially expand to an if-statement
Right, but it's only a concern in the context of "logging" as opposed to profiling. And this is where tracing gets into an unfortunate position, since it combines both. I expect profiling markers to be more lightweight than logging, and thus we'd put log in less places than we currently use tracing for. The hot paths will only have profiling! but no log!.
And this is where tracing gets into an unfortunate position, since it combines both.
That's the thing--tracing spans aren't really intended to be used for profiling in the first place, though they can do in a pinch.
And this is where tracing gets into an unfortunate position, since it combines both.
That's the thing--tracing spans aren't really intended to be used for profiling in the first place, though they can do in a pinch.
I mean, I've definitely _hoped_ that people would be able to build profiling tools on top of tracing, and I've talked about that in the past. But, yeah, it turns out that it's still hard to have one abstraction that works for every use case.
They would first have to enable the Info level globally, which implies consequences.
None of those compile time filters are default features AFAICT, so users do need to explicitly opt-out. However, when setting up the actual logging consumer/frontend they do typically enable a specific global filter there which would still be on the relatively very fast path (as long as they don't opt-in to lower levels in specific areas).
FWIW our filter level that we currently distribute at in veloren is Info. However, I'm not that worried about this either way since as mentioned:
The hot paths will only have profiling! but no log!.
So I suspect that the abundance of log! will be such that it isn't a performance concern either way.
edit: Additionally, we use tracing for most of our internal logging so we could potentially disable log at specific levels at compile time without much impact. Although, if we have any libraries using log with useful information for debugging errors on user machines this means we would have to distribute specialized builds to access this information.
And this is where tracing gets into an unfortunate position, since it combines both.
That's the thing--tracing spans aren't really intended to be used for profiling in the first place, though they can do in a pinch.
I mean, I've definitely _hoped_ that people would be able to build profiling tools on top of
tracing, and I've talked about that in the past. But, yeah, it turns out that it's still hard to have one abstraction that works for every use case.
In my experience, the interior time of an empty tracy-client span is around 50 ns and the exterior time is around 200 ns. Using tracing-tracy, these times change to around 400 ns and 900 ns. So I don't think it is unreasonable to use tracing for profiling at sufficient granularity. But for stuff under ~1 渭s it can be really helpful to have only 50 ns of extra time added on to the measurements. For cases where we want usable timing from nested spans, we need to be sure the added overhead is a minimal portion. Picking 5% as an arbitrary cutoff that would make tracy-client spans usable for nesting spans that we expect to be over ~4 渭s long and tracing-tracy usable with nesting spans over ~20 渭s long.
So if I would draw a final conclusion from this we want to stay with tracing for logging, but transfer most of our profiling-related events to profiling? Is this a fair conclusion?
Most helpful comment
AFAIK (as the maintainer of
tracing), this is a more or less fundamental limitation forlogthat can't easily be resolved without redesigning some fundamental aspects of how the crate works.There's a bit of a tradeoff here:
tracinggenerates a small static at every macro expansion that can be used to cache whether that individual event is enabled, so that very granular filtering only requires a load and branch to skip disabled events. However, this requires some part of the event to always live for the'staticlifetime.log, on the other hand, hasRecordandMetadatastructs that are generic over a lifetime. This makes it much easier to provide builder-style APIs for dynamically constructing records outside of a macro, and makes it easier to consume log records emitted by foreign (non-Rust) code. However, it means thatlogcan only implement a fast path for skipping disabled records when they're below a _global_ max level; there isn't an easy way to cache more granular filtering in this design.IMO both designs are valid; when designing
tracing, we chose to bias for the best possible performance for pure-Rust projects and to optimize for more granular filtering (since I've personally seen it used quite frequently). This was a choice we were able to make partially becauselogexists and makes the opposite choice, so users who really care about efficiently consuming logs from non-Rust code have other options. :)Hmm, I think I know what's up here. When a
tracingevent is disabled, nothing happens --- the macros essentially expand to a bigifstatement, with all the code for actually observing the event inside theif. This means that if the event is disabled, the compiled code basically just does a single jump to skip over actually recording anything. On the other hand, when a span is disabled, it constructs a disabled instance of theSpanstruct whose fields are empty. This is because theSpanstruct might have methods (likeSpan::enter) called on it, or be passed to other functions, and so on. When it's disabled, these calls don't do anything, but we still have to have the struct to pass around. Actually constructing theSpangenerates a bit more assembly, even though it's empty, so disabled spans have a bit more overhead than disabled events (which are very cheap). The overhead of constructing theSpanstruct isn't that costly, but when you have a very large number of extremely granular spans, it probably starts to add up!Some of that overhead is probably inherent to our design --- if we have to return the
Spanstruct, we'll always have to construct it on the stack, and we can't really avoid that. There are some optimizations that reduce this overhead a little. For example, a recenttracingrelease added#[inline]attributes on some of theSpanstruct's methods, and changed them to call a second private method that actually records data in the case that the span is enabled. That way, when the span is disabled,Span::enterand friends shouldn't actually result in pushing a stack frame. However, there's only so much we can do there --- at the end of the day, we're constructing a struct and we can't really change that.I think the
profilingcrate can probably avoid some of this overhead, because itsscope!macro doesn't return anything; skipping them can be as cheap as skipping atracingevent. But, that would limit some of the features we want to provide, like being able to attachSpans to futures, or being able to enter and exit the same span multiple times. So, if your use case is _just_ to create a very large number of extremely granular profiling scopes, and to have them be totally free when not profiling, theprofilingcrate may very well be a better choice!With all that in mind, here's what I'd personally recommend (with the caveat that I'm a complete outsider who's not intimately familiar with
wgpu's requirements): I thinkwgpushould replace uses oftracing's span macros in hot paths with theprofilingcrate'sscope!macro, but _continue usingtracing's event macros for logging, rather than switching to thelogcrate_.The reasoning behind this is that, while there are adapters in both directions that allow
logusers to consumetracingevents andtracingusers to consumelogrecords, emitting events usingtracingnatively has several advantages for downstream projects usingtracing. When atracinguser wants to consumelogrecords astracingevents, they can do this using an adapter layer implementing aloglogger that emits events. These events cannot participate in the per-event filter caching system I described above, so there's a significant performance impact when using more granular per-module filtering. Furthermore, there's some inherent overhead to going through an adapter layer instead of emitting events directly, and thelogrecords are recorded as an unstructured textual message rather than as structured data.On the other hand, when using
tracing's event macros natively, there's a feature flag that causes the macro expansions to also includelog's macros. This can even be switched on automatically based on whether or not atracingsubscriber is in use. The difference here is that thelogrecords are emitted natively usinglog's macros, sologusers don't have the additional overhead of going through an adapter layer that convertstracingevents intologrecords, whiletracingusers can still benefit fromtracing's additional filtering optimizations.Of course, it's up to you, that's just my suggestion. Let me know if you have any further questions!