Jaeger: Cannot use rate limiting lower than 1 trace per second

Created on 8 Oct 2020  路  10Comments  路  Source: jaegertracing/jaeger

Describe the bug
Rate limiting sampler strategy param is an integer. Setting a decimal param value does not work, forcing a minimum rate of 1 trace per second.

To Reproduce
Steps to reproduce the behavior:

  1. Set a sampling strategy of "ratelimiting" with a param of "0.5" for a service
  2. Make a GET call to the sampling endpoint (e.g. using curl), it will return {"strategyType":"RATE_LIMITING","rateLimitingSampling":{"maxTracesPerSecond":0}}

Expected behavior
I would expect to be able to specify a rate using a decimal, for example 0.5 traces per second

Screenshots
If applicable, add screenshots to help explain your problem.

Version (please complete the following information):

  • OS: Linux/Docker
  • Jaeger version: 1.18.0
  • Deployment: Kubernetes Operator

Additional context
The problematic code seems to be here, where Param is cast to an int16.

I have filed as a bug report because I have checked several clients and they seem to support decimal values:

enhancement

All 10 comments

I see, I didn't know that Thrift repo existed. Could we convert this to a feature request or should I open a new one?

Sorry if this is obvious, but why do you need to sample less than 1 trace per second? Do you have a huge number of endpoints?

We used <<1.0 rate quite a lot, because some API services have 1000s of instances, especially with event loop based languages where it's common to run one process per core. However, we used it for defaultLowerBoundTracesPerSecond, which is already a double.

https://github.com/jaegertracing/jaeger-idl/blob/52fb4c944067f7661e6a5fa23ba4c44c6f9c2923/thrift/sampling.thrift#L45

We use Jaeger in almost all of our deployments, something like 500-800 containers running Jaeger. The higher being on the weekends typically when we see more traffic. I think (up to) 800 traces per second is too high. Even ignoring the scaling/storage costs, we just don't need that many as we don't have the head count to look at all that data.

why don't you use probabilistic sampling?

Sorry for the confusion, we are using probabilistic. These are REST APIs (ASP.NET), so 0.6 ends up with ~60% of incoming HTTP requests being sampled for example.

The per operation sampler you linked sounds perfect! Being able to set upper/lower bounds would be great, 60% during peak times is a lot, while theres room to spare during lower traffic periods.

Upper-bound is not supported in the per-operation sampler because it breaks the ability to extrapolations full traffic measurements from the samples. But since Jaeger doesn't make use of those at the moment anyway, and if you don't care, then implementing the rate limiting upper bound sounds like a useful feature. For people who care about extrapolations, the upper bound can be an optional feature.

Got it, that's pretty much the only use-case I could think of.

In any case, is networking an issue for you? You can always down-sample (--downsampling.ratio) at the collector if your concern is only storage.

I'd rather avoid dropping spans on the collector if possible. We log the Trace ID on each service if it is sampled. Our devs know if they see an error or log message in Loki with a TraceID, they can go to Jaeger and look at the exact request. If we are downsampling we lose the ability to accurately do that as the client doesn't know the collector dropped the trace.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

yurishkuro picture yurishkuro  路  4Comments

albertteoh picture albertteoh  路  3Comments

devoxel picture devoxel  路  5Comments

pavolloffay picture pavolloffay  路  3Comments

elgalu picture elgalu  路  3Comments