Opentelemetry-specification: Limitation on number of attributes, links, events and behavior

Created on 13 Jul 2019 · 20Comments · Source: open-telemetry/opentelemetry-specification

In Java SDK implementation, there is a limit and behavior is to discard in FIFO.

The limitations are not part of the spec and behavior when there are more than max is not defined.

sdk needs discussion after-ga trace

Source

lmolkova

Most helpful comment

I'd have a slight preference for the third option:

If the limits are user-adjustable via API, aren't we mixing different personas? An application developer (who would set the limit) might not know the exporter and ultimatively the backend that uses the data. The backend might have difficulties with 10 or with 10k attributes. It should be a concern of the exporter/backend only.

discostu105 on 16 Jul 2019

👍4 👎1

All 20 comments

Seems like there’s a couple of options:

Spec defines a hard max limit, unconfigurable.
Spec defines a default max, user-adjustable.
Spec defines no limit, exporters can select a limit (config of this limit is an exporter-specific concern)

The third option is the most similar to OpenTracing imo. I think the second option might be the best because there’s always going to be some sort of maximum, might as well put it in a well-known place.

austinlparker on 13 Jul 2019

I'd have a slight preference for the third option:

discostu105 on 16 Jul 2019

👍4 👎1

That's a good point.

austinlparker on 16 Jul 2019

3 also seems like the most reasonable choice to me, though I'm not extremely opinionated about this topic

mtwo on 17 Jul 2019

3 is not a good option because the intent for this is to not grow memory unbounded, this is not to protect the exporter or the backend, is to help Application owners to control how much memory their instrumentation uses.

I think the best compromise is 2, the limit is adjustable via the SDK not via the API (so an instrumentation dev, as you pointed does not know the backend), but the Application owner knows the backend in almost all the situations.

bogdandrutu on 30 Sep 2019

2 and 3 are almost identical in a sense that if exporter can configure this SDK setting, than it is the same. Multiple exporters will just need to set the max of options and then do additional trimming on their side. There is no way to prevent exporters to configure SDK as long as this setting is available for end user.

SergeyKanzhelev on 30 Sep 2019

Removed this from the 0.3 milestone as it's not a blocker. We can add this in the 0.4 milestone.

jmacd on 22 Jan 2020

Do we also want to impose/allow to configure some limits on the _size_ of attributes values?

iNikem on 27 Jul 2020

For attribute and label keys, there is discussion about a maximum length (among other things) on #504.

Oberon00 on 27 Jul 2020

👍1

2 and 3 are almost identical in a sense that if exporter can configure this SDK setting, than it is the same. Multiple exporters will just need to set the max of options and then do additional trimming on their side. There is no way to prevent exporters to configure SDK as long as this setting is available for end user.

I would argue that 2 provides a significant advantage by also enabling limiting of events enforceable by the SDK itself, which ensures that one cannot accidentally flood memory with hundreds of thousands of attributes before the trace reaches the exporter.

For that reason alone I would argue for #2. Even if exporters end up configuring it, it also covers the don't-cause-memory-issues use case.

Other idea: ulimit-style limits.

As another idea too, you could make the maximum count work similar to ulimits in Linux: the value starts at infinite, you can set it to be lower, but can't set it higher. This ensures that the most attributes that will ever be passed to an exporter is a count that all exporters support.

But reading it through, it sounds like it's just overcomplicating things significantly. Most likely someone can configure their SDK appropriately with complete knowledge of their backends.

toumorokoshi on 11 Aug 2020

I suggest to remove release:required-for-ga label since this is an additive, non-breaking change.

tigrannajaryan on 8 Sep 2020

👍2

+1 on making this release:after-ga .

carlosalberto on 9 Sep 2020

From the issue triage mtg today, i'm changing the label to release:after-ga since it looks like from the comments this can be punted.

andrewhsu on 9 Sep 2020

👍1

@tigrannajaryan
It depends on the outcome of this discussion if this is an additive, non-breaking change.
If some users are used to recording 10k events on a span and after updating the SDK suddenly 9k of them are dropped because a "sanity limit" of 1k is introduced to the SDK (to limit memory usage in case of usage errors as @bogdandrutu pointed out, for example), this would be a breaking change in my opinion.
If the spec would clearly state, that at least n number of attributes, links and events (each) are supported to be recorded and exported, this would already give users some kind of guarantee and consumers know what they will have to expect at least.
Having a user-configurable limit to go beyond (or above) the sanity limit imposed by the SDK would be additive and non-breaking.

arminru on 10 Sep 2020

Would setting some relatively arbitrary limit be acceptable, in that case? I do agree that having a sane limit will save some real production headaches, and loss of telemetry (e.g. span attributes) is more desirable than an app crashing due to OOM.

Based on what I've seen 128 would be a good threshold (i've never seen someone go beyond that), but it's very anecdotal.

toumorokoshi on 10 Sep 2020

@toumorokoshi Exactly. I have no idea on what number can be deemed safe for "all" use cases since I fear we know our users not well enough, however. I've seen requests on our Gitter that people intend to use spans for representing long-running actions that span over multiple hours (days?) and I could expect a higher number of events on these. We should probably aim high to not prevent "valid" (who's to decide what's valid anyway) edge cases and only have it as a fallback for error conditions to prevent production crashes. If the limit we decide on turns out to be too low for certain use cases we haven't thought of, we can always go higher or allow overriding by users in a truly non-breaking manner.

@tigrannajaryan @carlosalberto @SergeyKanzhelev since you voted for moving this out of GA - what do you think?

arminru on 10 Sep 2020

since you voted for moving this out of GA - what do you think?

@arminru I am certainly not against having this capability and believe it is useful but we need to make some sacrifices to cut the scope and deliver the GA.

If some users are used to recording 10k events on a span and after updating the SDK suddenly 9k of them are dropped because a "sanity limit" of 1k is introduced to the SDK (to limit memory usage in case of usage errors as @bogdandrutu pointed out, for example), this would be a breaking change in my opinion.

Some behavioral changes are likely inevitable when we release new versions. Formally, one can call this a breaking change but it is not the sort of the breaking change I think we should worry the most. If we can avoid hard breaking changes that make it impossible to recompile your app with a new version of OpenTelemetry because function signatures are different I think that will likely be good enough.

If we want to absolutely guarantee no changes in the behavior we can set the default to infinity when we introduce the Span data limiting and trimming capability.

[UPDATE] Just to make it clear: I don't advocate for weak compatibility guarantees, but I think we need to strike the right balance between being careful and being able to postpone functionality to after GA. It is very difficult to be completely certain that we will not need to change the behavior after the GA. If we set the absolute certainty of non-breaking behavioral changes after GA as the criteria for the GA then we will slow down a lot and I don't think we can meet our GA timelines.

tigrannajaryan on 10 Sep 2020

I would say if we can't come to a fast consensus, then postponing makes sense.

Although I think this could be summed up into two changes to the spec:

set a common limit across events, links, attributes
add the need for logging if any of the above are added and discarded.

And to @arminru's point,we just choose a number that's well beyond 99.99% of use cases. Don't want to throw numbers out without data, but 1000 seems well above the needed amount.

toumorokoshi on 11 Sep 2020

👍1

If that would be acceptable, I'm happy to file a PR.

toumorokoshi on 11 Sep 2020

👍2

If that would be acceptable, I'm happy to file a PR.

Sounds good, please do it.

tigrannajaryan on 11 Sep 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Properly communicate the stability for OTLP

carlosalberto · 4Comments

Define the list of standard exporters that SDK should support

bogdandrutu · 6Comments

Interface stability for implementers and new methods in the spec

Oberon00 · 4Comments

Sampling based on URL

pavolloffay · 4Comments

Where to put instrumentation-specific semantic conventions.

anuraaga · 5Comments