What are you trying to achieve?
The Hypertrace agents collect all request and response data (headers and payloads). This data is not needed for every request, hence I would like to implement a sampler that would make a decision based on the URL/endpoint pattern when this data should be collected (e.g. one in 100 requests). The decision should be propagated to downstream services (e.g. via tracestate) to make the decision coherent across the transaction.
The custom sampler would make a decision whether to collect payload and headers, the span sampling would still go through the normal sampling process (e.g. my custom sampler would delegate to it).
To fully implement this use-case these parts are needed:
tracestate (supported by the API) or drop the header and payload attributes in SpanProcessor#onEnd (not supported by the API).What is missing in the spec to solve my use-case
The sampler does not have access to the URL. The sampler interface (at least in Java) accepts attributes, however the instrumentation (e.g. servlet in otel-java-instrumentation) is not passing URL attribute there nor spec mandates it to do it.
I think this was fixed recently in Java: https://github.com/open-telemetry/opentelemetry-java-instrumentation/issues/388.
The spec currently says in https://github.com/open-telemetry/opentelemetry-specification/blob/v1.1.0/specification/trace/api.md#span-creation:
Whenever possible, users SHOULD set any already known attributes at span creation instead of calling SetAttribute later.
which would include HTTP headers too (makes sense, what if you have an "X-My-Company-Loadtest" header?). Of course this means that sampling will not prevent initial attribute collection, for this problem see #620.
Thanks @Oberon00,
could we extend the spec and explicitly mention that the URL should be present at the sampling time (if it is known)?
The spec does leave room for the sampler to access attributes that are provided during span creation (not after the span creation), so it is not an API issue.
One possible approach is to encourage or enforce the URL to be provided as the attribute that is available at span creation, as part of the semantic convention.
The quoted spec already says that:
Whenever possible, users SHOULD set any already known attributes at span creation instead of calling SetAttribute later.
Of course this also applies to the URL. If we need to discriminate attributes that are more or less important to sampling (original semantic conventions generator PR had a sampling_relevant: bool attribute), we need to somehow update this statement, otherwise saying that URL is required before StartSpan is redundant.
Most helpful comment
I think this was fixed recently in Java: https://github.com/open-telemetry/opentelemetry-java-instrumentation/issues/388.
The spec currently says in https://github.com/open-telemetry/opentelemetry-specification/blob/v1.1.0/specification/trace/api.md#span-creation:
which would include HTTP headers too (makes sense, what if you have an "X-My-Company-Loadtest" header?). Of course this means that sampling will not prevent initial attribute collection, for this problem see #620.