Memcached operations such as get can specify multiple keys. If we want to add an attribute keys with a list of the keys included in the operation, we can do so with an arbitrary separator (e.g. span.set_attribute('keys', keys.join('|'))), however this requires coordination between instrumentation and analysis backends. Can we either define semantic conventions for representation of multi-valued attributes, or allow for multiple attributes on a span with the same key?
cc @tigrannajaryan
Other use case for multi-valued attributes: HTTP headers (would require a multimap), list of HTTP proxies used.
See also #76, a call for structured attribute values.
This has been discussed a while ago and we didn't have a specific use-case to demonstrate the need. I think this shows the need clearly.
The relevant part of the spec [1] that needs to be changed is:
Setting an attribute with the same key as an existing attribute SHOULD overwrite the existing attribute's value.
Note that this also requires a support in exporting protocols. Proposed OTLP [2] can support this, so can Jaeger Protobuf [3]. OpenCensus proto cannot support this directly because attributes are a map there [4], so we will need to come up with a reasonable way to translate to formats that require attribute keys to be unique.
[1] - https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/api-tracing.md#set-attributes
[2] - https://github.com/open-telemetry/oteps/pull/59
[3] - https://github.com/jaegertracing/jaeger/blob/4c7be1a10438590738779b54a151a0055b387132/model/proto/model.proto#L126
[4] - https://github.com/census-instrumentation/opencensus-proto/blob/5cec5ea58c3efa81fa808f2bd38ce182da9ee731/src/opencensus/proto/trace/v1/trace.proto#L158
Why not support list-valued attributes? I would prefer not to change the specification that states:
Setting an attribute with the same key as an existing attribute SHOULD overwrite the existing attribute's value.
Why not support list-valued attributes?
List of what thought? Is it a list of primitives only? Can it be a list of lists (and recursively so)? That makes it equivalent to https://github.com/open-telemetry/opentelemetry-specification/issues/76
This is also difficult to support in exporters. How do we translate a list-valued attribute to Jaeger tags for example? Do we flatten the list? If most of exporters are going to flatten the list then why don't we start with a flat list in the first place?
I suggest this because it would allow future exporters to enrich their data structure and support any JSON-type value. In a structured logging system, it will be very natural to have structured key-values, recursively so, and I'd like to maintain consistency between span event key-values and span attributes.
Also to answer your question, I would format these values as JSON and set them as strings.
I suggest this because it would allow future exporters to enrich their data structure and support any JSON-type value. In a structured logging system, it will be very natural to have structured key-values, recursively so, and I'd like to maintain consistency between span event key-values and span attributes.
I am not against structured values but I think that's a separate topic. Even with structured values I would still suggest to support multi-values (essentially creating a tree of multimaps).
Let me keep this debate going though since I am not convinced yet one way or another. I want to see more arguments in favor of the approaches.
One thing I don't like about list-valued attributes proposal is that it results in multi- and single-value attributes to have a different data structure, e.g. given the usecase in this issue I can have one of the following 2 structures depending on whether I am getting a single or multiple keys from Memchached:
{
"keys": "my-first-key"
}
{
"keys": ["my-first-key", "my-second-key"]
}
If I now want to search for all Spans representing Memcached get of "my-first-key" I need to have a complicated (possibly recursive) tree search.
If it was a multi-valued list my search approach would most likely be the same regardless of how many values of the same key I have. Assuming I am storing the data in a RDBMS the flat version is more natural and likely faster unless my RDBMS supports JSON indexing (and even if it does my search is likely still more complicated than the flat version).
For the record, the LightStep tracer supports JSON-valued inputs. The OpenTracing APIs support this to some extent, so for example the Golang Span.SetTag API takes an interface, and if you pass a non-scalar value the tracer will use json.Marshal to encode a string value.
To me, multimap semantics raise all sorts of new complicated questions about the API. Should we support a separate API to overwrite? Should we support deletion by key:value?
I see your concerns about RDBMS representation as out of place at the API level. If you received a JSON-encoded list, you could certainly choose to index it by flattening the list. If you receive a JSON-encoded dictionary, you could recursively expand it and index flat key:values. These are questions about the schema of the data.
That leaves only one real question, to me. It's convenient to say there are multi-map semantics, because it lets a user call "set" multiple times without remembering any state on their own. If we only support unique map semantics, then the user is required to remember all their values and assemble a list on their own. This would be my preference because it simplifies the API considerably. Set has overwrite semantics, and if you want to set multiple times with a list, you can. But it doesn't require any new API surface to replace an already-set value.
I see your concerns about RDBMS representation as out of place at the API level.
This is true.
I think I agree with what you are saying. As soon as we are able to support multiple values unambiguously and there is a clear convention on how it must be done I am OK with using array values for that. If we have a convention the backend is free to flatten or store however it wants.
@fbogsany thoughts on this? Does it work for you?
馃憤 I'm fine with a clear convention.
Here is a proposal to support multi-value attributes. Please comment.
In Tracing API specification, Set Attributes section replace this:
(Required) The attribute value, which must be either a string, a boolean value, or a numeric type.
By this:
- (Required) The attribute value, which is either:
- A primitive type: string, boolean or numeric.
- An array of primitive type values. The array MUST be homogeneous, i.e. it MUST NOT contain values of different types.
Add support for array types to AttributeKeyValue message:
ARRAY=4 to ValueType enum.repeated AttributeKeyValue array_value=7 to contain child elements of the array.ValueType==ARRAY then key field of all child elements must be empty.To store an array of values encode it as a JSON and store as a string value.
For example to store an array of 3 elements 10, 15 and 25 for key "redis" create the following key/value pair:
redis = [10, 15, 25]
Add the following paragraph to Semantic Conventions document:
Resource and Span attributes may be one of the following types:
- A primitive type: string, boolean or numeric.
- An array of primitive type values. The array MUST be homogeneous, i.e. it MUST NOT contain values of different types.
The type of the attribute SHOULD be specified in the semantic convention for that attribute. For protocols that do not natively support array values such values MUST be represented as JSON strings.
I would not require support for non-homogenous arrays and I would even go as far as forbidding semantic conventions to use them. From the top of my head, I can't think of any use cases for them and I think that they will lead to complicated or ugly APIs in some languages.
I would not require support for non-homogenous arrays and I would even go as far as forbidding semantic conventions to use them. From the top of my head, I can't think of any use cases for them and I think that they will lead to complicated or ugly APIs in some languages.
Agreed. My use case only requires homogenous arrays. I'm looking for an array of strings, but I could also see a use for arrays of booleans (which keys in a memcached multi get were hits) and arrays of numbers (size of each value returned).
Updated the proposal to require homogeneous arrays.
Note that the requirement to have homogeneous arrays also places limits in arbitrary nested structures that we may want to support in the future. If such support is added most likely there will be a desire to allow non-homogeneous array elements (so that you can e.g. directly represent JSON data).
I don't think the json representation is useful for Metrics. Hanse one more reason to not have the same representation for attributes in tracing and labels in metrics
The homogeneous requirement is fine, thanks!
simple joining on comma is cheaper to index split and join and more supportable vs json arrays. I know folks are already doing this and it requires no more coordination on backend than making a json array. the only thing you gain with json is a an escaping syntax. I know ship has sailed, but this one needs an alternative opinion mentioned for posterity
also the use case of tagging the input to multi-get is generalizable as request tagging. it is very common to have json strings when serializing requests into tags. A concrete non-request tagging use case for array values are things like experiment IDs. Usually these are bounded to much lower cardinality, yet need to be indexed for lookup. It is possible that someone wants to index all redis keys, also, but yeah I think this is more a request logging thing vs anything else.
hope the context helps for the next similar discussions
Most helpful comment
Why not support list-valued attributes? I would prefer not to change the specification that states: