Opentelemetry-specification: Representation of multi-valued attributes

Created on 2 Nov 2019 · 20Comments · Source: open-telemetry/opentelemetry-specification

Memcached operations such as get can specify multiple keys. If we want to add an attribute keys with a list of the keys included in the operation, we can do so with an arbitrary separator (e.g. span.set_attribute('keys', keys.join('|'))), however this requires coordination between instrumentation and analysis backends. Can we either define semantic conventions for representation of multi-valued attributes, or allow for multiple attributes on a span with the same key?

cc @tigrannajaryan

Source

fbogsany

👍1

Most helpful comment

Why not support list-valued attributes? I would prefer not to change the specification that states:

Setting an attribute with the same key as an existing attribute SHOULD overwrite the existing attribute's value.

jmacd on 4 Nov 2019

👍2

All 20 comments

Other use case for multi-valued attributes: HTTP headers (would require a multimap), list of HTTP proxies used.

Oberon00 on 3 Nov 2019

See also #76, a call for structured attribute values.

jmacd on 4 Nov 2019

This has been discussed a while ago and we didn't have a specific use-case to demonstrate the need. I think this shows the need clearly.

The relevant part of the spec [1] that needs to be changed is:

Setting an attribute with the same key as an existing attribute SHOULD overwrite the existing attribute's value.

Note that this also requires a support in exporting protocols. Proposed OTLP [2] can support this, so can Jaeger Protobuf [3]. OpenCensus proto cannot support this directly because attributes are a map there [4], so we will need to come up with a reasonable way to translate to formats that require attribute keys to be unique.

[1] - https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/api-tracing.md#set-attributes
[2] - https://github.com/open-telemetry/oteps/pull/59
[3] - https://github.com/jaegertracing/jaeger/blob/4c7be1a10438590738779b54a151a0055b387132/model/proto/model.proto#L126
[4] - https://github.com/census-instrumentation/opencensus-proto/blob/5cec5ea58c3efa81fa808f2bd38ce182da9ee731/src/opencensus/proto/trace/v1/trace.proto#L158

tigrannajaryan on 4 Nov 2019

Why not support list-valued attributes? I would prefer not to change the specification that states:

Setting an attribute with the same key as an existing attribute SHOULD overwrite the existing attribute's value.

jmacd on 4 Nov 2019

👍2

Why not support list-valued attributes?

List of what thought? Is it a list of primitives only? Can it be a list of lists (and recursively so)? That makes it equivalent to https://github.com/open-telemetry/opentelemetry-specification/issues/76

This is also difficult to support in exporters. How do we translate a list-valued attribute to Jaeger tags for example? Do we flatten the list? If most of exporters are going to flatten the list then why don't we start with a flat list in the first place?

tigrannajaryan on 5 Nov 2019

I suggest this because it would allow future exporters to enrich their data structure and support any JSON-type value. In a structured logging system, it will be very natural to have structured key-values, recursively so, and I'd like to maintain consistency between span event key-values and span attributes.

jmacd on 5 Nov 2019

Also to answer your question, I would format these values as JSON and set them as strings.

jmacd on 5 Nov 2019

I suggest this because it would allow future exporters to enrich their data structure and support any JSON-type value. In a structured logging system, it will be very natural to have structured key-values, recursively so, and I'd like to maintain consistency between span event key-values and span attributes.

I am not against structured values but I think that's a separate topic. Even with structured values I would still suggest to support multi-values (essentially creating a tree of multimaps).

Let me keep this debate going though since I am not convinced yet one way or another. I want to see more arguments in favor of the approaches.

One thing I don't like about list-valued attributes proposal is that it results in multi- and single-value attributes to have a different data structure, e.g. given the usecase in this issue I can have one of the following 2 structures depending on whether I am getting a single or multiple keys from Memchached:

{
  "keys": "my-first-key"
}

{
  "keys": ["my-first-key", "my-second-key"]
}

If I now want to search for all Spans representing Memcached get of "my-first-key" I need to have a complicated (possibly recursive) tree search.

If it was a multi-valued list my search approach would most likely be the same regardless of how many values of the same key I have. Assuming I am storing the data in a RDBMS the flat version is more natural and likely faster unless my RDBMS supports JSON indexing (and even if it does my search is likely still more complicated than the flat version).

tigrannajaryan on 5 Nov 2019

For the record, the LightStep tracer supports JSON-valued inputs. The OpenTracing APIs support this to some extent, so for example the Golang Span.SetTag API takes an interface, and if you pass a non-scalar value the tracer will use json.Marshal to encode a string value.

To me, multimap semantics raise all sorts of new complicated questions about the API. Should we support a separate API to overwrite? Should we support deletion by key:value?

I see your concerns about RDBMS representation as out of place at the API level. If you received a JSON-encoded list, you could certainly choose to index it by flattening the list. If you receive a JSON-encoded dictionary, you could recursively expand it and index flat key:values. These are questions about the schema of the data.

That leaves only one real question, to me. It's convenient to say there are multi-map semantics, because it lets a user call "set" multiple times without remembering any state on their own. If we only support unique map semantics, then the user is required to remember all their values and assemble a list on their own. This would be my preference because it simplifies the API considerably. Set has overwrite semantics, and if you want to set multiple times with a list, you can. But it doesn't require any new API surface to replace an already-set value.

jmacd on 5 Nov 2019

I see your concerns about RDBMS representation as out of place at the API level.

This is true.

I think I agree with what you are saying. As soon as we are able to support multiple values unambiguously and there is a clear convention on how it must be done I am OK with using array values for that. If we have a convention the backend is free to flatten or store however it wants.

@fbogsany thoughts on this? Does it work for you?

tigrannajaryan on 6 Nov 2019

👍 I'm fine with a clear convention.

fbogsany on 7 Nov 2019

Here is a proposal to support multi-value attributes. Please comment.

API Changes

In Tracing API specification, Set Attributes section replace this:

(Required) The attribute value, which must be either a string, a boolean value, or a numeric type.

By this:

- (Required) The attribute value, which is either:
  - A primitive type: string, boolean or numeric.
  - An array of primitive type values. The array MUST be homogeneous, i.e. it MUST NOT contain values of different types.

Protocol Support

OTLP

Add support for array types to AttributeKeyValue message:

Add ARRAY=4 to ValueType enum.
Add new field repeated AttributeKeyValue array_value=7 to contain child elements of the array.
Require that if ValueType==ARRAY then key field of all child elements must be empty.

Protocols with no built-in support for arrays

To store an array of values encode it as a JSON and store as a string value.

For example to store an array of 3 elements 10, 15 and 25 for key "redis" create the following key/value pair:

redis = [10, 15, 25]

Semantic Conventions

Add the following paragraph to Semantic Conventions document:

Resource and Span attributes may be one of the following types:
  - A primitive type: string, boolean or numeric.
  - An array of primitive type values. The array MUST be homogeneous, i.e. it MUST NOT contain values of different types.

The type of the attribute SHOULD be specified in the semantic convention for that attribute. For protocols that do not natively support array values such values MUST be represented as JSON strings.

tigrannajaryan on 28 Nov 2019

I would not require support for non-homogenous arrays and I would even go as far as forbidding semantic conventions to use them. From the top of my head, I can't think of any use cases for them and I think that they will lead to complicated or ugly APIs in some languages.

Oberon00 on 29 Nov 2019

👍1

I would not require support for non-homogenous arrays and I would even go as far as forbidding semantic conventions to use them. From the top of my head, I can't think of any use cases for them and I think that they will lead to complicated or ugly APIs in some languages.

Agreed. My use case only requires homogenous arrays. I'm looking for an array of strings, but I could also see a use for arrays of booleans (which keys in a memcached multi get were hits) and arrays of numbers (size of each value returned).

fbogsany on 30 Nov 2019

Updated the proposal to require homogeneous arrays.

tigrannajaryan on 2 Dec 2019

Note that the requirement to have homogeneous arrays also places limits in arbitrary nested structures that we may want to support in the future. If such support is added most likely there will be a desire to allow non-homogeneous array elements (so that you can e.g. directly represent JSON data).

tigrannajaryan on 2 Dec 2019

I don't think the json representation is useful for Metrics. Hanse one more reason to not have the same representation for attributes in tracing and labels in metrics

bogdandrutu on 2 Dec 2019

The homogeneous requirement is fine, thanks!

jmacd on 3 Dec 2019

simple joining on comma is cheaper to index split and join and more supportable vs json arrays. I know folks are already doing this and it requires no more coordination on backend than making a json array. the only thing you gain with json is a an escaping syntax. I know ship has sailed, but this one needs an alternative opinion mentioned for posterity

codefromthecrypt on 30 Jan 2020

also the use case of tagging the input to multi-get is generalizable as request tagging. it is very common to have json strings when serializing requests into tags. A concrete non-request tagging use case for array values are things like experiment IDs. Usually these are bounded to much lower cardinality, yet need to be indexed for lookup. It is possible that someone wants to index all redis keys, also, but yeah I think this is more a request logging thing vs anything else.

hope the context helps for the next similar discussions

codefromthecrypt on 30 Jan 2020

Was this page helpful?

0 / 5 - 0 ratings