Jaeger: Span format is not well suited for ES

Created on 3 Jul 2018  路  39Comments  路  Source: jaegertracing/jaeger

Jaeger spans, when put in elasticsearch, have the following structure:

{
  "_index": "jaeger-span-2018-07-02",
  "_type": "span",
  "_id": "a3RsXWQBjGb8h888uVuL",
  "_version": 1,
  "_score": null,
  "_source": {
    "traceID": "f229796cda43df60",
    "spanID": "274c0d9d4fc76848",
    "parentSpanID": "f229796cda43df60",
    "flags": 1,
    "operationName": "/",
    "references": [
      {
        "refType": "CHILD_OF",
        "traceID": "f229796cda43df60",
        "spanID": "f229796cda43df60"
      }
    ],
    "startTime": 1530575763255726,
    "duration": 1798,
    "tags": [
      {
        "key": "component",
        "type": "string",
        "value": "nginx"
      },
      {
        "key": "nginx.worker_pid",
        "type": "string",
        "value": "10767"
      },
      {
        "key": "peer.address",
        "type": "string",
        "value": "10.244.4.100:37542"
      },
      {
        "key": "http.method",
        "type": "string",
        "value": "POST"
      },
      {
        "key": "http.url",
        "type": "string",
        "value": "YYY"
      },
      {
        "key": "http.host",
        "type": "string",
        "value": "XXX"
      },
      {
        "key": "http.status_code",
        "type": "int64",
        "value": "204"
      },
      {
        "key": "http.status_line",
        "type": "string",
        "value": "204 No Content"
      }
    ],
    "logs": [],
    "processID": "",
    "process": {
      "serviceName": "ingress-controller",
      "tags": [
        {
          "key": "jaeger.version",
          "type": "string",
          "value": "C++-0.2.0"
        },
        {
          "key": "hostname",
          "type": "string",
          "value": "vega"
        },
        {
          "key": "ip",
          "type": "string",
          "value": "127.0.0.1"
        }
      ]
    },
    "warnings": null,
    "startTimeMillis": 1530575763255
  },
  "fields": {
    "startTimeMillis": [
      "2018-07-02T23:56:03.255Z"
    ]
  },
  "sort": [
    1530575763255
  ]
}

Notice the arrays here. The problem is that we were actually thinking about completely replacing debug logs with debug traces, but because everything is in arrays we can't index these spans in ES and thus cant really reliably search them. Jaeger is nice, but ES has much richer search capabilities and it would be just great if we could treat spans as regular structured documents we can put in ES and index properly.

Is there any plans to support this use case?

storagelasticsearch

Most helpful comment

I also had this issue with kibana. Additionally switching from nested documents to flat schema should help with query performance.
Anyway, the main constraint is the limit of fields in ES - index.mapping.total_fields.limit defaults to 1000 and can be increased, but I think I've read somewhere that 10k is too much.

My idea would be to create structure like this:

"tags": {
   "component_string": "nginx",
   "nginx_worker_pid_long": 10767,
   "peer_address_string": "10.244.4.100:37542",
   "http_method_string": "POST",
   ...
}
...
"other_fields": [
   "randomtag=1234",
   "otherrandomtag=ajhsdj"
]

1) there is no need to index field type, it's sufficient to put it into the field name.
2) "other_fields" allow to store arbitrary number of key-values and allow querying similar to what is available in Cassandra:

  • equals: use ES terms query on other_field and string otherrandomtag=ajhsdj
  • prefix search: use ES prefix query on other_field and string otherrandomtag=aj
    3) this allows to use long ES type for tags with long values and boolean ES type for tags with boolean values - leading to appropriate and smaller indices.

The keys in tags would be supported nicely in Kibana, while the remaining ones in other_fields would still be query-able if needed.

All 39 comments

  • (a) as far as I know, we already index everything in the spans in ES, despite having arrays
  • (b) what alternative format do you propose?

How interesting. When I create the "indexed pattern" from the spans in kibana it indeed picks up all the fileds in arrays. However, I can't search in Kibana by tag, for example. Is there any specific ES/Kibana configuration I need to apply? Didn't see anything in the docs.

Maybe having two maps, tags and tagTypes:

    "tags": {
        "component": "nginx",
        "nginx.worker_pid": "10767",
        "peer.address": "10.244.4.100:37542",
        "http.method": "POST",
        ......
     },
     "tagTypes": {
        "component": "string",
        "nginx.worker_pid": "string",
        "peer.address": "string",
        "http.method": "string",
        ......
     }

I'm not sure discussing the actual format is appropriate at this stage, given the "Jaeger indexes everything" response. @yurishkuro do you have any pointers how I might can Kibana to be able to search spans by tags/logs/etc? Let's investigate this and only start thinking about changing the schema after we conclude that it's indeed impossible today.

I am not an expect in Kibana. You can look at the code for ES span storage to see how it executes queries.

https://github.com/jaegertracing/jaeger/blob/master/plugin/storage/es/spanstore/reader.go

@yurishkuro Kibana is just a nice UI to render logs, it doesn't do anything interesting.

The issue is that Kibana cannot search on nested objects, so this is not necessarily Jaeger's problem. However, from the product perspective, this issue definitely makes Jaeger not being able to completely replace debug logs, which it clearly should be able to, as traces basically ARE logs with some additional metadata for the tree structure. So this implementation detail does mess up a very reasonable scenario which can exactly halve the storage needed for storing debug logs for teams who use ES already. I say it's definitely worth implementing.

I am still not clear what your proposal is that is "worth implementing". Traces are not regular logs, they are more structured. You can forward logs into the current span, which creates nested, one to many structure. You can use Jaeger UI to view those logs in the context of the span that they belong to, arguably a lot more useful experience than looking at a flat dump of logs across all requests.

I propose a flag to enable a different ES schema that will have plain objects instead of arrays with nested subobjects. One such proposal was presented above, but one possible alternative would be

{
    "tags": {
        "component": {
            "type": "string",
            "value": "nginx",
        },
        ......
     },
}

Basically, with the tag name in the key. Same with process.tags and logs.

afaik this creates problems when tags have dots in them, which a lot of standard OT tags do, because ES treats dots as hierarchy.

Also, how does this solve the problem with logs? The logs are still an array within the span.

The logs would be stored in exactly the same way, with something unique as a key. Possibly the timestamp, but I haven't put much thought in it yet.

The dots can be converted to something else like _ for storage, can't they?.

You can use Jaeger UI to view those logs in the context of the span that they belong to, arguably a lot more useful experience than looking at a flat dump of logs across all requests.

Yes, Jaeger UI is a better tool for looking at individual spans. But a native ES log search UI (i.e. Kibana as the most popular one) is MUCH better at searching for logs with complicated search requests that might, for example, contain regexes on log messages or whatnot. Kibana can also nicely visualize logs and you can build neat dashboards with graphs and such like. Jaeger UI is not there yet and making storage format Kibana-compatible does make a lot of sense to me.

The logs would be stored in exactly the same way, with something unique as a key. Possibly the timestamp, but I haven't put much thought in it yet.

That means you would need to always use a wildcard search expressions to skip that unique key?

The dots can be converted to something else like _ for storage, can't they?

Similarly, how would the user define search expressions? In Jaeger UI the tags will be "span.kind", but in ES it would be "span_kind", and the user would need to know that somehow.

BTW, I think Zipkin's ES implementation does something like what you're describing, but I haven't looked in detail how they deal with these issues.

Similarly, how would the user define search expressions? In Jaeger UI the tags will be "span.kind", but in ES it would be "span_kind", and the user would need to know that somehow.

It's certainly better than not being able to search at all, if you're not a programmer using curl and building search requests manually :)

In any case, I'm certainly not suggesting that this is something that must be done, I just think that this is a very useful feature that can solve a real problem of having to store logs twice: in spans for Jaeger and raw logs for searching and dashboarding AND having a separate infra for collecting those raw logs (though one probably would have it already, so this is not a major point).

Just something to consider.

Just FYI, I came across the same problem. As of now the only workaround is to do the scripted fields with flattened tag list, searches on scripted fields are supported in Kibana 6.0+. This has however rather terrible performance.

@kacper-jackiewicz Can you please share the field definitions? This was exactly my idea too, but I must admit that I have failed to write correct scripts quickly myself. And this can probably be universally useful for "Jaeger over ES" users.

@Monnoroch In my implementation just simple concat. Using Painless.
params._source['tags'].stream().map(item->item['key']).collect(Collectors.joining())

//edit: forgot to mention you

I also had this issue with kibana. Additionally switching from nested documents to flat schema should help with query performance.
Anyway, the main constraint is the limit of fields in ES - index.mapping.total_fields.limit defaults to 1000 and can be increased, but I think I've read somewhere that 10k is too much.

My idea would be to create structure like this:

"tags": {
   "component_string": "nginx",
   "nginx_worker_pid_long": 10767,
   "peer_address_string": "10.244.4.100:37542",
   "http_method_string": "POST",
   ...
}
...
"other_fields": [
   "randomtag=1234",
   "otherrandomtag=ajhsdj"
]

1) there is no need to index field type, it's sufficient to put it into the field name.
2) "other_fields" allow to store arbitrary number of key-values and allow querying similar to what is available in Cassandra:

  • equals: use ES terms query on other_field and string otherrandomtag=ajhsdj
  • prefix search: use ES prefix query on other_field and string otherrandomtag=aj
    3) this allows to use long ES type for tags with long values and boolean ES type for tags with boolean values - leading to appropriate and smaller indices.

The keys in tags would be supported nicely in Kibana, while the remaining ones in other_fields would still be query-able if needed.

Will ES blow up if the same tag name in different spans is set to different value type? E.g.

"tags": {
   "error": "true"
}

"tags": {
   "error": true
}

Will ES blow up if the same tag name in different spans is set to different value type? E.g.

It should be defined either as string or boolean in the mapping. The value with incorrect type will either be rejected or coerced, depending on the settings.

I like @mabn idea https://github.com/jaegertracing/jaeger/issues/906#issuecomment-403040934 but:

  • kibana users have to supply the type suffix
  • there is still problem with dost in tag keys

Here is an example of zipkin index and data

{
        "_index" : "zipkin:span-2018-07-24",
        "_type" : "span",
        "_id" : "AWTMwYfRE_JqQdsPFSM5",
        "_score" : 1.0,
        "_source" : {
          "traceId" : "81d7ee7cd45c831a",
          "duration" : 361483,
          "remoteEndpoint" : {
            "ipv4" : "127.0.0.1",
            "port" : 55230
          },
          "shared" : true,
          "localEndpoint" : {
            "serviceName" : "testsleuthzipkin",
            "ipv4" : "10.33.144.152"
          },
          "timestamp_millis" : 1532443591469,
          "kind" : "SERVER",
          "name" : "get",
          "id" : "3b96174448804d8a",
          "parentId" : "81d7ee7cd45c831a",
          "timestamp" : 1532443591469579,
          "tags" : {
            "http.method" : "GET",
            "http.path" : "/hi2",
            "mvc.controller.class" : "SampleController",
            "mvc.controller.method" : "hi2",
            "random-sleep-millis" : "353"
          }
        }
}

{
  "zipkin:span-2018-07-24" : {
    "aliases" : { },
    "mappings" : {
      "span" : {
        "_source" : {
          "excludes" : [
            "_q"
          ]
        },
        "dynamic_templates" : [
          {
            "strings" : {
              "match" : "*",
              "match_mapping_type" : "string",
              "mapping" : {
                "ignore_above" : 256,
                "norms" : false,
                "type" : "keyword"
              }
            }
          }
        ],
        "properties" : {
          "_q" : {
            "type" : "keyword"
          },
          "annotations" : {
            "type" : "object",
            "enabled" : false
          },
          "duration" : {
            "type" : "long"
          },
          "id" : {
            "type" : "keyword",
            "ignore_above" : 256
          },
          "kind" : {
            "type" : "keyword",
            "ignore_above" : 256
          },
          "localEndpoint" : {
            "dynamic" : "false",
            "properties" : {
              "serviceName" : {
                "type" : "keyword"
              }
            }
          },
          "name" : {
            "type" : "keyword"
          },
          "parentId" : {
            "type" : "keyword",
            "ignore_above" : 256
          },
          "remoteEndpoint" : {
            "dynamic" : "false",
            "properties" : {
              "serviceName" : {
                "type" : "keyword"
              }
            }
          },
          "shared" : {
            "type" : "boolean"
          },
          "tags" : {
            "type" : "object",
            "enabled" : false
          },
          "timestamp" : {
            "type" : "long"
          },
          "timestamp_millis" : {
            "type" : "date",
            "format" : "epoch_millis"
          },
          "traceId" : {
            "type" : "keyword"
          }
        }
      },
      "_default_" : {
        "dynamic_templates" : [
          {
            "strings" : {
              "match" : "*",
              "match_mapping_type" : "string",
              "mapping" : {
                "ignore_above" : 256,
                "norms" : false,
                "type" : "keyword"
              }
            }
          }
        ]
      }
    },
    "settings" : {
      "index" : {
        "number_of_shards" : "5",
        "provided_name" : "zipkin:span-2018-07-24",
        "mapper" : {
          "dynamic" : "false"
        },
        "creation_date" : "1532440048888",
        "requests" : {
          "cache" : {
            "enable" : "true"
          }
        },
        "analysis" : {
          "filter" : {
            "traceId_filter" : {
              "type" : "pattern_capture",
              "preserve_original" : "true",
              "patterns" : [
                "([0-9a-f]{1,16})$"
              ]
            }
          },
          "analyzer" : {
            "traceId_analyzer" : {
              "filter" : "traceId_filter",
              "type" : "custom",
              "tokenizer" : "keyword"
            }
          }
        },
        "number_of_replicas" : "1",
        "uuid" : "aOhYlv9lTCCAtMqwhDKWvQ",
        "version" : {
          "created" : "5061099"
        }
      }
    }
  }
}

I have been digging into Zipkin impl. From the above output we can see that tags are modeled as object but indexing is disabled (enabled: false). The query works on _q field (keyword):

The search in Kibana does not work either. It only allows to choose specific field e.g. tags.http.path. Whereas when Jaeger index is used it's possible to select the whole tags.

screenshot of kibana
screenshot of kibana 1

I have also tried https://github.com/ppadovani/KibanaNestedSupportPlugin. For more details see https://github.com/pavolloffay/jaeger-kibana. The issue is that is defines it's own query language. But the search worked.

I think we cannot change . to _. There are standard tags cotaining _ e.g. http.status_code https://github.com/opentracing/specification/blob/master/semantic_conventions.yaml#L17. We could only use suffix (_string) to infer the type.

Maybe we could use array datatpe https://www.elastic.co/guide/en/elasticsearch/reference/current/array.html to store all tags in

["http.url=/foo", "error=true"] The type could be as a suffix added to key or have a second array containing the types.

I don't think the array approach is a good idea, as it would affect any processing (e.g. aggregations) of values in ES.

As far as I can tell, the only issue with the ideas mentioned by @Monnoroch and @mabn is the dots in the key. This can simply be resolved by selecting a different character that isn't used by OT standard tags, e.g. colon.

My preference would be @Monnoroch's approach, as it avoids the type suffix.

@objectiser the second issue is the data types that need to be consistent for all spans. However, this is a good practice anyway, so this limitation is not necessarily critical. One can still store heterogeneous data in payload, it's just that it won't be indexed very well.

Just a documentation:

Will ES blow up if the same tag name in different spans is set to different value type? E.g.

For mapping: "tags":{"type":"object"} it does blow up if the

"tags":{
    "a": "true",
    "a": true
    },    

is present in the first span. ES returns
{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"mapper [tags.a] of different type, current_type [text], merged_type [boolean]"}],"type":"illegal_argument_exception","reason":"mapper [tags.a] of different type, current_type [text], merged_type [boolean]"},"status":400}%

However it passes on the second span. The type of the property is set based on the value from the first field in the first span. E.g. first span contains a: "true" then the type is set to text.

hi All,

I have submitted https://github.com/jaegertracing/jaeger/pull/980

tags: {
  "http:method": "GET",
}

Performance test with 300k spans and query
http://localhost:16686/api/traces?service=perf-test-thread-0&limit=50000&lookback=1h&tags={"fooo.bar*?%http.d6cconald":"hehuhoh$?ij","fooo.ba2sar":"true","fooo.ba4342r":"1","fooo.bar1":"fobarhax*+??","fooo.bar*?%http.do**2nald":"goobarRAXbaz","fooo.bar*?%http.don(a44ld":"goobarRAXbaz","fooo.ba24r*?%":"hehe"}. More results can be found in linked PRs.

limit 50000: mean 13363.90 milliseconds
limit 1500: mean = 441.96 milliseconds
limit 20: mean = 17.88 milliseconds.

Report time with multiple queries
191265.00 milliseconds 3.18 min

and https://github.com/jaegertracing/jaeger/pull/982

tags: {
  "http:method": {
     "value": "GET",
     "type": "string"
  }
}

limit 50000: mean = 13533.82 milliseconds
limit 1500: mean = 681.32 milliseconds
limit 20: mean = 41.09 milliseconds

Report time with multiple queries
260223.00 milliseconds = 4.3 min

Results for master, tags as nested datatype:

tags: {
  {},{},{}
 }

limit 50000: mean = 12683.73 milliseconds
limit 1500: mean = 405.27 milliseconds
limit 20: mean = 26.40 milliseconds

Report time:
206348.00 milliseconds = 3.4 min

The biggest limitation when using tag key as object key is index.mapping.total_fields.limit. With the default index setting I was able to store 180 #980 or 480 #982 unigue tags. Maybe it could be increased by overriding default mapping and disabling .raw field.

Given this limitation I would like to use a combined mapping. OT standard tags (or configured) would be stored as object datatype to enable query in kibana. Is there a consensus to move with this direction?

@Monnoroch @mabn @kacper-jackiewicz @yurishkuro could you please have a look at ^^^ I would like to get it done quickly.

Most important piece of my proposal is that only *specified tags would be stored as direct object. Other tags would still be stored as nested object like it is right now.

standard OpenTracing tags https://github.com/opentracing/specification/blob/master/semantic_conventions.yaml#L9 and configured tags.

@pavolloffay can we have an option "I am a good engineer and my tags are typed, please store them all in an object"? :)

@Monnoroch the type is not problematic. There is a limit on unique field keys, please read https://github.com/jaegertracing/jaeger/issues/906#issuecomment-413582617.

@pavolloffay ah! Missed that bit, thanks. Still, would be nice to have a flag for reversing the logic: blacklisting tags instead of whitelisting.

Made some comments in the PR.

I expect that for most people 180 tags is a theoretical limitation rather than a practical one, really.

We don't know how people are using them. If somebody is over the limit they would not be able to upgrade. The backup logic for figuring out what tag should be stored in different mapping would be very ugly...

Yeah, but nowadays people use more and more gRPC services instead or raw HTTP, and there are other RPC frameworks and the standard will not be able to keep up and the feature will become much less useful. Not to mention that bigger companies have their own mini frameworks with custom tag names.

My feeling is that having a reasonable number of tags is a more reasonable limitation than not being able to introduce your own tag names because you won't be able to search by them. With a whitelist I can only search by 20 pre-defined tags, while with a blacklist it's 180.

@pavolloffay I agree with @Monnoroch - there must be a way to extend standard list of tags (whitelist) to cover businesses related tags / existing conventions in frameworks (e.g. guid:x-request-id from Istio). From that perspective however 'index.mapping.total_fields.limit' is problematic for large multi-tenant installations where even a few custom tags per projects will eventually sum up to that limit. There are people reporting indexes running with 10k+ fields however ultimately Jaeger should make ES storage multi-tenant aware too (e.g. index per day per application / namespace on K8S). Until that time it would be up to the administrator to track custom tags using whitelist, where the most generic approach would be to just allow for all tags using '*' regex (to address @Monnoroch remark for flexibility). I prefer 980 over 982 since do not see the reason to store original type for tags - can you elaborate more on the original reasons / use case plus the fact that 980 works without it.

The model follows OpenTracing API which allows different types for tag values. IIRC tag type is only used in storage integration tests. The other consumer can be post-processing job. @yurishkuro do we or at Uber use tag types?

For the reference, Zipkin and OpenCensus API only support string tags.

OpenTracing supports typed tags, and Jaeger stores them as typed values, but so far we have not built any indexing capability that would make use of the types, such as supporting range queries for http.status_code as integer. However, the big data jobs can indeed use typed values.

I have submitted https://github.com/jaegertracing/jaeger/pull/1018 as a final PR.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

tomaszturkowski picture tomaszturkowski  路  4Comments

yurishkuro picture yurishkuro  路  4Comments

benraskin92 picture benraskin92  路  3Comments

devoxel picture devoxel  路  5Comments

Siddhesh-Ghadi picture Siddhesh-Ghadi  路  4Comments