Jaeger: Add the ability to store/retrieve partial spans

Created on 6 Mar 2018  路  24Comments  路  Source: jaegertracing/jaeger

For the first version of this feature, we shall assume the following

  • clients can flush and report partial spans; but not span deltas
  • duration is monotonically increasing
  • the final span has the longest duration

For e.g., a client can do the following
time = 1s
Report Span[traceID=1, spanID=2, duration=10s, operationName="someOperation"]

time = 2s
Report Span[traceID=1, spanID=2, duration=20s, operationName="someOperation"]

time = 3s
Report Span[traceID=1, spanID=2, duration=40s, operationName="someOperation", tags=...]

To support this, the backend would need the ability to resolve merge conflicts on spans. For V1, this means simply selecting the longest span.

On Cassandra, the jaeger-collector uses the model spanhash, guaranteeing that all partial spans are stored:
https://github.com/jaegertracing/jaeger/blob/412baf60db217d6f41d7fdca04358069362b29ca/plugin/storage/cassandra/spanstore/dbmodel/converter.go#L45

https://github.com/jaegertracing/jaeger/blob/fcbd21078ab7ed03682964c187452778242d3423/model/span.go#L63-L66

On ElasticSearch, we use the index api, which performs upserts.
https://github.com/jaegertracing/jaeger/blob/e52ecffbf69a572593504ea9acc4ff65854a3e9a/plugin/storage/es/spanstore/writer.go#L179-L182

To make matters more interesting, note that Jaeger supports storing Zipkin spans that share the same spanID for client and server spans. Jaeger adjusts these spanIDs during query time as seen here:
https://github.com/jaegertracing/jaeger/blob/412baf60db217d6f41d7fdca04358069362b29ca/model/adjuster/span_id_deduper.go#L23-L31

Another point to note is that jaeger-collector makes no guarantee that spans are stored in the order that they are received.

Enhancements

  1. Communicate that jaeger-query is serving incomplete spans to the user. One approach is to do something similar to https://github.com/jaegertracing/jaeger-ui/issues/132. We might have to enhance the Span model to store this information
  2. Allow for jaeger-clients to report that the span it is sending is the final span. This allows us to remove the assumption about monotonically increasing duration.
  3. Add support for span deltas, which might allow for clients to reduce state maintained. (Might be at odds with the previous point)
  4. Enhance Span model to store the lineage of a span (whether it was generated from a Zipkin or Jaeger span), this allows for more robust merging behavior

For more context on use cases and prior discussion see https://github.com/jaegertracing/jaeger-client-java/issues/231

Most helpful comment

Ok, that makes sense. Would have been good to know that before :)

So with this approach, the question is, how to model a final span, right? In https://github.com/jaegertracing/jaeger/pull/728 @black-adder already suggested to use a flag. In my understanding adding a new finalFlag constant like (https://github.com/jaegertracing/jaeger/blob/master/model/span.go#L27)
would allow to detect final spans without altering the db schema of cassandra, because it's a uint32
https://github.com/jaegertracing/jaeger/blob/master/plugin/storage/cassandra/spanstore/dbmodel/model.go#L28
That removes the necessity for any other incomplete attribute and aligns quite well with the current data model (at least if i understood the purpose of flags correctly). The clients could easily implement setting the flag in their span.finish() methods.

With a isFinal flag my suggested approach with a duration of 0 to distinguish between final and partial spans wouldn't be necessary aswell.

All 24 comments

I am not sure why we need phase 1-2 approach and restrictions. Could we not just merge all spans into one? Rules would be:

  • tags and lots are merged as sets
  • operationName and flags are taken from span with largest duration
  • start time is the earliest of all
  • duration can be either the largest or calculated from the latest endTime(i)=startTime(i)+duration(i)

Enhance Span model to store the lineage of a span (whether it was generated from a Zipkin or Jaeger span), this allows for more robust merging behavior

I don't think lineage is really relevant. Zipkin style spans should be deduped using span.kind = client/server distinction.

I am not sure why we need phase 1-2 approach and restrictions.

We don't really, if we agree on the merge rules, it may be simpler to implement the rules in one go.

We might still need the restriction that the final span has the longest duration, (or some other method of denoting the last span sent by a client for a specific span id)

By the proposal made in https://github.com/jaegertracing/jaeger-client-java/issues/231 where you've already referenced this discussion, the last span would be either denoted by a specific flag such as beeing complete or by having a duration at all. I don't know if that's possible but all in between spans would probably still report 0 duration because while still being running the duration information is redundant after all or isn't it? At least that's what i've understood form that discussion.

I was discussing with @black-adder about the approach described in https://github.com/jaegertracing/jaeger-client-java/issues/231 and want to summarize our findings here so far:

One important requirement is not to change the cassandra schema in order to keep updates of jaeger simple. Therefore I want to split the enhancements necessary in "before persistence" and "after persistence".

To allow for easier distinction of partial and complete spans after persistence, I would suggest to change the assumptions to

  • a partial span always has a duration of 0
  • the final span always has a duration > 0

Before persistence

  • Clients can optionally send an "incomplete" flag to indicate if this span is partial or not
  • Clients should report a duration of 0 for partial spans and the correct duration for the final span
  • Thrift SpanModel gets attribute "incomplete"
  • DomainSpan gets the attributes "incomplete" and "type" (which can be 'jaeger' or 'zipkin')
  • JsonSpan gets attribute "incomplete", but it will be populated only during incoming spans requests, not when reading from storage

The type attribute for the domain model is necessary to enable cassandra and es writers to distinguish between replacing partial jaeger spans logic and zipkin server and client have same spanId logic. The type will not be persisted.

The writer needs to distinguish between both aforementioned logics, cause UPSERTs will only work with a stable identifier (cassandra primary key and es document id). Current writer logic would not allow for this, because

  • the es indexer will generate a random document id for every call
  • the cassandra spanhash does not take into account, that a final span may have additional logs/tags/changes

In my approach we enhanced both writers as follows:

  • es writer in writeSpan function
if jaegerSpanType == model.JaegerSpanType {
    _, err = s.client.Index().Index(indexName).Type(spanType).Id(string(jsonSpan.TraceID) + string(jsonSpan.SpanID)).BodyJson(&elasticSpan).Do(s.ctx)
} else {
    _, err = s.client.Index().Index(indexName).Type(spanType).BodyJson(&elasticSpan).Do(s.ctx)
}

The Id(stable identifier) call is necessary to allow for UPSERT, but we don't want to replace zipkin spans with the same spanId, because of client/server spans having the same spanId. Therefore the distinction.

  • cassandra writer, enhancing DomainSpan.Hash function
//with supporting incomplete spans the hash needs to be unique for a combination
//of spanId and traceId, so the final representation of the same span
//will overwrite the incomplete version in cassandra
if s.Type == JaegerSpanType {
    return enc.Encode(JaegerSpanHash{s.TraceID, s.SpanID})
} else {
    return enc.Encode(s)
}

The primary key comprising of spanId, traceId and a hashed representation of the spans remains, we just switch to only hashing traceId and spanId for spans received from jaeger clients, thus having a stable identifier regardless of additonal logs/tags or whatsoever.

After persistence

Since we don't want to change the schema, we won't have any additional information if a span is partial or not. We could stick to the assumption, that a partial span always has a duration of 0 and a completed span always has a duration > 0.

-Jaeger-UI could distinguish and mark spans as partial, when duration is 0

  • In this approach no changes would be necessary in the deduper, since partial spans will always be replaced with the next incoming incarnation of this particular span
  • The clock adjuster (https://github.com/jaegertracing/jaeger/blob/412baf60db217d6f41d7fdca04358069362b29ca/model/adjuster/clockskew.go#L148) needs to check for spans with a duration of 0 in the following cases
  1. parentSpan is partial and childSpan is complete
    -> parentDuration < childDuration, therefore no adjustment will be done with the current logic
  2. parentSpan and childSpan are partial
    -> this would basically lead to a delta parentStartTime - childStartTime and therefore adjusting the child span to the same startTime the parent has
  3. parentSpan is complete and childSpan is partial
    -> childSpan will be adjusted to parentDuration/2 if the childStartTime is before parentStartTime
    -> if childStartTime is between parentStartTime and parentEndTime no adjustments will be made

So we probably need to incorporate changes for 2. and 3.
-> for 2 we should just check that childStartTime is adjusted to be >= parentStartTime, checking for endTimes makes no sense, since they are not known at this point
-> for 3 we should check that childStartTime is between parentStartTime and parentEndTime, but what would we want to adjust it too? Maybe parentDuration/2 is already a good approach?

I can provide a Pull Request with the changes described here (except the ClockAdjuster), so you guys can get a more detailled impression of the changes, I just need to fix one last failing test and can reference it here then :)

Why is it necessary to ensure upsert? I think a simpler solution is to merge spans on read.

I took this approach, because only the last span is relevant (at least this is my assumption) and it is much friendlier for storage not to keep all intermediate representations. Do you think storage size is not an issue?

Hasn't been in the cases we encountered. But in some cases it's not even an option, it's a requirement to save all partial spans. E.g. we have a system that executes long running (on the order of days) workflows, and it persists the intermediate state in the database. The start and finish events of a span may not even be in the same process.

Ok, that makes sense. Would have been good to know that before :)

So with this approach, the question is, how to model a final span, right? In https://github.com/jaegertracing/jaeger/pull/728 @black-adder already suggested to use a flag. In my understanding adding a new finalFlag constant like (https://github.com/jaegertracing/jaeger/blob/master/model/span.go#L27)
would allow to detect final spans without altering the db schema of cassandra, because it's a uint32
https://github.com/jaegertracing/jaeger/blob/master/plugin/storage/cassandra/spanstore/dbmodel/model.go#L28
That removes the necessity for any other incomplete attribute and aligns quite well with the current data model (at least if i understood the purpose of flags correctly). The clients could easily implement setting the flag in their span.finish() methods.

With a isFinal flag my suggested approach with a duration of 0 to distinguish between final and partial spans wouldn't be necessary aswell.

My concern with using the flags field is in changing its semantic meaning and scope. Right now we're using it merely to store the value of spanContext.flags, i.e. the thing that is propagated over the wire. @phal0r 's proposal would change it to also carry information that is only meaningful for the backend, and meaningless in the propagation format.

I don't have strong objections to using the duration field. It does have a few drawbacks:

  1. it is theoretically possible, on system with low precision clocks (like milliseconds) for the duration to be 0 even on the finished span.
  2. it's a poor man's substitution for a proper field in the Span model, mixing different concerns
  3. it does not allow full ordering of the span fragments

The downsides of introducing a dedicated field on the Span is mostly the implementation churn where we need to add it to the data model, update DB schemas everywhere, etc, I think internally we went with the approach based on duration - @vprithvi can you confirm?

@yurishkuro @vprithvi
I agree with you. As mentioned before, using 0 as indicator is not necessary and all suggestions I made are more or less a workaround for not changing the data model for cassandra.

Actually I do not favor any implementation or implementation detail. I want to help with my suggestions to push the feature of persisting incomplete/running spans. I care for semantical correctness and do understand, that the flags are not meant for that. So how can we go on from here?

The issue I see in the current concept is, that I don't see any way to detect if a span is complete or still running or am I missing something?

My preference is to add a new field to the model, semantically it's the cleanest way.

Yeah, we now have traversed every property, that is flexible enough for enhancements, but came to the point, that there is no semantically acceptable place for this. So did you guys already put some work internally to this or should we go on here with the discussion?

What we did internally was not aimed at capturing partial spans, so I'd ignore that.

Faced an issue when long span that is not finished is not writing logs at all until its finished.
Is this the issue relevant for this behaviour which is planned to take care of?

The issue is still relevant for our use case for long running jobs, is there any plan for this?

@mustafaakin
Yes, this is relevant for use, too. I had a first running implementation with Jaeger + ES and the node-client, but we upserted the traces, what is in conflict with this approach. So my next simpler iteration will be to introduce a incomplete flag for spans and a changed cassandra schema based on merger logic described in this issue. My rough plan is to finish this until the end of june. I don't know if anybody else is actively working on this.

I would like to help if needed @phal0r

@mustafaakin thx, I will send you a message, once I engage this issue.

I have implemented the changes and checked it with cassandra and elasticsearch storage.

Being able to inspect long running requests before they are complete is a really important functionality for us as well! We have tasks that take anywhere from sub-second up to many hours.

Jaeger/opentracing would be the perfect tool if we could trace progress during the execution of the tasks tasks.

I agree. I am working on finishing this implementation. There is one more change for the current pull request required and the discussion in https://github.com/jaegertracing/jaeger/issues/1119 needs to be resolved aswell, since this feature changes the cassandra schema.

My approach to the long-running process traces issue right now is that the span the starts a new trace should indeed be finished shortly after created. New spans for tasks spawned by the long-running process can be a child of the main span.
Maybe the UI rendering is not that clean this way, but at least the span is persisted.

Storing pending spans would be, of course, better.

Do you guys need a helping hand with closing this issue?

Szia @istvanszoboszlai ,

any help is appreciated. We are working in #979 to integrate this feature. There is some discussion going on about the approach and the impacts. @de-robat and me will look for the next time window to move this forward. So, if you want to help, we could discuss this together.

Hallo @phal0r,

Sorry for the slow response. Yes, I am happy to help, I hope I can. I am not that familiar with cassandra, I am more into elastic. Let me know when we can discuss it further.

MfG :)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

vprithvi picture vprithvi  路  3Comments

Siddhesh-Ghadi picture Siddhesh-Ghadi  路  4Comments

black-adder picture black-adder  路  4Comments

albertteoh picture albertteoh  路  3Comments

devoxel picture devoxel  路  5Comments