Envoy: Zipkin HTTP_JSON v2 spans using strings for timestamp/duration

Created on 13 Dec 2019  路  21Comments  路  Source: envoyproxy/envoy

Title: Zipkin HTTP_JSON v2 payload using strings for timestamp/duration

Description:
According to the Zipkin v2 API (https://zipkin.io/zipkin-api/#/default/post_spans) spans should contain timestamp and duration fields as int64, not strings, in the JSON payload. This makes collectors not being able to parse the JSON payload.

When configuring Zipkin to use the HTTP_JSON endpoint version, the resulting payload is:

[
  {
    "traceId": "83713138fa14c726",
    "id": "83713138fa14c726",
    "kind": "SERVER",
    "name": "www.hello",
    "timestamp": "1572849988488551",  <-------------------------------------------
    "duration": "7131",        <-------------------------------------------
    "localEndpoint": {
      "serviceName": "envoy-cluster",
      "ipv4": "172.31.0.5"
    },
    "tags": {
      "response_flags": "-",
      "request_size": "0",
      "response_size": "292",
      "node_id": "envoy",
      "http.status_code": "200",
      "http.method": "GET",
      "upstream_cluster": "testcluster",
      "guid:x-request-id": "3ab0fc56-f92e-9c03-a087-1f23662699cb",
      "zone": "us-east-1",
      "http.protocol": "HTTP/1.1",
      "component": "proxy",
      "http.url": "https://www.hello/headers",
      "downstream_cluster": "-",
      "user_agent": "HTTPie/0.9.2"
    }
  }
]

When it should be

[
  {
    "traceId": "83713138fa14c726",
    "id": "83713138fa14c726",
    "kind": "SERVER",
    "name": "www.hello",
    "timestamp": 1572849988488551,  <-------------------------------------------
    "duration": 7131,        <-------------------------------------------
    "localEndpoint": {
      "serviceName": "envoy-cluster",
      "ipv4": "172.31.0.5"
    },
    "tags": {
      "response_flags": "-",
      "request_size": "0",
      "response_size": "292",
      "node_id": "envoy",
      "http.status_code": "200",
      "http.method": "GET",
      "upstream_cluster": "testcluster",
      "guid:x-request-id": "3ab0fc56-f92e-9c03-a087-1f23662699cb",
      "zone": "us-east-1",
      "http.protocol": "HTTP/1.1",
      "component": "proxy",
      "http.url": "https://www.hello/headers",
      "downstream_cluster": "-",
      "user_agent": "HTTPie/0.9.2"
    }
  }
]

Repro steps:
Enable Zipkin tracing with HTTP_JSON as follows:

tracing:
  http:
    name: envoy.zipkin
    config:
      collector_cluster: tracing-sidecar
      collector_endpoint: /api/v2/spans
      collector_endpoint_version: HTTP_JSON
      shared_span_context: false
aretracing bug help wanted

Most helpful comment

using E notation intentionally, when the spec doesn't say to, and most clones will break if we do, and that there's no numeric overflow concern.. seems a recipe for disaster.

Why are we not getting the cpp lib to emit plain numbers? This seems easier than trying to change all libraries and clones to follow the change that happened here.

This could have been caught I think with an integration test of the json output.

All 21 comments

This makes collectors not being able to parse the JSON payload.

To clarify, this is OK for Zipkin server right?

I also came accross the same issue.
Specifically I came accross this using the library: https://github.com/openzipkin/zipkin-go.
Relevant code: https://github.com/openzipkin/zipkin-go/blob/master/model/span.go#L120

I'm not entirely sure if the bug is here or in the library.
I realize big numbers in json are problematic and the workaround seems to be we encode them as strings.
However from the RFC:

Note that when such software is used, numbers that are integers and
are in the range [-(2*53)+1, (2*53)-1] are interoperable in the
sense that implementations will agree exactly on their numeric
values.

Javascript MAX_SAFE_INTEGER also respects that: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Number/MAX_SAFE_INTEGER

So from that I would say that the number: 1576596709663067 is safe to be unqouted.
(2^53 is: 9007199254740991)

I realize the Go json decoder could also handle that, but according to the RFC an unqouted microsecond thimestamp should be fine for another ~235 years.

My 2 cents.

@dio I'm not sure if Zipkin server accepts those two int64 properties as strings. As, @erikbozic points it sounds like it wouldn't. Their API schema certainly has them as numbers, not strings (see https://zipkin.io/zipkin-api/#/default/post_spans)

I was trying to use the Zipkin plugin to send spans to Lightstep (Zipkin compatible collectors), and Lightstep fails to ingest spans complaining about those string fields.

@danielgblanco I tested with the setup under examples (https://github.com/envoyproxy/envoy/tree/master/examples/zipkin-tracing) changed it to HTTP_JSON, it works. While with jaeger it doesn't. I'll take a look at this closer.

Thanks. It may be that Zipkin server is able to cast those strings into integers and Lightstep collectors are not doing it, simply following the Zipkin API... Thanks for having a look!

@dio You're correct. If we send these spans directly to Zipkin it works fine. While with Jaeger it won't work, because of the same problem as in the library I pointed out. (https://github.com/jaegertracing/jaeger/blob/master/thrift-gen/zipkincore/ttypes.go#L766)

I imagine there might be similar issues in libraries/consumers written in other languages. (e.g. the Zipkin plugin for Lightstep apparently)

Now for our case we forked the library and changed it so it accepts strings. But now it won't work with numbers, which isn't ideal. But as long as we only accept spans from envoy, it will work.

@erikbozic @danielgblanco I鈥檒l go ahead to fix this.

@cetanu are you also OK with this? Or should we have an option to spit json where uint64 is rendered as string?

We changed some code to expect a json.Number type for these fields and then we cast it to the correct type for our use-case 馃憤 so either way we're OK here.

I think if envoy were to opt for not changing this it wouldn't be an absolutely incorrect decision but it would solve a problem in some cases. One example I can think of is where a user of envoy does not have control over the trace server/collector. I personally just had to send a PR to another team.

If clients that accept these traces could also be changed to accept a generic JSON "number", whatever that means to the respective client library, then that would cover both sides, but is probably/definitely not in scope for this issue.

I tried out the new 1.13.0 release of envoy and saw these changes.
However now, _sometimes_, I get the timestamp in scientific notation (e.g.: 1.57967750526223e+15).

The Go client I'm using is expecting something it can treat as uint64, however this value can only (safely) be interpreted as a float. (an issue about it in the Go repo: https://github.com/golang/go/issues/5562).

I haven't figured out why sometimes it's in scientific notation and sometimes in regular real number .

I'm not sure what the correct behavior is here. Should the producer (e.g. envoy) make sure this is "a proper int", or should the consumer (e.g. Go client library) handle it "because json".

Semantically though, I think timestmap shouldn't be a float.
But as @cetanu said, this is mostly a problem if you're not in control over the collector, otherwise it's easy enough to solve. (btw, thanks for the json.Number tip. I'll try that in our fork of the library :smile: )

Thanks @erikbozic for reporting. We should track this one. I鈥檒l try to take a look.

I was wondering if someone if currently working on a fix for that? Currently causing me trouble. Native jeager tracing is not working anymore with Envoy 13 because of cpp lib changed in jeager lib and zipkin v2 is not working properly + zipkin v1 no more available... Running out of option ;)

Oops. Putting this back to our radar.

using E notation intentionally, when the spec doesn't say to, and most clones will break if we do, and that there's no numeric overflow concern.. seems a recipe for disaster.

Why are we not getting the cpp lib to emit plain numbers? This seems easier than trying to change all libraries and clones to follow the change that happened here.

This could have been caught I think with an integration test of the json output.

@dio do you have an ETA on this issue? If it is not near to be fixed, I'll use Envoy 12.x with zipkin v1 :(

@marcantoine-bibeau sorry. I'll try to look at this as soon as possible.

I know this is not ideal, but here's an example of workaround if you want to "enable" zipkin v1 for the time being. https://github.com/envoyproxy/envoy/blob/1d751a53e1a8160be918beb42599c5c863977673/configs/using_deprecated_config.v2.yaml

@dio Just tried and unfortunately even zipkin v1 sometimes send timestamp using scientific notation

POST /api/v1/spans HTTP/1.1
host: tracing_zipkin_cluster
content-type: application/json
x-envoy-internal: true
x-forwarded-for: 10.1.12.170
x-envoy-expected-rq-timeout-ms: 5000
transfer-encoding: chunked

[{"traceId":"1db0586e994ff950","name":"10.1.12.170:9090","timestamp":1.58432429547687e+15,"duration":9137,"id":"1db0586e994ff950","annotations":[{"timestamp":1.58432429547687e+15,"endpoint":{"port":0,"serviceName":"echocall1","ipv4":"10.1.12.170"},"value":"sr"},{"value":"ss","timestamp":1584324295486163,"endpoint":{"port":0,"serviceName":"echocall1","ipv4":"10.1.12.170"}}],"binaryAnnotations":[{"value":"9fd656dc-c0a5-926f-a1bf-d1f03f76006f","key":"guid:x-request-id"},{"value":"http://10.1.12.170:9090/health","key":"http.url"},{"value":"GET","key":"http.method"},{"value":"-","key":"downstream_cluster"},{"value":"kube-probe/1.15","key":"user_agent"},{"value":"HTTP/1.1","key":"http.protocol"},{"value":"0","key":"request_size"},{"value":"15","key":"response_size"},{"value":"proxy","key":"component"},{"value":"h1_ingress_cluster","key":"upstream_cluster"},{"value":"200","key":"http.status_code"},{"value":"-","key":"response_flags"}]}]

Here's my config

...
tracing:
  http:
    name: envoy.zipkin
    typed_config:
      '@type': type.googleapis.com/envoy.config.trace.v2.ZipkinConfig
      collector_cluster: tracing_zipkin_cluster
      collector_endpoint: /api/v1/spans
      collector_endpoint_version: HTTP_JSON_V1
layered_runtime:
  layers:
  - name: static_layer
    static_layer:
      envoy.deprecated_features:envoy.config.trace.v2.ZipkinConfig.HTTP_JSON_V1: true

@marcantoine-bibeau if you don't mind, could you help to validate the fix in #10400? By testing this image: dio123/envoy-zipkin:9341-fix. Thank you!

@dio Yo hey :) Will do very soon, I'll ping you in #10400 once I'm done. Thanks a lot!

@erikbozic @marcantoine-bibeau I'm rethinking this, so we can keep rendering timestamp as a string. What Zipkin compliant collectors do you use? I'll probably go out there and try to fix that (probably also to fix the way openapi gen handles uint64_t from a string).

@dio I'm using open-telemetry-collector (https://github.com/open-telemetry/opentelemetry-collector) than uses zipkin-go liv (https://github.com/openzipkin/zipkin-go). I proposed to that repo a pr (https://github.com/openzipkin/zipkin-go/pull/161) to support scientific notation but got refused. And frankly, I kinda agree with their reason based on the fact that protocol (https://zipkin.io/zipkin-api/#/default/post_spans) is clear ;)

I don't think it should be a string if we follow their api...

@dio I'm also using https://github.com/openzipkin/zipkin-go. I made basically the same changes that @marcantoine-bibeau did in his PR and have been using my fork, since we're using the library with our own collector.

What's seems to be issue so we can't change this in Envoy? I'd dig in myself, but my cpp skills are non-existent...

Was this page helpful?
0 / 5 - 0 ratings