Bug Template
Title: One line description
Span being marked as a child of a service it is not a child of (should be peers)
Description:
What issue is being seen? Describe what should be happening instead of
the bug, for example: Envoy should not crash, the expected value isn't
returned, etc.
We have enabled tracing in an Ambassador instance that has authentication enabled. In this configuration, Ambassador requires that all calls go through the authentication first before making the call to the configured downstream service. In the screenshot below, external is the Ambassador instance. When we make a call to a downstream service you can see the trace shows the downstream call (which of course goes through Ambassador) as a child of the auth service. We expected the auth service and downstream service calls to be peers. It is unclear whether this is a bug in Ambassador or Envoy, but the Ambassador folks believe this is an Envoy bug so opening here.

Repro steps:
Include sample requests, environment, etc. All data and inputs
required to reproduce the bug.
Note: The Envoy_collect tool
gathers a tarball with debug logs, config and the following admin
endpoints: /stats, /clusters and /server_info. Please note if there are
privacy concerns, sanitize the data prior to sharing the tarball/pasting.
Admin and Stats Output:
Include the admin output for the following endpoints: /stats,
/clusters, /routes, /server_info. For more information, refer to the
admin endpoint documentation.
Note: If there are privacy concerns, sanitize the data prior to
sharing.
Config:
Include the config used to configure Envoy.
Logs:
Include the access logs and the Envoy logs.
Note: If there are privacy concerns, sanitize the data prior to
sharing.
Call Stack:
If the Envoy binary is crashing, a call stack is required.
Please refer to the Bazel Stack trace documentation.
@flands is this Zipkin? cc @objectiser
Not sure how this would be possible, unless somehow the ambassador-auth-service was creating a span and then passing back the trace context which was then used to create the subsequent child span.
Is it possible to get the envoy proxy configuration, and a copy of the trace instance json?
@mattklein123 @objectiser apologies on the delayed response. I am attaching the Envoy configuration as requested. While sanitizing the data, I noticed the following configuration for the auth service:
"filters": [
{"type": "decoder",
"name": "extauth",
"config": {"allowed_headers": ["X-User", "X-Tenant-ID", "X-User-ID"], "cluster": "cluster_ext_auth", "path_prefix": "/auth/api/check", "timeout_ms": 5000}
},{
"name": "cors",
"config": {}
},{"type": "decoder",
"name": "router",
"config": {"start_child_span": true}
}
]
After digging into start_child_span I found: https://www.envoyproxy.io/docs/envoy/latest/api-v1/http_filters/router_filter, but unclear whether related or not. From the documentation it is not clear to me what this option does.
Next, I found https://github.com/datawire/ambassador/pull/655. I reached out to Alex and he said the start_child_span option was added because spans from the auth service were not being captured without this option being turned on.
Below are the items requested. Please let me know if you need anything else.
To add on @flands comment about the need for start_child_span;
{"start_child_span": true} was added so Zipkin "Client Send" and "Client Receive" events are generated in child spans by Envoy, even if the remote downstream service does not implement tracing.
Such is the case with our dummy "echo" egress route:

JSON traces with start_child_span and without start_child_span (using a custom build of Ambassador) are available @ https://gist.github.com/alexgervais/21e3492f0b50bf2804b32f6db0a6a8f5
@flands Would it be possible to provide an updated version of the JaegerTracing sandbox example to include the authentication service call?
The example config you provided does not include any of the X-B3-* headers in the allowed headers, so I don't understand how the authentication service would be able to participate in the trace - let alone inserted in the span hierarchy at that particular point.
In the example trace data you provided, the span reported by the svc-auth (which I assume is the ambassador auth service?) has created a span id of "1" - is this accurate or did you change it from a valid 64 bit id? Just wondering where the code is that created this span.
@objectiser the original trace was not in raw format so the IDs were merged (i.e. I did not change to 1 that is the behavior of converting from zipkin to jaeger) -- here is an example raw trace:
As for implementation:
@flands You are right - it is Jaeger changing to 1 when it de-duplicates the shared span ids. I knew it was doing this, but didn't realise it was just using a numeric value for the span id - but it only happens on the trace returned to the UI so I guess it is ok.
The shared span context is actually the problem - when Jaeger modifies the span ids to create unique ids for the client and server spans, it assumes that any child span off that shared id is a child of the server span - which in this case is the auth service.
So if you set the shared span context to false, it should fix the problem.
@objectiser looking at @alexgervais gist above (https://gist.github.com/alexgervais/21e3492f0b50bf2804b32f6db0a6a8f5) it appears turning this option off will result in the auth service being put directly into the chain (e.g. Ambassador > Auth > Downstream) instead of being shown as children (e.g. Ambassador > Auth and Ambassador > Downstream).
There is a Gitter conversation going on in jaegertracing and Yuri is suggesting: Ambassador needs to have an entry span, and a bunch of exit spans for every RPC call it makes. Given Ambassador is based on Envoy, unclear whether this is an Ambassador issue or Envoy issue.
@flands Have you tried setting the shared_span_context to false as shown in the JaegerTracing sandbox example?
If that didn't fix the problem, could you attach the raw trace based on this config.
@objectiser I manually built Ambassador with start-child-span: false. As @alexgervais reported above, when you do this, you no longer see the authentication call in the trace. I have included the raw trace below. Note, you will still see an authentication call as svc-default (a downstream service) validates calls via svc-auth (also interesting how this shows up with start-child-span: false, but does not with start-child-span:true). I am still very confused as to how Envoy/Zipkin/Jaeger is handling this situation.
@flands I think there is a misunderstanding here - I believe all of the problems you are experiencing are related to zipkin's shared span context approach. Nothing to do with start-child-span.
If you can please set the shared_span_context to false, as shown in JaegerTracing sandbox example, then this will confirm one way or the other.
@objectiser indeed, thanks and clearly on my end. When was shared_span_context introduced on the Envoy side? I see the changes about a month back. Ambassador is on Envoy 1.7 though they have an EA on Envoy 1.8. Unfortunately, my configuration does not currently work on the Ambassador EA. If Envoy 1.8 is required, then will take me a bit to be able to test and confirm.
@flands , I've started testing Ambassador EA build last Friday. I'll make sure to validate the shared_span_context setting and let you know!
I have this PR open on Ambassador EA https://github.com/datawire/ambassador/pull/955 allowing us to tweak shared_span_context.
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted". Thank you for your contributions.