It would be nice to expose the bootstrap config a bit more with things like tracing and stats configs. We could even offer something like a file path that gets merged into the bootstrap YAML on boot.
For context, we're building our own control plane for service -> service mesh for Envoy. We're including stats and Zipkin traces (via the jaeger endpoint). We lose the trace context between Contour and our sidecar'd envoys because we can't configure Contour to include trace information.
Hi @bobbytables. To be very transparent, merging pieces of abstract configuration is not a feature I plan to add.
However, the contour bootstrap initContainer is not required to use Contour, it's just a convenience. You could replace the output of contour bootstrap with a ConfigMap into a volume. See #1
Roger that, I'll probably end up doing the mounted configmap then. Thanks!
That’s the approach I’d recommend. If you check out the heptio/gimbal repo, they’re probably already doing that.
On 18 May 2018, at 23:33, Robert Ross notifications@github.com wrote:
Roger that, I'll probably end up doing the mounted configmap then. Thanks!
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
@bobbytables @davecheney
I loved this suggestion (replacing contour bootstrap with a ConfigMap mounted volume); it's fantastically simple. I'm also trying to enable tracing (span injection) at the ingress controller; here's the results of my experiment.
TL;DR
First, deploy contour without bootstrap and mount an updated envoy configuration from a ConfigMap.
---
apiVersion: v1
kind: ConfigMap
metadata:
labels:
app: contour
role: ingress
name: contour-config
namespace: ingress-system
data:
contour.yaml: |
dynamic_resources:
lds_config:
api_config_source:
api_type: GRPC
cluster_names: [contour]
grpc_services:
- envoy_grpc:
cluster_name: contour
cds_config:
api_config_source:
api_type: GRPC
cluster_names: [contour]
grpc_services:
- envoy_grpc:
cluster_name: contour
static_resources:
clusters:
- name: contour
connect_timeout: { seconds: 5 }
type: STRICT_DNS
hosts:
- socket_address:
address: 127.0.0.1
port_value: 8001
lb_policy: ROUND_ROBIN
http2_protocol_options: {}
circuit_breakers:
thresholds:
- priority: high
max_connections: 100000
max_pending_requests: 100000
max_requests: 60000000
max_retries: 50
- priority: default
max_connections: 100000
max_pending_requests: 100000
max_requests: 60000000
max_retries: 50
- name: service_stats
connect_timeout: 0.250s
type: LOGICAL_DNS
lb_policy: ROUND_ROBIN
hosts:
- socket_address:
protocol: TCP
address: 127.0.0.1
port_value: 9001
- name: zipkin
connect_timeout: { seconds: 5 }
type: LOGICAL_DNS
lb_policy: ROUND_ROBIN
hosts:
- socket_address:
protocol: TCP
address: zipkin-collector.tracing-system
port_value: 9411
tracing:
http:
name: envoy.zipkin
config:
collector_cluster: zipkin
collector_endpoint: /api/v1/spans
admin:
access_log_path: /dev/null
address:
socket_address:
address: 0.0.0.0
port_value: 9001
---
```bash
kubectl create ns ingress-system
kubectl apply -f issue-399.yaml -l "app=contour"
kubectl -n ingress-system logs $POD_NAME -c envoy
[2018-07-19 17:52:21.931][1][info][main] source/server/server.cc:178] initializing epoch 0 (hot restart version=9.200.16384.127.options=capacity=16384, num_slots=8209 hash=228984379728933363)
[2018-07-19 17:52:21.935][1][info][config] source/server/configuration_impl.cc:52] loading 0 listener(s)
[2018-07-19 17:52:21.937][1][info][config] source/server/configuration_impl.cc:92] loading tracing configuration
[2018-07-19 17:52:21.938][1][info][config] source/server/configuration_impl.cc:101] loading tracing driver: envoy.zipkin
[2018-07-19 17:52:21.938][1][info][config] source/server/configuration_impl.cc:119] loading stats sink configuration
[2018-07-19 17:52:21.939][1][info][main] source/server/server.cc:353] starting main dispatch loop
Next, deploy a service to help inspect the upstream request.
kubectl apply -f issue-399.yaml -l "app=echo"
curl -iv http://echo.127.0.0.1.xip.io/
Request Headers:
accept=*/*
content-length=0
host=echo.127.0.0.1.xip.io
user-agent=curl/7.60.0
x-envoy-expected-rq-timeout-ms=15000
x-envoy-internal=true
x-forwarded-for=192.168.65.3
x-forwarded-proto=http
x-request-id=b645e1b6-2634-43f5-bd95-b6bac6b61c26
If I update the listener creation code (specifically, the function httpFilter) to conditionally add a tracing definition, spans are emitted correctly to my configured opentracing backend.
+ if enabletracing {
+ filter.Config.Fields["tracing"] = st(map[string]*types.Value{
+ "operation_name": sv("egress"),
+ })
+ }
+ return filter
Swapping out the container definition with this
- name: contour
image: docker.io/mattalberts/contour:0.5.0-test
imagePullPolicy: Always
command: ["contour"]
args:
- serve
- --incluster
- --enable-tracing
- --ingress-class-name=contour
curl -iv http://echo.127.0.0.1.xip.io/
Request Headers:
accept=*/*
content-length=0
host=echo.127.0.0.1.xip.io
user-agent=curl/7.60.0
x-b3-sampled=1
x-b3-spanid=31bd1c8e3db02128
x-b3-traceid=31bd1c8e3db02128
x-envoy-expected-rq-timeout-ms=15000
x-envoy-internal=true
x-forwarded-for=192.168.65.3
x-forwarded-proto=http
x-request-id=45b03ab1-2271-92e0-b926-c8946aa9c0d3
My patch currently relies on a flag --enable-tracing. Looking through the envoy code base, the overhead when tracing is enabled feels fairly minimal, but it does cause extra work to be done.
- remove the .txt
My initial work used a command-line option --enable-tracing to globally add the tracing definitions to all listeners. Looking through the annotations, it might make more sense to expose this as an annotation. Which would be preferred?
I'm leaning towards 1 or 2.
Some relevant Envoy APIs:
@rosskukulinski I probably missed something. Setting up the static cluster/trace resource was easy; I used the zipkin trace config
- name: zipkin
connect_timeout: { seconds: 5 }
type: LOGICAL_DNS
lb_policy: ROUND_ROBIN
hosts:
- socket_address:
protocol: TCP
address: zipkin-collector.tracing-system
port_value: 9411
tracing:
http:
name: envoy.zipkin
config:
collector_cluster: zipkin
collector_endpoint: /api/v1/spans
The more difficult part was enabling as part of the envoy filter. All the code has very much moved around by now :), I should pull 0.8.1 and play with it (I'm living back on 0.5.0).
Moving to the unplanned milestone. We don’t plan on looking at this til after Contour 1.0
Blocked:
The approach we have taken for structured logs (#624) without having to fork Contour is to introduce a middleware gRPC proxy between Envoy and Contour and activate the gRPC Access Log Service (ALS).
This effectively augments the xDS responses "in flight" to enable certain features (such as ALS) for which Contour does not have a design yet (or may never implement for various reasons). We can also remove/modify certain switches that Contour configures (for example, disable the stdout access log).
This is probably going to be our strategy to get tracing (this ticket) working. Unlike Contour, this proxy gRPC server does not have to be very fancy with caching as it can effectively pass-through most requests verbatim. With its limited scope, it can also be implemented in any language that supports gRPC. This extra latency is out of the data path and is a reasonable compromise.
Another potential option is to re-use or modify the FluentD plugin from Google that is covered at Envoy, Nginx, Apache HTTP Structured Logs with Google Cloud Logging
How's the situation with this issue?
Should I replace the output of contour bootstrap with a ConfigMap into a volume ? Or is there another recommended approach to enable envoy tracing?
@inigohu the best way to do this at the moment would be to provide your own configuration to envoy. contour bootstrap is a convenience to provide the parameters we expect for Envoy and Contour to work together but it is not required. You can replace it with your own mechanism for providing bootstrap configuration to Envoy.
I found altering the envoy config to be insufficient to get tracing working. I had to patch the setup of the HTTPConnectionManager
func tracing(enableTracing bool) *http.HttpConnectionManager_Tracing {
if enableTracing {
return &http.HttpConnectionManager_Tracing{
OperationName: http.EGRESS,
}
}
return nil
}
Faced this problem too.
Adding HttpConnectionManager_Tracing helps and enables traces, but it is not mentioned here how to configure sampling (as we set defaults - 100%).
According to envoy HttpConnectionManager.Tracing configuration (https://www.envoyproxy.io/docs/envoy/latest/api-v3/extensions/filters/network/http_connection_manager/v3/http_connection_manager.proto.html#extensions-filters-network-http-connection-manager-v3-httpconnectionmanager-tracing), it is possible to have default filter inserted and perform configuration with Runtime variables https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_conn_man/runtime#config-http-conn-man-runtime)
I to do this and it works.
Patch listener.go to add tracing filter to HttpConnectionManager
Tracing: &http.HttpConnectionManager_Tracing{},
Configure envoy with bootstrap config in configmap
Add zipkin cluster.
clusters:
...
- name: zipkin
...
...
Add tracing configuration.
tracing:
http:
name: envoy.zipkin
typed_config:
"@type": type.googleapis.com/envoy.config.trace.v2.ZipkinConfig
collector_cluster: zipkin
collector_endpoint: "/api/v2/spans"
collector_endpoint_version: HTTP_JSON
Add layered_runtime configuration to set sampling rates.
layered_runtime:
layers:
- name: static_layer
static_layer:
tracing:
client_enabled: 100
global_enabled: 100
random_sampling: 0
How it works now:
We do simple request with random_sampling set to 0.
curl helloworld-go.default.127.0.0.1.xip.io
Container received header
X-B3-Sampled: 0
=> not sampled.
Now trying with x-envoy-force-trace header.
curl http://helloworld-go.default.127.0.0.1.xip.io -H 'x-envoy-force-trace: 1'
X-B3-Sampled: 1
=> sampled.
Maybe someone can propose much better solution - it would be great!
Maybe to go with 100% tracing, but do sampling on collector side...
Sorry to revive an old issue, but as discussed with @youngnick recently, to enable tracing as described by @mattalberts we had to fork Contour.
To summarize our current implementation:
contour bootstrap and provide Envoy configuration through a volume mounted with the content of a ConfigMap. This requires no changes to contour itself.--enable-tracing serve flag and patch HttpConnectionManager with pretty much verbatim code described in https://github.com/projectcontour/contour/issues/399#issuecomment-534623703It'd be great to be able to do this natively in Contour, without having to maintain our own fork.
@pims can you send as a PR?
If (2) were merged into Contour then the only custom part would be the need to skip contour bootstrap and create an Envoy configuration manually, correct?
That would be a good first step but I assume it would be better to also work with use of bootstrap. Would this be done as a "merge" operation, to allow providing an Envoy config that Contour then merges during bootstrap, or need to add a new Contour configuration that changes bootstrap to include trace configuration for Envoy?
I have a few questions for those of you who've made this change:
What happens if you enable tracing, but don't set up the cluster in the bootstrap?
I'd really like Contour to be able to tell you something about if the config is working, in general. Do any of you who are or would use this feature want that here? What information would you want Contour to provide, in an ideal world? Please see #2495 for some other discussion around surfacing information to HTTPProxy users, and #2325 for more closely related ideas.
If we did add something to contour bootstrap to add tracing configuration, seems like it would need at least three parameters - enable-tracing, tracing-address, tracing-endpoint, and maybe whatever endpoint_version does. Does that seem right?
To be clear, I think that Contour needs to have the ability to tell Envoy to do tracing, and I want the feature in. I just want to make sure that we've made the feature useful, and operable, even for people who don't know much about Envoy.
Edit: No problems with reviving an old issue @pims. Thanks for restarting the discussion here.
@youngnick nothing happens if tracing is enabled but not configured in bootstrap, nor if the tracing cluster is unavailable. I can't really test those two cases with regard to memory/cpu consumption on a busy cluster at this time though.
The only errors that we’ve experienced during upgrades is :
Proto constraint validation failed (Using the default now-deprecated value HTTP_JSON_V1 for enum 'envoy.config.trace.v2.ZipkinConfig.collector_endpoint_version' from file trace.proto. This enum value will be removed from Envoy soon so a non-default value must now be explicitly set. Please see https://www.envoyproxy.io/docs/envoy/latest/intro/deprecated for details. If continued use of this field is absolutely necessary, see https://www.envoyproxy.io/docs/envoy/latest/configuration/operations/runtime#using-runtime-overrides-for-deprecated-features for how to apply a temporary and highly discouraged override.): collector_cluster: "zipkin"
collector_endpoint: "/api/v2/spans"
when we didn't specify the collector_endpoint_version.
Thanks for that @pims! So, do you think that Contour (that is, contour serve) needs to tell you anything other than "Hey, you enabled tracing"? Feels like contour serve would need at least a cluster name (that we could default if necessary), and we could get away with just logging "Traces will be sent to this cluster, if it's up".
As an aside, I'd really like a way to put some of this info on some object's status, somewhere. Not sure where that would be, yet.
What do you think about my suggestion for contour bootstrap as well? Do you think that it captures the info the bootstrap config would need, in a general case?
Also, everyone watching this issue, but particularly @v0id3r: what do you think about configuring sampling? Is it worth bringing this feature in with everything set to 100% sampling, and adding configurability later, or do we need to discuss sampling now too?
@youngnick this is a test configuration that should highlight the moving parts:
static_resources:
listeners:
- address:
socket_address:
address: 127.0.0.1
port_value: 8002
filter_chains:
- filters:
- name: envoy.filters.network.http_connection_manager
typed_config:
"@type": type.googleapis.com/envoy.config.filter.network.http_connection_manager.v2.HttpConnectionManager
tracing: {} #empty config is enough, no need for cluster info here.
codec_type: auto
stat_prefix: ingress_http
route_config:
name: local_route
virtual_hosts:
- name: backend
domains:
- "*"
routes:
- match:
prefix: "/"
route:
cluster: my-service
http_filters:
- name: envoy.filters.http.router
typed_config: {}
use_remote_address: true
clusters:
- name: my-service
connect_timeout: 0.250s
type: strict_dns
lb_policy: round_robin
load_assignment:
cluster_name: my-service
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: 127.0.0.1
port_value: 8003
- name: zipkin # this is a typical cluster, nothing tracing specific
connect_timeout: 1s
type: strict_dns
lb_policy: round_robin
load_assignment:
cluster_name: zipkin
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: 127.0.0.1
port_value: 9411
# this is the key part
tracing:
http:
name: envoy.tracers.zipkin
typed_config:
"@type": type.googleapis.com/envoy.config.trace.v2.ZipkinConfig
collector_cluster: zipkin # matches the cluster name defined above
collector_endpoint: /api/v2/spans
collector_endpoint_version: HTTP_JSON
admin:
access_log_path: "/dev/null"
address:
socket_address:
address: 127.0.0.1
port_value: 8001
so something along the lines of --tracing-cluster=zipkin should be enough given an envoy configuration with a static cluster of the same name and the tracing.http configured as required.
For bootstrap, I believe it's a bit more complicated given the different tracers:
// - *envoy.tracers.lightstep*
// - *envoy.tracers.zipkin*
// - *envoy.tracers.dynamic_ot*
// - *envoy.tracers.datadog*
// - *envoy.tracers.opencensus*
// - *envoy.tracers.xray*
the config is provider specific, which makes it quite difficult to generalize in the bootstrap process.
I'm not quite up-to-date on opencensus and when/if we can expect it to be the standard. I'd say it's easier to punt and let the static configuration deal with it.
So, for your use case @pims, the only code changes required to contour serve is that the HCM should have the empty tracing config? Then you would configure the tracing clusters and the rest using a custom bootstrap config you generate yourself? In that case, a config setting (enableTracing: true or similar) would seem to be sufficient.
Other people watching this issue, would this meet your bar for an MVP solution? Another important thing to note is that we have not changed to the v3 API yet, so we will need to check v2 docs for many xDS things.
@youngnick for our use-case, yes, all that's needed on contour's side is enabling the empty tracing config for HCM, everything else is taking care of by our bootstrap config.
I'm happy to submit a PR with our changes to get the ball rolling on this if it helps.
Thanks @pims, I'd like if possible to hear from anyone else who's watching this issue. Would requiring that you generate your own bootstrap config be a good first step for now?
I'll have a think about if then we do a first cut of a contour bootstrap update assuming a zipkin tracing type.
@youngnick asked:
Is it worth bringing this feature in with everything set to 100% sampling
Seems like both OpenTelemetry and Honeycomb are saying (a) the new hotness is tail-based sampling and (b) to do that well right now you need to centralize ALL your traces and do that sampling in one spot (where the feature is alpha), or else use a custom-built version of something or other. Perhaps later the tools will improve on that point.
So FWIW I, personally, am about to venture down that path and try to sample later, instead of in the apps (or in, say, Envoy).
Fair enough.
We've also added the new ExtensionService CRD, initially to help with external auth (as outlined in the design doc for auth, but it's been built to be able to add any Kubernetes service as a reference-able Envoy cluster. So it can be used to avoid having to supply a custom bootstrap with your tracing configuration, theoretically.
That means that, in order to implement tracing we'll need to:
Basically, the next steps are that we need a design doc for how to use the new ExtensionService to implement tracing. We don't currently have it listed on our roadmap, (oddly, we've obviously missed it in planning) but if anyone wants to take this up for themselves, or would like for it to be officially on there, this is the place to say so.
We do have a couple of must-deliver things soon (the xDS v3 upgrade is the most pressing(, so I can't make any guarantees yet about when we can get to this. But, requests from users are the fastest way for me to be able to shuffle priorities, hint hint.
I'm not sure if this should be a global Contour config, a global default overridable in HTTPProxy, or a per HTTPproxy virtualhost config. I'd love to hear from people using this right now what you'd prefer.
Per-namespace is how I would envision it for our use case (multi-tenant cluster).
Variations I can think of at the namespace level:
I suppose a per-HTTPProxy design also works; we could inherit from a namespace config through an admission controller.
For reference, this blog post mostly matches what we have in mind:
https://itnext.io/jaegers-multitenancy-with-elasticsearch-ae318501f415
The way I see it:
Personally my hope for tracing with Contour is to be able to just tell Envoy where to dump all its tracing to, globally, with the potential to override it per-namespace. Specifically I'd like to tell it to dump OpenCensus format traces to a locally deployed OpenTelemetry collector using https://www.envoyproxy.io/docs/envoy/latest/api-v2/config/trace/v2/opencensus.proto (v2). I'd really only need ocagent_address exposed, and have that automatically set ocagent_exporter_enabled if set. I think that would allow quite a few use cases since once the traces are in the OpenTelemetry collector you can export it to basically any observability backend (jaeger, zipkin, stackdriver, AWS x-ray, honeycomb, datadog, etc).
Presumably longer term envoy is going to add support for OpenTelemetry format traces directly but for now OpenCensus seems the most generic version.
We generally haven't done any config on a per-namespace basis, really, for Contour, because of the way Contour is triggered to generate Envoy config (basically, when one of a set of objects change). It seems like a global option, overridable per-HTTPProxy, would meet the most use cases out of this issue.
I'll wait to hear from any other interested parties, but if that's the case, the next step will be a design document, outlining what we'll support (if it's opencensus or some other format, and how the global with override setup will work).
We will probably also need to include samples of config to pass to Gatekeeper for the case that someone wants to _prevent_ HTTPProxies from overriding the config, as we have done for other settings.
I agree with @johanbrandhorst. We had to make a fork to configure opencensus. We're using envoy as an edge proxy and the sampling decision is being made at this point. That decision is being propagated through the network to the rest of the services. Then we're centralizing the metrics/traces ingestion with the opentelemetry collector.
Specifically I'd like to tell it to dump OpenCensus format traces to a locally deployed OpenTelemetry collector using https://www.envoyproxy.io/docs/envoy/latest/api-v2/config/trace/v2/opencensus.proto (v2). I'd really only need ocagent_address exposed, and have that automatically set ocagent_exporter_enabled if set.
We did the same approach but added an additional flag to set the probability sampling value. In the future, we''ll move to tail based sampling in the opentelemetry collector.
Sounds like we really need two, maybe three settings:
Questions I have for people watching this issue:
I'll give this one two weeks lazy consensus timeout - I'll mention it on the meetings and maybe the mailing list, then consider it okayed if there are no objections by EOD November 9, 2020, US Pacific time. If you want something else in the initial featureset, or you need it sooner, please speak now!
I think we _could_ do without probability sampling as an option initially - as mentioned before the world is moving to a tail based sampling model where the aggregators choose what to sample and the producers just blindly fire things to the sink.
Just as a stop gap - what can I do today if I want to play around with this? Modify the Contour ConfigMap before starting contour? What do I have to set exactly? I'm not familiar with using it as a proxy for envoy configuration. Thanks!
* Does OpenCensus meet everyone's needs?
Note, OpenCensus is "deprecated". The project has merged with OpenTracing to become https://opentelemetry.io/
We hope to have a GA release soon. Contour should definitely move forward with OpenTelemetry and not OpenCensus -- or any other option in my opinion :)
Envoy currently does not support exporting OpenTelemetry traces, or that would have been the obvious choice. OpenCensus is indeed deprecated, but in the interesting of adding _some_ support, OpenCensus still has wide library support in a range of languages, and the OpenTelemetry collector can consume the OpenCensus traces, so it's also a forward compatible choice.
Ah, so Envoy is instrumented with OpenCensus? Then that makes sense.
OpenCensus and OpenTelemetry use the same propagation format (W3C Trace Context) so it also means propagation upstream will work to OpenTelemetry instrumented services.
TIL about OpenCensus vs OpenTelemetry, thanks @tsloughter!
Just as a stop gap - what can I do today if I want to play around with this? Modify the Contour ConfigMap before starting contour? What do I have to set exactly? I'm not familiar with using it as a proxy for envoy configuration. Thanks!
It looks from @pims previous examples like you may be able to get a test version of this working by creating your own bootstrap config and modifying it - contour bootstrap is currently used by the Envoy deployment in an init container to generate the bootstrap for Envoy proper to consume - so you could take that output and munge it a bit to see what you can get working.
That would be valuable data for a design; You (or anyone) could start with PRing a new design document into the design/ directory, based off the design template. Once we've agreed on the design (by merging the PR), then anyone can start working on the code.
Having a quick look at the Envoy docs, I'm not sure how the per-httpproxy override would work, but that seems like an important detail to note in a design doc.
Please feel free to ask more questions here, or ping me on Kubernetes slack @youngnick. If you would like to work on the design doc, you can signal that by assigning this issue to yourself for now. If you can't that's fine too, you can also come to the community meeting this week and talk about it if that's what you would prefer.
tl;dr Next steps:
Since there are some questions above, it may be useful to recap a couple things.
Envoy support:
OpenTelemetry (OTel):
Sampling: when we talk about sampling, there's a lot of extra wrinkles, but the basic idea people seem to be pursuing with sampling is that if you record the sample rate applied to chosen traces (e.g., "this chosen sample is 1% of the time" and "this other chosen sample is sampled on some rule that's kept 70% of the time") then later you can statistically reconstruct the overall estimated occurrence of events like those kept. IMHO a first version of tracing configuration from Contour doesn't need to do any sampling, but may benefit from noting that "100% included" fact when sending traces elsewhere.
So what's being generally suggested above makes sense! If Contour can (1) configure Envoy's OpenCensus plugin for now (and Envoy's OpenTelemetry plugin when that's supported) to export to the address of a given collector system, and (2) configure propagation of the W3C Trace Context headers, then we'll be able to insert Envoy spans into a lot of distributed traces right away.
I'd try to turn this long comment into a spec., but I'm not sure I know enough about the internals of either Envoy's plugin configuration or Contour right now. 😅
Having no knowledge of how anything on contour works I spent some time this morning playing around with code changes to manually set the tracing configuration. I hacked together something in bootstrap.go just as a starting point:
diff --git a/internal/envoy/v2/bootstrap.go b/internal/envoy/v2/bootstrap.go
index 6084da1b..0f00230c 100644
--- a/internal/envoy/v2/bootstrap.go
+++ b/internal/envoy/v2/bootstrap.go
@@ -29,6 +29,7 @@ import (
clusterv2 "github.com/envoyproxy/go-control-plane/envoy/api/v2/cluster"
envoy_api_v2_core "github.com/envoyproxy/go-control-plane/envoy/api/v2/core"
envoy_api_bootstrap "github.com/envoyproxy/go-control-plane/envoy/config/bootstrap/v2"
+ envoy_config_trace_v2 "github.com/envoyproxy/go-control-plane/envoy/config/trace/v2"
matcher "github.com/envoyproxy/go-control-plane/envoy/type/matcher"
"github.com/golang/protobuf/proto"
"github.com/golang/protobuf/ptypes/any"
@@ -141,6 +142,21 @@ func bootstrap(c *envoy.BootstrapConfig) ([]bootstrapf, error) {
upstreamSdsTLSContext(sdsTLSCertificatePath, sdsValidationContextPath))
return c.Path, b
},
+ func(*envoy.BootstrapConfig) (string, proto.Message) {
+ b := bootstrapConfig(c)
+ b.Tracing = &envoy_config_trace_v2.Tracing{
+ Http: &envoy_config_trace_v2.Tracing_Http{
+ Name: "envoy.tracers.opencensus",
+ ConfigType: &envoy_config_trace_v2.Tracing_Http_TypedConfig{
+ TypedConfig: protobuf.MustMarshalAny(&envoy_config_trace_v2.OpenCensusConfig{
+ OcagentAddress: "dns:///<open-telemetry-collector-address>",
+ OcagentExporterEnabled: true,
+ }),
+ },
+ },
+ }
+ return c.Path, b
+ },
)
return steps, nil
I rebuilt the docker image and applied it to my local cluster. This did appear to set the envoy bootstrap tracing configuration correctly, but I was getting gRPC connection errors in the envoy logs and didn't have anymore time to debug it. It might be a good starting point for someone else.
[bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:93] StreamClusters gRPC config stream closed: 14, upstream connect error or disconnect/reset before headers. reset reason: connection termination
So what's being generally suggested above makes sense! If Contour can (1) configure Envoy's OpenCensus plugin for now (and Envoy's OpenTelemetry plugin when that's supported) to export to the address of a given collector system, and (2) configure propagation of the W3C Trace Context headers, then we'll be able to insert Envoy spans into a lot of distributed traces right away.
I expect that we will need to be able to configure a few different tracing systems. The space is really complex right now, as you point out. Even if OpenCensus is the emerging standard, right now I'd expect that a lot of people are using systems built around Zipkin or Jaeger.
we will need to be able to configure a few different tracing systems.
What you'll want is the ability to configure the propagation and export format.
So for Zipkin that is the B3 spec for propagation and for Jaeger I'm not sure they have an actual standard for their header keys.. so best to just allow the user to configure the header key used.
For export it is similar, zipkin and jaeger both have protocols, and all the providers (Stackdriver, Lightstep, Honeycomb, etc) have their own as well.
But with OpenTelemetry (and OpenCensus) the libraries support all those propagation and export protocols, so using the single instrumentation library allows the user to configure the user of B3 (Zipkin) propagation and their service might be a Java app instrumented with Brave. Using Otel or OC, and exposing their configuration, allows the user's Java app to remain unchanged.
Thanks @tsloughter for the point about the propoagation and export format being separate things. That is a good point to whoever ends up doing the design doc for this.
I'm mostly in-sync with what @inigohu, @johanbrandhorst & @kevincantu proposed. In my opinion we can base the design docs in these premises.
otel-collector is going to support both protocol as their initial aim is to help people transition easily.WDYT?
Yes, that sounds like a great plan, thanks @glerchundi!
I'll be honest, I don't think the current maintainers are going to be able to get to this design doc until early next year sometime, so if anyone else wants to get started before then, please see the design doc template in design/ in this repo, and feel free to post a WIP PR for early feedback. Of course, posting a non-WIP PR is fine too!
I've been thinking about this and before getting involved in the design doc I would like to know what does people think about a key question: multi-tenant deployments.
I started by thinking that the best approach would be to have a global configuration (i.e in the config file) but now I'm more inclined to configure tracing on a HTTPProxy basis. With the new Envoy API (v3) tracing, it could be configured at the HTTP connection manager level which means that each TLS-based virtualhost has its own tracing configuration parameters.
The drawback here is the same you're also having with rate limiting support and it's how we should deal with non-TLS virtualhosts where all converge in the same HTTP connection manager and therefore in the same tracing config. @skriss raised an issue where an answer for the third bullet point will solve the stated concerns here.
So lets start a poll to see what do you think:
tracingPolicy at the virtualhost levelUPDATE: By the way, sorry for choosing those biased emojis...
@glerchundi Does following the pattern for external auth make sense here? IMHO we should aim to make similar things similar :)
This is an interesting approach @jpeach but I don't know if it doesn't seem like too much for this. Let me draw how it would look like and tell me what you think:
Extension service definition:
apiVersion: projectcontour.io/v1alpha1
kind: ExtensionService
metadata:
namespace: projectcontour
name: tracing
spec:
services:
- name: tracing
port: 55678
HTTPProxy extension point:
tracingService:
extensionService: projectcontour/tracing
The benefit of this design is that it can be extended easily. For example the format in which we want to send the traces (although I would avoid initially).
I would expect a relatively minimal config to further include (or default to):
opencensus)w3c).Edit: the latter seems to be part of the OpenCensus configuration in v3.
I guess both would be part of the extension service definition, @glerchundi?
Edit 2: oh, so for auth the extension service is really just a reference to another Kubernetes service...
So this choice of tracing provider and provider-specific config (e.g., OpenCensus) would be defined as part of the HTTPProxy?
To answer all the questions I can see here:
tracingService stanza). @jpeach, interested on your thoughts on this one.Although I wanted to invest time to design this doc., I won't be able to do so for at least the next 3-6 months.
Just wanted to share this with you all to avoid generating false expectatives and to unblock in case someone wants to take and push it.
🙏
Wanted to revive this thread and update everyone that we are looking at adding tracing now, perhaps in the next release or the one after that. A lot of the comments already touch on use cases and what tracing stack to use, so just adding couple of quick thoughts on how we might go about this here.
Regarding which Tracing technologies to integrate against, we're pretty open. Jaeger is very popular right now due to its distributed architecture making it highly scalable. Both Zipkin and Jaeger support OpenTracing and OpenTelemtry and have a pretty robust community.
cc @skriss
Also we can look to make tracing data actionable, ie. for security preventative measures. I don't know if OPA is appropriate here, scenario I have in mind is services should be access in a particular order. ie you can only hit the 'shipping' service after hitting 'checkout' service. If it was the other way around, this should be logged and surfaced to the admin.
tagging this 1.17 to signify we are looking into serious design now.
V3 envoy api reference: https://www.envoyproxy.io/docs/envoy/v1.13.1/api-v3/config/trace/v3/trace.proto
I vaguely remember an upstream envoy discussion about some tracers being removed, I’ll try to find it
from https://github.com/projectcontour/contour/issues/399#issuecomment-741624428:
The drawback here is the same you're also having with rate limiting support and it's how we should deal with non-TLS virtualhosts where all converge in the same HTTP connection manager and therefore in the same tracing config. @skriss raised an issue where an answer for the third bullet point will solve the stated concerns here.
Yeah, unfortunately we still don't have a solution here. For rate limiting we ended up going with a single global rate limit service configured at the Contour level. We could do the same for tracing, though the votes on the original comment make it clear that folks prefer per-vhost settings. If we go with per-vhost settings, then they can only be provided for TLS vhosts, since all non-TLS vhosts share a single HCM/tracing config.
I think that adding a single, global tracing sink first is at least a start, and we then add the ability to override that sink on a per-object basis later?
Edit: I should note that, in the case you wanted to use a sidecar sink, that a single global one would be fine (since you could point it to localhost, and run the tracing sink in the Envoy pods. This would work well for tracing sinks that allow a separate ingestion tier that handles the sampling etc.)
I think that adding a single, global tracing sink first is at least a start, and we then add the ability to override that sink on a per-object basis later?
That'd be consistent with rate limiting, so I like that in principle.
Most helpful comment
I've been thinking about this and before getting involved in the design doc I would like to know what does people think about a key question: multi-tenant deployments.
I started by thinking that the best approach would be to have a global configuration (i.e in the config file) but now I'm more inclined to configure tracing on a HTTPProxy basis. With the new Envoy API (v3) tracing, it could be configured at the HTTP connection manager level which means that each TLS-based virtualhost has its own tracing configuration parameters.
The drawback here is the same you're also having with rate limiting support and it's how we should deal with non-TLS virtualhosts where all converge in the same HTTP connection manager and therefore in the same tracing config. @skriss raised an issue where an answer for the third bullet point will solve the stated concerns here.
So lets start a poll to see what do you think:
tracingPolicyat the virtualhost levelUPDATE: By the way, sorry for choosing those biased emojis...