Contour: Tracing Configuration

Created on 17 May 2018  ·  57Comments  ·  Source: projectcontour/contour

It would be nice to expose the bootstrap config a bit more with things like tracing and stats configs. We could even offer something like a file path that gets merged into the bootstrap YAML on boot.

For context, we're building our own control plane for service -> service mesh for Envoy. We're including stats and Zipkin traces (via the jaeger endpoint). We lose the trace context between Contour and our sidecar'd envoys because we can't configure Contour to include trace information.

kinfeature

Most helpful comment

I've been thinking about this and before getting involved in the design doc I would like to know what does people think about a key question: multi-tenant deployments.

I started by thinking that the best approach would be to have a global configuration (i.e in the config file) but now I'm more inclined to configure tracing on a HTTPProxy basis. With the new Envoy API (v3) tracing, it could be configured at the HTTP connection manager level which means that each TLS-based virtualhost has its own tracing configuration parameters.

The drawback here is the same you're also having with rate limiting support and it's how we should deal with non-TLS virtualhosts where all converge in the same HTTP connection manager and therefore in the same tracing config. @skriss raised an issue where an answer for the third bullet point will solve the stated concerns here.

So lets start a poll to see what do you think:

  • 👍 : tracingPolicy at the virtualhost level
  • 👎 : global config

UPDATE: By the way, sorry for choosing those biased emojis...

All 57 comments

Hi @bobbytables. To be very transparent, merging pieces of abstract configuration is not a feature I plan to add.

However, the contour bootstrap initContainer is not required to use Contour, it's just a convenience. You could replace the output of contour bootstrap with a ConfigMap into a volume. See #1

Roger that, I'll probably end up doing the mounted configmap then. Thanks!

That’s the approach I’d recommend. If you check out the heptio/gimbal repo, they’re probably already doing that.

On 18 May 2018, at 23:33, Robert Ross notifications@github.com wrote:

Roger that, I'll probably end up doing the mounted configmap then. Thanks!


You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.

@bobbytables @davecheney
I loved this suggestion (replacing contour bootstrap with a ConfigMap mounted volume); it's fantastically simple. I'm also trying to enable tracing (span injection) at the ingress controller; here's the results of my experiment.

TL;DR

  • configuring envoy opentracing manually (without bootstrap) is easy
  • envoy.http_connection_manager still needs a "tracing" definition to emit headers
  • I have a patch! 😄 (if you're interested)

Contour with OpenTracing

First, deploy contour without bootstrap and mount an updated envoy configuration from a ConfigMap.

---
apiVersion: v1
kind: ConfigMap
metadata:
  labels:
    app: contour
    role: ingress
  name: contour-config
  namespace: ingress-system
data:
  contour.yaml: |
    dynamic_resources:
      lds_config:
        api_config_source:
          api_type: GRPC
          cluster_names: [contour]
          grpc_services:
          - envoy_grpc:
              cluster_name: contour
      cds_config:
        api_config_source:
          api_type: GRPC
          cluster_names: [contour]
          grpc_services:
          - envoy_grpc:
              cluster_name: contour
    static_resources:
      clusters:
      - name: contour
        connect_timeout: { seconds: 5 }
        type: STRICT_DNS
        hosts:
          - socket_address:
              address: 127.0.0.1
              port_value: 8001
        lb_policy: ROUND_ROBIN
        http2_protocol_options: {}
        circuit_breakers:
          thresholds:
            - priority: high
              max_connections: 100000
              max_pending_requests: 100000
              max_requests: 60000000
              max_retries: 50
            - priority: default
              max_connections: 100000
              max_pending_requests: 100000
              max_requests: 60000000
              max_retries: 50
      - name: service_stats
        connect_timeout: 0.250s
        type: LOGICAL_DNS
        lb_policy: ROUND_ROBIN
        hosts:
          - socket_address:
              protocol: TCP
              address: 127.0.0.1
              port_value: 9001
      - name: zipkin
        connect_timeout: { seconds: 5 }
        type: LOGICAL_DNS
        lb_policy: ROUND_ROBIN
        hosts:
          - socket_address:
              protocol: TCP
              address: zipkin-collector.tracing-system
              port_value: 9411
    tracing:
      http:
        name: envoy.zipkin
        config:
          collector_cluster: zipkin
          collector_endpoint: /api/v1/spans
    admin:
      access_log_path: /dev/null
      address:
        socket_address:
          address: 0.0.0.0
          port_value: 9001
---

```bash
kubectl create ns ingress-system
kubectl apply -f issue-399.yaml -l "app=contour"
kubectl -n ingress-system logs $POD_NAME -c envoy

[2018-07-19 17:52:21.931][1][info][main] source/server/server.cc:178] initializing epoch 0 (hot restart version=9.200.16384.127.options=capacity=16384, num_slots=8209 hash=228984379728933363)
[2018-07-19 17:52:21.935][1][info][config] source/server/configuration_impl.cc:52] loading 0 listener(s)
[2018-07-19 17:52:21.937][1][info][config] source/server/configuration_impl.cc:92] loading tracing configuration
[2018-07-19 17:52:21.938][1][info][config] source/server/configuration_impl.cc:101]   loading tracing driver: envoy.zipkin
[2018-07-19 17:52:21.938][1][info][config] source/server/configuration_impl.cc:119] loading stats sink configuration
[2018-07-19 17:52:21.939][1][info][main] source/server/server.cc:353] starting main dispatch loop

Echo

Next, deploy a service to help inspect the upstream request.

kubectl apply -f issue-399.yaml -l "app=echo"
curl -iv http://echo.127.0.0.1.xip.io/

Request Headers:
    accept=*/*
    content-length=0
    host=echo.127.0.0.1.xip.io
    user-agent=curl/7.60.0
    x-envoy-expected-rq-timeout-ms=15000
    x-envoy-internal=true
    x-forwarded-for=192.168.65.3
    x-forwarded-proto=http
    x-request-id=b645e1b6-2634-43f5-bd95-b6bac6b61c26
  • missing tracing headers in the request.

If I update the listener creation code (specifically, the function httpFilter) to conditionally add a tracing definition, spans are emitted correctly to my configured opentracing backend.

+       if enabletracing {
+               filter.Config.Fields["tracing"] = st(map[string]*types.Value{
+                       "operation_name": sv("egress"),
+               })
+       }
+       return filter
  • This isn't all of the patch, but its the relevant part

Swapping out the container definition with this

        - name: contour
          image: docker.io/mattalberts/contour:0.5.0-test
          imagePullPolicy: Always
          command: ["contour"]
          args:
            - serve
            - --incluster
            - --enable-tracing
            - --ingress-class-name=contour
curl -iv http://echo.127.0.0.1.xip.io/

Request Headers:
    accept=*/*
    content-length=0
    host=echo.127.0.0.1.xip.io
    user-agent=curl/7.60.0
    x-b3-sampled=1
    x-b3-spanid=31bd1c8e3db02128
    x-b3-traceid=31bd1c8e3db02128
    x-envoy-expected-rq-timeout-ms=15000
    x-envoy-internal=true
    x-forwarded-for=192.168.65.3
    x-forwarded-proto=http
    x-request-id=45b03ab1-2271-92e0-b926-c8946aa9c0d3
  • tracing headers in the request!

My patch currently relies on a flag --enable-tracing. Looking through the envoy code base, the overhead when tracing is enabled feels fairly minimal, but it does cause extra work to be done.

issue-399.yaml.txt

  • remove the .txt

My initial work used a command-line option --enable-tracing to globally add the tracing definitions to all listeners. Looking through the annotations, it might make more sense to expose this as an annotation. Which would be preferred?

  1. per-service annotation enable?
  2. global enable with per-service annotation disable?
  3. global enable?

I'm leaning towards 1 or 2.

@rosskukulinski I probably missed something. Setting up the static cluster/trace resource was easy; I used the zipkin trace config

      - name: zipkin
        connect_timeout: { seconds: 5 }
        type: LOGICAL_DNS
        lb_policy: ROUND_ROBIN
        hosts:
          - socket_address:
              protocol: TCP
              address: zipkin-collector.tracing-system
              port_value: 9411
    tracing:
      http:
        name: envoy.zipkin
        config:
          collector_cluster: zipkin
          collector_endpoint: /api/v1/spans

The more difficult part was enabling as part of the envoy filter. All the code has very much moved around by now :), I should pull 0.8.1 and play with it (I'm living back on 0.5.0).

Moving to the unplanned milestone. We don’t plan on looking at this til after Contour 1.0

Blocked:

  • [x] #1130

The approach we have taken for structured logs (#624) without having to fork Contour is to introduce a middleware gRPC proxy between Envoy and Contour and activate the gRPC Access Log Service (ALS).

This effectively augments the xDS responses "in flight" to enable certain features (such as ALS) for which Contour does not have a design yet (or may never implement for various reasons). We can also remove/modify certain switches that Contour configures (for example, disable the stdout access log).

This is probably going to be our strategy to get tracing (this ticket) working. Unlike Contour, this proxy gRPC server does not have to be very fancy with caching as it can effectively pass-through most requests verbatim. With its limited scope, it can also be implemented in any language that supports gRPC. This extra latency is out of the data path and is a reasonable compromise.

Another potential option is to re-use or modify the FluentD plugin from Google that is covered at Envoy, Nginx, Apache HTTP Structured Logs with Google Cloud Logging

How's the situation with this issue?
Should I replace the output of contour bootstrap with a ConfigMap into a volume ? Or is there another recommended approach to enable envoy tracing?

@inigohu the best way to do this at the moment would be to provide your own configuration to envoy. contour bootstrap is a convenience to provide the parameters we expect for Envoy and Contour to work together but it is not required. You can replace it with your own mechanism for providing bootstrap configuration to Envoy.

I found altering the envoy config to be insufficient to get tracing working. I had to patch the setup of the HTTPConnectionManager

func tracing(enableTracing bool) *http.HttpConnectionManager_Tracing {
    if enableTracing {
        return &http.HttpConnectionManager_Tracing{
            OperationName: http.EGRESS,
        }
    }
    return nil
}

Faced this problem too.
Adding HttpConnectionManager_Tracing helps and enables traces, but it is not mentioned here how to configure sampling (as we set defaults - 100%).

According to envoy HttpConnectionManager.Tracing configuration (https://www.envoyproxy.io/docs/envoy/latest/api-v3/extensions/filters/network/http_connection_manager/v3/http_connection_manager.proto.html#extensions-filters-network-http-connection-manager-v3-httpconnectionmanager-tracing), it is possible to have default filter inserted and perform configuration with Runtime variables https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_conn_man/runtime#config-http-conn-man-runtime)

I to do this and it works.

Patch listener.go to add tracing filter to HttpConnectionManager

Tracing: &http.HttpConnectionManager_Tracing{},

Configure envoy with bootstrap config in configmap

Add zipkin cluster.

clusters:
...
  - name: zipkin
  ...
...

Add tracing configuration.

tracing:
  http:
    name: envoy.zipkin
      typed_config:
        "@type": type.googleapis.com/envoy.config.trace.v2.ZipkinConfig
        collector_cluster: zipkin
        collector_endpoint: "/api/v2/spans"
        collector_endpoint_version: HTTP_JSON

Add layered_runtime configuration to set sampling rates.

layered_runtime:
      layers:
        - name: static_layer
          static_layer:
            tracing:
              client_enabled: 100
              global_enabled: 100
              random_sampling: 0

How it works now:
We do simple request with random_sampling set to 0.
curl helloworld-go.default.127.0.0.1.xip.io
Container received header

X-B3-Sampled: 0

=> not sampled.

Now trying with x-envoy-force-trace header.
curl http://helloworld-go.default.127.0.0.1.xip.io -H 'x-envoy-force-trace: 1'

X-B3-Sampled: 1

=> sampled.

Maybe someone can propose much better solution - it would be great!
Maybe to go with 100% tracing, but do sampling on collector side...

Sorry to revive an old issue, but as discussed with @youngnick recently, to enable tracing as described by @mattalberts we had to fork Contour.

To summarize our current implementation:

  1. we skip contour bootstrap and provide Envoy configuration through a volume mounted with the content of a ConfigMap. This requires no changes to contour itself.
  2. we provide a new --enable-tracing serve flag and patch HttpConnectionManager with pretty much verbatim code described in https://github.com/projectcontour/contour/issues/399#issuecomment-534623703

It'd be great to be able to do this natively in Contour, without having to maintain our own fork.

@pims can you send as a PR?

If (2) were merged into Contour then the only custom part would be the need to skip contour bootstrap and create an Envoy configuration manually, correct?

That would be a good first step but I assume it would be better to also work with use of bootstrap. Would this be done as a "merge" operation, to allow providing an Envoy config that Contour then merges during bootstrap, or need to add a new Contour configuration that changes bootstrap to include trace configuration for Envoy?

I have a few questions for those of you who've made this change:

What happens if you enable tracing, but don't set up the cluster in the bootstrap?

I'd really like Contour to be able to tell you something about if the config is working, in general. Do any of you who are or would use this feature want that here? What information would you want Contour to provide, in an ideal world? Please see #2495 for some other discussion around surfacing information to HTTPProxy users, and #2325 for more closely related ideas.

If we did add something to contour bootstrap to add tracing configuration, seems like it would need at least three parameters - enable-tracing, tracing-address, tracing-endpoint, and maybe whatever endpoint_version does. Does that seem right?

To be clear, I think that Contour needs to have the ability to tell Envoy to do tracing, and I want the feature in. I just want to make sure that we've made the feature useful, and operable, even for people who don't know much about Envoy.

Edit: No problems with reviving an old issue @pims. Thanks for restarting the discussion here.

@youngnick nothing happens if tracing is enabled but not configured in bootstrap, nor if the tracing cluster is unavailable. I can't really test those two cases with regard to memory/cpu consumption on a busy cluster at this time though.

The only errors that we’ve experienced during upgrades is :

Proto constraint validation failed (Using the default now-deprecated value HTTP_JSON_V1 for enum 'envoy.config.trace.v2.ZipkinConfig.collector_endpoint_version' from file trace.proto. This enum value will be removed from Envoy soon so a non-default value must now be explicitly set. Please see https://www.envoyproxy.io/docs/envoy/latest/intro/deprecated for details. If continued use of this field is absolutely necessary, see https://www.envoyproxy.io/docs/envoy/latest/configuration/operations/runtime#using-runtime-overrides-for-deprecated-features for how to apply a temporary and highly discouraged override.): collector_cluster: "zipkin"
collector_endpoint: "/api/v2/spans"

when we didn't specify the collector_endpoint_version.

Thanks for that @pims! So, do you think that Contour (that is, contour serve) needs to tell you anything other than "Hey, you enabled tracing"? Feels like contour serve would need at least a cluster name (that we could default if necessary), and we could get away with just logging "Traces will be sent to this cluster, if it's up".

As an aside, I'd really like a way to put some of this info on some object's status, somewhere. Not sure where that would be, yet.

What do you think about my suggestion for contour bootstrap as well? Do you think that it captures the info the bootstrap config would need, in a general case?

Also, everyone watching this issue, but particularly @v0id3r: what do you think about configuring sampling? Is it worth bringing this feature in with everything set to 100% sampling, and adding configurability later, or do we need to discuss sampling now too?

@youngnick this is a test configuration that should highlight the moving parts:

static_resources:
  listeners:
  - address:
      socket_address:
        address: 127.0.0.1
        port_value: 8002
    filter_chains:
    - filters:
      - name: envoy.filters.network.http_connection_manager
        typed_config:
          "@type": type.googleapis.com/envoy.config.filter.network.http_connection_manager.v2.HttpConnectionManager
          tracing: {} #empty config is enough, no need for cluster info here.
          codec_type: auto
          stat_prefix: ingress_http
          route_config:
            name: local_route
            virtual_hosts:
            - name: backend
              domains:
              - "*"
              routes:
              - match:
                  prefix: "/"
                route:
                  cluster: my-service
          http_filters:
          - name: envoy.filters.http.router
            typed_config: {}
          use_remote_address: true
  clusters:
  - name: my-service
    connect_timeout: 0.250s
    type: strict_dns
    lb_policy: round_robin
    load_assignment:
      cluster_name: my-service
      endpoints:
      - lb_endpoints:
        - endpoint:
            address:
              socket_address:
                address: 127.0.0.1
                port_value: 8003
  - name: zipkin # this is a typical cluster, nothing tracing specific
    connect_timeout: 1s
    type: strict_dns
    lb_policy: round_robin
    load_assignment:
      cluster_name: zipkin
      endpoints:
      - lb_endpoints:
        - endpoint:
            address:
              socket_address:
                address: 127.0.0.1
                port_value: 9411

# this is the key part
tracing:
  http:
      name: envoy.tracers.zipkin
      typed_config:
          "@type": type.googleapis.com/envoy.config.trace.v2.ZipkinConfig
          collector_cluster: zipkin # matches the cluster name defined above
          collector_endpoint: /api/v2/spans
          collector_endpoint_version: HTTP_JSON
admin:
  access_log_path: "/dev/null"
  address:
    socket_address:
      address: 127.0.0.1
      port_value: 8001

so something along the lines of --tracing-cluster=zipkin should be enough given an envoy configuration with a static cluster of the same name and the tracing.http configured as required.

For bootstrap, I believe it's a bit more complicated given the different tracers:

    // - *envoy.tracers.lightstep*
    // - *envoy.tracers.zipkin*
    // - *envoy.tracers.dynamic_ot*
    // - *envoy.tracers.datadog*
    // - *envoy.tracers.opencensus*
    // - *envoy.tracers.xray*

the config is provider specific, which makes it quite difficult to generalize in the bootstrap process.
I'm not quite up-to-date on opencensus and when/if we can expect it to be the standard. I'd say it's easier to punt and let the static configuration deal with it.

So, for your use case @pims, the only code changes required to contour serve is that the HCM should have the empty tracing config? Then you would configure the tracing clusters and the rest using a custom bootstrap config you generate yourself? In that case, a config setting (enableTracing: true or similar) would seem to be sufficient.

Other people watching this issue, would this meet your bar for an MVP solution? Another important thing to note is that we have not changed to the v3 API yet, so we will need to check v2 docs for many xDS things.

@youngnick for our use-case, yes, all that's needed on contour's side is enabling the empty tracing config for HCM, everything else is taking care of by our bootstrap config.

I'm happy to submit a PR with our changes to get the ball rolling on this if it helps.

Thanks @pims, I'd like if possible to hear from anyone else who's watching this issue. Would requiring that you generate your own bootstrap config be a good first step for now?

I'll have a think about if then we do a first cut of a contour bootstrap update assuming a zipkin tracing type.

@youngnick asked:

Is it worth bringing this feature in with everything set to 100% sampling

Seems like both OpenTelemetry and Honeycomb are saying (a) the new hotness is tail-based sampling and (b) to do that well right now you need to centralize ALL your traces and do that sampling in one spot (where the feature is alpha), or else use a custom-built version of something or other. Perhaps later the tools will improve on that point.

So FWIW I, personally, am about to venture down that path and try to sample later, instead of in the apps (or in, say, Envoy).

Fair enough.

We've also added the new ExtensionService CRD, initially to help with external auth (as outlined in the design doc for auth, but it's been built to be able to add any Kubernetes service as a reference-able Envoy cluster. So it can be used to avoid having to supply a custom bootstrap with your tracing configuration, theoretically.

That means that, in order to implement tracing we'll need to:

  • add a tracing configuration section somewhere. I'm not sure if this should be a global Contour config, a global default overridable in HTTPProxy, or a per HTTPproxy virtualhost config. I'd love to hear from people using this right now what you'd prefer.
  • Figure out what, if any, tunables need to be in the tracing config (sample rate, etc.)

Basically, the next steps are that we need a design doc for how to use the new ExtensionService to implement tracing. We don't currently have it listed on our roadmap, (oddly, we've obviously missed it in planning) but if anyone wants to take this up for themselves, or would like for it to be officially on there, this is the place to say so.

We do have a couple of must-deliver things soon (the xDS v3 upgrade is the most pressing(, so I can't make any guarantees yet about when we can get to this. But, requests from users are the fastest way for me to be able to shuffle priorities, hint hint.

I'm not sure if this should be a global Contour config, a global default overridable in HTTPProxy, or a per HTTPproxy virtualhost config. I'd love to hear from people using this right now what you'd prefer.

Per-namespace is how I would envision it for our use case (multi-tenant cluster).
Variations I can think of at the namespace level:

  • on/off switch
  • sampling controls
  • destination jaeger receiver -- since jaeger isn't multi-tenant aware, we'll likely have a receiver in every namespace
  • static tags on traces -- if Envoy allows such thing; otherwise I believe the receiver can handle that

I suppose a per-HTTPProxy design also works; we could inherit from a namespace config through an admission controller.

For reference, this blog post mostly matches what we have in mind:
https://itnext.io/jaegers-multitenancy-with-elasticsearch-ae318501f415

The way I see it:

  • doing per-HTTPProxy (well, per HTTPProxy virtualhost or root proxy), allows people to use the feature as quickly as possible, at the cost of making people have to configure the tracing on every vhost.
  • Having a Contour-level default available seems like it might possibly be useful, but is more work.

Personally my hope for tracing with Contour is to be able to just tell Envoy where to dump all its tracing to, globally, with the potential to override it per-namespace. Specifically I'd like to tell it to dump OpenCensus format traces to a locally deployed OpenTelemetry collector using https://www.envoyproxy.io/docs/envoy/latest/api-v2/config/trace/v2/opencensus.proto (v2). I'd really only need ocagent_address exposed, and have that automatically set ocagent_exporter_enabled if set. I think that would allow quite a few use cases since once the traces are in the OpenTelemetry collector you can export it to basically any observability backend (jaeger, zipkin, stackdriver, AWS x-ray, honeycomb, datadog, etc).

Presumably longer term envoy is going to add support for OpenTelemetry format traces directly but for now OpenCensus seems the most generic version.

We generally haven't done any config on a per-namespace basis, really, for Contour, because of the way Contour is triggered to generate Envoy config (basically, when one of a set of objects change). It seems like a global option, overridable per-HTTPProxy, would meet the most use cases out of this issue.

I'll wait to hear from any other interested parties, but if that's the case, the next step will be a design document, outlining what we'll support (if it's opencensus or some other format, and how the global with override setup will work).

We will probably also need to include samples of config to pass to Gatekeeper for the case that someone wants to _prevent_ HTTPProxies from overriding the config, as we have done for other settings.

I agree with @johanbrandhorst. We had to make a fork to configure opencensus. We're using envoy as an edge proxy and the sampling decision is being made at this point. That decision is being propagated through the network to the rest of the services. Then we're centralizing the metrics/traces ingestion with the opentelemetry collector.

Specifically I'd like to tell it to dump OpenCensus format traces to a locally deployed OpenTelemetry collector using https://www.envoyproxy.io/docs/envoy/latest/api-v2/config/trace/v2/opencensus.proto (v2). I'd really only need ocagent_address exposed, and have that automatically set ocagent_exporter_enabled if set.

We did the same approach but added an additional flag to set the probability sampling value. In the future, we''ll move to tail based sampling in the opentelemetry collector.

Sounds like we really need two, maybe three settings:

  • the place to send traces
  • the probability sampling value
  • the type of the trace (if we end up supporting more than one).

Questions I have for people watching this issue:

  • Does OpenCensus meet everyone's needs? Obviously it does for @johanbrandhorst and @inigohu, but can I get any comments from anyone else?
  • How important is probability sampling? Not a big deal, just wanting to know.

I'll give this one two weeks lazy consensus timeout - I'll mention it on the meetings and maybe the mailing list, then consider it okayed if there are no objections by EOD November 9, 2020, US Pacific time. If you want something else in the initial featureset, or you need it sooner, please speak now!

I think we _could_ do without probability sampling as an option initially - as mentioned before the world is moving to a tail based sampling model where the aggregators choose what to sample and the producers just blindly fire things to the sink.

Just as a stop gap - what can I do today if I want to play around with this? Modify the Contour ConfigMap before starting contour? What do I have to set exactly? I'm not familiar with using it as a proxy for envoy configuration. Thanks!

* Does OpenCensus meet everyone's needs?

Note, OpenCensus is "deprecated". The project has merged with OpenTracing to become https://opentelemetry.io/

We hope to have a GA release soon. Contour should definitely move forward with OpenTelemetry and not OpenCensus -- or any other option in my opinion :)

Envoy currently does not support exporting OpenTelemetry traces, or that would have been the obvious choice. OpenCensus is indeed deprecated, but in the interesting of adding _some_ support, OpenCensus still has wide library support in a range of languages, and the OpenTelemetry collector can consume the OpenCensus traces, so it's also a forward compatible choice.

Ah, so Envoy is instrumented with OpenCensus? Then that makes sense.

OpenCensus and OpenTelemetry use the same propagation format (W3C Trace Context) so it also means propagation upstream will work to OpenTelemetry instrumented services.

TIL about OpenCensus vs OpenTelemetry, thanks @tsloughter!

Just as a stop gap - what can I do today if I want to play around with this? Modify the Contour ConfigMap before starting contour? What do I have to set exactly? I'm not familiar with using it as a proxy for envoy configuration. Thanks!

It looks from @pims previous examples like you may be able to get a test version of this working by creating your own bootstrap config and modifying it - contour bootstrap is currently used by the Envoy deployment in an init container to generate the bootstrap for Envoy proper to consume - so you could take that output and munge it a bit to see what you can get working.

That would be valuable data for a design; You (or anyone) could start with PRing a new design document into the design/ directory, based off the design template. Once we've agreed on the design (by merging the PR), then anyone can start working on the code.

Having a quick look at the Envoy docs, I'm not sure how the per-httpproxy override would work, but that seems like an important detail to note in a design doc.

Please feel free to ask more questions here, or ping me on Kubernetes slack @youngnick. If you would like to work on the design doc, you can signal that by assigning this issue to yourself for now. If you can't that's fine too, you can also come to the community meeting this week and talk about it if that's what you would prefer.

tl;dr Next steps:

  • [ ] Design document specifying what we need to build
  • [ ] Someone builds that
  • [ ] Contour has tracing!

Since there are some questions above, it may be useful to recap a couple things.

Envoy support:

  • Envoy supports the creation and export of traces to several different kinds of system (OpenCensus, AWS X-Ray, Zipkin, and several others) through plugins.
  • There's both v2 and v3 config for them.
  • There's some kind of sampling support I've seen pull reqs. about (but not found docs for yet).
  • Envoy also knows how to receive, extend, and transmit several kinds of headers for distributed tracing, and the OpenCensus plugin apparently can handle the W3C Trace context headers.

OpenTelemetry (OTel):

  • The otel-collector can be configured with receivers to ingest data from a variety of different sources of trace data (including Jaeger, Kafka, OpenCensus, OpenTelemetry, Zipkin) and sources of metrics data (including OpenTelemetry, OpenCensus, and Prometheus).
  • OpenTelemetry is standardizing on the W3C Trace Context headers, one of several which can be used to connect parent/child spans between different systems instrumented for tracing.

Sampling: when we talk about sampling, there's a lot of extra wrinkles, but the basic idea people seem to be pursuing with sampling is that if you record the sample rate applied to chosen traces (e.g., "this chosen sample is 1% of the time" and "this other chosen sample is sampled on some rule that's kept 70% of the time") then later you can statistically reconstruct the overall estimated occurrence of events like those kept. IMHO a first version of tracing configuration from Contour doesn't need to do any sampling, but may benefit from noting that "100% included" fact when sending traces elsewhere.

So what's being generally suggested above makes sense! If Contour can (1) configure Envoy's OpenCensus plugin for now (and Envoy's OpenTelemetry plugin when that's supported) to export to the address of a given collector system, and (2) configure propagation of the W3C Trace Context headers, then we'll be able to insert Envoy spans into a lot of distributed traces right away.

I'd try to turn this long comment into a spec., but I'm not sure I know enough about the internals of either Envoy's plugin configuration or Contour right now. 😅

Having no knowledge of how anything on contour works I spent some time this morning playing around with code changes to manually set the tracing configuration. I hacked together something in bootstrap.go just as a starting point:

diff --git a/internal/envoy/v2/bootstrap.go b/internal/envoy/v2/bootstrap.go
index 6084da1b..0f00230c 100644
--- a/internal/envoy/v2/bootstrap.go
+++ b/internal/envoy/v2/bootstrap.go
@@ -29,6 +29,7 @@ import (
    clusterv2 "github.com/envoyproxy/go-control-plane/envoy/api/v2/cluster"
    envoy_api_v2_core "github.com/envoyproxy/go-control-plane/envoy/api/v2/core"
    envoy_api_bootstrap "github.com/envoyproxy/go-control-plane/envoy/config/bootstrap/v2"
+   envoy_config_trace_v2 "github.com/envoyproxy/go-control-plane/envoy/config/trace/v2"
    matcher "github.com/envoyproxy/go-control-plane/envoy/type/matcher"
    "github.com/golang/protobuf/proto"
    "github.com/golang/protobuf/ptypes/any"
@@ -141,6 +142,21 @@ func bootstrap(c *envoy.BootstrapConfig) ([]bootstrapf, error) {
                upstreamSdsTLSContext(sdsTLSCertificatePath, sdsValidationContextPath))
            return c.Path, b
        },
+       func(*envoy.BootstrapConfig) (string, proto.Message) {
+           b := bootstrapConfig(c)
+           b.Tracing = &envoy_config_trace_v2.Tracing{
+               Http: &envoy_config_trace_v2.Tracing_Http{
+                   Name: "envoy.tracers.opencensus",
+                   ConfigType: &envoy_config_trace_v2.Tracing_Http_TypedConfig{
+                       TypedConfig: protobuf.MustMarshalAny(&envoy_config_trace_v2.OpenCensusConfig{
+                           OcagentAddress:         "dns:///<open-telemetry-collector-address>",
+                           OcagentExporterEnabled: true,
+                       }),
+                   },
+               },
+           }
+           return c.Path, b
+       },
    )

    return steps, nil

I rebuilt the docker image and applied it to my local cluster. This did appear to set the envoy bootstrap tracing configuration correctly, but I was getting gRPC connection errors in the envoy logs and didn't have anymore time to debug it. It might be a good starting point for someone else.

[bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:93] StreamClusters gRPC config stream closed: 14, upstream connect error or disconnect/reset before headers. reset reason: connection termination

So what's being generally suggested above makes sense! If Contour can (1) configure Envoy's OpenCensus plugin for now (and Envoy's OpenTelemetry plugin when that's supported) to export to the address of a given collector system, and (2) configure propagation of the W3C Trace Context headers, then we'll be able to insert Envoy spans into a lot of distributed traces right away.

I expect that we will need to be able to configure a few different tracing systems. The space is really complex right now, as you point out. Even if OpenCensus is the emerging standard, right now I'd expect that a lot of people are using systems built around Zipkin or Jaeger.

we will need to be able to configure a few different tracing systems.

What you'll want is the ability to configure the propagation and export format.

So for Zipkin that is the B3 spec for propagation and for Jaeger I'm not sure they have an actual standard for their header keys.. so best to just allow the user to configure the header key used.

For export it is similar, zipkin and jaeger both have protocols, and all the providers (Stackdriver, Lightstep, Honeycomb, etc) have their own as well.

But with OpenTelemetry (and OpenCensus) the libraries support all those propagation and export protocols, so using the single instrumentation library allows the user to configure the user of B3 (Zipkin) propagation and their service might be a Java app instrumented with Brave. Using Otel or OC, and exposing their configuration, allows the user's Java app to remain unchanged.

Thanks @tsloughter for the point about the propoagation and export format being separate things. That is a good point to whoever ends up doing the design doc for this.

I'm mostly in-sync with what @inigohu, @johanbrandhorst & @kevincantu proposed. In my opinion we can base the design docs in these premises.

  • Use OpenCensus knowing that it will be "ephemeral" as the plan is to migrate to use OpenTelemetry
  • Try to surface the minimum quantity of parameters, at least initially. That is, just the otel-collector endpoint.
  • In order achieve the previous point use sane defaults for trace format (W3C) & sampling (always). The former to stick with standards and the latter to let decide to the collector as it seems quite "standard" nowadays to do tail-based sampling at the collector level.
  • This migration should be safe because otel-collector is going to support both protocol as their initial aim is to help people transition easily.

WDYT?

Yes, that sounds like a great plan, thanks @glerchundi!

I'll be honest, I don't think the current maintainers are going to be able to get to this design doc until early next year sometime, so if anyone else wants to get started before then, please see the design doc template in design/ in this repo, and feel free to post a WIP PR for early feedback. Of course, posting a non-WIP PR is fine too!

I've been thinking about this and before getting involved in the design doc I would like to know what does people think about a key question: multi-tenant deployments.

I started by thinking that the best approach would be to have a global configuration (i.e in the config file) but now I'm more inclined to configure tracing on a HTTPProxy basis. With the new Envoy API (v3) tracing, it could be configured at the HTTP connection manager level which means that each TLS-based virtualhost has its own tracing configuration parameters.

The drawback here is the same you're also having with rate limiting support and it's how we should deal with non-TLS virtualhosts where all converge in the same HTTP connection manager and therefore in the same tracing config. @skriss raised an issue where an answer for the third bullet point will solve the stated concerns here.

So lets start a poll to see what do you think:

  • 👍 : tracingPolicy at the virtualhost level
  • 👎 : global config

UPDATE: By the way, sorry for choosing those biased emojis...

@glerchundi Does following the pattern for external auth make sense here? IMHO we should aim to make similar things similar :)

This is an interesting approach @jpeach but I don't know if it doesn't seem like too much for this. Let me draw how it would look like and tell me what you think:

Extension service definition:

apiVersion: projectcontour.io/v1alpha1
kind: ExtensionService
metadata:
  namespace: projectcontour
  name: tracing
spec:
  services:
    - name: tracing
      port: 55678

HTTPProxy extension point:

tracingService:
  extensionService: projectcontour/tracing

The benefit of this design is that it can be extended easily. For example the format in which we want to send the traces (although I would avoid initially).

I would expect a relatively minimal config to further include (or default to):

  • not just the port, but the kind of tracing (default: opencensus)
  • the propagation header format (default: w3c).

Edit: the latter seems to be part of the OpenCensus configuration in v3.

I guess both would be part of the extension service definition, @glerchundi?

Edit 2: oh, so for auth the extension service is really just a reference to another Kubernetes service...

So this choice of tracing provider and provider-specific config (e.g., OpenCensus) would be defined as part of the HTTPProxy?

To answer all the questions I can see here:

  • I think the tracing config should be a virtualhost-level thing. We can look at using Gatekeeper for policy enforcment later, if operators want to ensure that all HTTPProxies have tracing configured.
  • We should use ExtensionService to define the actual tracing service to connect to
  • Additional config for the service (tracing type, propagation formats etc) should probably be in fields in the HTTPProxy extension point. (@glerchundi's tracingService stanza). @jpeach, interested on your thoughts on this one.

Although I wanted to invest time to design this doc., I won't be able to do so for at least the next 3-6 months.

Just wanted to share this with you all to avoid generating false expectatives and to unblock in case someone wants to take and push it.

🙏

Wanted to revive this thread and update everyone that we are looking at adding tracing now, perhaps in the next release or the one after that. A lot of the comments already touch on use cases and what tracing stack to use, so just adding couple of quick thoughts on how we might go about this here.

  • Leverage Envoy’s native tracing capabilities to tag incoming requests with some identifier (eg https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/observability/tracing.htm)
  • Allow data to be collected by something like OpenTelemetry which is then persisted to a Jaeger or Zipkin deployment. OpenCensus is also suggested here, this can work as well.
  • Allow Contour user to choose a trace provider in our init file, or specify this in the contour deployment somehow
  • Tracers supported by Envoy listed here : https://www.envoyproxy.io/docs/envoy/v1.13.1/api-v2/config/trace/v2/trace.proto#config-trace-v2-tracing-http like lightstep, Zipkin
  • We should also allow setting this provider in the Contour Operator
  • We should validate tracing a request that is rate limited and passes through external auth.

Regarding which Tracing technologies to integrate against, we're pretty open. Jaeger is very popular right now due to its distributed architecture making it highly scalable. Both Zipkin and Jaeger support OpenTracing and OpenTelemtry and have a pretty robust community.

cc @skriss

Also we can look to make tracing data actionable, ie. for security preventative measures. I don't know if OPA is appropriate here, scenario I have in mind is services should be access in a particular order. ie you can only hit the 'shipping' service after hitting 'checkout' service. If it was the other way around, this should be logged and surfaced to the admin.

tagging this 1.17 to signify we are looking into serious design now.

V3 envoy api reference: https://www.envoyproxy.io/docs/envoy/v1.13.1/api-v3/config/trace/v3/trace.proto

I vaguely remember an upstream envoy discussion about some tracers being removed, I’ll try to find it

from https://github.com/projectcontour/contour/issues/399#issuecomment-741624428:

The drawback here is the same you're also having with rate limiting support and it's how we should deal with non-TLS virtualhosts where all converge in the same HTTP connection manager and therefore in the same tracing config. @skriss raised an issue where an answer for the third bullet point will solve the stated concerns here.

Yeah, unfortunately we still don't have a solution here. For rate limiting we ended up going with a single global rate limit service configured at the Contour level. We could do the same for tracing, though the votes on the original comment make it clear that folks prefer per-vhost settings. If we go with per-vhost settings, then they can only be provided for TLS vhosts, since all non-TLS vhosts share a single HCM/tracing config.

I think that adding a single, global tracing sink first is at least a start, and we then add the ability to override that sink on a per-object basis later?

Edit: I should note that, in the case you wanted to use a sidecar sink, that a single global one would be fine (since you could point it to localhost, and run the tracing sink in the Envoy pods. This would work well for tracing sinks that allow a separate ingestion tier that handles the sampling etc.)

I think that adding a single, global tracing sink first is at least a start, and we then add the ability to override that sink on a per-object basis later?

That'd be consistent with rate limiting, so I like that in principle.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

stevesloka picture stevesloka  ·  6Comments

stevesloka picture stevesloka  ·  6Comments

seemiller picture seemiller  ·  4Comments

jpeach picture jpeach  ·  5Comments

jedsalazar picture jedsalazar  ·  6Comments