Jaeger: distirbution of traces/span amongst collector

Created on 23 Jul 2019  Â·  39Comments  Â·  Source: jaegertracing/jaeger

Requirement - what kind of business use case are you trying to solve?

Are collector load balanced ?

Problem - what in Jaeger blocks you from solving the requirement?



We have our jaegertracing setup working with back end configured as elastic search. Currently we have two collector replica set up . There are 5-10 services which sends traces to the collector ( the number of services , keep changing ) . I see collectors are not evenly loaded with traffic. One collector reaches to the max queue usage where as other collector is hardly using 20-30% capacity . This causes the drop from the collector which is loaded to the capacity .
Can we load balance the traffic (spans) amongst the both collector ? I am not sure if there is any config and i am missing it.

Proposal - what do you suggest to solve the problem or improve the existing situation?

Any open questions to address

question

All 39 comments

If you are using the Jaeger Agent, you can configure them to use gRPC instead of Thrift (--reporter.type=grpc). You can then either pass a static list of collectors, or use gRPC's notation for discovering the servers (--reporter.grpc.host-port=dns:///service-name:14250).

If your tracers are connecting directly to the collector, only TChannel is supported at the moment, and it's not possible to load balance individual requests.

we are using agent , will check the configuration . As of now we are managing the jaeger set up and there are different team they are just bombarding the traces and spans. Is there anyway we can do at collector level ?

@prana24 at Uber we recommend all team to use an internal wrapper for Jaeger client libraries, which makes sure that production services are always using remote sampler that pulls sampling strategies from the backend. This way you can create configuration for the collectors controlling how much each service should sample.

If you have no control over the clients, the brute-force solution is to implement downsampling in the collector (which we do at Uber, but at this point as more of a safety measure). Downsampling is consistently based on trace ID hash, so you don't get partial traces, but downsampling affects all users equally, not just the offending service.

Another approach is throttling clients doing sampling, but it's not currently implemented (#1676).

The best solution imo is tail-based sampling, which Jaeger does not support yet directly, but you can get it with OpenCensus Service.

We were using jaeger-agent 1.8.x , i see grpc was probably not enabled in that version . I am upgrading agent to latest ( 1.13.x ) . My collector is still 1.9.x , is this version ok , or i should upgrade that as well ?

If you can, keep both the collector and the agent at the same version.

Thank you @jpkrohling , i have done that , i have a basic question about dns:///:14250 , what is , here it is the same name which we get by command kubectl get service for collector service ?

It's the DNS name under which the service can be reached. In Kubernetes, this is typically service_name.namespace.svc.cluster.local, but depending on the cluster configuration, you might be able to use only the service name as the hostname, if both the client and the the agent/collector are in the same namespace.

If you are using Kubernetes, I recommend taking a look at the jaeger-operator. Even if you decide not to use it for production, you might benefit from seeing how it deploys Jaeger.

sure, thank you . I am taking a lookg

Hi ,
I have made changes to my agent .yaml , somehow it still sends traffic to one of the collector only . it looks like it is not able to dns look up , adding my agent.yaml and agent log here for reference.

019/08/06 11:19:50 maxprocs: Leaving GOMAXPROCS=24: CPU quota undefined
{"level":"info","ts":1565090391.0357707,"caller":"flags/service.go:115","msg":"Mounting metrics handler on admin server","route":"/metrics"}
{"level":"info","ts":1565090391.0362382,"caller":"flags/admin.go:108","msg":"Mounting health check on admin server","route":"/"}
{"level":"info","ts":1565090391.0362995,"caller":"flags/admin.go:114","msg":"Starting admin HTTP server","http-port":14271}
{"level":"info","ts":1565090391.0363176,"caller":"flags/admin.go:100","msg":"Admin server started","http-port":14271,"health-status":"unavailable"}
{"level":"info","ts":1565090391.0373657,"caller":"grpc/builder.go:75","msg":"Agent requested insecure grpc connection to collector(s)"}
{"level":"info","ts":1565090391.041124,"caller":"grpc/clientconn.go:242","msg":"parsed scheme: \"dns\"","system":"grpc","grpc_log":true}
{"level":"info","ts":1565090391.067472,"caller":"agent/main.go:74","msg":"Starting agent"}
{"level":"info","ts":1565090391.0675416,"caller":"healthcheck/handler.go:129","msg":"Health Check state change","status":"ready"}
{"level":"info","ts":1565090391.0675752,"caller":"app/agent.go:68","msg":"Starting jaeger-agent HTTP server","http-port":5778}
{"level":"info","ts":1565090391.0754118,"caller":"dns/dns_resolver.go:264","msg":"grpc: failed dns SRV record lookup due to lookup _grpclb._tcp.jaeger-collector-dev.sampling.svc.cluster.local on 192.168.0.3:53: no such host.\n","system":"grpc","grpc_log":true}
{"level":"info","ts":1565090391.119492,"caller":"dns/dns_resolver.go:289","msg":"grpc: failed dns TXT record lookup due to lookup _grpc_config.jaeger-collector-dev.sampling.svc.cluster.local on 192.168.0.3:53: no such host.\n","system":"grpc","grpc_log":true}
{"level":"info","ts":1565090391.1195319,"caller":"grpc/resolver_conn_wrapper.go:140","msg":"ccResolverWrapper: got new service config: ","system":"grpc","grpc_log":true}
{"level":"info","ts":1565090391.119683,"caller":"grpc/resolver_conn_wrapper.go:126","msg":"ccResolverWrapper: sending new addresses to cc: [{192.168.172.54:14250 0  <nil>}]","system":"grpc","grpc_log":true}
{"level":"info","ts":1565090391.1197627,"caller":"base/balancer.go:76","msg":"base.baseBalancer: got new resolver state: {[{192.168.172.54:14250 0  <nil>}] }","system":"grpc","grpc_log":true}
{"level":"info","ts":1565090391.1197968,"caller":"base/balancer.go:130","msg":"base.baseBalancer: handle SubConn state change: 0xc00018d560, CONNECTING","system":"grpc","grpc_log":true}
{"level":"info","ts":1565090391.1241517,"caller":"base/balancer.go:130","msg":"base.baseBalancer: handle SubConn state change: 0xc00018d560, READY","system":"grpc","grpc_log":true}
{"level":"info","ts":1565090391.1241896,"caller":"roundrobin/roundrobin.go:50","msg":"roundrobinPicker: newPicker called with readySCs: map[{192.168.172.54:14250 0  <nil>}:0xc00018d560]","system":"grpc","grpc_log":true}

Also pasted here agent.yaml

# Source: jaeger-client-mon/templates/deployment.yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: jaeger-app-1
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: jaeger-app-1
  template:
    metadata:
      labels:
        app.kubernetes.io/name: jaeger-app-1
    spec:
      containers:
      - image: docker.artifactory.prod.adnxs.net/jaeger_client_1
        name: jaeger-app-1
        ports:
        - containerPort: 8080
      - image: docker.artifactory.prod.mycompany.net/jaegertracing/jaeger-agent:1.13.1-1-b8a6d4ea680063ab03575e864f233841cfcb45cb58a9c5ddde2e287844c1b679
        name: jaeger-agent-1
        #args: ["--collector.host-port=jaeger-collector-dev.sampling.svc:14267"]
        args: ["--reporter.grpc.host-port=dns:///jaeger-collector-dev.sampling.svc.cluster.local:14250"]
        ports:
        - containerPort: 5775
          protocol: UDP
        - containerPort: 6831
          protocol: UDP
        - containerPort: 6832
          protocol: UDP
        - containerPort: 5778
          protocol: TCP

Any idea what is wrong here ?

Nothing seems wrong there: gRPC tried to load some extra configuration via DNS but couldn't find anything "extra". As you can see in the following log entries, the connection with the collector was established and is ready:

{"level":"info","ts":1565090391.1197968,"caller":"base/balancer.go:130","msg":"base.baseBalancer: handle SubConn state change: 0xc00018d560, CONNECTING","system":"grpc","grpc_log":true}
{"level":"info","ts":1565090391.1241517,"caller":"base/balancer.go:130","msg":"base.baseBalancer: handle SubConn state change: 0xc00018d560, READY","system":"grpc","grpc_log":true}

So, looks like it's working ;-)

It is working but , i was expecting agent should send traces/span to both the collector , which currently sending to only one . i mean to say it is not load balanced , Am i missing something here ?

You might not see round-robin load balancing, as gRPC will reuse the same pipe for multiple requests, but one easy way to check that it's working as expected is by killing one of the collectors. If the agent switches over to the remaining collector, the load balancing is working.

Oops !! that is failover right , that is not loadbalanced ? I want to avoid sitaution like this , i have added grafana images here , where collector1 reaches the max capacity and collector2 is sitting idle , because of this we see span drops. ( of course the implementation contains tchannel communication between agent and collector ) so as advised in this issue above i am adding grpc but somehow still i do not see spans are being load balanced between both the collector .
collector_load
collector_span_drop

Let me know if i am doing anything wrong here ?

I just checked the gRPC docs, and it seems that it should indeed be doing round-robin balancing:

It is worth noting that load-balancing within gRPC happens on a per-call basis, not a per-connection basis. In other words, even if all requests come from a single client, we still want them to be load-balanced across all servers.

Source: https://github.com/grpc/grpc/blob/master/doc/load-balancing.md

of course the implementation contains tchannel communication between agent and collector

What do you mean here? The communication between Agent and Collector should be via gRPC, not via TChannel.

@jkandasa, @kevinearls I think one of you ran some tests for this behavior in the past. Can you spot if there's anything missing here?

Just to clear confusion , the grafana image which i have posted is the production problem which i want to solve( agent , collector running on 1.8 .x with tchannel) .
Since it was recommended that if we use grpc with latest version ( 1.13.1 ) we can see traces/span loadbalanced . I am trying to the same in our dev environment to see if the traffic is really load balanced. But somehow all the spans are being moved to one collector.
I am just concerned about how can i get my traffic loadbalanced , hence all the collector does the work and there is minimum drops

I'm confused now: you are seeing load balanced traffic in production, but not on your dev environment?

My production version is 1.8 communication with tchannel , and the grafana images are from production env. it shows that load is unbalanced and also drops.

I want to check if we move to 1.13 .x with grpc we can solve the problem in production , and that is why i am trying 1.13.1 +grpc in dev ( agent.yaml and log which i shared ) ,
But in Dev env. also i do not see load balanced in traffic ,

@jkandasa, @kevinearls I think one of you ran some tests for this behavior in the past. Can you spot if there's anything missing here?

@jpkrohling In openshift, we create an additional service(jaegerqe-collector-headless) on the collector for gRPC load-balancing with ClusterIP: None.

That's probably the trick that @prana24 is missing! Thanks @jkandasa!

Vow ..!! i cant wait , @jkandasa can you give me more information ? where do i get it ? any reference and topology ?

@prana24 AFAIK, there is no specific example to create collector headless service. @objectiser can guide here better.
jaeger-operator creates a headless service by default. Reference in jaeger-operator code

I just copied/modified collector service YAML from generated(by jaeger-operator) service file.
I hope this will work(not tested).
Important line spec.clusterIP: None.
You may add to your existing service and test. If you create a new service named jaeger-collector-headless, do not forget to change it on your agent.

- apiVersion: v1
  kind: Service
  metadata:
    name: jaeger-collector-headless
    labels:
      app: jaeger
      jaeger-infra: collector-service
spec:
  clusterIP: None
  ports:
    - name: jaeger-collector-grpc
      port: 14250
      protocol: TCP
      targetPort: 14250
  selector:
      jaeger-infra: collector-pod
  type: ClusterIP

Thanks @jkandasa , i will give it a shot today

i have the same problem!!!

error is
root@ubuntu-165:~/jaeger# kubectl logs productpage-v1-787bcf4b68-j88qj jaeger-agent | grep addrConn.createTransport
{"level":"info","ts":1574242860.4926956,"caller":"grpc/clientconn.go:1191","msg":"grpc: addrConn.createTransport failed to connect to {10.33.36.204:14250 0 }. Err :connection error: desc = \"transport: Error while dialing dial tcp 10.33.36.204:14250: connect: connection refused\". Reconnecting...","system":"grpc","grpc_log":true}
{"level":"info","ts":1574242861.493872,"caller":"grpc/clientconn.go:1191","msg":"grpc: addrConn.createTransport failed to connect to {10.33.36.204:14250 0 }. Err :connection error: desc = \"transport: Error while dialing dial tcp 10.33.36.204:14250: connect: connection refused\". Reconnecting...","system":"grpc","grpc_log":true}

but network was right
root@ubuntu-165:~/jaeger# kubectl exec box2 -- nslookup my-jaeger-collector-headless.kube-system
Server: 10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local

Name: my-jaeger-collector-headless.kube-system
Address 1: 10.33.36.204 10-33-36-204.my-jaeger-collector.kube-system.svc.cluster.local
root@ubuntu-165:~/jaeger# kubectl exec box2 -- nslookup my-jaeger-collector.kube-system.svc.cluster.local

@pujunYang could you please share what's your my-jaeger-collector-headless definition? kubectl get service my-jaeger-collector-headless -o yaml should do the trick. How are you setting it up? Is it via the Operator?

@jpkrohling yes
use Operator start jaeger,

kubectl get service my-jaeger-collector-headless -n kube-system  -o yaml 
apiVersion: v1
kind: Service
metadata:
  annotations:
    prometheus.io/scrape: "false"
  creationTimestamp: "2019-11-20T09:40:13Z"
  labels:
    app: jaeger
    app.kubernetes.io/component: service-collector
    app.kubernetes.io/instance: my-jaeger
    app.kubernetes.io/managed-by: jaeger-operator
    app.kubernetes.io/name: my-jaeger-collector
    app.kubernetes.io/part-of: jaeger
  name: my-jaeger-collector-headless
  namespace: kube-system
  ownerReferences:
  - apiVersion: jaegertracing.io/v1
    controller: true
    kind: Jaeger
    name: my-jaeger
    uid: c00e3485-0b79-11ea-ab62-5254006535e0
  resourceVersion: "1933"
  selfLink: /api/v1/namespaces/kube-system/services/my-jaeger-collector-headless
  uid: c0869851-0b79-11ea-ab62-5254006535e0
spec:
  clusterIP: None
  ports:
  - name: zipkin
    port: 9411
    protocol: TCP
    targetPort: 9411
  - name: grpc
    port: 14250
    protocol: TCP
    targetPort: 14250
  - name: c-tchan-trft
    port: 14267
    protocol: TCP
    targetPort: 14267
  - name: c-binary-trft
    port: 14268
    protocol: TCP
    targetPort: 14268
  selector:
    app: jaeger
    app.kubernetes.io/component: collector
    app.kubernetes.io/instance: my-jaeger
    app.kubernetes.io/managed-by: jaeger-operator
    app.kubernetes.io/name: my-jaeger-collector
    app.kubernetes.io/part-of: jaeger
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}

@jpkrohling it is was Jaeger.yaml

apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: my-jaeger
  namespace: kube-system
spec:
  strategy: production # <1>
  allInOne:
    image: jaegertracing/all-in-one:latest # <2>
    options: # <3>
      log-level: debug # <4>
  storage:
    type: elasticsearch # <5>
    options: # <6>
      es: # <7>
        server-urls: http://elasticsearch-logging:9200
        tls:
          skip-host-verify: true
  ingress:
    enabled: false # <8>
  agent:
    strategy: DaemonSet # <9>
  annotations:
    scheduler.alpha.kubernetes.io/critical-pod: "" # <10>

I have another concerned about load balancing;

We use "--reporter.grpc.host-port=dns:///jaeger-collector-gRPC.service.consul:14250" to get the list of collectors, which is working fine. All collectors receive spans.

The problem;
If we scale out the collectors the agent will never get a new list.
This also means if one or more collectors is removed/offline the list of collectors on the agents will remain the same.

It seems it only resolve the list when the agent starts?
Or am I missing something?

Which version of Jaeger are you using, @parberge? We've bumped the gRPC client version in v1.20.0 which was recently released, and I know it has some improvements in this area, although I'm not 100% sure this case is covered. When fixing #2443, I remember reading that the gRPC client will get a new list of clients only when it runs out of healthy connections, but hopefully this newest gRPC client is smarter.

I'm closing, as I think this has been answered some time ago, but feel free to reopen if there are still questions.

Which version of Jaeger are you using, @parberge? We've bumped the gRPC client version in v1.20.0 which was recently released, and I know it has some improvements in this area, although I'm not 100% sure this case is covered. When fixing #2443, I remember reading that the gRPC client will get a new list of clients only when it runs out of healthy connections, but hopefully this newest gRPC client is smarter.

Not 1.20 that's for sure.

Will test and create an issue if the problem remains. Thanks.

Hey guys - I am seeing some of the same issues. On v.1.19 right now, so will try to do the upgrade. But much like some of the folks are seeing. I am using the jaeger-operator, have the HPA setup for min/max of 2/10, and when CPU gets hammered during our bot/soak tests the collectors scale up as expected, but the agent connections continue to fire spans down their already existing connections. So effectively, feels like more of a fault tolerance setup than a high availability one. @parberge before I go super deep, did you see any positive changes with v1.20.0?

Haven't upgraded yet 😑

Den tors 15 okt. 2020 02:27Josh Kierpiec notifications@github.com skrev:

Hey guys - I am seeing some of the same issues. On v.1.19 right now, so
will try to do the upgrade. But much like some of the folks are seeing. I
am using the jaeger-operator, have the HPA setup for min/max of 2/10, and
when CPU gets hammered during our bot/soak tests the collectors scale up as
expected, but the agent connections continue to fire spans down their
already existing connections. So effectively, feels like more of a fault
tolerance setup than a high availability one. @parberge
https://github.com/parberge before I go super deep, did you see any
positive changes with v1.20.0?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/jaegertracing/jaeger/issues/1678#issuecomment-708756333,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABJQPWI4QULKBMK4ZUWLRWDSKY6WJANCNFSM4IGGIAVA
.

Feel free to reopen this issue you see the same problem happening on v1.20.

Hi folks, unfortunately I see the same behavior as before. Running 1.20.0 container and agent (see below) via jaeger-operator.

Execute our bots firing off spans/traces and can observe the following in our graphs. You can see below we scaled to 4 collector instances, but the agents have no knowledge that they should reconnect and continue to saturate the collectors they are already connected to. The situation makes sense - I'm not missing a configuration in anyway for the collectors to notify the agent they should reconnect when dropping spans?

image

Containers:
  jaeger-collector:
    Container ID:  docker://98c391498dc8cdb605f5d001350c2451d16800470a7e03c096cab7b808ff7b95
    Image:         jaegertracing/jaeger-collector:1.20.0

...

Containers:
  jaeger-agent-daemonset:
    Container ID:  docker://fa6d25021bc5531053b998789d34fcdb0520d2a8f7907125951c44feeb10ffa4
    Image:         jaegertracing/jaeger-agent:1.20.0

@jpkrohling will try to reopen, need to figure out how =)

I'll check what we can do, but I think the gRPC client might need some time to update the list of backends. In earlier versions, it would update only if all known backends were failing.

OK - I am letting this soak. This may be something unique to how our bots are running also, as they are being spun up asynchronously in a single service, so would make sense that it would send traffic a single agent and thus overload the collector its connected to. Is the agent designed to have a single connection to a collector at a given point in time? If thats the case, this MAY be OK for us in production when we have bots replaced with real traffic and getting load balanced across our edge service, thus distributing across the agents more naturally.

Is the agent designed to have a single connection to a collector at a given point in time?

I'd have to double-check with the gRPC client load balancer documentation, but I think that's indeed the case. The agent has a list of backends, but will only failover once its "current" backend fails.

@jkandasa do you remember from your load-tests what's the expected behavior here?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Siddhesh-Ghadi picture Siddhesh-Ghadi  Â·  4Comments

yurishkuro picture yurishkuro  Â·  5Comments

vprithvi picture vprithvi  Â·  3Comments

benraskin92 picture benraskin92  Â·  3Comments

albertteoh picture albertteoh  Â·  3Comments