Jaeger: JaegerAgent (1.9.2) running as DaemonSet not always forwarding spans to JaegerCollector (1.9.2)

Created on 10 Sep 2020  路  14Comments  路  Source: jaegertracing/jaeger

Describe the bug

With Jaeger 1.9.2 even Agent receives the log it does not forward logs to Collector (1.9.2). I set debug level on both collector and Agent. I see spans are received by Agent but it is not forwarded to Collector. After i kill same Agent (via kubectl delete -n observability jaeger-agent-... ) It forwards logs to Jaeger Collector.

To Reproduce
Steps to reproduce the behavior:

  1. Install jaeger agent 1.9.2 as DaemonSet with log level of Debug (to see an error)
  2. Send logs to DaemonSet (not via SideCar)
  3. See Jaeger-Agent received spans but not sending it to Jaeger-Collector 1.9.2

We are running Kafka but since this issue is Agent <--> Collector no details are given for Kafka as after killing and respawning Agent DaemonSet it works fine.

Expected behavior
Agent should forward spans to Collector even on DaemonSet mode without issues. This situation is flaky. Can't get a path to repeat it. It is something to do with connection from Agent to Collector I think.

Screenshots
If applicable, add screenshots to help explain your problem.

Version (please complete the following information):

  • OS: Ubuntu 16.04 (running as Azure AKS 1.16.13)
  • Jaeger version: 1.9.2
  • Deployment: Kubernetes 1.16.13 on AKS

What troubleshooting steps did you try?

stern --tail 20 -n observability jaeger-agent-daemo*
stern --tail 20 -n observability jaeger-collector-*

Additional context
I see logs at Jaeger Agent running as DaemonSet:

jaeger-agent-daemonset-xjjxk jaeger-agent-daemonset {"level":"debug","ts":1599743431.949259,"caller":"processors/thrift_processor.go:120","msg":"Span(s) received by the agent","bytes-received":649}
jaeger-agent-daemonset-xjjxk jaeger-agent-daemonset {"level":"debug","ts":1599743432.9496808,"caller":"processors/thrift_processor.go:120","msg":"Span(s) received by the agent","bytes-received":643}
jaeger-agent-daemonset-xjjxk jaeger-agent-daemonset {"level":"debug","ts":1599743433.9493701,"caller":"processors/thrift_processor.go:120","msg":"Span(s) received by the agent","bytes-received":644}
jaeger-agent-daemonset-xjjxk jaeger-agent-daemonset {"level":"debug","ts":1599743434.949377,"caller":"processors/thrift_processor.go:120","msg":"Span(s) received by the agent","bytes-received":652}
jaeger-agent-daemonset-xjjxk jaeger-agent-daemonset {"level":"debug","ts":1599743435.9492238,"caller":"processors/thrift_processor.go:120","msg":"Span(s) received by the agent","bytes-received":644}

but not on Collector.

After killing jaeger-agent pod spans are received by Collector and was sent to Storage as expected:

ce-id":"4c3dae19927fccf6","span-id":"0beb7aa3f306f3ff"}
jaeger-collector-959b7f564-85dck jaeger-collector {"level":"debug","ts":1599743435.6712654,"caller":"static/strategy_store.go:77","msg":"sampling strategy not found, using default","service":"executor"}
jaeger-collector-959b7f564-85dck jaeger-collector {"level":"debug","ts":1599743435.6715622,"caller":"app/span_processor.go:147","msg":"Span written to the storage by the collector","trace-id":"37c1963c629b5f83","span-id":"62499a50f177ae93"}
jaeger-collector-959b7f564-85dck jaeger-collector {"level":"debug","ts":1599743435.671604,"caller":"app/span_processor.go:147","msg":"Span written to the storage by the collector","trace-id":"37c1963c629b5f83","span-id":"37c1963c629b5f83"}
jaeger-collector-959b7f564-85dck jaeger-collector {"level":"debug","ts":1599743435.671599,"caller":"app/span_processor.go:147","msg":"Span written to the storage by the collector","trace-id":"05d426c26771f2a7","span-id":"05d426c26771f2a7"}
jaeger-collector-959b7f564-85dck jaeger-collector {"level":"debug","ts":1599743435.6716194,"caller":"app/span_processor.go:147","msg":"Span written to the storage by the collector","trace-id":"05d426c26771f2a7","span-id":"498394fccffe4a96"}
bug needs-info wontfix

All 14 comments

Our template for Jaeger CR for this issue is as follows:


apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: jaeger # For clarity we just used jaeger 
  namespace: {{seldon_tracing_ns}}
spec:
  strategy: streaming
  storage:
    type: elasticsearch 
    options:
      es:
        server-urls: {{es_protocol}}://{{es_host}}:{{es_port}}
        index-prefix: {{es_jaeger_prefix}}
        tls: 
          enabled: yes
          ca: /es/certificates/ca.crt
      log-level: debug 
    secretName: {{seldon_tracing_secret}}
  volumeMounts: 
    - name: certificates
      mountPath: /es/certificates/
      readOnly: true
  volumes:
    - name: certificates
      secret:
        secretName: {{seldon_tracing_secret_ca}}
  agent:
    strategy: DaemonSet
    options:
      log-level: debug 
  ingress:
    enabled: false  #Default true. But we need to create custom Ingress for Nginx Ingress and serving thru a subdir
    options:
      log-level: debug 
  collector:
    options:
      log-level: debug
      kafka: 
        producer:
          topic: {{seldon_tracing_kafka_topic}}
          brokers: {{seldon_tracing_kafka_name}}-kafka-bootstrap.{{seldon_tracing_kafka_ns}}:{{seldon_tracing_kafka_port}}
  ingester:
    options:
      log-level: debug
      kafka: 
        consumer:
          topic: {{seldon_tracing_kafka_topic}} 
          brokers: {{seldon_tracing_kafka_name}}-kafka-bootstrap.{{seldon_tracing_kafka_ns}}:{{seldon_tracing_kafka_port}}
      ingester:
        deadlockInterval: 0s #umber of minutes to wait for a message before terminating ingester 
  query:
    options:
      query:
        base-path: /{{seldon_tracing_path}}
      log-level: debug
  ui:
    options: 
      log-level: debug
      dependencies:
        menuEnabled: false # Default False

Could you get us the whole agent logs? I'm particularly interested in the gRPC-related entries. Also, are you able to reproduce it on plain minikube? If this is about networking issues in AKS, I won't be able to reproduce, as I don't have such an environment readily available.

After using helm-chart jaeger-operator-2.17.0 i dont see that issue again. So will watch for a few days and if i dont see this issue we can close it.

helm list -n observability
NAME            NAMESPACE       REVISION        UPDATED                                 STATUS          CHART                   APP VERSION
jaeger-operator observability   8               2020-09-15 11:05:01.307903837 +0000 UTC deployed        jaeger-operator-2.17.0  1.19.0

We have a similar problem, same version: jaeger-operator-2.17.0 App VERSION 1.19.0:
Agent + Collector have version 1.19.2.

There seems to be a problem with the agent:
Many times the agent does not receive any spans (or only 4-5 spans) from the client, right from the start of the agent. We tested different clients (Java, NodeJS), same behaviour. Loglevel=Debug, in Agent + Collector.

The agent seems to receive some spans 0 or (4 or 5), and than it stalls, nothing is logged anymore in the debug log (the 4-5 spans were logged as recieved), and now it is broken. After restarting the agent 2-3 times, it recieves all? spans, and sends them to the connector which sends them to Elasticsearch. In this case, it keeps on working and working..

The error is very hard to track down. We switched to an older versions of jaeger (1.18.1.), and this flaky behaviour is gone. Seems to be a bug in the receiving part of the agent in 1.19.0 (we use UDP+Thrift, standard java client lib, standard nodejs client lib).

Any hints how to debug this issue? Is there some easy way to remote-debug the golang agent, like a special docker-image with delve enabled? We would like to have a look at this.

We have a similar problem, same version: jaeger-operator-2.17.0 App VERSION 1.19.0:
Agent + Collector have version 1.19.2.

Could you please confirm the operator version? It probably is 1.17.0. Are you overriding the images in the CR, so that it uses 1.19.2? It probably won't matter for your case, but if you use the operator v1.19.x, it should automatically get you Jaeger v1.19.2.

Would you be open to use a sidecar for some deployments, and daemonset for others? I wonder if the deployment mode affects the behavior in this case, and if you could run some deployments with sidecars, that would help confirm this idea.

I would also appreciate if you could send us more details on how your infra looks like: how are your deployments configured? Is it like this:

      containers:
      - name: myapp
        image: acme/myapp:myversion
        env:
        - name: JAEGER_AGENT_HOST
          valueFrom:
            fieldRef:
              fieldPath: status.hostIP

like a special docker-image with delve enabled? We would like to have a look at this.

We don't have that, but we do have pprof endpoints that you can inspect:
http://localhost:14269/debug/pprof/

You should also take a look at the metrics:
http://localhost:14269/metrics

You can also use the LoggingReporter on the client side to give a positive confirmation that the client sent a batch to the agent. For most clients, it's sufficient to set JAEGER_REPORTER_LOG_SPANS to true as an env var.

Related to @bjung82

We used operator version 1.19.0.

We now changed it to operator version 1.18.1.
Related to connection issues with Elasticsearch, we changed the image version from the jaeager-collector and the jaeger jaeger-query to image version 19.0.0.
With this configuration options it works fine.

Images:
jaeager-operator: jaegertracing/jaeger-operator:1.18.1
jaeager-collector: jaegertracing/jaeger-collector:1.19.0
jaeager-query: jaegertracing/jaeger-query:1.19.0
jaeger-agent: jaegertracing/jaeger-agent:1.18.1

Starting with image jaegertracing/jaeger-agent:1.19.0 it stops working.

Another question, is it normal behavior that if there are already spans send when the agent is starting, that the agent will not record any spans?
Jaeger-Agent will then only start collecting spans if it is restarted.

I think the problems you are experiencing, @SimoneFalzone and @bjung82, isn't the same as the originally reported. It sound suspiciously similar to these: https://github.com/jaegertracing/jaeger-operator/issues/1175 and https://github.com/jaegertracing/jaeger/issues/2443 . Could you try with jaegertracing/jaeger-agent:latest, which should have the fix for the mentioned issue?

If possible, provide the bootstrap logs for your agent.

EDIT: actually, I think even the original report might be the same as the issues I linked above... @omerfsen, could you also try to use the latest agent and see if your problem is fixed?

@jpkrohling Ok let me try that.

Hi @jpkrohling Does Release 1.20.0 contains fix for this issue ?

It does contain a fix for jaegertracing/jaeger-operator#1175 and #2443. Whether those are the fixes for the problem you reported, I hope you'll tell me :-)

Ok i will try this and let you know after we get helm chart updated to have 1.20.0. Will also try the latest

Is there an ETA for jaegertracing/jaeger-operator that supports 1.20 ?

Probably something to raise in the Helm Chart repo: https://github.com/jaegertracing/helm-charts/issues/new

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

Was this page helpful?
0 / 5 - 0 ratings