Jaeger: Memory peaks on agent v1.19.2

Created on 3 Sep 2020 · 44Comments · Source: jaegertracing/jaeger

Describe the bug
One group of jaeger agents (3 replicas), deployed on kubernetes. Not so loaded system - 500-700 spans per second.
Used v1.13.1 for year without any issues, upgraded all componenet to 1.19.2 (also replaced Tchannel with gRPC) and agent memory usage become unstable. Before upgrade agent instances used as low as 16 mb with 64 limit, but after upgrade memory usage peaks appeared and agent got oomkilled. I raised limits, but even 512 mb is not enough.

After few hours and tens of oomkills, I downgraded agent to 1.18.1 version and see no issues so far.

To Reproduce
Steps to reproduce the behavior:

Deploy agent v1.19.2
Send 500-700 spans per second
???
Observe memory usage peaks.

Expected behavior
Agent memory usage close to lower versions.

Screenshots

Version (please complete the following information):

OS: Kubernetes nodes based on Ubuntu 20.04 LTS, 5.4.0-39 kernel
Jaeger version: 1.19.2
Deployment: Kubernetes

What troubleshooting steps did you try?
Collected /metrics, will provide if needed.

Additional context
-

bug needs-info performance

Source

zigmund

All 44 comments

When you downgraded to 1.18.1, did you change any other configuration? Would you mind sharing your agent configuration?

jpkrohling on 3 Sep 2020

@jpkrohling the only agent parameter used before and after downgrade is --reporter.grpc.host-port=jaeger-collector:14250.

Also played for a while after got first OOMkills with parameters, but reverted changes singe they didn't helped much:

        - --processor.jaeger-binary.server-queue-size=...
        - --processor.jaeger-binary.workers=...
        - --processor.jaeger-compact.server-queue-size=...
        - --processor.jaeger-compact.workers=...
        - --processor.zipkin-compact.server-queue-size=...
        - --processor.zipkin-compact.workers=...

zigmund on 3 Sep 2020

👍1

I'll take a look at list of commits affecting the agent between 1.18.1 and 1.19.2, perhaps I can spot a suspect there. I'll also try to reproduce the problem here. Do you think it would be reproducible with minikube and a simple agent with 64Mi of requests , 256Mi limit? Jaeger's tracegen would probably generate enough load ;-)

jpkrohling on 3 Sep 2020

@zigmund I'll need your help here. I tried to reproduce your problem but didn't succeed. I have a local minikube cluster with 20Gi of memory available, initialized with:

$ minikube start --vm-driver kvm2 --cpus 6 --memory 20480 --container-runtime=crio --addons=ingress

Using the Jaeger Operator, I provisioned a Jaeger instance with the following spec:

apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: agent-as-daemonset
spec:
  agent:
    image: jaegertracing/jaeger-agent:latest
    strategy: DaemonSet
    resources:
      requests:
        memory: 64Mi
      limits:
        memory: 256Mi
  storage:
    options:
      memory:
        max-traces: 100000

And here's a trace generator, creating 1 trace per millisecond:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tracegen
  annotations:
    "sidecar.jaegertracing.io/inject": "false"
spec:
  replicas: 1
  selector:
    matchLabels:
      app: tracegen
  template:
    metadata:
      labels:
        app: tracegen
    spec:
      containers:
      - name: tracegen
        image: jaegertracing/jaeger-tracegen:1.19.2
        env:
        - name: JAEGER_AGENT_HOST
          valueFrom:
            fieldRef:
              fieldPath: status.hostIP
        args:
        - -pause=1ms
        - -duration=1h
        - -workers=1

The agent pod was up for the whole time, and the metrics shows me that the agent didn't even blink. I changed the Jaeger CR to use 1.18.1, to see if latest was in the same ballpark, and I couldn't spot any differences:

Screenshot_2020-09-03 Prometheus Time Series Collection and Processing Server

The first part of the graph is for an early test, where I set the tracegen to have 10 workers and no pause. I also forgot to set the pod limits and left for lunch. If I'm reading this data correctly, it shows about 200Mi of memory consumption for about 4M spans per minute.

The second part is jaegertracing/jaeger-agent:latest, with a 64Mi of requests and 256Mi of limit. The tracegen is with a pause of 1ms and 1 worker.

The third part is the same a second part, only with jaegertracing/jaeger-agent:1.18.1 as the image.

Here's how I need your help: are you able to reproduce your performance problems based on this setup? If you can't, could you perhaps then run a git bisect with the first good commit being v1.18.1 and the bad as being v1.19.2? Like:

$ git bisect start 
$ git bisect good v1.18.1
$ git bisect bad v1.19.2
$ make docker # this is the easiest, but will build everything, which might take a while
$ podman tag jaegertracing/jaeger-agent:latest quay.io/jpkroehling/jaeger-agent:issue2452-step1
$ podman push quay.io/jpkroehling/jaeger-agent:issue2452-step1

Change the last two commands accordingly, to tag and push to a namespace you own. At this point, configure your agent to use the image you just pushed.

Then, test it:

$ # test it
$ git bisect good # if the performance is good, or use "bad" if it shows the problem already

Rinse and repeat, until git says that it found the problematic commit.

If you don't have Jaeger source code locally to run the previous steps, you should probably do this first (see CONTRIBUTING.md):

$ git clone https://github.com/jaegertracing/jaeger.git
$ cd jaeger
$ git submodule update --init --recursive
$ make install-tools

jpkrohling on 3 Sep 2020

@jpkrohling I'll try to test with git bisect, but it will take some time.

zigmund on 4 Sep 2020

👍1

It's really hard to test good steps since it could take from minutes to much more time to catch first oom. But at least already reproduced oom's at the first two steps. At the moment I'm testing the third step for an hour already and I don't know if it is enough to be certain that the step is good.

zigmund on 4 Sep 2020

How fast did you get OOMs for the first two steps? If they are at around 15m, it might be safe to say that a commit is good if it's not crashing after 30m.

jpkrohling on 4 Sep 2020

@jpkrohling at the first two steps got first ooms in few minutes.
Got compile error at 4 and 5 steps, testing 6 step at the moment. It seems that https://github.com/jaegertracing/jaeger/commit/be17169a417954a347408708182519e267618500 affected memory usage behavior in out environment.

zigmund on 4 Sep 2020

@jpkrohling completed 6 step with hour+ test period.

be17169a417954a347408708182519e267618500 is the first bad commit
commit be17169a417954a347408708182519e267618500
Author: Pavol Loffay <[email protected]>
Date:   Fri Jun 26 16:11:45 2020 +0200

    Upgrade thrift to 0.13 (#2311)

    * Upgrade thrift to 0.13

    Signed-off-by: Pavol Loffay <[email protected]>

    * Add context to sampling managers

    Signed-off-by: Pavol Loffay <[email protected]>

zigmund on 4 Sep 2020

🚀1

cc @pavolloffay

jpkrohling on 4 Sep 2020

Here is the upgrade to thrift 0.13 PR https://github.com/jaegertracing/jaeger/pull/2311. The upgrade was done due to security issues in the old version and also it was needed in OpenTelemetry collector IIRC.

I am not sure what is causing the regression.

pavolloffay on 4 Sep 2020

@zigmund are you able to get us a memory profile for the running agent? The admin port has pprof endpoints, which you can browse like this:

$ go tool pprof -http localhost:6060 http://your-agent-host:14271/debug/pprof/heap

This will get a web UI on http://localhost:6060 for browsing your agent's memory.

jpkrohling on 7 Sep 2020

@jpkrohling I'll try but don't think it will help. Agent allocates large amount of memory and releases it next moment, so I have to catch that moment with profiler(?)

zigmund on 7 Sep 2020

That's true. Ideally, you'd capture it every N minutes, so that we'd get at least one profile when the memory is ramping up. There are a few experimental tools that can help with that, but I'm really not sure about their state, like conprof (cc @brancz).

Once we understand better where this memory usage is coming from, we can probably try to reproduce it ourselves, which is the first step to fix the problem.

jpkrohling on 8 Sep 2020

Conprof is not well optimized itself but used by various organizations so it's definitely worth a try :)

brancz on 8 Sep 2020

@jpkrohling lowered replicas to 1 and collected profile every second via forwarded port with curl:

while true; do curl http://localhost:14271/debug/pprof/heap -o `date +%s`.pb.gz; sleep 1; done

Raised limit to 2Gb to avoid oom but agent still been killed. :D
But at least I got few identical profiles right before oom. Here is one of them

zigmund on 9 Sep 2020

🎉1

Looks interesting

I'll try to reproduce it here again. Just to confirm: data is flowing from the agent to the collector, right?

jpkrohling on 9 Sep 2020

@jpkrohling yes, agent sends data to collector.

zigmund on 9 Sep 2020

@zigmund it's still quite difficult to tell what's going on. The profile you sent seems to indicate that all memory is being allocated in the ReadString from Thrift's Compact protocol, which seems to indicate that you have a span bigger than 930MiB which I doubt is the case, especially because a batch has a max size equal to the limits of a UDP packet, typically 64KiB. Either that, or quite a bunch of spans are locked in that function, which also seems unlikely.

I tried to put as much pressure as I could on the agent, but it keeps with a minimal memory usage. I don't think Elasticsearch is part of the equation here, but I used it anyway as the backend, without luck. Even my memory profile is quite different than yours: every time I get a snapshot, it's mostly showing 1MiB here or 512KiB there, which is in line with the inflight data (like the image at the end).

Could you also get us a goroutine snapshot? If we have a single call to ReadString, it would probably be apparent there.

I would also like to ask you for your exact runtime configuration, like:

where are you deploying it? Bare metal? VM? Kubernetes?
the exact command used to start your Collector (or Deployment, in case of Kubernetes, or the Jaeger CR in case you provisioned it using the Jaeger Operator)
same for the Agent
which client are you using? Jaeger Go Client? OpenTelemetry? Something else?

jpkrohling on 9 Sep 2020

@jpkrohling

Could you also get us a goroutine snapshot?

Do you mean profile from /debug/pprof/goroutine?debug=2?

where are you deploying it? Bare metal? VM? Kubernetes?

Kubernetes

the exact command used to start your Collector (or Deployment, in case of Kubernetes, or the Jaeger CR in case you provisioned it using the Jaeger Operator)

Env vars passed via configmap:

SPAN_STORAGE_TYPE=elasticsearch
ES_SERVER_URLS=http://es-node1:9200,http://es-node2:9200,http://es-node3:9200
ES_NUM_SHARDS=3
ES_NUM_REPLICAS=0
ES_MAX_SPAN_AGE=24h

same for the Agent

Single argument: --reporter.grpc.host-port=jaeger-collector:14250

which client are you using? Jaeger Go Client? OpenTelemetry? Something else?

golang - https://github.com/jaegertracing/jaeger-client-go
node.js - https://github.com/jaegertracing/jaeger-client-node
php - https://github.com/jukylin/jaeger-php
nginx - https://github.com/opentracing-contrib/nginx-opentracing

zigmund on 9 Sep 2020

Do you mean profile from /debug/pprof/goroutine?debug=2?

Yes, please

golang - https://github.com/jaegertracing/jaeger-client-go
node.js - https://github.com/jaegertracing/jaeger-client-node
php - https://github.com/jukylin/jaeger-php
nginx - https://github.com/opentracing-contrib/nginx-opentracing

Would it be possible to have one agent per client type? If we notice that it only happens with spans from one specific client, we might have a lead...

jpkrohling on 9 Sep 2020

Yes, please

I'll try to catch memory eating profile with method I used before.

Would it be possible to have one agent per client type? If we notice that it only happens with spans from one specific client, we might have a lead...

I don't think it's possible since we're have tens of services pointed to same jaeger agent endpoint. I probably could configure nginx and node.js services by myself... but all other services' are out of my control.

zigmund on 9 Sep 2020

👍1

@jpkrohling Tried to collect goroutine profile containing thrift.(*TCompactProtocol).ReadString for few hours with no luck. :(
Memory allocations can be seen via pprof for a time till GC comes, I think. But gouroutine profile have to be collected right at the moment(?).

zigmund on 14 Sep 2020

Maybe I should make custom build with some debug logs in thrift.(*TCompactProtocol).ReadString? But I'll need help since I'm not so strong in golang.

zigmund on 14 Sep 2020

That is generated code and is being called for almost every field in every object: it would probably generate a lot of noise. We have a log statement at debug level around this place already:

https://github.com/jaegertracing/jaeger/blob/df2582d5331f8b0931a88e0a7c4c32ef788561f0/cmd/agent/app/processors/thrift_processor.go#L120

Are you seeing it often, or does it seem stuck right before the process crashes? But just in case, I built an image for the agent which would add two new prints to the console:

diff --git a/thrift-gen/jaeger/jaeger.go b/thrift-gen/jaeger/jaeger.go
index 1e8cb9f3..dcbeb44a 100644
--- a/thrift-gen/jaeger/jaeger.go
+++ b/thrift-gen/jaeger/jaeger.go
@@ -1855,6 +1855,7 @@ func (p *Batch) Read(iprot thrift.TProtocol) error {
         if err := p.ReadField1(iprot); err != nil {
           return err
         }
+       fmt.Printf("processing batch for service %q\n", p.Process.GetServiceName())
         issetProcess = true
       } else {
         if err := iprot.Skip(fieldTypeId); err != nil {
@@ -1910,6 +1911,7 @@ func (p *Batch) Read(iprot thrift.TProtocol) error {
   if !issetSpans{
     return thrift.NewTProtocolExceptionWithType(thrift.INVALID_DATA, fmt.Errorf("Required field Spans is not set"));
   }
+  fmt.Printf("processed batch for service %q\n", p.Process.GetServiceName())
   return nil
 }

This will print the service name for a batch before processing its spans, and after it finished parsing the batch: it's reading one span from the batch that seems to be breaking, so, I expect you to eventually see one print but not the other. The hypothesis I had earlier will confirm if you end up with the name of the service for all crashes, or at least with services using the same client type.

Image with these extra prints: quay.io/jpkroehling/jaeger-agent:issue2452-extra-verbose

jpkrohling on 14 Sep 2020

@jpkrohling launched agent from your image, enabled debug and see tons of logs:

{"level":"debug","ts":1600148307.8616223,"caller":"processors/thrift_processor.go:120","msg":"Span(s) received by the agent","bytes-received":847}
{"level":"debug","ts":1600148307.863402,"caller":"processors/thrift_processor.go:120","msg":"Span(s) received by the agent","bytes-received":847}
{"level":"debug","ts":1600148307.8717968,"caller":"processors/thrift_processor.go:120","msg":"Span(s) received by the agent","bytes-received":25111}

But don't see any lines processing batch for service...

Also we have some logs about bad spans:

{"level":"error","ts":1600148418.579937,"caller":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: don't know what type: 14","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
{"level":"error","ts":1600148418.816928,"caller":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 27 read error: don't know what type: 14","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
{"level":"error","ts":1600148433.9862144,"caller":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: Required field TraceIdLow is not set","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
{"level":"error","ts":1600148451.748264,"caller":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: Required field TraceIdLow is not set","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}

But they been there for long time before upgrade.

zigmund on 15 Sep 2020

launched agent from your image, enabled debug and see tons of logs:

What's the last Span(s) received by the agent entry you see before the crash?

But don't see any lines processing batch for service...

Something might have happened between building the binary and placing it in the image. Anyway, here's a new one, which I confirmed to work: quay.io/jpkroehling/jaeger-agent:issue2452-extra-verbose-2:

$ podman run -p 6831/udp quay.io/jpkroehling/jaeger-agent:issue2452-extra-verbose-2 --reporter.grpc.host-port=192.168.2.114:14250 --log-level=debug
...
{"level":"info","ts":1600158147.7916539,"caller":"command-line-arguments/main.go:78","msg":"Starting agent"}
{"level":"info","ts":1600158147.7916827,"caller":"healthcheck/handler.go:128","msg":"Health Check state change","status":"ready"}
...
{"level":"debug","ts":1600158156.744476,"caller":"processors/thrift_processor.go:120","msg":"Span(s) received by the agent","bytes-received":376}
processing batch for service "jaeger-query"
processed batch for service "jaeger-query"
{"level":"debug","ts":1600158156.7448108,"caller":"reporter/client_metrics.go:177","msg":"received batch from a new client, starting to keep stats","client-uuid":"5b9222a904b62ac6"}

Also we have some logs about bad spans:

Do you know which clients/services are sending them? Those failures are happening very close to where you are seeing the memory explode. How does the metric jaeger_agent_thrift_udp_t_processor_handler_errors_total looks like? Are they all part of model="jaeger",protocol="compact", or do you see some other variants there? How about jaeger_agent_thrift_udp_server_packets_processed_total?

jpkrohling on 15 Sep 2020

@zigmund, do you have any news for us? I'm really curious on how this develops :-)

jpkrohling on 16 Sep 2020

@jpkrohling sorry, I'm now in business trip for few days. Will test as soon as possible.

zigmund on 16 Sep 2020

@jpkrohling, checked metrics.

sum(rate(jaeger_agent_thrift_udp_t_processor_handler_errors_total[1m])) by (model, protocol):

{model="jaeger",protocol="binary"} | 0
{model="jaeger",protocol="compact"} | 0.06667950988310203
{model="zipkin",protocol="compact"} | 0

sum(rate(jaeger_agent_thrift_udp_server_packets_processed_total[1m])) by (model, protocol):

{model="jaeger",protocol="binary"} | 0.35553778133262237
{model="jaeger",protocol="compact"} | 73.39500099980003
{model="zipkin",protocol="compact"} | 0

Currently unable to test debug image v2, will try to do it later.

zigmund on 17 Sep 2020

👍1

@jpkrohling, I think I found two agent-breaking services, both are php-written. Will try to catch breaking traces, but don't know how at the moment. :D

zigmund on 17 Sep 2020

🎉1

I think I found two agent-breaking services, both are php-written

That's actually great news. This means that the Agent might not be at fault here at all for common cases. We still need to figure out what type of incoming data is causing the problem, but I believe it does change the criticality of the issue.

Will try to catch breaking traces, but don't know how at the moment.

I can't help much here, but perhaps you can start by checking which operations are expected to exist vs. which ones actually exist. The missing one(s) might be worth double-checking.

jpkrohling on 17 Sep 2020

start by checking which operations are expected to exist vs. which ones actually exist.

These two services are pretty large monolith php applications, traces often consist of hundreds spans, so It's near impossible to catch missing.

zigmund on 21 Sep 2020

Alright, let us know once you have more info then, or if there's anything we can do on our side to help you.

jpkrohling on 21 Sep 2020

@zigmund, were you able to find a pattern already? I'm worried that we might have a bigger issue, as bad clients should never be able to bring down the agent...

jpkrohling on 24 Sep 2020

@jpkrohling, unfortunately, didn't found any additional info. :(

zigmund on 24 Sep 2020

Are you still experiencing those problems, or have you disabled tracing for the services that are causing problems?

jpkrohling on 24 Sep 2020

@jpkrohling I'm using your extra verbose image for debug and reverting back to 1.18.1 for regular usage. The services' tracing is enabled.

zigmund on 24 Sep 2020

Got it. We might have newer versions of the models/protobuf soon, which _might_ help here. I'll ping you once we have it ready, in case you are able to try it out before we release it.

jpkrohling on 24 Sep 2020

OK, I'm ready to test.

zigmund on 24 Sep 2020

@zigmund We released 1.20 some days ago. Would you be able to try it out?

jpkrohling on 9 Oct 2020

@jpkrohling checked 1.20 - same behavior as 1.19.2

zigmund on 12 Oct 2020

👍1

Thanks for the confirmation. I'm leaving this open, but unless there's new information, there's not much we can do to reproduce and finally fix this problem.

jpkrohling on 12 Oct 2020

Hi @zigmund
Thank you so much because of this issue. I'm facing to the same problem. I'm also using jukylin/jaeger-php for my project. After a long day for debugging, I figure out that the reason is about my data. It isn't all data, just only malformed data - where percentage is very small. So it is really difficult to debug if you are on production environment.
You can read this comment #2638 (comment). Thank you so much @jpkrohling
Hope you can resolve this issue.