Describe the bug
One group of jaeger agents (3 replicas), deployed on kubernetes. Not so loaded system - 500-700 spans per second.
Used v1.13.1 for year without any issues, upgraded all componenet to 1.19.2 (also replaced Tchannel with gRPC) and agent memory usage become unstable. Before upgrade agent instances used as low as 16 mb with 64 limit, but after upgrade memory usage peaks appeared and agent got oomkilled. I raised limits, but even 512 mb is not enough.
After few hours and tens of oomkills, I downgraded agent to 1.18.1 version and see no issues so far.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Agent memory usage close to lower versions.
Screenshots

Version (please complete the following information):
What troubleshooting steps did you try?
Collected /metrics, will provide if needed.
Additional context
-
When you downgraded to 1.18.1, did you change any other configuration? Would you mind sharing your agent configuration?
@jpkrohling the only agent parameter used before and after downgrade is --reporter.grpc.host-port=jaeger-collector:14250.
Also played for a while after got first OOMkills with parameters, but reverted changes singe they didn't helped much:
- --processor.jaeger-binary.server-queue-size=...
- --processor.jaeger-binary.workers=...
- --processor.jaeger-compact.server-queue-size=...
- --processor.jaeger-compact.workers=...
- --processor.zipkin-compact.server-queue-size=...
- --processor.zipkin-compact.workers=...
I'll take a look at list of commits affecting the agent between 1.18.1 and 1.19.2, perhaps I can spot a suspect there. I'll also try to reproduce the problem here. Do you think it would be reproducible with minikube and a simple agent with 64Mi of requests , 256Mi limit? Jaeger's tracegen would probably generate enough load ;-)
@zigmund I'll need your help here. I tried to reproduce your problem but didn't succeed. I have a local minikube cluster with 20Gi of memory available, initialized with:
$ minikube start --vm-driver kvm2 --cpus 6 --memory 20480 --container-runtime=crio --addons=ingress
Using the Jaeger Operator, I provisioned a Jaeger instance with the following spec:
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
name: agent-as-daemonset
spec:
agent:
image: jaegertracing/jaeger-agent:latest
strategy: DaemonSet
resources:
requests:
memory: 64Mi
limits:
memory: 256Mi
storage:
options:
memory:
max-traces: 100000
And here's a trace generator, creating 1 trace per millisecond:
apiVersion: apps/v1
kind: Deployment
metadata:
name: tracegen
annotations:
"sidecar.jaegertracing.io/inject": "false"
spec:
replicas: 1
selector:
matchLabels:
app: tracegen
template:
metadata:
labels:
app: tracegen
spec:
containers:
- name: tracegen
image: jaegertracing/jaeger-tracegen:1.19.2
env:
- name: JAEGER_AGENT_HOST
valueFrom:
fieldRef:
fieldPath: status.hostIP
args:
- -pause=1ms
- -duration=1h
- -workers=1
The agent pod was up for the whole time, and the metrics shows me that the agent didn't even blink. I changed the Jaeger CR to use 1.18.1, to see if latest was in the same ballpark, and I couldn't spot any differences:

The first part of the graph is for an early test, where I set the tracegen to have 10 workers and no pause. I also forgot to set the pod limits and left for lunch. If I'm reading this data correctly, it shows about 200Mi of memory consumption for about 4M spans per minute.
The second part is jaegertracing/jaeger-agent:latest, with a 64Mi of requests and 256Mi of limit. The tracegen is with a pause of 1ms and 1 worker.
The third part is the same a second part, only with jaegertracing/jaeger-agent:1.18.1 as the image.
Here's how I need your help: are you able to reproduce your performance problems based on this setup? If you can't, could you perhaps then run a git bisect with the first good commit being v1.18.1 and the bad as being v1.19.2? Like:
$ git bisect start
$ git bisect good v1.18.1
$ git bisect bad v1.19.2
$ make docker # this is the easiest, but will build everything, which might take a while
$ podman tag jaegertracing/jaeger-agent:latest quay.io/jpkroehling/jaeger-agent:issue2452-step1
$ podman push quay.io/jpkroehling/jaeger-agent:issue2452-step1
Change the last two commands accordingly, to tag and push to a namespace you own. At this point, configure your agent to use the image you just pushed.
Then, test it:
$ # test it
$ git bisect good # if the performance is good, or use "bad" if it shows the problem already
Rinse and repeat, until git says that it found the problematic commit.
If you don't have Jaeger source code locally to run the previous steps, you should probably do this first (see CONTRIBUTING.md):
$ git clone https://github.com/jaegertracing/jaeger.git
$ cd jaeger
$ git submodule update --init --recursive
$ make install-tools
@jpkrohling I'll try to test with git bisect, but it will take some time.
It's really hard to test good steps since it could take from minutes to much more time to catch first oom. But at least already reproduced oom's at the first two steps. At the moment I'm testing the third step for an hour already and I don't know if it is enough to be certain that the step is good.
How fast did you get OOMs for the first two steps? If they are at around 15m, it might be safe to say that a commit is good if it's not crashing after 30m.
@jpkrohling at the first two steps got first ooms in few minutes.
Got compile error at 4 and 5 steps, testing 6 step at the moment. It seems that https://github.com/jaegertracing/jaeger/commit/be17169a417954a347408708182519e267618500 affected memory usage behavior in out environment.
@jpkrohling completed 6 step with hour+ test period.
be17169a417954a347408708182519e267618500 is the first bad commit
commit be17169a417954a347408708182519e267618500
Author: Pavol Loffay <[email protected]>
Date: Fri Jun 26 16:11:45 2020 +0200
Upgrade thrift to 0.13 (#2311)
* Upgrade thrift to 0.13
Signed-off-by: Pavol Loffay <[email protected]>
* Add context to sampling managers
Signed-off-by: Pavol Loffay <[email protected]>
cc @pavolloffay
Here is the upgrade to thrift 0.13 PR https://github.com/jaegertracing/jaeger/pull/2311. The upgrade was done due to security issues in the old version and also it was needed in OpenTelemetry collector IIRC.
I am not sure what is causing the regression.
@zigmund are you able to get us a memory profile for the running agent? The admin port has pprof endpoints, which you can browse like this:
$ go tool pprof -http localhost:6060 http://your-agent-host:14271/debug/pprof/heap
This will get a web UI on http://localhost:6060 for browsing your agent's memory.
@jpkrohling I'll try but don't think it will help. Agent allocates large amount of memory and releases it next moment, so I have to catch that moment with profiler(?)
That's true. Ideally, you'd capture it every N minutes, so that we'd get at least one profile when the memory is ramping up. There are a few experimental tools that can help with that, but I'm really not sure about their state, like conprof (cc @brancz).
Once we understand better where this memory usage is coming from, we can probably try to reproduce it ourselves, which is the first step to fix the problem.
Conprof is not well optimized itself but used by various organizations so it's definitely worth a try :)
@jpkrohling lowered replicas to 1 and collected profile every second via forwarded port with curl:
while true; do curl http://localhost:14271/debug/pprof/heap -o `date +%s`.pb.gz; sleep 1; done
Raised limit to 2Gb to avoid oom but agent still been killed. :D
But at least I got few identical profiles right before oom. Here is one of them
Looks interesting

I'll try to reproduce it here again. Just to confirm: data is flowing from the agent to the collector, right?
@jpkrohling yes, agent sends data to collector.
@zigmund it's still quite difficult to tell what's going on. The profile you sent seems to indicate that all memory is being allocated in the ReadString from Thrift's Compact protocol, which seems to indicate that you have a span bigger than 930MiB which I doubt is the case, especially because a batch has a max size equal to the limits of a UDP packet, typically 64KiB. Either that, or quite a bunch of spans are locked in that function, which also seems unlikely.
I tried to put as much pressure as I could on the agent, but it keeps with a minimal memory usage. I don't think Elasticsearch is part of the equation here, but I used it anyway as the backend, without luck. Even my memory profile is quite different than yours: every time I get a snapshot, it's mostly showing 1MiB here or 512KiB there, which is in line with the inflight data (like the image at the end).
Could you also get us a goroutine snapshot? If we have a single call to ReadString, it would probably be apparent there.
I would also like to ask you for your exact runtime configuration, like:
Deployment, in case of Kubernetes, or the Jaeger CR in case you provisioned it using the Jaeger Operator)
@jpkrohling
Could you also get us a goroutine snapshot?
Do you mean profile from /debug/pprof/goroutine?debug=2?
where are you deploying it? Bare metal? VM? Kubernetes?
Kubernetes
the exact command used to start your Collector (or Deployment, in case of Kubernetes, or the Jaeger CR in case you provisioned it using the Jaeger Operator)
Env vars passed via configmap:
SPAN_STORAGE_TYPE=elasticsearch
ES_SERVER_URLS=http://es-node1:9200,http://es-node2:9200,http://es-node3:9200
ES_NUM_SHARDS=3
ES_NUM_REPLICAS=0
ES_MAX_SPAN_AGE=24h
same for the Agent
Single argument: --reporter.grpc.host-port=jaeger-collector:14250
which client are you using? Jaeger Go Client? OpenTelemetry? Something else?
golang - https://github.com/jaegertracing/jaeger-client-go
node.js - https://github.com/jaegertracing/jaeger-client-node
php - https://github.com/jukylin/jaeger-php
nginx - https://github.com/opentracing-contrib/nginx-opentracing
Do you mean profile from /debug/pprof/goroutine?debug=2?
Yes, please
golang - https://github.com/jaegertracing/jaeger-client-go
node.js - https://github.com/jaegertracing/jaeger-client-node
php - https://github.com/jukylin/jaeger-php
nginx - https://github.com/opentracing-contrib/nginx-opentracing
Would it be possible to have one agent per client type? If we notice that it only happens with spans from one specific client, we might have a lead...
Yes, please
I'll try to catch memory eating profile with method I used before.
Would it be possible to have one agent per client type? If we notice that it only happens with spans from one specific client, we might have a lead...
I don't think it's possible since we're have tens of services pointed to same jaeger agent endpoint. I probably could configure nginx and node.js services by myself... but all other services' are out of my control.
@jpkrohling Tried to collect goroutine profile containing thrift.(*TCompactProtocol).ReadString for few hours with no luck. :(
Memory allocations can be seen via pprof for a time till GC comes, I think. But gouroutine profile have to be collected right at the moment(?).
Maybe I should make custom build with some debug logs in thrift.(*TCompactProtocol).ReadString? But I'll need help since I'm not so strong in golang.
That is generated code and is being called for almost every field in every object: it would probably generate a lot of noise. We have a log statement at debug level around this place already:
Are you seeing it often, or does it seem stuck right before the process crashes? But just in case, I built an image for the agent which would add two new prints to the console:
diff --git a/thrift-gen/jaeger/jaeger.go b/thrift-gen/jaeger/jaeger.go
index 1e8cb9f3..dcbeb44a 100644
--- a/thrift-gen/jaeger/jaeger.go
+++ b/thrift-gen/jaeger/jaeger.go
@@ -1855,6 +1855,7 @@ func (p *Batch) Read(iprot thrift.TProtocol) error {
if err := p.ReadField1(iprot); err != nil {
return err
}
+ fmt.Printf("processing batch for service %q\n", p.Process.GetServiceName())
issetProcess = true
} else {
if err := iprot.Skip(fieldTypeId); err != nil {
@@ -1910,6 +1911,7 @@ func (p *Batch) Read(iprot thrift.TProtocol) error {
if !issetSpans{
return thrift.NewTProtocolExceptionWithType(thrift.INVALID_DATA, fmt.Errorf("Required field Spans is not set"));
}
+ fmt.Printf("processed batch for service %q\n", p.Process.GetServiceName())
return nil
}
This will print the service name for a batch before processing its spans, and after it finished parsing the batch: it's reading one span from the batch that seems to be breaking, so, I expect you to eventually see one print but not the other. The hypothesis I had earlier will confirm if you end up with the name of the service for all crashes, or at least with services using the same client type.
Image with these extra prints: quay.io/jpkroehling/jaeger-agent:issue2452-extra-verbose
@jpkrohling launched agent from your image, enabled debug and see tons of logs:
{"level":"debug","ts":1600148307.8616223,"caller":"processors/thrift_processor.go:120","msg":"Span(s) received by the agent","bytes-received":847}
{"level":"debug","ts":1600148307.863402,"caller":"processors/thrift_processor.go:120","msg":"Span(s) received by the agent","bytes-received":847}
{"level":"debug","ts":1600148307.8717968,"caller":"processors/thrift_processor.go:120","msg":"Span(s) received by the agent","bytes-received":25111}
But don't see any lines processing batch for service...
Also we have some logs about bad spans:
{"level":"error","ts":1600148418.579937,"caller":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: don't know what type: 14","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
{"level":"error","ts":1600148418.816928,"caller":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: *jaeger.Span field 27 read error: don't know what type: 14","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
{"level":"error","ts":1600148433.9862144,"caller":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: Required field TraceIdLow is not set","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
{"level":"error","ts":1600148451.748264,"caller":"processors/thrift_processor.go:123","msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error reading struct: Required field TraceIdLow is not set","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:123\ngithub.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2\n\tgithub.com/jaegertracing/jaeger/cmd/agent/app/processors/thrift_processor.go:87"}
But they been there for long time before upgrade.
launched agent from your image, enabled debug and see tons of logs:
What's the last Span(s) received by the agent entry you see before the crash?
But don't see any lines processing batch for service...
Something might have happened between building the binary and placing it in the image. Anyway, here's a new one, which I confirmed to work: quay.io/jpkroehling/jaeger-agent:issue2452-extra-verbose-2:
$ podman run -p 6831/udp quay.io/jpkroehling/jaeger-agent:issue2452-extra-verbose-2 --reporter.grpc.host-port=192.168.2.114:14250 --log-level=debug
...
{"level":"info","ts":1600158147.7916539,"caller":"command-line-arguments/main.go:78","msg":"Starting agent"}
{"level":"info","ts":1600158147.7916827,"caller":"healthcheck/handler.go:128","msg":"Health Check state change","status":"ready"}
...
{"level":"debug","ts":1600158156.744476,"caller":"processors/thrift_processor.go:120","msg":"Span(s) received by the agent","bytes-received":376}
processing batch for service "jaeger-query"
processed batch for service "jaeger-query"
{"level":"debug","ts":1600158156.7448108,"caller":"reporter/client_metrics.go:177","msg":"received batch from a new client, starting to keep stats","client-uuid":"5b9222a904b62ac6"}
Also we have some logs about bad spans:
Do you know which clients/services are sending them? Those failures are happening very close to where you are seeing the memory explode. How does the metric jaeger_agent_thrift_udp_t_processor_handler_errors_total looks like? Are they all part of model="jaeger",protocol="compact", or do you see some other variants there? How about jaeger_agent_thrift_udp_server_packets_processed_total?
@zigmund, do you have any news for us? I'm really curious on how this develops :-)
@jpkrohling sorry, I'm now in business trip for few days. Will test as soon as possible.
@jpkrohling, checked metrics.
sum(rate(jaeger_agent_thrift_udp_t_processor_handler_errors_total[1m])) by (model, protocol):
{model="jaeger",protocol="binary"} | 0
{model="jaeger",protocol="compact"} | 0.06667950988310203
{model="zipkin",protocol="compact"} | 0
sum(rate(jaeger_agent_thrift_udp_server_packets_processed_total[1m])) by (model, protocol):
{model="jaeger",protocol="binary"} | 0.35553778133262237
{model="jaeger",protocol="compact"} | 73.39500099980003
{model="zipkin",protocol="compact"} | 0
Currently unable to test debug image v2, will try to do it later.
@jpkrohling, I think I found two agent-breaking services, both are php-written. Will try to catch breaking traces, but don't know how at the moment. :D
I think I found two agent-breaking services, both are php-written
That's actually great news. This means that the Agent might not be at fault here at all for common cases. We still need to figure out what type of incoming data is causing the problem, but I believe it does change the criticality of the issue.
Will try to catch breaking traces, but don't know how at the moment.
I can't help much here, but perhaps you can start by checking which operations are expected to exist vs. which ones actually exist. The missing one(s) might be worth double-checking.
start by checking which operations are expected to exist vs. which ones actually exist.
These two services are pretty large monolith php applications, traces often consist of hundreds spans, so It's near impossible to catch missing.
Alright, let us know once you have more info then, or if there's anything we can do on our side to help you.
@zigmund, were you able to find a pattern already? I'm worried that we might have a bigger issue, as bad clients should never be able to bring down the agent...
@jpkrohling, unfortunately, didn't found any additional info. :(
Are you still experiencing those problems, or have you disabled tracing for the services that are causing problems?
@jpkrohling I'm using your extra verbose image for debug and reverting back to 1.18.1 for regular usage. The services' tracing is enabled.
Got it. We might have newer versions of the models/protobuf soon, which _might_ help here. I'll ping you once we have it ready, in case you are able to try it out before we release it.
OK, I'm ready to test.
@zigmund We released 1.20 some days ago. Would you be able to try it out?
@jpkrohling checked 1.20 - same behavior as 1.19.2
Thanks for the confirmation. I'm leaving this open, but unless there's new information, there's not much we can do to reproduce and finally fix this problem.
Hi @zigmund
Thank you so much because of this issue. I'm facing to the same problem. I'm also using jukylin/jaeger-php for my project. After a long day for debugging, I figure out that the reason is about my data. It isn't all data, just only malformed data - where percentage is very small. So it is really difficult to debug if you are on production environment.
You can read this comment #2638 (comment). Thank you so much @jpkrohling
Hope you can resolve this issue.