After a while of running, my post process forwarder appear to be having issues with communication to Kafka:
Traceback (most recent call last):
File "/usr/local/bin/sentry", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python2.7/site-packages/sentry/runner/__init__.py", line 164, in main
cli(prog_name=get_prog(), obj={}, max_content_width=100)
File "/usr/local/lib/python2.7/site-packages/click/core.py", line 722, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python2.7/site-packages/click/core.py", line 1066, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python2.7/site-packages/click/core.py", line 1066, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python2.7/site-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python2.7/site-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/click/decorators.py", line 17, in new_func
return f(get_current_context(), *args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/sentry/runner/decorators.py", line 68, in inner
return ctx.invoke(f, *args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/click/decorators.py", line 17, in new_func
return f(get_current_context(), *args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/sentry/runner/decorators.py", line 30, in inner
return ctx.invoke(f, *args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/sentry/runner/commands/run.py", line 318, in post_process_forwarder
initial_offset_reset=options["initial_offset_reset"],
File "/usr/local/lib/python2.7/site-packages/sentry/utils/services.py", line 104, in <lambda>
context[key] = (lambda f: lambda *a, **k: getattr(self, f)(*a, **k))(key)
File "/usr/local/lib/python2.7/site-packages/sentry/eventstream/kafka/backend.py", line 184, in run_post_process_forwarder
raise Exception(error)
Exception: KafkaError{code=OFFSET_OUT_OF_RANGE,val=1,str="Broker: Offset out of range"}
Oddly, my Snuba consumer is working fine and Sentry appears to be running as expected and I see the Snuba consumer running in the Kafka logs.
Can someone give me some suggestions on what to look for to figure out what the post-processor is looking for?
Can you try modifying this line to:
command: run post-process-forwarder --commit-batch-size 1 --initial-offset-reset earliest
and see if it helps?
@BYK Thank you, I'll have a try on the manifest and update it.
Just an FYI I did fix this by effectively destroying the volume for the Kafka backend and recreate it.
I also noticed my volume size is pretty low for the amount of tasks we have coming in. Does Kafka have an inherent GC that cleans things up? It looks as though the disk was filling upwards of 8GB.
@BYK It came back, looks like with both earliest and latest offsets I get the same error. This is POST fixing things, so it's unexpectedly broken itself again
Trying to think of what I can do to fix it.
Hello,
I am getting same issue, and it eventually kills sentry, it simply stops logging any issues.
Been going on for some time
I can give you access to a Digital Ocean Droplet where this is happening, if that will help.
I upped the instance to 8gb memory, to test if resources is issue.
https://sentry.uptactics.com/share/issue/2a83806a230e4c4196f86c83890ad62e/
I upped the instance to 8gb memory, to test if resources is issue.
I think that helped. Today (since 7h ago) the error stopped.

The readme/docs note at least 2GB memory. Seems I needed 8GB to stop the error. Possible resource related?
Thanks @DandyDeveloper for outlining your solution path. We had the same issue and were also able to resolve it for now by destroying the persistent volume of Kafka. Until now, we didn't have the issue occurring again like it did in your case.
I have a solution:
docker exec -it onpremise_kafka_1 bashkafka-consumer-groups --bootstrap-server 127.0.0.1:9092 --listkafka-consumer-groups --bootstrap-server 127.0.0.1:9092 --group snuba-post-processor -describekafka-consumer-groups --bootstrap-server 127.0.0.1:9092 --group snuba-post-processor --topic events --reset-offsets --to-latestkafka-consumer-groups --bootstrap-server 127.0.0.1:9092 --group snuba-post-processor --topic events --reset-offsets --to-latest --executeThis solution work the same way for other Snuba components (just replace group and\or topic)
@rmisyurev This actually worked for me, thank you very much! But to be honest I don't know why.
The first time I got an error message like this:
Error: Assignments can only be reset if the group 'snuba-post-processor' is inactive, but the current state is Stable.
I just stopped all containers, then only started zookeeper and kafka containers, executed the statement above and stopped both containers again.
After that I started all containers and everything was fine.
I am sure it is not the right solution, so be careful, but for me it worked.
@rmisyurev - thanks for sharing the steps that helped you! I'll make sure to add these to some future documentation.
Thanks so much for the helpful comments here, especially @rmisyurev — I looked at your post with the kafka-consumer-groups commands at least 30 times when debugging this.
This issue has been plaguing me for weeks, and I tried everything including adding a ton of consumer replicas, which didn't help at all. What I _think_ has finally solved it (in addition to giving Kafka 8GB of RAM) was actually setting the --commit-batch-size argument on the post-processor. The Helm chart for Sentry defaults this to 1 for some reason, which is way too low (see here). I set it to 100 and for the first time I'm seeing the lag go down rather than up. Wanted to share this in case anyone else runs across this issue.
This issue has gone three weeks without activity. In another week, I will close it.
But! If you comment or otherwise update it, I will reset the clock, and if you label it Status: Accepted, I will leave it alone ... forever!
"A weed is but an unloved flower." โ _Ella Wheeler Wilcox_ ๐ฅ
Hi, thanks for all the quick fixes, but unfortunately none of them fixed it for us. Currently Sentry runs ok for a few days with:
--commit-batch-size 20 --initial-offset-reset earliest.
The first problem i'am experiencing is that the 2FA check fails, the rest seems to operate fine.
The only way to get the system fully operational again is deleting everything related to kafka (volumes etc) and rebuilding it.
Any suggestions on how to make sentry run more stable?
Same issue here on 20.11.1. I did all the steps mentioned by @rmisyurev, but to no avail.
Now upgrading sentry to the latest version and redoing all the steps. Hardware is an intel i7-4770@ 3.40Ghz with 16GB on Docker under windows 10 (using WSL), so that shouldn't be an issue for +/- 10k events per day I would say? The sentry worker is running with 2x100% CPU every now or then still.
@bbreijer where do you run this setting?
EDIT: Update tot 21.1.0 did not solve the issue. sentry_onpremise_snuba-consumer_1, sentry_onpremise_snuba-outcomes-consumer_1, sentry_onpremise_snub-transactions-consumer_1 all keep restarting all the time.
EDIT2: I changed the commit-batch-size in the docker-compose.yml file.
EDIT2: I changed the commit-batch-size in the docker-compose.yml file.
That is where I also changed it.
Edited the commit-batch-size and removed all snuba (kafka) instances, but still:
f.__message = self.__consumer.poll(timeout=1.0)
File "/usr/src/snuba/snuba/subscriptions/consumer.py", line 120, in poll
message = self.__consumer.poll(timeout)
File "/usr/src/snuba/snuba/utils/streams/synchronized.py", line 217, in poll
message = self.__consumer.poll(timeout)
File "/usr/src/snuba/snuba/utils/streams/backends/kafka.py", line 400, in poll
raise OffsetOutOfRange(str(error))
snuba.utils.streams.backends.abstract.OffsetOutOfRange: KafkaError{code=OFFSET_OUT_OF_RANGE,val=1,str="Broker: Offset out of range"}
````
errors all the time and continuous restarts.
EDIT:
Ah found out that I had to do:
````
kafka-consumer-groups --bootstrap-server 127.0.0.1:9092 --group snuba-consumers --topic events --reset-offsets --to-latest --execute
````
etc..
But how do I know which topic is used?
Stuck at these issues that keep coming back:
2021-01-18 15:13:33,298 New partitions assigned: {Partition(topic=Topic(name='events'), index=0): 5877628}
2021-01-18 15:13:33,299 Caught OffsetOutOfRange('KafkaError{code=OFFSET_OUT_OF_RANGE,val=1,str="Broker: Offset out of range"}'), shutting down...
EDIT: Fixed this by:
kafka-consumer-groups --bootstrap-server 127.0.0.1:9092 --group snuba-events-subscriptions-consumers --topic events --reset-offsets --to-latest --execute
2021-01-18 15:27:59,791 New partitions assigned: {Partition(topic=Topic(name='outcomes'), index=0): 6250846}
2021-01-18 15:27:59,795 Caught OffsetOutOfRange('KafkaError{code=OFFSET_OUT_OF_RANGE,val=1,str="Broker: Offset out of range"}'), shutting down...
EDIT: Tried:
kafka-consumer-groups --bootstrap-server 127.0.0.1:9092 --group outcomes_raw --topic outcomes --reset-offsets --to-latest --execute
and:
kafka-consumer-groups --bootstrap-server 127.0.0.1:9092 --group outcomes_raw --topic events --reset-offsets --to-latest --execute
```
but this issues still persists with the sentry_onpremise_snuba_outcomes-consumer_1 restarting every time. The other containers do stay up now though :)
I have the same issue, how I can fix this?
@kfkawalec
You have to look at the error. And then per error determine which snuba-consumers to update and update it manually.
@gabn88 thanks a lot for sharing your solution here. It is a bit hard to follow though, do you think this is something we can implement in the repo to save other's some hassle and time?
@BYK No, the issue is consumers actively registered to the topic/group MUST be killed before you can actually run through and reset the offsets. Because some consumers may still be running.
Unfortunately, I think this is a deeper issue with the Snuba components being bad at "finding" where they left off in the case of disconnects/other criteria for failure. Although that's difficult to prove.
This is housekeeping that needs to be performed manually for the time being, the best that can be done is provide clear instructions on how to resolve it (which ideally should be provided by Sentry themselves, or contributed to their docs).
The best way I've found it using similar steps as @gabn88 above. Specifically, running into Kafka and resetting the topic/consumer group that has problems.
@BYK I knew I recognised your name! As a follow up, since you're actively part of the Sentry dev cycle. Is this something Sentry is aware of with the Snuba components?
Unfortunately, I think this is a deeper issue with the Snuba components being bad at "finding" where they left off in the case of disconnects/other criteria for failure. Although that's difficult to prove.
Pinging @lynnagara to represent @getsentry/sns here and prove you right or wrong ๐
This is housekeeping that needs to be performed manually for the time being, the best that can be done is provide clear instructions on how to resolve it (which ideally should be provided by Sentry themselves, or contributed to their docs).
We are collecting some common cases and solutions to document. This may even be a fix-kafka-snuba.sh script if I understand correctly.
As a follow up, since you're actively part of the Sentry dev cycle. Is this something Sentry is aware of with the Snuba components?
As far as we are concerned (and I am aware), we don't have these issues so I don't think Snuba is at fault here. Typically this happens when there's a burst of events and then Kafka just rolls over, causing offsets to got lost or invalid and then hell breaks lose. Sounds familiar? ๐
@BYK I knew I recognised your name!
From where tho, have we met before and I am being a terrible person for not recognizing your GitHub handle? ๐
@BYK I'm not trying to point fingers, I agree it looks like a combination of Kafka exploding from an excessive number of events and the consumer being unable to start back up and finds its way.
I think adding an inherent failover mechanism to the Snuba consumers for this would work for everyone. Though the biggest problem would be coordinating consumers to terminate their connections to Kafka whilst the "problem child" resolves its offset issue. I'm sure there's a bunch more things as well.
From where tho
Sentry forums from a while back. Thank you for all your service!
Stupidity Alert: I thought I was in the Sentry chart repository which is why I was being more defensive about introducing an automated fix in the chart not realising I am in the onprem repo.... I am an idiot.
Hi @DandyDeveloper ,
a couple of clarifications about the behavior you are observing. You see an OFFSET_OUT_OF_RANGE from the post-processor while the primary snuba consumer is fine from your first message.
This is not a surprising behavior if your ingestion rate is too high for the ingestion pipeline as the snuba consumer processes messages in parallel while the post processor does not yet.
https://getsentry.github.io/event-ingestion-graph/ for details. See that the post-process is downstream with respect to the snuba consumer.
What is likely to happen and that I would like you to confirm is that the post process cannot keep up with the events ingestion rate. Events keep piling up until Kafka drops the old ones for running out of space and the post process consumer does not have a committed offset to start from anymore. Would you share the amount of events you ingest per second to ensure this is the scenario we are talking about?
Now, one point that was not discussed was that resetting the offset of the post process consumer means loosing data. The events post processing feature would never run on the events that were dropped and all those that were skipped. This is not a behavior snuba would take automatically. I'd argue it should not as this kind of data loss should not happen transparently.
Now what to do about it?
events topic in kafka. At this point if you start multiple instances of the post process consumer (post-process-forwarder ) these will consume the partitions in parallel, and you will increase your throughput.Happy to provide further clarifications.
Filippo
@fpacifici Thank you for the reply. Just to be clear, I have overcome these issues for the time being myself. My comments and observations were in response to @BYK and just giving my 2 cents on the situation as other users are experiencing this.
I am almost certain this was the case though, and for everyone else above, that the ingestion rate was far below expectations on and Kafka dropped them. I'll see if I can fetch some numbers for you, but I have since made sure events were scaled down (as individuals were misusing the Sentry instance and misunderstood what is classified as an event ;) ). As a result, I don't really have these problems anymore.
I believe Snuba having the capabilities to at least self-diagnose to inform the user on what may have happened could be helpful here though, or even documentation that covers the ingestion workflow and what people should "look out for". I understand that self-healing might not be ideal because that could increase the loss of documents.
Parallel consumption could well solve this issue for most individuals, presumably this means a single consumer could be fed more resources in order to open more threads to processing events? OR, do you mean we could have multiple consumers on a single topic?
Either way, in my case, and other using Kubernetes based deployments. It'll make things much easier to scale out.
@DandyDeveloper I came back to mention potential data loss but @fpacifici beat me to it. I think we can at least show people the option with a big red warning: we can automatically recover but that will mean you'll lose all the events in the past x hours. Transparency and easiness to understand the system is quite important, I agree.
Sentry forums from a while back. Thank you for all your service!
Oh, thanks a lot. Quite literally my duty but still happy that it was a memorable experience for you :)
Most helpful comment
I have a solution:
docker exec -it onpremise_kafka_1 bashkafka-consumer-groups --bootstrap-server 127.0.0.1:9092 --listkafka-consumer-groups --bootstrap-server 127.0.0.1:9092 --group snuba-post-processor -describekafka-consumer-groups --bootstrap-server 127.0.0.1:9092 --group snuba-post-processor --topic events --reset-offsets --to-latestkafka-consumer-groups --bootstrap-server 127.0.0.1:9092 --group snuba-post-processor --topic events --reset-offsets --to-latest --executeThis solution work the same way for other Snuba components (just replace group and\or topic)