Onpremise: Post Process Forwarder - KafkaError "Offset Out of Range"

Created on 7 May 2020 · 25Comments · Source: getsentry/onpremise

After a while of running, my post process forwarder appear to be having issues with communication to Kafka:

Traceback (most recent call last):
  File "/usr/local/bin/sentry", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python2.7/site-packages/sentry/runner/__init__.py", line 164, in main
    cli(prog_name=get_prog(), obj={}, max_content_width=100)
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/sentry/runner/decorators.py", line 68, in inner
    return ctx.invoke(f, *args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/sentry/runner/decorators.py", line 30, in inner
    return ctx.invoke(f, *args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/sentry/runner/commands/run.py", line 318, in post_process_forwarder
    initial_offset_reset=options["initial_offset_reset"],
  File "/usr/local/lib/python2.7/site-packages/sentry/utils/services.py", line 104, in <lambda>
    context[key] = (lambda f: lambda *a, **k: getattr(self, f)(*a, **k))(key)
  File "/usr/local/lib/python2.7/site-packages/sentry/eventstream/kafka/backend.py", line 184, in run_post_process_forwarder
    raise Exception(error)
Exception: KafkaError{code=OFFSET_OUT_OF_RANGE,val=1,str="Broker: Offset out of range"}

Oddly, my Snuba consumer is working fine and Sentry appears to be running as expected and I see the Snuba consumer running in the Kafka logs.

Can someone give me some suggestions on what to look for to figure out what the post-processor is looking for?

Docs In Progress

Source

DandyDeveloper

👍2

Most helpful comment

I have a solution:

Log on to Kafka container
docker exec -it onpremise_kafka_1 bash
Receive consumers list
kafka-consumer-groups --bootstrap-server 127.0.0.1:9092 --list
Get group info
kafka-consumer-groups --bootstrap-server 127.0.0.1:9092 --group snuba-post-processor -describe
Watching what gonna happend with offset by using dry-run (optional)
kafka-consumer-groups --bootstrap-server 127.0.0.1:9092 --group snuba-post-processor --topic events --reset-offsets --to-latest
Set offset to latest and execute
kafka-consumer-groups --bootstrap-server 127.0.0.1:9092 --group snuba-post-processor --topic events --reset-offsets --to-latest --execute
Profit

This solution work the same way for other Snuba components (just replace group and\or topic)

rmisyurev on 30 Jul 2020

❤21 👍13 🚀12

All 25 comments

Can you try modifying this line to:

command: run post-process-forwarder --commit-batch-size 1 --initial-offset-reset earliest

and see if it helps?

BYK on 7 May 2020

@BYK Thank you, I'll have a try on the manifest and update it.

Just an FYI I did fix this by effectively destroying the volume for the Kafka backend and recreate it.

I also noticed my volume size is pretty low for the amount of tasks we have coming in. Does Kafka have an inherent GC that cleans things up? It looks as though the disk was filling upwards of 8GB.

DandyDeveloper on 8 May 2020

@BYK It came back, looks like with both earliest and latest offsets I get the same error. This is POST fixing things, so it's unexpectedly broken itself again

Trying to think of what I can do to fix it.

DandyDeveloper on 11 May 2020

Hello,
I am getting same issue, and it eventually kills sentry, it simply stops logging any issues.
Been going on for some time
I can give you access to a Digital Ocean Droplet where this is happening, if that will help.
I upped the instance to 8gb memory, to test if resources is issue.

https://sentry.uptactics.com/share/issue/2a83806a230e4c4196f86c83890ad62e/

ProxiBlue on 5 Jun 2020

I upped the instance to 8gb memory, to test if resources is issue.

I think that helped. Today (since 7h ago) the error stopped.

Selection_320

The readme/docs note at least 2GB memory. Seems I needed 8GB to stop the error. Possible resource related?

ProxiBlue on 5 Jun 2020

Thanks @DandyDeveloper for outlining your solution path. We had the same issue and were also able to resolve it for now by destroying the persistent volume of Kafka. Until now, we didn't have the issue occurring again like it did in your case.

joscha-alisch on 19 Jun 2020

I have a solution:

Log on to Kafka container
docker exec -it onpremise_kafka_1 bash
Receive consumers list
kafka-consumer-groups --bootstrap-server 127.0.0.1:9092 --list
Get group info
kafka-consumer-groups --bootstrap-server 127.0.0.1:9092 --group snuba-post-processor -describe
Watching what gonna happend with offset by using dry-run (optional)
kafka-consumer-groups --bootstrap-server 127.0.0.1:9092 --group snuba-post-processor --topic events --reset-offsets --to-latest
Set offset to latest and execute
kafka-consumer-groups --bootstrap-server 127.0.0.1:9092 --group snuba-post-processor --topic events --reset-offsets --to-latest --execute
Profit

This solution work the same way for other Snuba components (just replace group and\or topic)

rmisyurev on 30 Jul 2020

❤21 👍13 🚀12

@rmisyurev This actually worked for me, thank you very much! But to be honest I don't know why.

The first time I got an error message like this:

Error: Assignments can only be reset if the group 'snuba-post-processor' is inactive, but the current state is Stable.

I just stopped all containers, then only started zookeeper and kafka containers, executed the statement above and stopped both containers again.
After that I started all containers and everything was fine.

I am sure it is not the right solution, so be careful, but for me it worked.

flokoe on 31 Jul 2020

🎉5 🚀1 ❤1

@rmisyurev - thanks for sharing the steps that helped you! I'll make sure to add these to some future documentation.

BYK on 3 Aug 2020

Thanks so much for the helpful comments here, especially @rmisyurev — I looked at your post with the kafka-consumer-groups commands at least 30 times when debugging this.

This issue has been plaguing me for weeks, and I tried everything including adding a ton of consumer replicas, which didn't help at all. What I _think_ has finally solved it (in addition to giving Kafka 8GB of RAM) was actually setting the --commit-batch-size argument on the post-processor. The Helm chart for Sentry defaults this to 1 for some reason, which is way too low (see here). I set it to 100 and for the first time I'm seeing the lag go down rather than up. Wanted to share this in case anyone else runs across this issue.

jdotjdot on 30 Nov 2020

👍1

This issue has gone three weeks without activity. In another week, I will close it.

But! If you comment or otherwise update it, I will reset the clock, and if you label it Status: Accepted, I will leave it alone ... forever!

"A weed is but an unloved flower." ― _Ella Wheeler Wilcox_ 🥀

github-actions[bot] on 4 Jan 2021

Hi, thanks for all the quick fixes, but unfortunately none of them fixed it for us. Currently Sentry runs ok for a few days with:
--commit-batch-size 20 --initial-offset-reset earliest.
The first problem i'am experiencing is that the 2FA check fails, the rest seems to operate fine.

The only way to get the system fully operational again is deleting everything related to kafka (volumes etc) and rebuilding it.

Any suggestions on how to make sentry run more stable?

bbreijer on 6 Jan 2021

👍2

Same issue here on 20.11.1. I did all the steps mentioned by @rmisyurev, but to no avail.
Now upgrading sentry to the latest version and redoing all the steps. Hardware is an intel i7-4770@ 3.40Ghz with 16GB on Docker under windows 10 (using WSL), so that shouldn't be an issue for +/- 10k events per day I would say? The sentry worker is running with 2x100% CPU every now or then still.

@bbreijer where do you run this setting?

EDIT: Update tot 21.1.0 did not solve the issue. sentry_onpremise_snuba-consumer_1, sentry_onpremise_snuba-outcomes-consumer_1, sentry_onpremise_snub-transactions-consumer_1 all keep restarting all the time.

EDIT2: I changed the commit-batch-size in the docker-compose.yml file.

gabn88 on 18 Jan 2021

EDIT2: I changed the commit-batch-size in the docker-compose.yml file.

That is where I also changed it.

bbreijer on 18 Jan 2021

👍1

Edited the commit-batch-size and removed all snuba (kafka) instances, but still:

f.__message = self.__consumer.poll(timeout=1.0)

File "/usr/src/snuba/snuba/subscriptions/consumer.py", line 120, in poll

message = self.__consumer.poll(timeout)

File "/usr/src/snuba/snuba/utils/streams/synchronized.py", line 217, in poll

message = self.__consumer.poll(timeout)

File "/usr/src/snuba/snuba/utils/streams/backends/kafka.py", line 400, in poll

raise OffsetOutOfRange(str(error))

snuba.utils.streams.backends.abstract.OffsetOutOfRange: KafkaError{code=OFFSET_OUT_OF_RANGE,val=1,str="Broker: Offset out of range"}
````

errors all the time and continuous restarts.

EDIT:
Ah found out that I had to do:

````
kafka-consumer-groups --bootstrap-server 127.0.0.1:9092 --group snuba-consumers --topic events --reset-offsets --to-latest --execute
```` 
etc..
But how do I know which topic is used?

Stuck at these issues that keep coming back:

exec gosu snuba snuba subscriptions --auto-offset-reset=latest --consumer-group=snuba-events-subscriptions-consumers --topic=events --result-topic=events-subscription-results --dataset=events --commit-log-topic=snuba-commit-log --commit-log-group=snuba-consumers --delay-seconds=60 --schedule-ttl=60

2021-01-18 15:13:33,298 New partitions assigned: {Partition(topic=Topic(name='events'), index=0): 5877628}

2021-01-18 15:13:33,299 Caught OffsetOutOfRange('KafkaError{code=OFFSET_OUT_OF_RANGE,val=1,str="Broker: Offset out of range"}'), shutting down...

EDIT: Fixed this by:

kafka-consumer-groups --bootstrap-server 127.0.0.1:9092 --group snuba-events-subscriptions-consumers --topic events --reset-offsets --to-latest --execute

exec gosu snuba snuba consumer --storage outcomes_raw --auto-offset-reset=earliest --max-batch-time-ms 750

2021-01-18 15:27:59,791 New partitions assigned: {Partition(topic=Topic(name='outcomes'), index=0): 6250846}

2021-01-18 15:27:59,795 Caught OffsetOutOfRange('KafkaError{code=OFFSET_OUT_OF_RANGE,val=1,str="Broker: Offset out of range"}'), shutting down...

EDIT: Tried:

kafka-consumer-groups --bootstrap-server 127.0.0.1:9092 --group outcomes_raw --topic outcomes --reset-offsets --to-latest --execute

and:

kafka-consumer-groups --bootstrap-server 127.0.0.1:9092 --group outcomes_raw --topic events --reset-offsets --to-latest --execute
```
but this issues still persists with the sentry_onpremise_snuba_outcomes-consumer_1 restarting every time. The other containers do stay up now though :)

gabn88 on 18 Jan 2021

👀1

I have the same issue, how I can fix this?

kfkawalec on 18 Jan 2021

@kfkawalec
You have to look at the error. And then per error determine which snuba-consumers to update and update it manually.

gabn88 on 18 Jan 2021

@gabn88 thanks a lot for sharing your solution here. It is a bit hard to follow though, do you think this is something we can implement in the repo to save other's some hassle and time?

BYK on 19 Jan 2021

@BYK No, the issue is consumers actively registered to the topic/group MUST be killed before you can actually run through and reset the offsets. Because some consumers may still be running.

Unfortunately, I think this is a deeper issue with the Snuba components being bad at "finding" where they left off in the case of disconnects/other criteria for failure. Although that's difficult to prove.

This is housekeeping that needs to be performed manually for the time being, the best that can be done is provide clear instructions on how to resolve it (which ideally should be provided by Sentry themselves, or contributed to their docs).

The best way I've found it using similar steps as @gabn88 above. Specifically, running into Kafka and resetting the topic/consumer group that has problems.

DandyDeveloper on 20 Jan 2021

@BYK I knew I recognised your name! As a follow up, since you're actively part of the Sentry dev cycle. Is this something Sentry is aware of with the Snuba components?

DandyDeveloper on 20 Jan 2021

Unfortunately, I think this is a deeper issue with the Snuba components being bad at "finding" where they left off in the case of disconnects/other criteria for failure. Although that's difficult to prove.

Pinging @lynnagara to represent @getsentry/sns here and prove you right or wrong 🙂

This is housekeeping that needs to be performed manually for the time being, the best that can be done is provide clear instructions on how to resolve it (which ideally should be provided by Sentry themselves, or contributed to their docs).

We are collecting some common cases and solutions to document. This may even be a fix-kafka-snuba.sh script if I understand correctly.

As a follow up, since you're actively part of the Sentry dev cycle. Is this something Sentry is aware of with the Snuba components?

As far as we are concerned (and I am aware), we don't have these issues so I don't think Snuba is at fault here. Typically this happens when there's a burst of events and then Kafka just rolls over, causing offsets to got lost or invalid and then hell breaks lose. Sounds familiar? 😁

@BYK I knew I recognised your name!

From where tho, have we met before and I am being a terrible person for not recognizing your GitHub handle? 😅

BYK on 20 Jan 2021

👍1

@BYK I'm not trying to point fingers, I agree it looks like a combination of Kafka exploding from an excessive number of events and the consumer being unable to start back up and finds its way.

I think adding an inherent failover mechanism to the Snuba consumers for this would work for everyone. Though the biggest problem would be coordinating consumers to terminate their connections to Kafka whilst the "problem child" resolves its offset issue. I'm sure there's a bunch more things as well.

From where tho

Sentry forums from a while back. Thank you for all your service!

Stupidity Alert: I thought I was in the Sentry chart repository which is why I was being more defensive about introducing an automated fix in the chart not realising I am in the onprem repo.... I am an idiot.

DandyDeveloper on 21 Jan 2021

Hi @DandyDeveloper ,
a couple of clarifications about the behavior you are observing. You see an OFFSET_OUT_OF_RANGE from the post-processor while the primary snuba consumer is fine from your first message.
This is not a surprising behavior if your ingestion rate is too high for the ingestion pipeline as the snuba consumer processes messages in parallel while the post processor does not yet.
https://getsentry.github.io/event-ingestion-graph/ for details. See that the post-process is downstream with respect to the snuba consumer.

What is likely to happen and that I would like you to confirm is that the post process cannot keep up with the events ingestion rate. Events keep piling up until Kafka drops the old ones for running out of space and the post process consumer does not have a committed offset to start from anymore. Would you share the amount of events you ingest per second to ensure this is the scenario we are talking about?

Now, one point that was not discussed was that resetting the offset of the post process consumer means loosing data. The events post processing feature would never run on the events that were dropped and all those that were skipped. This is not a behavior snuba would take automatically. I'd argue it should not as this kind of data loss should not happen transparently.

Now what to do about it?

good news is that the parallel consumer is going to be deployed for post processing soonish. It is likely to happen within H1, that would make it much faster at processing events.
The reason the post process cannot keep up (assuming this was your issue) is that by default the onpremise deployment creates only one partition, which means events are strictly consumed sequentially. Do you send all your load on one single project or is it spread across multiple projects? If your load is already spread among multiple projects you can increase the partitions number of the events topic in kafka. At this point if you start multiple instances of the post process consumer (post-process-forwarder ) these will consume the partitions in parallel, and you will increase your throughput.
If your load is entirely in one project the situation is more complex. Every project is assigned to one and only one partition, which means that increasing partitions, if you have one project only, won't help. In this case is there any way you can divide your load across multiple sentry projects?

Happy to provide further clarifications.

Filippo

fpacifici on 22 Jan 2021

@fpacifici Thank you for the reply. Just to be clear, I have overcome these issues for the time being myself. My comments and observations were in response to @BYK and just giving my 2 cents on the situation as other users are experiencing this.

I am almost certain this was the case though, and for everyone else above, that the ingestion rate was far below expectations on and Kafka dropped them. I'll see if I can fetch some numbers for you, but I have since made sure events were scaled down (as individuals were misusing the Sentry instance and misunderstood what is classified as an event ;) ). As a result, I don't really have these problems anymore.

I believe Snuba having the capabilities to at least self-diagnose to inform the user on what may have happened could be helpful here though, or even documentation that covers the ingestion workflow and what people should "look out for". I understand that self-healing might not be ideal because that could increase the loss of documents.

Parallel consumption could well solve this issue for most individuals, presumably this means a single consumer could be fed more resources in order to open more threads to processing events? OR, do you mean we could have multiple consumers on a single topic?

Either way, in my case, and other using Kubernetes based deployments. It'll make things much easier to scale out.

DandyDeveloper on 22 Jan 2021

@DandyDeveloper I came back to mention potential data loss but @fpacifici beat me to it. I think we can at least show people the option with a big red warning: we can automatically recover but that will mean you'll lose all the events in the past x hours. Transparency and easiness to understand the system is quite important, I agree.