Onpremise: Worker / celery stops working

Created on 17 Aug 2020  Â·  55Comments  Â·  Source: getsentry/onpremise

After upgrading to 20.8.0.dev 069e8ccd events stop showing up in the frontend sporadically.

The UI shows Background workers haven't checked in recently. It seems that you have a backlog of 71 tasks. Either your workers aren't running or you need more capacity.

The worker log shows:

11:42:28 [INFO] sentry.deletions.async: object.delete.executed (object_id=26394L model=u'Group' transaction_id=u'2109ad23955d4c59                               91283be5016442b7')
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/celery/worker/consumer/consumer.py", line 316, in start
    blueprint.start(self)
  File "/usr/local/lib/python2.7/site-packages/celery/bootsteps.py", line 119, in start
    step.start(parent)
  File "/usr/local/lib/python2.7/site-packages/celery/worker/consumer/consumer.py", line 592, in start
    c.loop(*c.loop_args())
  File "/usr/local/lib/python2.7/site-packages/celery/worker/loops.py", line 91, in asynloop
    next(loop)
  File "/usr/local/lib/python2.7/site-packages/kombu/asynchronous/hub.py", line 354, in create_loop
    cb(*cbargs)
  File "/usr/local/lib/python2.7/site-packages/kombu/transport/redis.py", line 1047, in on_readable
    self.cycle.on_readable(fileno)
  File "/usr/local/lib/python2.7/site-packages/kombu/transport/redis.py", line 344, in on_readable
    chan.handlers[type]()
  File "/usr/local/lib/python2.7/site-packages/kombu/transport/redis.py", line 721, in _brpop_read
    **options)
  File "/usr/local/lib/python2.7/site-packages/redis/client.py", line 680, in parse_response
    response = connection.read_response()
  File "/usr/local/lib/python2.7/site-packages/redis/connection.py", line 624, in read_response
    response = self._parser.read_response()
  File "/usr/local/lib/python2.7/site-packages/redis/connection.py", line 403, in read_response
    (e.args,))
ConnectionError: Error while reading from socket: ('Connection closed by server.',)
11:43:29 [WARNING] celery.worker.consumer.consumer: consumer: Connection to broker lost. Trying to re-establish the connection...
Restoring 13 unacknowledged message(s)

Restarting the worker reprocesses the events and works for a short time. The connection won't be re-established as printed.

Besides fixing a potential bug while re-establishing the connection the worker should exit in order for docker restart policy to kick in as a last resort.

Most helpful comment

For anyone, who is looking for a fast solution:

Gotta say, the 0 * * * * cd /opt/sentry && docker-compose restart worker 2> /dev/null solution works pretty well with 20.8 😄

All 55 comments

Same problem here.

btw: Health-Checks in the compose-file would be good to restart the worker automatically on such errors

Maybe updating celery will help regarding to https://github.com/celery/celery/issues/3932

I’m having the same problem in the last few weeks.

Restarting docker-compose down && docker-compose up -d helps but not for a long time.

Hey all,

We suspect this to be due to a recent Celery upgrade. We have submitted a revert over at getsentry/sentry#20531. If you can test this out by running: SENTRY_IMAGE=us.gcr.io/sentryio/sentry:8e03c697cd50ceba9e73ae5801729f86624c6989 ./install.sh and report back here, it would really help us determine how to proceed with @wedamija.

Thanks a lot for your cooperation and apologies for the inconvenience!

Did:

git pull
SENTRY_IMAGE=us.gcr.io/sentryio/sentry:8e03c697cd50ceba9e73ae5801729f86624c6989
./install.sh
docker-compose up -d

Seems like it’s working fine now. At least, docker-compose logs -f isn’t yielding much about failed jobs and /manage/queue/ page isn’t showing huge amount of jobs in Global Throughput graph.

@maximal Are things still running well after a day? If so, I'll look into bumping to latest Celery and see whether it helps.

@wedamija, I cannot definitely say it’s completely fine, but the system behaves in much more stable way.

Once during the day it yielded a standard queue overflow message (and then stopped receiving jobs):

Background workers haven't checked in recently. It seems that you have a backlog of 2382 tasks. Either your workers aren't running or you need more capacity.

I restarted Sentry’s docker containers, and it went okay. At least I don’t have to restart those twice an hour.

This is what I see regularly on the worker, after restart it continues to run for 20-40 minutes:

worker_1                       | 11:27:04 [WARNING] celery.worker.consumer.consumer: consumer: Connection to broker lost. Trying to re-establish the connection...
worker_1                       | Restoring 31 unacknowledged message(s)
worker_1                       | Traceback (most recent call last):
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/celery/worker/consumer/consumer.py", line 316, in start
worker_1                       |     blueprint.start(self)
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/celery/bootsteps.py", line 119, in start
worker_1                       |     step.start(parent)
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/celery/worker/consumer/consumer.py", line 592, in start
worker_1                       |     c.loop(*c.loop_args())
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/celery/worker/loops.py", line 91, in asynloop
worker_1                       |     next(loop)
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/kombu/asynchronous/hub.py", line 276, in create_loop
worker_1                       |     tick_callback()
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/kombu/transport/redis.py", line 1040, in on_poll_start
worker_1                       |     cycle_poll_start()
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/kombu/transport/redis.py", line 322, in on_poll_start
worker_1                       |     self._register_BRPOP(channel)
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/kombu/transport/redis.py", line 308, in _register_BRPOP
worker_1                       |     channel._brpop_start()
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/kombu/transport/redis.py", line 714, in _brpop_start
worker_1                       |     self.client.connection.send_command('BRPOP', *keys)
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/redis/connection.py", line 610, in send_command
worker_1                       |     self.send_packed_command(self.pack_command(*args))
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/redis/connection.py", line 603, in send_packed_command
worker_1                       |     (errno, errmsg))
worker_1                       | ConnectionError: Error 32 while writing to socket. Broken pipe.
worker_1                       | 11:27:13 [WARNING] celery.worker.consumer.consumer: consumer: Connection to broker lost. Trying to re-establish the connection...
worker_1                       | Restoring 31 unacknowledged message(s)

Another thing that happened to me with 8e03c697cd50ceba9e73ae5801729f86624c6989, redis server consumes ton of memory. I've restricted it now to 4G - it was eating all RAM up to 11G before.

Some solution?

Follow-up: I'm restarting worker from time to time now, but this morning processing stopped because post-process-forwarder was not connected.

post-process-forwarder_1       | %3|1599378769.399|FAIL|rdkafka#consumer-1| [thrd:kafka:9092/bootstrap]: kafka:9092/1001: Disconnected
post-process-forwarder_1       | %3|1599378769.404|ERROR|rdkafka#consumer-1| [thrd:kafka:9092/bootstrap]: kafka:9092/1001: Disconnected
post-process-forwarder_1       | %3|1599378769.404|ERROR|rdkafka#consumer-1| [thrd:kafka:9092/bootstrap]: 1/1 brokers are down
post-process-forwarder_1       | %3|1599378769.399|FAIL|rdkafka#consumer-2| [thrd:kafka:9092/bootstrap]: kafka:9092/1001: Disconnected
post-process-forwarder_1       | %3|1599378769.404|ERROR|rdkafka#consumer-2| [thrd:kafka:9092/bootstrap]: kafka:9092/1001: Disconnected
post-process-forwarder_1       | %3|1599378769.404|ERROR|rdkafka#consumer-2| [thrd:kafka:9092/bootstrap]: 1/1 brokers are down
post-process-forwarder_1       | %3|1599378769.513|FAIL|rdkafka#consumer-2| [thrd:kafka:9092/bootstrap]: kafka:9092/1001: Connect to ipv4#10.17.6.10:9092 failed: Connection refused
post-process-forwarder_1       | %3|1599378769.513|ERROR|rdkafka#consumer-2| [thrd:kafka:9092/bootstrap]: kafka:9092/1001: Connect to ipv4#10.17.6.10:9092 failed: Connection refused
post-process-forwarder_1       | %3|1599378769.513|FAIL|rdkafka#consumer-1| [thrd:kafka:9092/bootstrap]: kafka:9092/1001: Connect to ipv4#10.17.6.10:9092 failed: Connection refused
post-process-forwarder_1       | %3|1599378769.513|ERROR|rdkafka#consumer-1| [thrd:kafka:9092/bootstrap]: kafka:9092/1001: Connect to ipv4#10.17.6.10:9092 failed: Connection refused
post-process-forwarder_1       | %3|1599378773.518|FAIL|rdkafka#consumer-1| [thrd:kafka:9092/bootstrap]: kafka:9092/1001: Failed to resolve 'kafka:9092': Name or service not known
post-process-forwarder_1       | %3|1599378773.518|ERROR|rdkafka#consumer-1| [thrd:kafka:9092/bootstrap]: kafka:9092/1001: Failed to resolve 'kafka:9092': Name or service not known
post-process-forwarder_1       | %3|1599378773.519|FAIL|rdkafka#consumer-2| [thrd:kafka:9092/bootstrap]: kafka:9092/1001: Failed to resolve 'kafka:9092': Name or service not known
post-process-forwarder_1       | %3|1599378773.519|ERROR|rdkafka#consumer-2| [thrd:kafka:9092/bootstrap]: kafka:9092/1001: Failed to resolve 'kafka:9092': Name or service not known
post-process-forwarder_1       | %3|1599378774.519|FAIL|rdkafka#consumer-2| [thrd:kafka:9092/bootstrap]: kafka:9092/1001: Connect to ipv4#10.17.6.10:9092 failed: Connection refused
post-process-forwarder_1       | %3|1599378774.519|ERROR|rdkafka#consumer-2| [thrd:kafka:9092/bootstrap]: kafka:9092/1001: Connect to ipv4#10.17.6.10:9092 failed: Connection refused
post-process-forwarder_1       | %3|1599378774.521|FAIL|rdkafka#consumer-1| [thrd:kafka:9092/bootstrap]: kafka:9092/1001: Connect to ipv4#10.17.6.10:9092 failed: Connection refused
post-process-forwarder_1       | %3|1599378774.521|ERROR|rdkafka#consumer-1| [thrd:kafka:9092/bootstrap]: kafka:9092/1001: Connect to ipv4#10.17.6.10:9092 failed: Connection refused

Without restarting kafka, but only post-process-forwarder Sentry started to process again.

@e2-robert this is interesting, could it be that when you restarted kafka, it got a new IP address and other services failed due to DNS caching?

Okay, a new build is out with an upgrade (getsentry/sentry#20625). Can anyone try

SENTRY_IMAGE=us.gcr.io/sentryio/sentry:583756a81710fa11a0a19017654dbc09b390ab65 ./install.sh

@BYK works for me

@Madcat148 - nice! If it still works after a few days, I think we have a winner here.

@BYK, works fine so far.

I updated to this around 10PM last night, and my worker stopped processing events just after midnight.

Has anyone else seen this on the 583756a81710fa11a0a19017654dbc09b390ab65 release?

For anyone, who is looking for a fast solution:

Gotta say, the 0 * * * * cd /opt/sentry && docker-compose restart worker 2> /dev/null solution works pretty well with 20.8 😄

I'm now on 583756a81710fa11a0a19017654dbc09b390ab65 as well, and the worker stopped with:

worker_1                       | %3|1599726789.977|ERROR|rdkafka#producer-1| [thrd:kafka:9092/bootstrap]: kafka:9092/1001: 1 request(s) timed out: disconnect
worker_1                       | %3|1599726789.977|ERROR|rdkafka#producer-1| [thrd:kafka:9092/bootstrap]: 1/1 brokers are down
worker_1                       | %3|1599726789.977|ERROR|rdkafka#producer-2| [thrd:kafka:9092/bootstrap]: kafka:9092/1001: 1 request(s) timed out: disconnect
worker_1                       | %3|1599726789.977|ERROR|rdkafka#producer-2| [thrd:kafka:9092/bootstrap]: 1/1 brokers are down

Restarting the worker manually continued processing of events (without touching Kafka).

@e2-robert this is interesting, could it be that when you restarted kafka, it got a new IP address and other services failed due to DNS caching?

I did not restart kafka back then...

Same problem with kafka - both on 20.7.0 and 20.8.0.
Tried to connect to different kafka clusters w/ different version - the same situation:

```
%3|1599734018.282|FAIL|rdkafka#producer-2| [thrd:kfk-3:6667/2638]: kfk-3:6667/2638: 1 request(s) timed out: disconnect
%3|1599734018.282|ERROR|rdkafka#producer-2| [thrd:kfk-3:6667/2638]: kfk-3:6667/2638: 1 request(s) timed out: disconnect
````

583756a81710fa11a0a19017654dbc09b390ab65 is working fine for about 24 hours by this time without any restarts.

Based on feedback here it looks like upgrading celery to latest likely fixes the celery related issues. @Madcat148 is it still working for you?

I suspect these Kafka timeouts are a separate issue.

@giggsey Could you post any logs you have after events stop processing? Just want to confirm whether this is Kafka or Celery related.

Based on feedback here it looks like upgrading celery to latest likely fixes the celery related issues. @Madcat148 is it still working for you?

I suspect these Kafka timeouts are a separate issue.

It still ok, but have to say (not sure if it's related) that the previous failure happened during a burst of errors, and now traffic is lower.

worker_1                       | ConnectionError: Error 32 while writing to socket. Broken pipe.
worker_1                       | 09:05:52 [WARNING] celery.worker.consumer.consumer: consumer: Connection to broker lost. Trying to re-establish the connection...
worker_1                       | Traceback (most recent call last):
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/celery/worker/consumer/consumer.py", line 316, in start
worker_1                       |     blueprint.start(self)
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/celery/bootsteps.py", line 119, in start
worker_1                       |     step.start(parent)
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/celery/worker/consumer/consumer.py", line 592, in start
worker_1                       |     c.loop(*c.loop_args())
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/celery/worker/loops.py", line 91, in asynloop
worker_1                       |     next(loop)
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/kombu/asynchronous/hub.py", line 354, in create_loop
worker_1                       |     cb(*cbargs)
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/kombu/transport/redis.py", line 1047, in on_readable
worker_1                       |     self.cycle.on_readable(fileno)
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/kombu/transport/redis.py", line 344, in on_readable
worker_1                       |     chan.handlers[type]()
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/kombu/transport/redis.py", line 721, in _brpop_read
worker_1                       |     **options)
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/redis/client.py", line 680, in parse_response
worker_1                       |     response = connection.read_response()
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/redis/connection.py", line 624, in read_response
worker_1                       |     response = self._parser.read_response()
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/redis/connection.py", line 403, in read_response
worker_1                       |     (e.args,))
worker_1                       | ConnectionError: Error while reading from socket: ('Connection closed by server.',)
worker_1                       | 11:36:44 [WARNING] celery.worker.consumer.consumer: consumer: Connection to broker lost. Trying to re-establish the connection...
worker_1                       | Restoring 23 unacknowledged message(s)

Redis logs appear normal and last logs in kafka are 3 hours before this. Any other logs you need?

@giggsey Those logs are helpful, thanks. And just confirming, this is on the 583756a81710fa11a0a19017654dbc09b390ab65 release?

@giggsey Those logs are helpful, thanks. And just confirming, this is on the 583756a81710fa11a0a19017654dbc09b390ab65 release?

Correct.

Interesting, I seem to have had some more socket errors recently that have kept processing fine without intervention.

worker_1                       | 17:57:52 [WARNING] sentry.tasks.release_registry: Release registry URL is not specified, skipping the task.
worker_1                       | Traceback (most recent call last):
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/celery/worker/consumer/consumer.py", line 316, in start
worker_1                       |     blueprint.start(self)
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/celery/bootsteps.py", line 119, in start
worker_1                       |     step.start(parent)
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/celery/worker/consumer/consumer.py", line 592, in start
worker_1                       |     c.loop(*c.loop_args())
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/celery/worker/loops.py", line 91, in asynloop
worker_1                       |     next(loop)
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/kombu/asynchronous/hub.py", line 354, in create_loop
worker_1                       |     cb(*cbargs)
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/kombu/transport/redis.py", line 1047, in on_readable
worker_1                       |     self.cycle.on_readable(fileno)
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/kombu/transport/redis.py", line 344, in on_readable
worker_1                       |     chan.handlers[type]()
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/kombu/transport/redis.py", line 721, in _brpop_read
worker_1                       |     **options)
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/redis/client.py", line 680, in parse_response
worker_1                       |     response = connection.read_response()
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/redis/connection.py", line 624, in read_response
worker_1                       |     response = self._parser.read_response()
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/redis/connection.py", line 403, in read_response
worker_1                       |     (e.args,))
worker_1                       | ConnectionError: Error while reading from socket: ('Connection closed by server.',)
worker_1                       | 18:01:47 [WARNING] celery.worker.consumer.consumer: consumer: Connection to broker lost. Trying to re-establish the connection...
worker_1                       | Traceback (most recent call last):
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/celery/worker/consumer/consumer.py", line 316, in start
worker_1                       |     blueprint.start(self)
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/celery/bootsteps.py", line 119, in start
worker_1                       |     step.start(parent)
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/celery/worker/consumer/consumer.py", line 592, in start
worker_1                       |     c.loop(*c.loop_args())
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/celery/worker/loops.py", line 91, in asynloop
worker_1                       |     next(loop)
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/kombu/asynchronous/hub.py", line 354, in create_loop
worker_1                       |     cb(*cbargs)
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/kombu/transport/redis.py", line 1047, in on_readable
worker_1                       |     self.cycle.on_readable(fileno)
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/kombu/transport/redis.py", line 344, in on_readable
worker_1                       |     chan.handlers[type]()
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/kombu/transport/redis.py", line 721, in _brpop_read
worker_1                       |     **options)
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/redis/client.py", line 680, in parse_response
worker_1                       |     response = connection.read_response()
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/redis/connection.py", line 624, in read_response
worker_1                       |     response = self._parser.read_response()
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/redis/connection.py", line 403, in read_response
worker_1                       |     (e.args,))
worker_1                       | ConnectionError: Error while reading from socket: ('Connection closed by server.',)
worker_1                       | 18:02:21 [WARNING] celery.worker.consumer.consumer: consumer: Connection to broker lost. Trying to re-establish the connection...
worker_1                       | Traceback (most recent call last):
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/celery/worker/consumer/consumer.py", line 316, in start
worker_1                       |     blueprint.start(self)
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/celery/bootsteps.py", line 119, in start
worker_1                       |     step.start(parent)
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/celery/worker/consumer/consumer.py", line 592, in start
worker_1                       |     c.loop(*c.loop_args())
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/celery/worker/loops.py", line 91, in asynloop
worker_1                       |     next(loop)
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/kombu/asynchronous/hub.py", line 276, in create_loop
worker_1                       |     tick_callback()
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/kombu/transport/redis.py", line 1040, in on_poll_start
worker_1                       |     cycle_poll_start()
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/kombu/transport/redis.py", line 322, in on_poll_start
worker_1                       |     self._register_BRPOP(channel)
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/kombu/transport/redis.py", line 308, in _register_BRPOP
worker_1                       |     channel._brpop_start()
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/kombu/transport/redis.py", line 714, in _brpop_start
worker_1                       |     self.client.connection.send_command('BRPOP', *keys)
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/redis/connection.py", line 610, in send_command
worker_1                       |     self.send_packed_command(self.pack_command(*args))
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/redis/connection.py", line 603, in send_packed_command
worker_1                       |     (errno, errmsg))
worker_1                       | ConnectionError: Error 32 while writing to socket. Broken pipe.
worker_1                       | 18:02:37 [WARNING] celery.worker.consumer.consumer: consumer: Connection to broker lost. Trying to re-establish the connection...
worker_1                       | Restoring 2 unacknowledged message(s)
worker_1                       | 18:02:52 [WARNING] sentry.tasks.release_registry: Release registry URL is not specified, skipping the task.
worker_1                       | Traceback (most recent call last):
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/celery/worker/consumer/consumer.py", line 316, in start
worker_1                       |     blueprint.start(self)
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/celery/bootsteps.py", line 119, in start
worker_1                       |     step.start(parent)
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/celery/worker/consumer/consumer.py", line 592, in start
worker_1                       |     c.loop(*c.loop_args())
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/celery/worker/loops.py", line 91, in asynloop
worker_1                       |     next(loop)
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/kombu/asynchronous/hub.py", line 354, in create_loop
worker_1                       |     cb(*cbargs)
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/kombu/transport/redis.py", line 1047, in on_readable
worker_1                       |     self.cycle.on_readable(fileno)
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/kombu/transport/redis.py", line 344, in on_readable
worker_1                       |     chan.handlers[type]()
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/kombu/transport/redis.py", line 721, in _brpop_read
worker_1                       |     **options)
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/redis/client.py", line 680, in parse_response
worker_1                       |     response = connection.read_response()
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/redis/connection.py", line 624, in read_response
worker_1                       |     response = self._parser.read_response()
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/redis/connection.py", line 403, in read_response
worker_1                       |     (e.args,))
worker_1                       | ConnectionError: Error while reading from socket: ('Connection closed by server.',)
worker_1                       | 18:02:58 [WARNING] celery.worker.consumer.consumer: consumer: Connection to broker lost. Trying to re-establish the connection...
worker_1                       | Restoring 1 unacknowledged message(s)
worker_1                       | Traceback (most recent call last):
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/celery/worker/consumer/consumer.py", line 316, in start
worker_1                       |     blueprint.start(self)
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/celery/bootsteps.py", line 119, in start
worker_1                       |     step.start(parent)
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/celery/worker/consumer/consumer.py", line 592, in start
worker_1                       |     c.loop(*c.loop_args())
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/celery/worker/loops.py", line 91, in asynloop
worker_1                       |     next(loop)
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/kombu/asynchronous/hub.py", line 276, in create_loop
worker_1                       |     tick_callback()
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/kombu/transport/redis.py", line 1040, in on_poll_start
worker_1                       |     cycle_poll_start()
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/kombu/transport/redis.py", line 322, in on_poll_start
worker_1                       |     self._register_BRPOP(channel)
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/kombu/transport/redis.py", line 308, in _register_BRPOP
worker_1                       |     channel._brpop_start()
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/kombu/transport/redis.py", line 714, in _brpop_start
worker_1                       |     self.client.connection.send_command('BRPOP', *keys)
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/redis/connection.py", line 610, in send_command
worker_1                       |     self.send_packed_command(self.pack_command(*args))
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/redis/connection.py", line 603, in send_packed_command
worker_1                       |     (errno, errmsg))
worker_1                       | ConnectionError: Error 32 while writing to socket. Broken pipe.
worker_1                       | 18:03:39 [WARNING] celery.worker.consumer.consumer: consumer: Connection to broker lost. Trying to re-establish the connection...
worker_1                       | Restoring 2 unacknowledged message(s)
worker_1                       | Traceback (most recent call last):
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/celery/worker/consumer/consumer.py", line 316, in start
worker_1                       |     blueprint.start(self)
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/celery/bootsteps.py", line 119, in start
worker_1                       |     step.start(parent)
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/celery/worker/consumer/consumer.py", line 592, in start
worker_1                       |     c.loop(*c.loop_args())
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/celery/worker/loops.py", line 91, in asynloop
worker_1                       |     next(loop)
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/kombu/asynchronous/hub.py", line 354, in create_loop
worker_1                       |     cb(*cbargs)
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/kombu/transport/redis.py", line 1047, in on_readable
worker_1                       |     self.cycle.on_readable(fileno)
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/kombu/transport/redis.py", line 344, in on_readable
worker_1                       |     chan.handlers[type]()
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/kombu/transport/redis.py", line 721, in _brpop_read
worker_1                       |     **options)
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/redis/client.py", line 680, in parse_response
worker_1                       |     response = connection.read_response()
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/redis/connection.py", line 624, in read_response
worker_1                       |     response = self._parser.read_response()
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/redis/connection.py", line 403, in read_response
worker_1                       |     (e.args,))
worker_1                       | ConnectionError: Error while reading from socket: (104, 'Connection reset by peer')
worker_1                       | 18:03:53 [WARNING] celery.worker.consumer.consumer: consumer: Connection to broker lost. Trying to re-establish the connection...
worker_1                       | Restoring 1 unacknowledged message(s)
worker_1                       | 18:07:52 [WARNING] sentry.tasks.release_registry: Release registry URL is not specified, skipping the task.

@giggsey can you also share your redis logs?

@BYK Nothing odd in the redis logs:

redis_1                        | 1971:C 11 Sep 2020 09:01:56.115 * DB saved on disk
redis_1                        | 1971:C 11 Sep 2020 09:01:56.116 * RDB: 1 MB of memory used by copy-on-write
redis_1                        | 1:M 11 Sep 2020 09:01:56.168 * Background saving terminated with success
redis_1                        | 1:M 11 Sep 2020 09:02:57.051 * 10000 changes in 60 seconds. Saving...
redis_1                        | 1:M 11 Sep 2020 09:02:57.052 * Background saving started by pid 1972
redis_1                        | 1972:C 11 Sep 2020 09:02:57.102 * DB saved on disk
redis_1                        | 1972:C 11 Sep 2020 09:02:57.103 * RDB: 0 MB of memory used by copy-on-write
redis_1                        | 1:M 11 Sep 2020 09:02:57.153 * Background saving terminated with success
redis_1                        | 1:M 11 Sep 2020 09:03:58.012 * 10000 changes in 60 seconds. Saving...
redis_1                        | 1:M 11 Sep 2020 09:03:58.013 * Background saving started by pid 1973
redis_1                        | 1973:C 11 Sep 2020 09:03:58.068 * DB saved on disk
redis_1                        | 1973:C 11 Sep 2020 09:03:58.069 * RDB: 0 MB of memory used by copy-on-write
redis_1                        | 1:M 11 Sep 2020 09:03:58.113 * Background saving terminated with success
redis_1                        | 1:M 11 Sep 2020 09:04:59.097 * 10000 changes in 60 seconds. Saving...
redis_1                        | 1:M 11 Sep 2020 09:04:59.099 * Background saving started by pid 1974
redis_1                        | 1974:C 11 Sep 2020 09:04:59.150 * DB saved on disk
redis_1                        | 1974:C 11 Sep 2020 09:04:59.152 * RDB: 1 MB of memory used by copy-on-write
redis_1                        | 1:M 11 Sep 2020 09:04:59.199 * Background saving terminated with success
redis_1                        | 1:M 11 Sep 2020 09:06:00.056 * 10000 changes in 60 seconds. Saving...
redis_1                        | 1:M 11 Sep 2020 09:06:00.059 * Background saving started by pid 1975
redis_1                        | 1975:C 11 Sep 2020 09:06:00.111 * DB saved on disk
redis_1                        | 1975:C 11 Sep 2020 09:06:00.113 * RDB: 0 MB of memory used by copy-on-write
redis_1                        | 1:M 11 Sep 2020 09:06:00.159 * Background saving terminated with success
redis_1                        | 1:M 11 Sep 2020 09:07:01.077 * 10000 changes in 60 seconds. Saving...
redis_1                        | 1:M 11 Sep 2020 09:07:01.078 * Background saving started by pid 1976
redis_1                        | 1976:C 11 Sep 2020 09:07:01.125 * DB saved on disk
redis_1                        | 1:M 11 Sep 2020 11:33:57.759 * Background saving terminated with success
redis_1                        | 1:M 11 Sep 2020 11:34:58.045 * 10000 changes in 60 seconds. Saving...
redis_1                        | 1:M 11 Sep 2020 11:34:58.055 * Background saving started by pid 2044
redis_1                        | 2044:C 11 Sep 2020 11:34:59.290 * DB saved on disk
redis_1                        | 2044:C 11 Sep 2020 11:34:59.299 * RDB: 6 MB of memory used by copy-on-write
redis_1                        | 1:M 11 Sep 2020 11:34:59.379 * Background saving terminated with success
redis_1                        | 1:M 11 Sep 2020 11:36:00.094 * 10000 changes in 60 seconds. Saving...
redis_1                        | 1:M 11 Sep 2020 11:36:00.103 * Background saving started by pid 2045
redis_1                        | 2045:C 11 Sep 2020 11:36:00.999 * DB saved on disk
redis_1                        | 2045:C 11 Sep 2020 11:36:01.007 * RDB: 5 MB of memory used by copy-on-write
redis_1                        | 1:M 11 Sep 2020 11:36:01.018 * Background saving terminated with success
redis_1                        | 1:M 11 Sep 2020 11:37:02.036 * 10000 changes in 60 seconds. Saving...
redis_1                        | 1:M 11 Sep 2020 11:37:02.042 * Background saving started by pid 2046
redis_1                        | 2046:C 11 Sep 2020 11:37:02.576 * DB saved on disk

The only thing jumps out to me is there seems to be a lot of changes happening to Redis, which I wouldn't expect. Could it be that Redis is not able to keep up with the load?

Removing the release-blocker label as we have merged the upgrade for Celery, Kombu and redis-py. Let's see how this plays out.

I'm just thinking if it makes sense to implement healthchecks in docker-compose.yml. Then if e.g. Celery breaks the worker container get's restarted (as I'm now doing by hand).

A specialised solution for Celery it described on Stack Overflow. But I think a more generic solution would make more sense, maybe based on django-healthcheck? This could then work for the other containers based on the main sentry image, too.

Just my five pence...

I'm currently looking in the logs but we see the same problems after some days with 583756a81710fa11a0a19017654dbc09b390ab65 too

worker_1 throws same exceptions like already posted

Traceback (most recent call last): File "/usr/local/lib/python2.7/site-packages/celery/worker/consumer/consumer.py", line 316, in start blueprint.start(self) File "/usr/local/lib/python2.7/site-packages/celery/bootsteps.py", line 119, in start step.start(parent) File "/usr/local/lib/python2.7/site-packages/celery/worker/consumer/consumer.py", line 592, in start c.loop(*c.loop_args()) File "/usr/local/lib/python2.7/site-packages/celery/worker/loops.py", line 91, in asynloop next(loop) File "/usr/local/lib/python2.7/site-packages/kombu/asynchronous/hub.py", line 354, in create_loop cb(*cbargs) File "/usr/local/lib/python2.7/site-packages/kombu/transport/redis.py", line 1047, in on_readable self.cycle.on_readable(fileno) File "/usr/local/lib/python2.7/site-packages/kombu/transport/redis.py", line 344, in on_readable chan.handlers[type]() File "/usr/local/lib/python2.7/site-packages/kombu/transport/redis.py", line 721, in _brpop_read **options) File "/usr/local/lib/python2.7/site-packages/redis/client.py", line 680, in parse_response response = connection.read_response() File "/usr/local/lib/python2.7/site-packages/redis/connection.py", line 624, in read_response response = self._parser.read_response() File "/usr/local/lib/python2.7/site-packages/redis/connection.py", line 403, in read_response (e.args,)) ConnectionError: Error while reading from socket: ('Connection closed by server.',) 22:21:28 [WARNING] celery.worker.consumer.consumer: consumer: Connection to broker lost. Trying to re-establish the connection...

redis & kafka logs only contains standard/regular log messages

There is a difference to previous versions: it seems that after restarting containers, all missing messages are post-processed successfully. No event has been lost. with the version before the bugfix the messages were lost within the processing-break

@gbrehmer

it seems that after restarting containers, all missing messages are post-processed successfully. No event has been lost. with the version before the bugfix the messages were lost within the processing-break

Okay, this is great to hear. So we fixed _something_. I think things getting stuck is celery and redis not being able to keep up with the bursty traffic.

@mpibpc-mroose

I'm just thinking if it makes sense to implement healthchecks in docker-compose.yml. Then if e.g. Celery breaks the worker container get's restarted (as I'm now doing by hand).

Yeah, I agree. We'll try to get to this but not sure when. If you don't mind submitting a PR, we can work together to get it implemented tho.

For folks still having issues after upgrading to 20.9.0, can you add the following line to your config.yml file under the sentry directory and restart all Sentry instances (especially workers):

postprocess.use-cache-key: 1.0

This should enable a new optimization we introduced and reduce the load on Redis & Celery.

/cc @markstory

@BYK When using postprocess.use-cache-key: 1 in config.yml raises TypeError.

TypeError: 'postprocess.use-cache-key': got <type 'int'>, expected float

Shall we use 1.0 as value?

@sumit4613 - oh, sorry didn't realize that. Yeah 1.0 should do the trick. Updated the above comment now. Thanks a lot!

Could we please consider a release version 2.8.1 with a fix for this problem?

I cannot update to 2.9.0 due to the docker version bump (and that version of docker not yet being available from Amazon Linux extras), and this bug in 2.8.0 is causing some troubles for many.

P.S., the current version of docker in Amazon Linux 2, with Amazon Linux extras is 19.03.6

# yum info docker
Loaded plugins: extras_suggestions, langpacks, priorities, update-motd
202 packages excluded due to repository priority protections
Installed Packages
Name        : docker
Arch        : x86_64
Version     : 19.03.6ce
Release     : 4.amzn2
Size        : 161 M
Repo        : installed
From repo   : amzn2extra-docker

@christopherowen you can manually change the install script to remove or bypass the docker version check. We bumped it mostly for new installs to avoid some weird networking issues (and also we were requiring a pretty dated version)

I don't want to hijack this thread, but I see we reduced the docker version requirement for GCP, could it be reduced further to the AML version?

I don't want to hijack this thread, but I see we reduced the docker version requirement for GCP, could it be reduced further to the AML version?

The way to do that is to file a new issue or better, submit a PR? 😉

I've upgraded to 2.9.0 by editing the install.sh. I'll report back if the issue in this ticket persists.

After running the upgrade I'm getting a number of errors. It ingested events for about 5 minutes and has now stopped working again

ingest-consumer_1              | Traceback (most recent call last):
ingest-consumer_1              |   File "/usr/local/bin/sentry", line 8, in <module>
ingest-consumer_1              |     sys.exit(main())
ingest-consumer_1              |   File "/usr/local/lib/python2.7/site-packages/sentry/runner/__init__.py", line 166, in main
ingest-consumer_1              |     cli(prog_name=get_prog(), obj={}, max_content_width=100)
ingest-consumer_1              |   File "/usr/local/lib/python2.7/site-packages/click/core.py", line 722, in __call__
ingest-consumer_1              |     return self.main(*args, **kwargs)
ingest-consumer_1              |   File "/usr/local/lib/python2.7/site-packages/click/core.py", line 697, in main
ingest-consumer_1              |     rv = self.invoke(ctx)
ingest-consumer_1              |   File "/usr/local/lib/python2.7/site-packages/click/core.py", line 1066, in invoke
ingest-consumer_1              |     return _process_result(sub_ctx.command.invoke(sub_ctx))
ingest-consumer_1              |   File "/usr/local/lib/python2.7/site-packages/click/core.py", line 1066, in invoke
ingest-consumer_1              |     return _process_result(sub_ctx.command.invoke(sub_ctx))
ingest-consumer_1              |   File "/usr/local/lib/python2.7/site-packages/click/core.py", line 895, in invoke
ingest-consumer_1              |     return ctx.invoke(self.callback, **ctx.params)
ingest-consumer_1              |   File "/usr/local/lib/python2.7/site-packages/click/core.py", line 535, in invoke
ingest-consumer_1              |     return callback(*args, **kwargs)
ingest-consumer_1              |   File "/usr/local/lib/python2.7/site-packages/click/decorators.py", line 17, in new_func
ingest-consumer_1              |     return f(get_current_context(), *args, **kwargs)
ingest-consumer_1              |   File "/usr/local/lib/python2.7/site-packages/sentry/runner/decorators.py", line 68, in inner
ingest-consumer_1              |     return ctx.invoke(f, *args, **kwargs)
ingest-consumer_1              |   File "/usr/local/lib/python2.7/site-packages/click/core.py", line 535, in invoke
ingest-consumer_1              |     return callback(*args, **kwargs)
ingest-consumer_1              |   File "/usr/local/lib/python2.7/site-packages/click/decorators.py", line 17, in new_func
ingest-consumer_1              |     return f(get_current_context(), *args, **kwargs)
ingest-consumer_1              |   File "/usr/local/lib/python2.7/site-packages/sentry/runner/decorators.py", line 30, in inner
ingest-consumer_1              |     return ctx.invoke(f, *args, **kwargs)
ingest-consumer_1              |   File "/usr/local/lib/python2.7/site-packages/click/core.py", line 535, in invoke
ingest-consumer_1              |     return callback(*args, **kwargs)
ingest-consumer_1              |   File "/usr/local/lib/python2.7/site-packages/sentry/runner/commands/run.py", line 440, in ingest_consumer
ingest-consumer_1              |     get_ingest_consumer(consumer_types=consumer_types, **options).run()
ingest-consumer_1              |   File "/usr/local/lib/python2.7/site-packages/sentry/utils/batching_kafka_consumer.py", line 212, in run
ingest-consumer_1              |     self._run_once()
ingest-consumer_1              |   File "/usr/local/lib/python2.7/site-packages/sentry/utils/batching_kafka_consumer.py", line 230, in _run_once
ingest-consumer_1              |     raise Exception(msg.error())
ingest-consumer_1              | Exception: KafkaError{code=UNKNOWN_TOPIC_OR_PART,val=3,str="Subscribed topic not available: ingest-attachments: Broker: Unknown topic or partition"}
relay_1                        | 2020-09-22T09:53:24Z [relay_server::actors::upstream] ERROR: authentication encountered error: could not send request to upstream
relay_1                        |   caused by: Failed to connect to host: Failed resolving hostname: no record found for name: web.eu-west-1.compute.internal. type: AAAA class: IN
relay_1                        |   caused by: Failed resolving hostname: no record found for name: web.eu-west-1.compute.internal. type: AAAA class: IN
relay_1                        |   caused by: Failed resolving hostname: no record found for name: web.eu-west-1.compute.internal. type: AAAA class: IN

Would you like me to open a separate issue?

Edit: Fixed the above by docker execing into kafka and running kafka-topics --create --topic ingest-attachments --bootstrap-server localhost:9092

I did notice this in the redis logs, maybe it's why it can't keep up?

redis_1                        | 1:M 22 Sep 2020 09:52:35.493 # You requested maxclients of 10000 requiring at least 10032 max file descriptors.
redis_1                        | 1:M 22 Sep 2020 09:52:35.493 # Server can't set maximum open files to 10032 because of OS error: Operation not permitted.
redis_1                        | 1:M 22 Sep 2020 09:52:35.493 # Current maximum open files is 4096. maxclients has been reduced to 4064 to compensate for low ulimit. If you need higher maxclients increase 'ulimit -n'.
redis_1                        | 1:M 22 Sep 2020 09:52:35.493 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.

Celery with redis broker seems to be very unstable. How about using different broker?

Sorry if I wrote in the wrong place.
But I might have a similar problem.
A couple of weeks ago I installed Sentry 20.8.0 via docker compose. The only change I had made is the nginx port change. Further, errors in the sentry were sent by another department.
A couple of days ago, 2 problems appeared,

  1. the first OOM was killing processes, we added memory to the server up to 16 GB (specifically, it killed redis).
  2. new errors sent to Sentry, but no displayed in the web interface.
    Next, we decided to upgrade to version 20.9.0 and add the line postprocess.use-cache-key: 1.0.

But as a result, the problem with displaying data in the web interface persists. I tried to figure out what could be wrong. and while I got the impression that redis is only growing since the size is not reduced (although it probably depends on the data flow). Those. I couldn't even send the python script test error.
I understand that the matter is most likely in some kind of worker, but I do not understand why it suddenly broke and does not work on the updated installation.

Release 20.9.0 has fixed this issue?🤔

Upgrading to version 20.9.0 didn't help, I still can't see events in the web interface. (the first few days I actually saw the events, but then they suddenly stopped appearing)

I'm having the same issue. After a while (1-3 days) the OOM starts killing processes. A temporary fix is to restart sentry every night using cron jobs but obviously that isn't a good solution. Are there any logs I can provide to help fix the issue?

Celery getting stuck because of redis broker and OOM issues are unrelated.

@chevvi @madmurl0c - Your issues seem more like scaling issues rather than being specific to workers as this issue covers.

Since we are not receiving this specific issue and many people using the config option we shared or the new 20.9.0 version report more stability, I'll be closing the issue.

Please use the forum or file a new issue with the proper issue template so we can help you better.

Upgraded to 20.9.0

sentry --version
sentry, version 20.9.0 (bb3d590af6fa)

Works like a few hours and again.
See in worker logs those errors (and actually in all sentry services, that use kafka)

%3|1601458843.486|ERROR|rdkafka#producer-1| [thrd:kfk-1:6667/2644]: kfk-1:6667/2644: 1 request(s) timed out: disconnect
%3|1601458843.539|FAIL|rdkafka#producer-1| [thrd:kfk-9:6667/2636]:kfk-9:6667/2636: 3 request(s) timed out: disconnect

@mikhno-s if you look at the original report, the issue was with the connection with Redis. If you are having kafka connection issues, it maybe because your kafka instance is down or unable to handle the load.

I'll go ahead and lock this issue to prevent further piling up on an already fixed problem. Feel free to filing a new issue if you think this is a bug in Sentry itself with as much logging as possible. Otherwise I recommend using the forum for seeking scaling help.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

dotconnor picture dotconnor  Â·  6Comments

meriturva picture meriturva  Â·  6Comments

TheRatG picture TheRatG  Â·  4Comments

SteveEdson picture SteveEdson  Â·  3Comments

eandersons picture eandersons  Â·  5Comments