Channels: ChannelFull exception crashing workers during backlog

Created on 26 May 2017 · 6Comments · Source: django/channels

We recently encountered ChannelFull worker crashes while clearing ~15 minutes of messages (after the runworker processes had been temporarily offline).

At the time there were approximately 20,000 keys in the appropriate redis db and we were using the default channel capacity of 100.

After research it seems like the suggested solution to clear the backlog is to increase the number of workers – so we did. They proceeded to crash with the included stack trace and didn't help to process messages any faster.

We found ourselves in a situation where the only way to get things operating correctly was to pull the plug on new incoming WebSocket connections. This is because the ChannelFull error was crashing workers which, in turn, means that the channels weren't actually being cleared (leading the more crashes and so on).

At the time we had 32 worker processes across a number of machines attempting to catch up.

Is this expected behaviour for the workers to crash like this, and how could we mitigate similar problems in the future?

Setup

Nginx proxying to upstream daphne running in containers
Using channels to service only WebSocket requests
asgi_redis.RedisSentinelChannelLayer backend
Running runworker via supervisor on a number of machines

Versions

asgi-redis==1.3.0
channels==1.1.3
daphne==1.2.0
django==1.11.1
Twisted==17.1.0

Traceback

Traceback (most recent call last):
  File "/home/team/releases/current/manage.py", line 9, in <module>
    execute_from_command_line(sys.argv)
  File "/home/team/releases/current/virtualenv/local/lib/python2.7/site-packages/django/core/management/__init__.py", line 363, in execute_from_command_line
    utility.execute()
  File "/home/team/releases/current/virtualenv/local/lib/python2.7/site-packages/django/core/management/__init__.py", line 355, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/home/team/releases/current/virtualenv/local/lib/python2.7/site-packages/django/core/management/base.py", line 283, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/home/team/releases/current/virtualenv/local/lib/python2.7/site-packages/django/core/management/base.py", line 330, in execute
    output = self.handle(*args, **options)
  File "/home/team/releases/current/virtualenv/local/lib/python2.7/site-packages/channels/management/commands/runworker.py", line 83, in handle
    worker.run()
  File "/home/team/releases/current/virtualenv/local/lib/python2.7/site-packages/channels/worker.py", line 151, in run
    consumer_finished.send(sender=self.__class__)
  File "/home/team/releases/current/virtualenv/local/lib/python2.7/site-packages/django/dispatch/dispatcher.py", line 193, in send
    for receiver in self._live_receivers(sender)
  File "/home/team/releases/current/virtualenv/local/lib/python2.7/site-packages/channels/message.py", line 93, in send_and_flush
    sender.send(message, immediately=True)
  File "/home/team/releases/current/virtualenv/local/lib/python2.7/site-packages/channels/channel.py", line 44, in send
    self.channel_layer.send(self.name, content)
  File "/home/team/releases/current/virtualenv/local/lib/python2.7/site-packages/asgi_redis/core.py", line 177, in send
    raise self.ChannelFull
asgiref.base_layer.ChannelFull

Source

millar

Most helpful comment

Ah yes, I see what's happening, the atomic message handling is not correctly dealing with ChannelFull. I'll work on a fix for it soon.

andrewgodwin on 27 May 2017

👍3 😄1

All 6 comments

It looks like you hooked up a signal to a channel sender - is that the case?

andrewgodwin on 26 May 2017

We haven't knowingly attached any signals to channel senders.
On Fri, 26 May 2017 at 17:43, Andrew Godwin notifications@github.com
wrote:

It looks like you hooked up a signal to a channel sender - is that the
case?

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/django/channels/issues/643#issuecomment-304331078,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AABK6MEIgEK_mfe5gQA2k_WOlCqOuDDjks5r9wE_gaJpZM4Nne6a
.

millar on 26 May 2017

Ah yes, I see what's happening, the atomic message handling is not correctly dealing with ChannelFull. I'll work on a fix for it soon.

andrewgodwin on 27 May 2017

👍3 😄1

I have implemented a patch for this in 34a047a - in particular, it retries messages onto a full channel up to a time limit (two seconds), and then quits out with a much more detailed error message that suggests alternatives.

If you want different handling of ChannelFull (dropping, longer retries), you should switch to immediately=True mode on the send() call, and then the message will send in-line and you'll be responsible for handling ChannelFull yourself however you like.

andrewgodwin on 29 May 2017

Thanks for the really quick work on this – it's super helpful!

millar on 29 May 2017

This issue started affecting me in production and development too; thanks for the super quick fix.