Python-slack-sdk: Random Crash When RTM Reading

Created on 24 Aug 2016 · 19Comments · Source: slackapi/python-slack-sdk

[X ] I've read and understood the Contributing guidelines and have done my best effort to follow them.
[ X] I've read and agree to the Code of Conduct.
[ X] I've searched for any related issues and avoided creating a duplicate issue.

Description

Slack client randomly disconnects with the following stack trace. I have tried adding a thread to my bot that pings on a regular basis, but that doesn't prevent the crash. I found what I thought was a related bug, but I'm not sure if it's the same issue or not. While this bug persists, bots don't stay running and have to be watched by a parent process that restarts it.

Traceback (most recent call last):
File "./vobot.py", line 163, in
mainloop()
File "./vobot.py", line 142, in mainloop
channel, message = read_output(SLACK_CLIENT.rtm_read())
File "/usr/local/lib/python2.7/dist-packages/slackclient/_client.py", line 120, in rtm_read
json_data = self.server.websocket_safe_read()
File "/usr/local/lib/python2.7/dist-packages/slackclient/_server.py", line 140, in websocket_safe_read
data += "{0}\n".format(self.websocket.recv())
File "/usr/local/lib/python2.7/dist-packages/websocket/_core.py", line 298, in recv
opcode, data = self.recv_data()
File "/usr/local/lib/python2.7/dist-packages/websocket/_core.py", line 315, in recv_data
opcode, frame = self.recv_data_frame(control_frame)
File "/usr/local/lib/python2.7/dist-packages/websocket/_core.py", line 328, in recv_data_frame
frame = self.recv_frame()
File "/usr/local/lib/python2.7/dist-packages/websocket/_core.py", line 360, in recv_frame
return self.frame_buffer.recv_frame()
File "/usr/local/lib/python2.7/dist-packages/websocket/_abnf.py", line 312, in recv_frame
self.recv_header()
File "/usr/local/lib/python2.7/dist-packages/websocket/_abnf.py", line 261, in recv_header
header = self.recv_strict(2)
File "/usr/local/lib/python2.7/dist-packages/websocket/_abnf.py", line 346, in recv_strict
bytes = self.recv(min(16384, shortage))
File "/usr/local/lib/python2.7/dist-packages/websocket/_core.py", line 429, in _recv
return recv(self.sock, bufsize)
File "/usr/local/lib/python2.7/dist-packages/websocket/_socket.py", line 89, in recv
raise WebSocketConnectionClosedException("Connection is already closed.")
websocket._exceptions.WebSocketConnectionClosedException: Connection is already closed.

Reproducible in:

[ ] This is reproducible in the sample project.
python-slackclient version: 1.0.1
Python Version: 2.7.6
OS Version: Ubuntu 14.04.5 LTS

Steps to reproduce:

connect to Slack using this library with a call to rtm_connect()
read from Slack using rtm_read()
wait for the crash to happen at a random time

Expected result:

I expect the bot to keep running.

Actual result:

The bot crashes.

Attachments:

Source

KnightOfNight

👍11

Most helpful comment

warn_log_070617.txt
I receive websocket-related errors at least once per day. I'm attaching my error log here, along with the log entry preceding each error.

I've noticed that in the event preceding every WebSocketConnectionClosedException, rtm_read() returns an [{}], whereas the standard empty response is [].

Additionally, i should note that my bot currently runs rtm_read() every second. Sample code:

def _auto_reconnect(self, running):
    """Validates the connection boolean returned via `SlackClient.rtm_connect()`
    if running is False:
        * Attempt to auto reconnect a max of 5 times
        * For every attempt, delay reconnection (F_n)*5, where n = num of retries
    Parameters
    -----------
    running : bool
        The boolean value returned via `SlackClient.rtm_connect()`
    Returns
    --------
    running : bool
        The validated boolean value returned via `SlackClient.rtm_connect()`
    """
    while not running:
        if self._retries < self.slacker.settings.max_retries:
            self._retries += 1
            try:
                # delay for longer and longer each retry in case of extended outages
                current_delay = (self._retries + (self._retries - 1))*5 # fibonacci, bro
                self._logger.info(
                    "Attempting reconnection %s of %s in %s seconds...",
                    self._retries,
                    self.slacker.settings.max_retries,
                    current_delay
                )
                sleep(current_delay)
                running = self.slacker.rtm_api.rtm_connect()
            except KeyboardInterrupt:
                self._logger.info("KeyboardInterrupt received.")
                break
        else:
            self._logger.error("Max retries exceeded")
            break
    return running

def run(self):
    """Passes the `SlackClient.rtm_connect()` method into `self._auto_reconnect()` for validation
    if running:
        * Capture and parse events via `Slacker.process_events()` every second
        * Close gracefully on KeyboardInterrupt (for testing)
        * log any Exceptions and try to reconnect if needed
    Parameters
    -----------
    n/a
    Returns
    --------
    n/a
    """
    running = self._auto_reconnect(self.slacker.rtm_api.rtm_connect())
    while running:
        self._logger.debug(
            "Slack connection successful using token <%s>. Response: %s",
            self.slacker.rtm_api.slacktoken,
            running
            )
        try:
            self.slacker.process_events()
            sleep(self.slacker.settings.sleep_secs)
        except KeyboardInterrupt:
            self._logger.info("KeyboardInterrupt received.")
            running = False
        except Exception as e:
            self._logger.exception(e)
            running = self._auto_reconnect(self.slacker.rtm_api.rtm_connect())

SupermanIsEvious on 7 Jul 2017

👍3 ❤1

All 19 comments

I have this problem too, added my traceback
python3.5.2 ubuntu16.04

Traceback (most recent call last):
  File "/home/dkorytkin/mario/slack_bot/main.py", line 117, in <module>
    req = sc.rtm_read()
  File "/usr/local/lib/python3.5/dist-packages/slackclient/_client.py", line 120, in rtm_read
    json_data = self.server.websocket_safe_read()
  File "/usr/local/lib/python3.5/dist-packages/slackclient/_server.py", line 140, in websocket_safe_read
    data += "{0}\n".format(self.websocket.recv())
  File "/usr/local/lib/python3.5/dist-packages/websocket/_core.py", line 298, in recv
    opcode, data = self.recv_data()
  File "/usr/local/lib/python3.5/dist-packages/websocket/_core.py", line 315, in recv_data
    opcode, frame = self.recv_data_frame(control_frame)
  File "/usr/local/lib/python3.5/dist-packages/websocket/_core.py", line 328, in recv_data_frame
    frame = self.recv_frame()
  File "/usr/local/lib/python3.5/dist-packages/websocket/_core.py", line 360, in recv_frame
    return self.frame_buffer.recv_frame()
  File "/usr/local/lib/python3.5/dist-packages/websocket/_abnf.py", line 312, in recv_frame
    self.recv_header()
  File "/usr/local/lib/python3.5/dist-packages/websocket/_abnf.py", line 261, in recv_header
    header = self.recv_strict(2)
  File "/usr/local/lib/python3.5/dist-packages/websocket/_abnf.py", line 346, in recv_strict
    bytes = self.recv(min(16384, shortage))
  File "/usr/local/lib/python3.5/dist-packages/websocket/_core.py", line 429, in _recv
    return recv(self.sock, bufsize)
  File "/usr/local/lib/python3.5/dist-packages/websocket/_socket.py", line 89, in recv
    raise WebSocketConnectionClosedException("Connection is already closed.")
websocket._exceptions.WebSocketConnectionClosedException: Connection is already closed.`

example my code

sc = SlackClient(SLACK_TOKEN)
if sc.rtm_connect():

    while True:
        req = sc.rtm_read()

DKorytkin on 26 Aug 2016

Traceback (most recent call last):
  File "rtmbot.py", line 28, in <module>
    bot.start()
  File "/var/scripts/devSFTP/douglask/visionBot-dev/rtmbot/core.py", line 78, in start
    self._start()
  File "/var/scripts/devSFTP/douglask/visionBot-dev/rtmbot/core.py", line 65, in _start
    for reply in self.slack_client.rtm_read():
  File "/usr/local/lib/python2.7/dist-packages/slackclient/_client.py", line 120, in rtm_read
    json_data = self.server.websocket_safe_read()
  File "/usr/local/lib/python2.7/dist-packages/slackclient/_server.py", line 140, in websocket_safe_read
    data += "{0}\n".format(self.websocket.recv())
  File "/usr/local/lib/python2.7/dist-packages/websocket/_core.py", line 298, in recv
    opcode, data = self.recv_data()
  File "/usr/local/lib/python2.7/dist-packages/websocket/_core.py", line 315, in recv_data
    opcode, frame = self.recv_data_frame(control_frame)
  File "/usr/local/lib/python2.7/dist-packages/websocket/_core.py", line 328, in recv_data_frame
    frame = self.recv_frame()
  File "/usr/local/lib/python2.7/dist-packages/websocket/_core.py", line 360, in recv_frame
    return self.frame_buffer.recv_frame()
  File "/usr/local/lib/python2.7/dist-packages/websocket/_abnf.py", line 312, in recv_frame
    self.recv_header()
  File "/usr/local/lib/python2.7/dist-packages/websocket/_abnf.py", line 261, in recv_header
    header = self.recv_strict(2)
  File "/usr/local/lib/python2.7/dist-packages/websocket/_abnf.py", line 346, in recv_strict
    bytes = self.recv(min(16384, shortage))
  File "/usr/local/lib/python2.7/dist-packages/websocket/_core.py", line 429, in _recv
    return recv(self.sock, bufsize)
  File "/usr/local/lib/python2.7/dist-packages/websocket/_socket.py", line 89, in recv
    raise WebSocketConnectionClosedException("Connection is already closed.")
websocket._exceptions.WebSocketConnectionClosedException: Connection is already closed.

Adding traceback as I am also seeing this issue when running against a list. Seems like a timeout is reached and will continue to dig. The problem is intermittent for me when testing but does cause stability issues for my bot

kyledouglas on 18 Sep 2016

wait for the crash to happen at a random time

I'm curious, do you notice it happening unpredictably, or does it tend to crash after staying idle for some time?

ajferrick on 24 Sep 2016

In my testing I wasn't able to find any pattern. Sometimes it crashes soon after my read loop starts, other times it takes a number (as in dozens) of hours.

cgreatorexatonshape on 24 Sep 2016

Commenting since I've run into this issue on-and-off with thekevjames/jarvis since version 0.16 of this library at least. My personal suspicion is that it's an issue with the Slack websocket closing for some reason unrelated to the python library specifically, though I've only done a cursory look at this library.

As a workaround, I do:

while is_polling:
    try:
        for message in slack.rtm_read():
            # processing, etc
    except WebSocketConnectionClosedException:
        slack.rtm_connect()

TheKevJames on 28 Sep 2016

This might seem unrelated, but how many members do your teams have?

I'm wondering if this is related to https://github.com/slackapi/python-slackclient/issues/127

Roach on 9 Dec 2016

I've seen this occur on test teams with a single user (plus the bot).

TheKevJames on 9 Dec 2016

Traceback (most recent call last):
  File "bot.py", line 61, in <module>
    command, channel = parse_slack_output(slack_client.rtm_read())
  File "/home/hathor/bot/local/lib/python2.7/site-packages/slackclient/_client.py", line 120, in rtm_read
    json_data = self.server.websocket_safe_read()
  File "/home/hathor/bot/local/lib/python2.7/site-packages/slackclient/_server.py", line 138, in websocket_safe_read
    data += "{0}\n".format(self.websocket.recv())
  File "/home/hathor/bot/local/lib/python2.7/site-packages/websocket/_core.py", line 293, in recv
    opcode, data = self.recv_data()
  File "/home/hathor/bot/local/lib/python2.7/site-packages/websocket/_core.py", line 310, in recv_data
    opcode, frame = self.recv_data_frame(control_frame)
  File "/home/hathor/bot/local/lib/python2.7/site-packages/websocket/_core.py", line 323, in recv_data_frame
    frame = self.recv_frame()
  File "/home/hathor/bot/local/lib/python2.7/site-packages/websocket/_core.py", line 357, in recv_frame
    return self.frame_buffer.recv_frame()
  File "/home/hathor/bot/local/lib/python2.7/site-packages/websocket/_abnf.py", line 336, in recv_frame
    self.recv_header()
  File "/home/hathor/bot/local/lib/python2.7/site-packages/websocket/_abnf.py", line 286, in recv_header
    header = self.recv_strict(2)
  File "/home/hathor/bot/local/lib/python2.7/site-packages/websocket/_abnf.py", line 371, in recv_strict
    bytes_ = self.recv(min(16384, shortage))
  File "/home/hathor/bot/local/lib/python2.7/site-packages/websocket/_core.py", line 427, in _recv
    return recv(self.sock, bufsize)
  File "/home/hathor/bot/local/lib/python2.7/site-packages/websocket/_socket.py", line 93, in recv
    "Connection is already closed.")
websocket._exceptions.WebSocketConnectionClosedException: Connection is already closed.

I'm also experiencing this issue. @ajferrick it does seem to occur (for me, at least) when saying idle for some time. Also noticed the data it read just before crashing was: {}.

Any way to work around this for now?

chrissphinx on 25 Jan 2017

Afraid that I'm not adding anything other than "me too"; similar stack trace after being online for some unknown number of hours/days (less than 72 hours, at least). I'll attempt wrapping this in the exception handling listed above (I already had to add exception handling for socket.error to handle connection timeouts). This really seems like it should be more robust, or at the very least the documentation should cover what sort of exceptions we should expect from slack.rtm_read().

This may be silly, but is this related to https://github.com/gkbrk/hackchat/issues/1 by any chance?

Traceback (most recent call last):
  File "./netsuitebot.py", line 142, in <module>
    if __name__ == "__main__":
  File "./netsuitebot.py", line 139, in main
    bot = SlackBot(args.token, args.name, args.file)
  File "./netsuitebot.py", line 66, in run
    command, channel = self.__parse_slack_output(self.slack.rtm_read())
  File "/home/jochs/netsuitebot/slackclient/local/lib/python2.7/site-packages/slackclient/_client.py", line 123, in rtm_read
    json_data = self.server.websocket_safe_read()
  File "/home/jochs/netsuitebot/slackclient/local/lib/python2.7/site-packages/slackclient/_server.py", line 141, in websocket_safe_read
    data += "{0}\n".format(self.websocket.recv())
  File "/home/jochs/netsuitebot/slackclient/local/lib/python2.7/site-packages/websocket/_core.py", line 293, in recv
    opcode, data = self.recv_data()
  File "/home/jochs/netsuitebot/slackclient/local/lib/python2.7/site-packages/websocket/_core.py", line 310, in recv_data
    opcode, frame = self.recv_data_frame(control_frame)
  File "/home/jochs/netsuitebot/slackclient/local/lib/python2.7/site-packages/websocket/_core.py", line 323, in recv_data_frame
    frame = self.recv_frame()
  File "/home/jochs/netsuitebot/slackclient/local/lib/python2.7/site-packages/websocket/_core.py", line 357, in recv_frame
    return self.frame_buffer.recv_frame()
  File "/home/jochs/netsuitebot/slackclient/local/lib/python2.7/site-packages/websocket/_abnf.py", line 336, in recv_frame
    self.recv_header()
  File "/home/jochs/netsuitebot/slackclient/local/lib/python2.7/site-packages/websocket/_abnf.py", line 286, in recv_header
    header = self.recv_strict(2)
  File "/home/jochs/netsuitebot/slackclient/local/lib/python2.7/site-packages/websocket/_abnf.py", line 371, in recv_strict
    bytes_ = self.recv(min(16384, shortage))
  File "/home/jochs/netsuitebot/slackclient/local/lib/python2.7/site-packages/websocket/_core.py", line 427, in _recv
    return recv(self.sock, bufsize)
  File "/home/jochs/netsuitebot/slackclient/local/lib/python2.7/site-packages/websocket/_socket.py", line 93, in recv
    "Connection is already closed.")
websocket._exceptions.WebSocketConnectionClosedException: Connection is already closed.

diamondsw on 27 Feb 2017

@diamondsw I can't give you a definitive answer there, but from personal experience it seems that running a regular ping task doesn't seem to prevent this error. See my bot's keepalive task.

TheKevJames on 27 Feb 2017

Hi guys.
This is what I found (my bot connects really seldom) during my debugging.
When RTM (real time websocket) starts it sends rtm.start request first. In the response there is websocket url along with other information such as users (all users) etc...
In my case this response has cca 40 megabytes (yeah megabytes).
As per documentation:
The Websocket URLs provided by rtm.start are single-use and are only valid for 30 seconds, so make sure to connect quickly. If you connect successfully the first event received will be a hello:
The message is consumed and parsed usually not in this interval (30 seconds) so I got usually this response:
{u'type': u'error', u'error': {u'msg': u'Socket URL has expired', u'code': 1}}

Then my bot tries to reconnect and this scenario is usually repeating over and over.

phonkee on 4 Apr 2017

We encounter it as well every once in a while :(

Ni-Knight on 13 Apr 2017

i think @phonkee's issue is distinct from the one being discussed here. that is related to very large teams having a lot more data in the rtm.start method's response, which can sometimes timeout because its so big. i've created this issue to address this problem with a new feature: #190.

we're going to need more data on the problem if we want to solve this issue. one thing we've gathered is potentially disconnects tend to occur when the bot is relatively idle as opposed to busy. i'd really appreciate any other patterns or data points you all have regarding this issue.

im creating a related issue for implementing better, deeper logging. if we had that before all these reports, we might have had the right data to diagnose and solve the issue already: #192

aoberoi on 27 Jun 2017

warn_log_070617.txt
I receive websocket-related errors at least once per day. I'm attaching my error log here, along with the log entry preceding each error.

I've noticed that in the event preceding every WebSocketConnectionClosedException, rtm_read() returns an [{}], whereas the standard empty response is [].

Additionally, i should note that my bot currently runs rtm_read() every second. Sample code:

def _auto_reconnect(self, running):
    """Validates the connection boolean returned via `SlackClient.rtm_connect()`
    if running is False:
        * Attempt to auto reconnect a max of 5 times
        * For every attempt, delay reconnection (F_n)*5, where n = num of retries
    Parameters
    -----------
    running : bool
        The boolean value returned via `SlackClient.rtm_connect()`
    Returns
    --------
    running : bool
        The validated boolean value returned via `SlackClient.rtm_connect()`
    """
    while not running:
        if self._retries < self.slacker.settings.max_retries:
            self._retries += 1
            try:
                # delay for longer and longer each retry in case of extended outages
                current_delay = (self._retries + (self._retries - 1))*5 # fibonacci, bro
                self._logger.info(
                    "Attempting reconnection %s of %s in %s seconds...",
                    self._retries,
                    self.slacker.settings.max_retries,
                    current_delay
                )
                sleep(current_delay)
                running = self.slacker.rtm_api.rtm_connect()
            except KeyboardInterrupt:
                self._logger.info("KeyboardInterrupt received.")
                break
        else:
            self._logger.error("Max retries exceeded")
            break
    return running

def run(self):
    """Passes the `SlackClient.rtm_connect()` method into `self._auto_reconnect()` for validation
    if running:
        * Capture and parse events via `Slacker.process_events()` every second
        * Close gracefully on KeyboardInterrupt (for testing)
        * log any Exceptions and try to reconnect if needed
    Parameters
    -----------
    n/a
    Returns
    --------
    n/a
    """
    running = self._auto_reconnect(self.slacker.rtm_api.rtm_connect())
    while running:
        self._logger.debug(
            "Slack connection successful using token <%s>. Response: %s",
            self.slacker.rtm_api.slacktoken,
            running
            )
        try:
            self.slacker.process_events()
            sleep(self.slacker.settings.sleep_secs)
        except KeyboardInterrupt:
            self._logger.info("KeyboardInterrupt received.")
            running = False
        except Exception as e:
            self._logger.exception(e)
            running = self._auto_reconnect(self.slacker.rtm_api.rtm_connect())

SupermanIsEvious on 7 Jul 2017

👍3 ❤1

@SupermanIsEvious this data is super useful, thanks so much for sharing!

i haven't yet been able to reproduce this myself, but i haven't let my connection stay alive for more than a couple hours. i suspect that we'd have to leave it on for a day to reproduce.

i don't have a lot of bandwidth at the moment for investigating further (but i agree its imp and will get to it eventually) so if anyone else wants to help solve this, i'd welcome any contribution!

aoberoi on 2 Aug 2017

This happens to my bot on many early mornings, so after a few quiet hours

Gus-Corvil on 13 Oct 2017

I've observed that some Slack teams have (recently) been opted into an experiment (by SlackHQ) which requires that ping messages are sent by RTM clients, otherwise they will be disconnected after 2-3 minutes. This is more frequently than the other reporters, and has only become an issue in the last 4 days for me.

This issue does not appear to be a connectivity issue (observed from multiple ASNs in different continents), is not limited in scope to one API implementation, and disconnections still occur even when there is a constant stream of events coming from the server.

However, this experiment has only been applied to a limited number of teams (I see only one team with the issue and all others without, even when running the same bot code), which don't appear to correlate with age of the team.

There is an existing Server.ping method, however this is not called by any of the code within slackclient itself. Additionally, the example code for using RTM neither mentions pinging as a requirement, nor includes the sufficient boilerplate to implement this.

My feeling is that the API surface should be simple to use, so python-slackclient should implement the pinging rather than individual implementors requiring additional, currently undocumented boilerplate.

I've fixed this in my own Slack client implementation with https://github.com/micolous/python-slackrealtime/commit/e9c94416f979a6582110ebba09c147de2bfe20a1

micolous on 3 Dec 2017

I receive this message instantly upon sending anything using sc.server.send_to_websocket().

For example:

def send_typing_indicator(sc, channel):
    import json
    sc.server.send_to_websocket(json.dumps({'type': 'typing', 'channel': channel}))

^{Untested code!}

Which ultimately causes the exception to be thrown on the next events read. Maybe that will help in debugging this.

Qix- on 30 Dec 2017

Fixed in 1.2.0 using client.rtm_connect(auto_reconnect=True)

See the note at the bottom of this section: https://slackapi.github.io/python-slackclient/real_time_messaging.html#rtm-start-vs-rtm-connect

Roach on 30 Mar 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

files_upload ignores filetype

schlegelp · 3Comments

Message is not being formatted and channels/names are not linked

avinassh · 3Comments

file_not_found after uploading file

ErikKalkoken · 3Comments

Return rtm.start API response to callers of rtm_connect()

Terrance · 3Comments

No handlers could be found for logger "slackclient.client" (ubuntu WORKS, same code centos7 does not)

miklasz · 5Comments