Synapse: "Fatal Python error: Cannot recover from stack overflow" from logger

Created on 19 Nov 2018  路  14Comments  路  Source: matrix-org/synapse

Description

The server crashes and stops with error Fatal Python error: Cannot recover from stack overflow.
The full stacktrace can be found here: matrix-stackoverflow.log

Steps to reproduce

The only thing that I guess that could help to reproduce the bug is to edit log.yaml and set every log levels to INFO or even DEBUG.

I expect that my matrix server doesn't crash.

Version information

  • Homeserver: my personal homeserver (im.deuxfleurs.fr)
  • Version: v0.33.9rc1
  • Install method: Homemade Docker (with Python 3 + downloading synapse archive from Github)
  • Platform: Running in Docker, on a Debian Buster host, on a Scaleway VPS

All 14 comments

I can't reproduce this error - can you try against python 3.6? It may be version dependent

related to #4191?

@neilisfragile seems dup, ill try to make our docker image switch to py3.6 , and it should then reproduce exactly your error.

I encountered this error while having lot of instabilities with matrix, with some strange errors due to encoding and errors loop (an error handler that failed, triggering an error handler that fail and so on).
However, I was not able to clearly identify the problem at the moment, and even now I struggle to understand why Matrix went wild. It seems that setting the error verbosity to ERROR mitigated the problem.

I am a bit lost as some stacktraces seem normal as this one:

2018-11-26 08:53:36,253 - twisted - 244 - CRITICAL - - Unexpected exception from twisted.web.client.FileBodyProducer.stopProducing
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/twisted/web/_newclient.py", line 195, in _callAppFunction
    function()
  File "/usr/local/lib/python3.5/dist-packages/twisted/web/client.py", line 1087, in stopProducing
    self._task.stop()
  File "/usr/local/lib/python3.5/dist-packages/twisted/internet/task.py", line 497, in stop
    self._checkFinish()
  File "/usr/local/lib/python3.5/dist-packages/twisted/internet/task.py", line 507, in _checkFinish
    raise self._completionState
twisted.internet.task.TaskStopped

In my case, I suppose that the reported error here has been triggered by this error, which seem to be also reported in #4191 :

2018-11-26 09:04:47,977 - synapse.state.v1 - 196 - ERROR - PUT-260673-$1543223055125NpvOU:encom.eu.org-$2458-4-4-6T06252.913452|Marina's_Silly_Donkey_Sits_Snowpain's_Good_Newt:ponies.im- Failed to resolve state
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File "/usr/local/lib/python3.5/dist-packages/synapse/storage/events_worker.py", line 364, in _enqueue_events
    for e in res if e
  File "/usr/local/lib/python3.5/dist-packages/twisted/internet/defer.py", line 1362, in returnValue
    raise _DefGen_Return(val)
twisted.internet.defer._DefGen_Return: {"$2458-4-4-1T34409.752380|Queen Chrysalis's\u205fUpset\u205fKangaroo\u205fSteals\u205fCranky Doodle Ryan\u2019s\u205fFantastic\u205fLeopard:ponies.im": _EventCacheEntry(event=<FrozenEvent event_id='$2458-4-4-1T34409.752380|Queen Chrysalis's\u205fUpset\u205fKangaroo\u205fSteals\u205fCranky Doodle Ryan\u2019s\u205fFantastic\u205fLeopard:ponies.im', type='m.room.member', state_key='@/dev/ponies:ponies.im'>, redacted_event=None)}

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File "/usr/local/lib/python3.5/dist-packages/synapse/storage/events_worker.py", line 197, in _get_events
    defer.returnValue(events)
  File "/usr/local/lib/python3.5/dist-packages/twisted/internet/defer.py", line 1362, in returnValue
    raise _DefGen_Return(val)
twisted.internet.defer._DefGen_Return: [<FrozenEvent event_id='$2458-4-4-1T34409.752380|Queen Chrysalis's\u205fUpset\u205fKangaroo\u205fSteals\u205fCranky Doodle Ryan\u2019s\u205fFantastic\u205fLeopard:ponies.im', type='m.room.member', state_key='@/dev/ponies:ponies.im'>]

During handling of the above exception, another exception occurred:

[...]

  File "/usr/local/lib/python3.5/dist-packages/synapse/state/v1.py", line 301, in key_func
    return -int(e.depth), hashlib.sha1(e.event_id.encode('ascii')).hexdigest()
builtins.UnicodeEncodeError: 'ascii' codec can't encode character '\u205f' in position 42: ordinal not in range(128)

And a verbosity greater than ERROR. But it's only speculation.

But the stacktrace is so huge, and I struggle to find where is the useful information in this stacktrace.
Here are the current logs of my server which really look like to #4191 but seem a bit different too (maybe due to python3?).
stderr.txt

Sorry for this not so useful bug report, I think you could close it in favor of #4191, even if I suspect that this error can cause many different stacktraces and even crash the whole server.
Thanks for the help, and if you want that I repost some elements from the last logfile I have attached in #4191, just ask.

For us the current workaround was to filter with CRITICAL, ERROR was not sufficient

2018-11-26 08:53:36,253 - twisted - 244 - CRITICAL - - Unexpected exception from twisted.web.client.FileBodyProducer.stopProducing
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/twisted/web/_newclient.py", line 195, in _callAppFunction
    function()
  File "/usr/local/lib/python3.5/dist-packages/twisted/web/client.py", line 1087, in stopProducing
    self._task.stop()
  File "/usr/local/lib/python3.5/dist-packages/twisted/internet/task.py", line 497, in stop
    self._checkFinish()
  File "/usr/local/lib/python3.5/dist-packages/twisted/internet/task.py", line 507, in _checkFinish
    raise self._completionState
twisted.internet.task.TaskStopped

This is https://github.com/matrix-org/synapse/issues/4003.

2018-11-26 09:04:47,977 - synapse.state.v1 - 196 - ERROR - PUT-260673-$1543223055125NpvOU:encom.eu.org-$2458-4-4-6T06252.913452|Marina's_Silly_Donkey_Sits_Snowpain's_Good_Newt:ponies.im- Failed to resolve state
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File "/usr/local/lib/python3.5/dist-packages/synapse/storage/events_worker.py", line 364, in _enqueue_events
    for e in res if e
  File "/usr/local/lib/python3.5/dist-packages/twisted/internet/defer.py", line 1362, in returnValue
    raise _DefGen_Return(val)
twisted.internet.defer._DefGen_Return: {"$2458-4-4-1T34409.752380|Queen Chrysalis's\u205fUpset\u205fKangaroo\u205fSteals\u205fCranky Doodle Ryan\u2019s\u205fFantastic\u205fLeopard:ponies.im": _EventCacheEntry(event=<FrozenEvent event_id='$2458-4-4-1T34409.752380|Queen Chrysalis's\u205fUpset\u205fKangaroo\u205fSteals\u205fCranky Doodle Ryan\u2019s\u205fFantastic\u205fLeopard:ponies.im', type='m.room.member', state_key='@/dev/ponies:ponies.im'>, redacted_event=None)}

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File "/usr/local/lib/python3.5/dist-packages/synapse/storage/events_worker.py", line 197, in _get_events
    defer.returnValue(events)
  File "/usr/local/lib/python3.5/dist-packages/twisted/internet/defer.py", line 1362, in returnValue
    raise _DefGen_Return(val)
twisted.internet.defer._DefGen_Return: [<FrozenEvent event_id='$2458-4-4-1T34409.752380|Queen Chrysalis's\u205fUpset\u205fKangaroo\u205fSteals\u205fCranky Doodle Ryan\u2019s\u205fFantastic\u205fLeopard:ponies.im', type='m.room.member', state_key='@/dev/ponies:ponies.im'>]

During handling of the above exception, another exception occurred:

[...]

  File "/usr/local/lib/python3.5/dist-packages/synapse/state/v1.py", line 301, in key_func
    return -int(e.depth), hashlib.sha1(e.event_id.encode('ascii')).hexdigest()
builtins.UnicodeEncodeError: 'ascii' codec can't encode character '\u205f' in position 42: ordinal not in range(128)

This is #4226, and not related to #4191 afaict.

A couple of things which might be worth trying here:

Let me know if either of the above help.

I have already set the encoding for my log file.


log.yaml

version: 1

formatters:
  precise:
   format: '%(asctime)s - %(name)s - %(lineno)d - %(levelname)s - %(request)s- %(message)s'

filters:
  context:
    (): synapse.util.logcontext.LoggingContextFilter
    request: ""

handlers:
  file:
    class: logging.handlers.RotatingFileHandler
    formatter: precise
    filename: /var/log/matrix-synapse/homeserver.log
    maxBytes: 104857600
    backupCount: 10
    filters: [context]
    level: ERROR
    encoding: utf8
  console:
    class: logging.StreamHandler
    formatter: precise
    level: ERROR

loggers:
  synapse:
    level: ERROR

  synapse.storage.SQL:
    level: ERROR

  ldap3:
    level: ERROR
  ldap_auth_provider:
    level: ERROR

root:
  level: ERROR
  handlers: [file, console]

I was running synapse with /usr/bin/python3 -m synapse.app.homeserver -c /etc/matrix-synapse/homeserver.yaml, I have changed it for /usr/bin/python3 -m synapse.app.homeserver -n -c /etc/matrix-synapse/homeserver.yaml as suggested.

If I still encounter some bugs, I will try to switch from ERROR to CRITICAL in log.yaml.
Anyway, I will update this comment after a while to see how synapse behave.

for what it's worth, I think most of those level settings are irrelevant, because they will be inherited from the root logger.

I get the same log as here

The server crashes and stops with error Fatal Python error: Cannot recover from stack overflow.
The full stacktrace can be found here: matrix-stackoverflow.log

Not sure if this is any useful, but for me it was enough to increase the level to ERROR in file and disabling the level in root to make synapse run again

handlers:
    file:
        class: logging.handlers.RotatingFileHandler
        formatter: precise
        filename: /var/log/synapse/homeserver.log
        maxBytes: 104857600
        backupCount: 10
        filters: [context]
        encoding: utf8
        level: ERROR
    console:
        class: logging.StreamHandler
        formatter: precise
        filters: [context]
        level: INFO

loggers:
    synapse:
        level: INFO

    synapse.storage.SQL:
        # beware: increasing this to DEBUG will make synapse log sensitive
        # information such as access tokens.
        level: INFO

root:
    #level: ERROR
    handlers: [file, console]

Ok no it just happened again

Since the creation of this bug report, I have updated matrix many times and have not encountered this bug anymore. Feel free to close this issue.

so the basis of the problem is:

  • if the log subsystem encounters an error when writing a log line, it attempts to write an error message to stderr
  • synapse redirects stderr to the twisted logger (at level ERROR) unless --no-redirect-stdio (aka -n) is set on the commandline
  • so the error message is sent back through the log subsystem
  • boom

So there are two problems here. One is that stderr is piped back through the log subsystem, which would be fixed with something like https://github.com/matrix-org/synapse/issues/1539#issuecomment-435084421, but also there was an initial error which set off the cascade.

Not sure if this is any useful, but for me it was enough to increase the level to ERROR in file and disabling the level in root to make synapse run again

This sort of thing can reduce the amount of logging that is written, thus reducing the likelihood of hitting an error in the log subsystem - but as you have discovered, is not a complete solution.

It is tricky to figure out what is going on to cause the initial error, beyond obvious things like /var/log/synapse running out of diskspace. The easiest approach is probably:

  • make sure synapse is configured not to daemonize itself (ie, don't use use --daemonize, or if you're using synctl, set --no-daemonize).
  • set --no-redirect-stdio on the commandline
  • remove console from the list of handlers under the root logger, to reduce the amount of stuff going to stderr
  • pipe stderr to somewhere useful
  • leave it running
  • see what comes up in stderr

closing as a dup of #4240

Was this page helpful?
0 / 5 - 0 ratings