Synapse: workers stop working after elevated traffic

Created on 17 Dec 2017 · 11Comments · Source: matrix-org/synapse

Description

There appears to be nothing indicating a problem in the logs, however there's circumstantial evidence that when synapse receives higher than normal traffic it can cause the federation_sender to stop working (no activity), therefore not federating with remote servers. The federation_sender logs don't seem to have anything out of the ordinary - it just stops sending requests. The main synapse process complains about the events stream falling behind, but doesn't seem to cause problems until 12 minutes later.

This has happened about 10 times in the past to t2bot.io, and each time the number of events being persisted was always elevated (double it's normal rate) before the federation_sender stopped working. For t2bot.io "normal" is defined as 2-3Hz. Each time the federation_sender has stopped the persisted events were going through at >6Hz (this latest being ~6-10Hz).

Here's the timeline for the problem (in UTC):

01:56:09 Synapse crosses the 6Hz persisted events line
03:07:28 The main synapse process started complaining that the events stream was falling behind
03:10:03 Synapse falls below the 6Hz persisted events line
03:19:56 The federation_sender officially stopped working
04:27:22 The entire stack was restarted, restoring federation

During this time the only error spat out was (repeated every few seconds):

homeserver - 2017-12-17 03:13:04,014 - twisted - 131 - CRITICAL - - 
Traceback (most recent call last):
  File "/home/matrix/.synapse/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/home/matrix/.synapse/local/lib/python2.7/site-packages/twisted/python/failure.py", line 408, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/home/matrix/.synapse/local/lib/python2.7/site-packages/synapse/replication/tcp/resource.py", line 164, in on_notifier_poke
    updates, current_token = yield stream.get_updates()
  File "/home/matrix/.synapse/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/home/matrix/.synapse/local/lib/python2.7/site-packages/twisted/python/failure.py", line 408, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/home/matrix/.synapse/local/lib/python2.7/site-packages/synapse/replication/tcp/streams.py", line 169, in get_updates
    updates, current_token = yield self.get_updates_since(self.last_token)
  File "/home/matrix/.synapse/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks
    result = g.send(result)
  File "/home/matrix/.synapse/local/lib/python2.7/site-packages/synapse/replication/tcp/streams.py", line 200, in get_updates_since
    raise Exception("stream %s has fallen behined" % (self.NAME))
Exception: stream current_state_deltas has fallen behined

Further, during this time incoming federation was unaffected. Synapse was still processing events and passing them along to appservices. Only outbound federation was affected.

More in-depth logs are available upon request.

Version information

Homeserver: t2bot.io
Version: 0.26.0-rc1
Install method: pip
Platform: container, ubuntu host.

bug p1

Source

turt2live

👍1

All 11 comments

apparently this is still happening on 0.27.3 - @erikjohnston did you see anything in this area when painting go-faster stripes on the federation-sender the other day (by parallelising sends?)

ara4n on 13 Apr 2018

fwiw this always correlates with a higher number of events being persisted. I wouldn't be surprised if the federation_sender was spamming get event requests, choking out other traffic.

turt2live on 13 Apr 2018

This was triggered on the utwente.io homeserver when it persisted about 120k m.room.member events from matrix.org in <2 minutes (according to synapse_storage_events_persisted_events_sep).

Unfortunately, the logs don't seem to contain anything useful other than the error given above.

silkeh on 9 May 2018

Judging from the fact that the federation sender flatlines, my original hypothesis of it doing a lot of work doesn't stand.

turt2live on 8 Jun 2018

The latest outage (screenshot in last comment) appears to be directly attributed to Voyager joining HQ (https://github.com/matrix-org/synapse/issues/3337), however I'm still not sure why the federation sender gave up on life. It appears as though the stream falling behind breaks replication, therefore the federation sender does nothing.

Although the logs don't actually show a 35Hz stream of events being logged, the timing and verbosity of Voyager's join to HQ is oddly convenient.

10:02:26 - Voyager tries to join HQ
10:02:26 - matrix.org replies 200 OK to /make_join
10:02:54 - matrix.org replies 200 OK to /send_join
10:03:20 - Synapse starts complaining about signature checks and tampered events in HQ
10:03:40 - Synapse tries to store HQ again, and fails (https://github.com/matrix-org/synapse/issues/3374)
10:03:41 - Synapse asks matrix.org for an event in HQ
10:03:43 - matrix.org replies 200 OK for the event, Synapse starts rejecting events in HQ (due to visibility)
10:19:45 - Voyager tries to join HQ again (timeout caused it to try again 'later')
10:19:45 - Synapse rejects the join due to an AssertionError (room_id is in the room_queue)
10:29:18 - The join completes
10:29:19 - stream caches falls behind

From that point on, more streams fall behind and eventually the federation sender stops getting traffic.

turt2live on 8 Jun 2018

Had something similar on my server. Joined 3 largeish rooms in a short period (#archlinux:matrix.org, #manjaro:matrix.org and #general:privacytools.io) and things became very unstable. CPU was at 100% due to replication_notifier being stuck on full throttle. Didn't resolve until I restarted both Synapse and the event_creator - the only worker I run.

Screenshot from 2019-07-09 12-59-00

Events shows 40Hz m.room.member related to matrix.org when things got wedged after joining the rooms.

Selection_215

Running Synapse develop branch at commit cb8d568cf90bb0f5f07ee9e7e6796ad7cd83361f

jaywink on 9 Jul 2019

A bot from my server was invited to #fedora:matrix.org, which eventually made federation sending stop working. When I noticed it, synapse was making logs at around 300mb/min, so the original errors are long gone.

Here's one of the stack traces about the federation sender stream failing. The first two error messages were ~1.3mb each (containing the whole latest room state?), maybe something should prevent that?

2019-09-25 18:58:45,371 - synapse.replication.tcp.resource - 200 - INFO - replication_notifier-2176370- Failed to handle stream events
2019-09-25 18:58:45,372 - synapse.metrics.background_process_metrics - 202 - ERROR - replication_notifier-2176370- Background process 'replication_notifier' threw an exception
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
StopIteration: [(8595496, '!sWugvEqzNjUsmflyPC:matrix.org', 'm.room.member', '@wayne734:matrix.org', '$1506025804571122MsqQl:matrix.org'), 1309125 more characters...

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
StopIteration: [(8595496, '!sWugvEqzNjUsmflyPC:matrix.org', 'm.room.member', '@wayne734:matrix.org', '$1506025804571122MsqQl:matrix.org'), 1309125 more characters...

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
StopIteration: <generator object merge at 0x7f9155c1cf48>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/synapse/metrics/background_process_metrics.py", line 200, in run
    yield func(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/usr/local/lib/python3.7/dist-packages/twisted/python/failure.py", line 512, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/local/lib/python3.7/dist-packages/synapse/replication/tcp/resource.py", line 198, in _run_notifier_loop
    updates, current_token = yield stream.get_updates()
  File "/usr/local/lib/python3.7/dist-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/usr/local/lib/python3.7/dist-packages/twisted/python/failure.py", line 512, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/local/lib/python3.7/dist-packages/synapse/replication/tcp/streams/_base.py", line 158, in get_updates
    updates, current_token = yield self.get_updates_since(self.last_token)
  File "/usr/local/lib/python3.7/dist-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File "/usr/local/lib/python3.7/dist-packages/synapse/replication/tcp/streams/_base.py", line 199, in get_updates_since
    raise Exception("stream %s has fallen behind" % (self.NAME))
Exception: stream events has fallen behind

tulir on 25 Sep 2019

Same in assorted colours:

2019-11-29 07:01:30,589 - synapse.replication.tcp.protocol - 532 - ERROR - replication-REPLICATE-events-4534- [synapse.app.synchrotron-iFLjN] Failed to handle REPLICATE command
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
StopIteration: [(1388657, '!yhqiEdqNjyPbxtUjzm:matrix.org', 'm.room.member', '@wking:matrix.org', '$15009018172787008uGnlB:matrix.org'),.......


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
StopIteration: [(1388657, '!yhqiEdqNjyPbxtUjzm:matrix.org', 'm.room.member', '@wking:matrix.org', '$15009018172787008uGnlB:matrix.org'), (1388657, '!yhqiEdqNjyPbxtUjzm:matrix.org', 'm.room.member', '@wkennington:matrix.org', '$1484152256110686udaKk:matrix.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
StopIteration: <generator object merge at 0x7effd5f59660>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/synapse/replication/tcp/protocol.py", line 484, in subscribe_to_stream
    stream_name, token
  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 491, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/lib/python3/dist-packages/synapse/util/metrics.py", line 71, in measured_func
    r = yield func(self, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File "/usr/lib/python3/dist-packages/synapse/replication/tcp/streams/_base.py", line 199, in get_updates_since
    raise Exception("stream %s has fallen behind" % (self.NAME))
Exception: stream events has fallen behind
2019-11-29 07:01:30,629 - synapse.replication.tcp.protocol - 252 - WARNING - replication-REPLICATE-events-4534- [synapse.app.synchrotron-iFLjN] Closing connection
2019-11-29 07:01:30,629 - synapse.replication.tcp.protocol - 373 - INFO - replication-REPLICATE-events-4534- [synapse.app.synchrotron-iFLjN] Connection was closed
2019-11-29 07:01:30,643 - synapse.replication.tcp.protocol - 354 - INFO - - [synapse.app.synchrotron-iFLjN] Replication connection closed: <twisted.python.failure.Failure twisted.internet.error.ConnectionDone: Connection was closed cleanly.>
2019-11-29 07:01:30,643 - synapse.replication.tcp.protocol - 373 - INFO - - [synapse.app.synchrotron-iFLjN] Connection was closed

grinapo on 29 Nov 2019

the stacktrace has evolved slightly since 2017 as this code has been updated, but the problem is still very much there and biting people today.

@tulir's stacktrace in https://github.com/matrix-org/synapse/issues/2738#issuecomment-535180595 is representative; the hallmarks are Background process 'replication_notifier' threw an exception, with a huge stacktrace culminating in Exception: stream events has fallen behind.

Certain actions (joining a large room) can require a huge update to be sent over the replication stream to the workers; the huge update is misinterpreted as the workers getting behind and causes an exception. The only way out is to restart the master and all the workers.

This has been fudged around on matrix.org via 1766a5fdc0. I am ... displeased ... to discover this hasn't made it to mainline.

@grinapo's similar exception with Failed to handle REPLICATE command is due to workers getting behind; the fact they don't recover is https://github.com/matrix-org/synapse/issues/4963.