There appears to be nothing indicating a problem in the logs, however there's circumstantial evidence that when synapse receives higher than normal traffic it can cause the federation_sender to stop working (no activity), therefore not federating with remote servers. The federation_sender logs don't seem to have anything out of the ordinary - it just stops sending requests. The main synapse process complains about the events stream falling behind, but doesn't seem to cause problems until 12 minutes later.
This has happened about 10 times in the past to t2bot.io, and each time the number of events being persisted was always elevated (double it's normal rate) before the federation_sender stopped working. For t2bot.io "normal" is defined as 2-3Hz. Each time the federation_sender has stopped the persisted events were going through at >6Hz (this latest being ~6-10Hz).
Here's the timeline for the problem (in UTC):
events stream was falling behindDuring this time the only error spat out was (repeated every few seconds):
homeserver - 2017-12-17 03:13:04,014 - twisted - 131 - CRITICAL - -
Traceback (most recent call last):
File "/home/matrix/.synapse/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "/home/matrix/.synapse/local/lib/python2.7/site-packages/twisted/python/failure.py", line 408, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/home/matrix/.synapse/local/lib/python2.7/site-packages/synapse/replication/tcp/resource.py", line 164, in on_notifier_poke
updates, current_token = yield stream.get_updates()
File "/home/matrix/.synapse/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "/home/matrix/.synapse/local/lib/python2.7/site-packages/twisted/python/failure.py", line 408, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/home/matrix/.synapse/local/lib/python2.7/site-packages/synapse/replication/tcp/streams.py", line 169, in get_updates
updates, current_token = yield self.get_updates_since(self.last_token)
File "/home/matrix/.synapse/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks
result = g.send(result)
File "/home/matrix/.synapse/local/lib/python2.7/site-packages/synapse/replication/tcp/streams.py", line 200, in get_updates_since
raise Exception("stream %s has fallen behined" % (self.NAME))
Exception: stream current_state_deltas has fallen behined
Further, during this time incoming federation was unaffected. Synapse was still processing events and passing them along to appservices. Only outbound federation was affected.
More in-depth logs are available upon request.
apparently this is still happening on 0.27.3 - @erikjohnston did you see anything in this area when painting go-faster stripes on the federation-sender the other day (by parallelising sends?)
fwiw this always correlates with a higher number of events being persisted. I wouldn't be surprised if the federation_sender was spamming get event requests, choking out other traffic.
This was triggered on the utwente.io homeserver when it persisted about 120k m.room.member events from matrix.org in <2 minutes (according to synapse_storage_events_persisted_events_sep).
Unfortunately, the logs don't seem to contain anything useful other than the error given above.

Judging from the fact that the federation sender flatlines, my original hypothesis of it doing a lot of work doesn't stand.
The latest outage (screenshot in last comment) appears to be directly attributed to Voyager joining HQ (https://github.com/matrix-org/synapse/issues/3337), however I'm still not sure why the federation sender gave up on life. It appears as though the stream falling behind breaks replication, therefore the federation sender does nothing.
Although the logs don't actually show a 35Hz stream of events being logged, the timing and verbosity of Voyager's join to HQ is oddly convenient.
From that point on, more streams fall behind and eventually the federation sender stops getting traffic.
Had something similar on my server. Joined 3 largeish rooms in a short period (#archlinux:matrix.org, #manjaro:matrix.org and #general:privacytools.io) and things became very unstable. CPU was at 100% due to replication_notifier being stuck on full throttle. Didn't resolve until I restarted both Synapse and the event_creator - the only worker I run.

Events shows 40Hz m.room.member related to matrix.org when things got wedged after joining the rooms.

Running Synapse develop branch at commit cb8d568cf90bb0f5f07ee9e7e6796ad7cd83361f
A bot from my server was invited to #fedora:matrix.org, which eventually made federation sending stop working. When I noticed it, synapse was making logs at around 300mb/min, so the original errors are long gone.
Here's one of the stack traces about the federation sender stream failing. The first two error messages were ~1.3mb each (containing the whole latest room state?), maybe something should prevent that?
2019-09-25 18:58:45,371 - synapse.replication.tcp.resource - 200 - INFO - replication_notifier-2176370- Failed to handle stream events
2019-09-25 18:58:45,372 - synapse.metrics.background_process_metrics - 202 - ERROR - replication_notifier-2176370- Background process 'replication_notifier' threw an exception
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
StopIteration: [(8595496, '!sWugvEqzNjUsmflyPC:matrix.org', 'm.room.member', '@wayne734:matrix.org', '$1506025804571122MsqQl:matrix.org'), 1309125 more characters...
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
StopIteration: [(8595496, '!sWugvEqzNjUsmflyPC:matrix.org', 'm.room.member', '@wayne734:matrix.org', '$1506025804571122MsqQl:matrix.org'), 1309125 more characters...
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
StopIteration: <generator object merge at 0x7f9155c1cf48>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/synapse/metrics/background_process_metrics.py", line 200, in run
yield func(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "/usr/local/lib/python3.7/dist-packages/twisted/python/failure.py", line 512, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/usr/local/lib/python3.7/dist-packages/synapse/replication/tcp/resource.py", line 198, in _run_notifier_loop
updates, current_token = yield stream.get_updates()
File "/usr/local/lib/python3.7/dist-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "/usr/local/lib/python3.7/dist-packages/twisted/python/failure.py", line 512, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/usr/local/lib/python3.7/dist-packages/synapse/replication/tcp/streams/_base.py", line 158, in get_updates
updates, current_token = yield self.get_updates_since(self.last_token)
File "/usr/local/lib/python3.7/dist-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
File "/usr/local/lib/python3.7/dist-packages/synapse/replication/tcp/streams/_base.py", line 199, in get_updates_since
raise Exception("stream %s has fallen behind" % (self.NAME))
Exception: stream events has fallen behind
Same in assorted colours:
2019-11-29 07:01:30,589 - synapse.replication.tcp.protocol - 532 - ERROR - replication-REPLICATE-events-4534- [synapse.app.synchrotron-iFLjN] Failed to handle REPLICATE command
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
StopIteration: [(1388657, '!yhqiEdqNjyPbxtUjzm:matrix.org', 'm.room.member', '@wking:matrix.org', '$15009018172787008uGnlB:matrix.org'),.......
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
StopIteration: [(1388657, '!yhqiEdqNjyPbxtUjzm:matrix.org', 'm.room.member', '@wking:matrix.org', '$15009018172787008uGnlB:matrix.org'), (1388657, '!yhqiEdqNjyPbxtUjzm:matrix.org', 'm.room.member', '@wkennington:matrix.org', '$1484152256110686udaKk:matrix.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
StopIteration: <generator object merge at 0x7effd5f59660>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/synapse/replication/tcp/protocol.py", line 484, in subscribe_to_stream
stream_name, token
File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 491, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/usr/lib/python3/dist-packages/synapse/util/metrics.py", line 71, in measured_func
r = yield func(self, *args, **kwargs)
File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
File "/usr/lib/python3/dist-packages/synapse/replication/tcp/streams/_base.py", line 199, in get_updates_since
raise Exception("stream %s has fallen behind" % (self.NAME))
Exception: stream events has fallen behind
2019-11-29 07:01:30,629 - synapse.replication.tcp.protocol - 252 - WARNING - replication-REPLICATE-events-4534- [synapse.app.synchrotron-iFLjN] Closing connection
2019-11-29 07:01:30,629 - synapse.replication.tcp.protocol - 373 - INFO - replication-REPLICATE-events-4534- [synapse.app.synchrotron-iFLjN] Connection was closed
2019-11-29 07:01:30,643 - synapse.replication.tcp.protocol - 354 - INFO - - [synapse.app.synchrotron-iFLjN] Replication connection closed: <twisted.python.failure.Failure twisted.internet.error.ConnectionDone: Connection was closed cleanly.>
2019-11-29 07:01:30,643 - synapse.replication.tcp.protocol - 373 - INFO - - [synapse.app.synchrotron-iFLjN] Connection was closed
the stacktrace has evolved slightly since 2017 as this code has been updated, but the problem is still very much there and biting people today.
@tulir's stacktrace in https://github.com/matrix-org/synapse/issues/2738#issuecomment-535180595 is representative; the hallmarks are Background process 'replication_notifier' threw an exception, with a huge stacktrace culminating in Exception: stream events has fallen behind.
Certain actions (joining a large room) can require a huge update to be sent over the replication stream to the workers; the huge update is misinterpreted as the workers getting behind and causes an exception. The only way out is to restart the master and all the workers.
This has been fudged around on matrix.org via 1766a5fdc0. I am ... displeased ... to discover this hasn't made it to mainline.
@grinapo's similar exception with Failed to handle REPLICATE command is due to workers getting behind; the fact they don't recover is https://github.com/matrix-org/synapse/issues/4963.
essentially this happens whenever the homeserver joins a room with more than 10000 state events :/
hopefully made to go away by #6967.