Synapse: Outgoing federation suddenly broken

Created on 12 Nov 2020  Â·  9Comments  Â·  Source: matrix-org/synapse

Description

Since a few days (not exactly sure when), my HS seems incapable of sending outgoing federation messages. Incoming messages do arrive correctly.

Weirder part, my HS apparently manages to send read receipts to other servers (people on matrix have told me they see my read icon advance in element). Incoming read receipts arrive as well.

I don't see any error in my synapse log, but whenever I try to send a message in a federated room, the logs are flooded for a few seconds with a lot of lines like

2020-11-12 08:15:29,153 - synapse.federation.sender.transaction_manager - 127 - INFO - federation_transaction_transmission_loop-1480260 - TX [matrix.org] {1605094380906} Sending transaction [
1605094380906], (PDUs: 3, EDUs: 0)
2020-11-12 08:15:29,703 - synapse.http.matrixfederationclient - 204 - INFO - federation_transaction_transmission_loop-1480260 - {PUT-O-1142924} [matrix.org] Completed request: 200 OK in 0.55
secs - PUT matrix://matrix.org/_matrix/federation/v1/send/1605094380906

relating to a lot of other servers (way more than the number of servers participating in the room I'm trying to send a message into).

I don't really see what I can do to figure out what is going on, but I remain available to check anything you think would be useful to debug this. I am available for more direct discussion as Levans on Freenode IRC (I'm in #matrix-synapse / #synapse:matrix.org so you can ping me there for example).

Steps to reproduce

Sadly I don't know what started it, it seemingly started by itself at some point.

Restarting synapse or rebooting the server does not help.

Version information

  • Synapse 1.22.1
  • installed from https://packages.matrix.org/debian buster InRelease on a debian 10.6 server
  • Instance with only a few users (and only one participating in federated rooms, me), and running without workers enabled
  • The instance name is safaradeg.net, and the server is delegated to the subdomain matrix.safaradeg.net
bug p2

All 9 comments

That sounds bad. Do you have some room IDs that you've sent messages in that aren't federating? Preferably one's with matrix.org in so that we can have a look at its logs.

I just sent a test message to #synapse:matrix.org (!ehXvUhWNASUkSLvAGP:matrix.org) from my account @levans:safaradeg.net.

The message as displayed by element, if that helps:

{
  "type": "m.room.message",
  "sender": "@levans:safaradeg.net",
  "content": {
    "msgtype": "m.text",
    "body": "Test message, can you receive me ?"
  },
  "origin_server_ts": 1605175426928,
  "unsigned": {
    "age": 232,
    "transaction_id": "m1605175426260.4"
  },
  "event_id": "$ycmMNoIU5HeihPB2IJ-Txd_kG9kwh2_tjD0zzzypRIo",
  "room_id": "!ehXvUhWNASUkSLvAGP:matrix.org"
}

Thanks! That event doesn't seem to have made it to either matrix.org or jki.re :confused:

I've looked at both matrix.org and jki.re logs, and it seems that your server is repeatedly sending the same set of events to repeatedly to matrix.org and jki.re (though different sets to each), which is very weird. The resent events seem to be from 2020-11-09 ~08:00. Though both servers do seem to have a few events that arrived since then, even this morning at 2020-11-12 08:18:19+00.

This sounds like it could be related to the recent work to retry sending events to other servers after outages.

I'd be interested in seeing the contents of some of the federation DB tables if possible, I think the following would be very useful to know:

SELECT * FROM destinations WHERE destination IN ('matrix.org', 'jki.re');
SELECT * FROM destinations WHERE destination IN ('matrix.org', 'jki.re');
SELECT MAX(stream_ordering) FROM events;
SELECT stream_ordering FROM events WHERE event_id = '$ycmMNoIU5HeihPB2IJ-Txd_kG9kwh2_tjD0zzzypRIo';
SELECT * FROM federation_stream_position;
synapse_matrix=# SELECT * FROM destinations WHERE destination IN ('matrix.org', 'jki.re');
 destination | retry_last_ts | retry_interval | failure_ts | last_successful_stream_ordering
-------------+---------------+----------------+------------+---------------------------------
 matrix.org  |             0 |              0 |            |                         7972966
 jki.re      |             0 |              0 |            |                         7972924
(2 rows)

synapse_matrix=# SELECT MAX(stream_ordering) FROM events;
   max
---------
 7982676
(1 row)

synapse_matrix=# SELECT stream_ordering FROM events WHERE event_id = '$ycmMNoIU5HeihPB2IJ-Txd_kG9kwh2_tjD0zzzypRIo';
 stream_ordering
-----------------
         7982585
(1 row)

synapse_matrix=# SELECT * FROM federation_stream_position;
    type    | stream_id | instance_name
------------+-----------+---------------
 federation |        -1 | master
 events     |   7972922 | master
(2 rows)

Thanks! This does look like both matrix.org and jki.re are stuck in "catchup" mode (the stream orderings are significantly earlier than the current stream orderings).

@vberger do you see log lines which contain Catching up rooms to?

Also if its easy for you to do it would be very useful if you could apply the following patch and share the logs after letting it run for a bit, if not don't worry about it.

diff --git a/synapse/federation/sender/per_destination_queue.py b/synapse/federation/sender/per_destination_queue.py
index db8e456fe..285f767b9 100644
--- a/synapse/federation/sender/per_destination_queue.py
+++ b/synapse/federation/sender/per_destination_queue.py
@@ -240,11 +240,13 @@ class PerDestinationQueue:
             # hence why we throw the result away.
             await get_retry_limiter(self._destination, self._clock, self._store)

+            logger.info("Catching up %s?", self._destination)
             if self._catching_up:
                 # we potentially need to catch-up first
                 await self._catch_up_transmission_loop()
                 if self._catching_up:
                     # not caught up yet
+                    logger.info("Still catching up %s!", self._destination)
                     return

             pending_pdus = []
@@ -452,6 +454,10 @@ class PerDestinationQueue:
             # Sadly, this means we can't do anything here as we don't know what
             # needs catching up — so catching up is futile; let's stop.
             self._catching_up = False
+            logger.info(
+                "Caught up %s, due to null last successful stream ordering!",
+                self._destination,
+            )
             return

         # get at most 50 catchup room/PDUs
@@ -465,10 +471,15 @@ class PerDestinationQueue:
                 # of a race condition, so we check that no new events have been
                 # skipped due to us being in catch-up mode

+                logger.info("No more events for %s...", self._destination)
+
                 if self._catchup_last_skipped > self._last_successful_stream_ordering:
                     # another event has been skipped because we were in catch-up mode
+                    logger.info("... but got a poke so looping %s!", self._destination)
                     continue

+                logger.info("Caught up as not poked %s!", self._destination)
+
                 # we are done catching up!
                 self._catching_up = False
                 break
@@ -500,6 +511,9 @@ class PerDestinationQueue:
             )

             if not success:
+                logger.info(
+                    "Catch up failed as failed to poke remote %s!", self._destination
+                )
                 return

             sent_transactions_counter.inc()

Yes, I have quite a lot of lines like

2020-11-11 11:16:08,632 - synapse.federation.sender.per_destination_queue - 496 - INFO - federation_transaction_transmission_loop-242 - Catching up rooms to t2bot.io: ['!FKxhjkICpjBVzVeCdF:safaradeg.net']

relating to a lot of different servers in my logs of the last few days.

I'll try the patch.

So I applied the patch and added [DEBG] in the messages for easier grepping. Then grepping for it in the log files just after restarting my HS gives this : https://paste.rs/Ac8

Thanks! Hopefully that is enough to investigate more deeply

After yet another reboot of my HS it looks like things suddenly fixed themselves, on all rooms at once.

Before rebooting I actually was adding some more debug-logging, and noticed something that surprised me (before I realized the federation was working again):

It appeared that this piece of code, which updates the last_successful_stream_ordering for a given destination

https://github.com/matrix-org/synapse/blob/f31f8e63198cfe46af48d788dbb294aba9155e5a/synapse/federation/sender/per_destination_queue.py#L358-L368

was invoked several times in a row with the same value for last_successful_stream_ordering. I don't know if it is relevant to the current bug, but I found it odd so I guess it's worth putting it here.

Was this page helpful?
0 / 5 - 0 ratings