Synapse: Synapse stops responding to incoming requests if PostgreSQL stops responding

Created on 16 Oct 2020  路  24Comments  路  Source: matrix-org/synapse

Description

For a while I've been having an issue where after a certain amount of time after a fresh boot, Synapse stops receiving (or at least handling) federation events and often even stops responding to client connections. CPU usage goes down to basically 0% and only a machine reboot (not just rebooting the service) fixes the issue. Strangely enough the federation tester at that point reports federation works fine, but clearly it doesn't.

Note that I have the federation reader and synchrotron workers enabled and they replicate via Redis.

I get some errors in the federation reader log which might be relevant:

2020-10-16 08:15:27,206 - twisted - 250 - CRITICAL - - Rollback failed
Traceback (most recent call last):
  File "/usr/lib/python3.8/site-packages/twisted/python/threadpool.py", line 250, in inContext
    result = inContext.theWork()
  File "/usr/lib/python3.8/site-packages/twisted/python/threadpool.py", line 266, in <lambda>
    inContext.theWork = lambda: context.call(ctx, func, *args, **kw)
  File "/usr/lib/python3.8/site-packages/twisted/python/context.py", line 122, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/lib/python3.8/site-packages/twisted/python/context.py", line 85, in callWithContext
    return func(*args,**kw)
--- <exception caught here> ---
  File "/usr/lib/python3.8/site-packages/twisted/enterprise/adbapi.py", line 303, in _runWithConnection
    conn.rollback()
  File "/usr/lib/python3.8/site-packages/twisted/enterprise/adbapi.py", line 52, in rollback
    self._connection.rollback()
psycopg2.InterfaceError: connection already closed

2020-10-16 08:15:27,217 - synapse.util.ratelimitutils - 182 - DEBUG - PUT-52551- Ratelimit [281473023937824]: Processed req
2020-10-16 08:15:27,219 - synapse.util.async_helpers - 293 - DEBUG - PUT-52551- Releasing linearizer lock 'fed_txn_handler' for key 'en-root.org'
2020-10-16 08:15:27,222 - synapse.federation.transport.server - 428 - ERROR - PUT-52551- on_incoming_transaction failed
Traceback (most recent call last):
  File "/usr/lib/python3.8/site-packages/synapse/federation/transport/server.py", line 424, in on_PUT
    code, response = await self.handler.on_incoming_transaction(
  File "/usr/lib/python3.8/site-packages/synapse/federation/federation_server.py", line 164, in on_incoming_transaction
    return await self._transaction_resp_cache.wrap(
  File "/usr/lib/python3.8/site-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/usr/lib/python3.8/site-packages/twisted/python/failure.py", line 512, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/lib/python3.8/site-packages/synapse/federation/federation_server.py", line 190, in _on_incoming_transaction_inner
    result = await self._handle_incoming_transaction(
  File "/usr/lib/python3.8/site-packages/synapse/federation/federation_server.py", line 209, in _handle_incoming_transaction
    response = await self.transaction_actions.have_responded(origin, transaction)
  File "/usr/lib/python3.8/site-packages/synapse/federation/persistence.py", line 54, in have_responded
    return await self.store.get_received_txn_response(transaction_id, origin)
  File "/usr/lib/python3.8/site-packages/synapse/storage/databases/main/transactions.py", line 77, in get_received_txn_response
    return await self.db_pool.runInteraction(
  File "/usr/lib/python3.8/site-packages/synapse/storage/database.py", line 569, in runInteraction
    result = await self.runWithConnection(
  File "/usr/lib/python3.8/site-packages/synapse/storage/database.py", line 646, in runWithConnection
    return await make_deferred_yieldable(
  File "/usr/lib/python3.8/site-packages/twisted/python/threadpool.py", line 250, in inContext
    result = inContext.theWork()
  File "/usr/lib/python3.8/site-packages/twisted/python/threadpool.py", line 266, in <lambda>
    inContext.theWork = lambda: context.call(ctx, func, *args, **kw)
  File "/usr/lib/python3.8/site-packages/twisted/python/context.py", line 122, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/lib/python3.8/site-packages/twisted/python/context.py", line 85, in callWithContext
    return func(*args,**kw)
  File "/usr/lib/python3.8/site-packages/twisted/enterprise/adbapi.py", line 306, in _runWithConnection
    compat.reraise(excValue, excTraceback)
  File "/usr/lib/python3.8/site-packages/twisted/python/compat.py", line 464, in reraise
    raise exception.with_traceback(traceback)
  File "/usr/lib/python3.8/site-packages/twisted/enterprise/adbapi.py", line 297, in _runWithConnection
    result = func(conn, *args, **kw)
  File "/usr/lib/python3.8/site-packages/synapse/storage/database.py", line 626, in inner_func
    assert not self.engine.in_transaction(conn)
AssertionError
2020-10-16 08:15:27,230 - synapse.http.server - 81 - ERROR - PUT-52551- Failed handle request via 'FederationSendServlet': <SynapseRequest at 0xffffa21b07f0 method='PUT' uri='/_matrix/federation/v1/send/1594808804394' clientproto='HTTP/1.0' site=8084>
Traceback (most recent call last):
  File "/usr/lib/python3.8/site-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/usr/lib/python3.8/site-packages/twisted/python/failure.py", line 512, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/lib/python3.8/site-packages/synapse/federation/federation_server.py", line 190, in _on_incoming_transaction_inner
    result = await self._handle_incoming_transaction(
  File "/usr/lib/python3.8/site-packages/synapse/federation/federation_server.py", line 209, in _handle_incoming_transaction
    response = await self.transaction_actions.have_responded(origin, transaction)
  File "/usr/lib/python3.8/site-packages/synapse/federation/persistence.py", line 54, in have_responded
    return await self.store.get_received_txn_response(transaction_id, origin)
  File "/usr/lib/python3.8/site-packages/synapse/storage/databases/main/transactions.py", line 77, in get_received_txn_response
    return await self.db_pool.runInteraction(
  File "/usr/lib/python3.8/site-packages/synapse/storage/database.py", line 569, in runInteraction
    result = await self.runWithConnection(
  File "/usr/lib/python3.8/site-packages/synapse/storage/database.py", line 646, in runWithConnection
    return await make_deferred_yieldable(
  File "/usr/lib/python3.8/site-packages/twisted/python/threadpool.py", line 250, in inContext
    result = inContext.theWork()
  File "/usr/lib/python3.8/site-packages/twisted/python/threadpool.py", line 266, in <lambda>
    inContext.theWork = lambda: context.call(ctx, func, *args, **kw)
  File "/usr/lib/python3.8/site-packages/twisted/python/context.py", line 122, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/lib/python3.8/site-packages/twisted/python/context.py", line 85, in callWithContext
    return func(*args,**kw)
  File "/usr/lib/python3.8/site-packages/twisted/enterprise/adbapi.py", line 306, in _runWithConnection
    compat.reraise(excValue, excTraceback)
  File "/usr/lib/python3.8/site-packages/twisted/python/compat.py", line 464, in reraise
    raise exception.with_traceback(traceback)
  File "/usr/lib/python3.8/site-packages/twisted/enterprise/adbapi.py", line 297, in _runWithConnection
    result = func(conn, *args, **kw)
  File "/usr/lib/python3.8/site-packages/synapse/storage/database.py", line 626, in inner_func
    assert not self.engine.in_transaction(conn)
AssertionError

Steps to reproduce

I don't know honestly. It seems to be specific to something in my setup, but I can't pinpoint what it is.

Version information

  • Homeserver: fam-ribbers.com

If not matrix.org:

  • Version: 1.21.2

  • Install method: The Alpine Linux package manager

  • Platform: Alpine Linux, bare metal. Running on a RockPro64 so ARM64, with the PostgreSQL database running on a different (also RockPro64) machine.

bug info-needed p1 regression

Most helpful comment

Well my homeserver has been unreachable because of this for a few days now since I got annoyed by constantly restarting it. Element seems to be connected, but I haven't received any messages in any room for ages, even though those same rooms seem to be very active if I check them from a matrix.org account.

Could this _please_ get some dev attention? My setup is literally unusable because of this.

All 24 comments

psycopg2.InterfaceError: connection already closed

It seems that Synapse is losing its connection to Postgres. It is known that Synapse does not handle the db connection disappearing and reappearing very well. Is your Postgres overloaded or running out of memory?

If you get stuck investigating postgres, then (redacted) logs from postgres and synapse with SQL debug logging turned out would be helpful.

Hmm, maybe. CPU usage of the postgres process is normally ~10%, but if I start opening a lot of rooms after each other it jumps up a lot and starts maxing out some cores. Out of memory doesn't seem to be it; the machine running the DB has some left and it doesn't really increase if I start switch between a lot of rooms quickly.

It does seem like a probable cause yes, although I then wonder why just restarting the Synapse process and workers isn't enough :thinking:

@PureTryOut I tried to DM you about this on matrix (because I'm running into the exact same problem since a few days ago), but it seems your server is not responding to requests right now.. If we're both experiencing starting a few days ago, I'd blame this on a recent synapse update. Still, maybe it makes sense to compare our setups to help synapse devs with reproducing?

I'm running both synapse and postgres in docker containers, and traefik for getting traffic around between services, both for http traffic coming from outside, and for getting the postgres traffic from synapse to postgres. Any similarities in your setup? Differences I know already: I don't run any workers, and restarting the containers is enough for me to fix it.

I've found the exact same issue happening rather often at home as well, with Synapse running in Kubernetes, with workers.
Seems to be new from around 1.21.
My setup also only needs a container recreation to get things federating again, but then that's basically the same as a reboot of a regular machine.

I'm also using traefik to get traffic in to the Synapse server and the workers, but all internal communication is done with the K8s overlay network. In there I'm running a triple-node HA postgres cluster for the database, with a proxy in front of it to redirect to the active master. (The proxy forwards all traffic, and generates a connection reset if the master dies, which hasn't happened during these issues)

An example of the exceptions I'm seeing; (Here taken from a federation reader worker)

2020-10-21 14:16:24,458 - twisted - 250 - CRITICAL - - Rollback failed
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/twisted/python/threadpool.py", line 250, in inContext
    result = inContext.theWork()
  File "/usr/local/lib/python3.7/site-packages/twisted/python/threadpool.py", line 266, in <lambda>
    inContext.theWork = lambda: context.call(ctx, func, *args, **kw)
  File "/usr/local/lib/python3.7/site-packages/twisted/python/context.py", line 122, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/local/lib/python3.7/site-packages/twisted/python/context.py", line 85, in callWithContext
    return func(*args,**kw)
--- <exception caught here> ---
  File "/usr/local/lib/python3.7/site-packages/twisted/enterprise/adbapi.py", line 303, in _runWithConnection
    conn.rollback()
  File "/usr/local/lib/python3.7/site-packages/twisted/enterprise/adbapi.py", line 52, in rollback
    self._connection.rollback()
psycopg2.InterfaceError: connection already closed

2020-10-21 14:16:24,460 - synapse.federation.transport.server - 428 - ERROR - PUT-58348- on_incoming_transaction failed
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/synapse/federation/transport/server.py", line 425, in on_PUT
    origin, transaction_data
  File "/usr/local/lib/python3.7/site-packages/synapse/federation/federation_server.py", line 169, in on_incoming_transaction
    request_time,
  File "/usr/local/lib/python3.7/site-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/usr/local/lib/python3.7/site-packages/twisted/python/failure.py", line 512, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/local/lib/python3.7/site-packages/synapse/federation/federation_server.py", line 191, in _on_incoming_transaction_inner
    origin, transaction, request_time
  File "/usr/local/lib/python3.7/site-packages/synapse/federation/federation_server.py", line 252, in _handle_incoming_transaction
    await self.transaction_actions.set_response(origin, transaction, 200, response)
  File "/usr/local/lib/python3.7/site-packages/synapse/federation/persistence.py", line 67, in set_response
    transaction_id, origin, code, response
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/databases/main/transactions.py", line 130, in set_received_txn_response
    desc="set_received_txn_response",
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/database.py", line 727, in simple_insert
    await self.runInteraction(desc, self.simple_insert_txn, table, values)
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/database.py", line 577, in runInteraction
    **kwargs
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/database.py", line 647, in runWithConnection
    self._db_pool.runWithConnection(inner_func, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/twisted/python/threadpool.py", line 250, in inContext
    result = inContext.theWork()
  File "/usr/local/lib/python3.7/site-packages/twisted/python/threadpool.py", line 266, in <lambda>
    inContext.theWork = lambda: context.call(ctx, func, *args, **kw)
  File "/usr/local/lib/python3.7/site-packages/twisted/python/context.py", line 122, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/local/lib/python3.7/site-packages/twisted/python/context.py", line 85, in callWithContext
    return func(*args,**kw)
  File "/usr/local/lib/python3.7/site-packages/twisted/enterprise/adbapi.py", line 306, in _runWithConnection
    compat.reraise(excValue, excTraceback)
  File "/usr/local/lib/python3.7/site-packages/twisted/python/compat.py", line 464, in reraise
    raise exception.with_traceback(traceback)
  File "/usr/local/lib/python3.7/site-packages/twisted/enterprise/adbapi.py", line 297, in _runWithConnection
    result = func(conn, *args, **kw)
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/database.py", line 626, in inner_func
    assert not self.engine.in_transaction(conn)
AssertionError
2020-10-21 14:16:24,460 - synapse.http.server - 85 - ERROR - PUT-58348- Failed handle request via 'FederationSendServlet': <SynapseRequest at 0x7efee762f6d0 method='PUT' uri='/_matrix/federation/v1/send/1602441424225' clientproto='HTTP/1.1' site=8083>
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/usr/local/lib/python3.7/site-packages/twisted/python/failure.py", line 512, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/local/lib/python3.7/site-packages/synapse/federation/federation_server.py", line 191, in _on_incoming_transaction_inner
    origin, transaction, request_time
  File "/usr/local/lib/python3.7/site-packages/synapse/federation/federation_server.py", line 252, in _handle_incoming_transaction
    await self.transaction_actions.set_response(origin, transaction, 200, response)
  File "/usr/local/lib/python3.7/site-packages/synapse/federation/persistence.py", line 67, in set_response
    transaction_id, origin, code, response
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/databases/main/transactions.py", line 130, in set_received_txn_response
    desc="set_received_txn_response",
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/database.py", line 727, in simple_insert
    await self.runInteraction(desc, self.simple_insert_txn, table, values)
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/database.py", line 577, in runInteraction
    **kwargs
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/database.py", line 647, in runWithConnection
    self._db_pool.runWithConnection(inner_func, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/twisted/python/threadpool.py", line 250, in inContext
    result = inContext.theWork()
  File "/usr/local/lib/python3.7/site-packages/twisted/python/threadpool.py", line 266, in <lambda>
    inContext.theWork = lambda: context.call(ctx, func, *args, **kw)
  File "/usr/local/lib/python3.7/site-packages/twisted/python/context.py", line 122, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/local/lib/python3.7/site-packages/twisted/python/context.py", line 85, in callWithContext
    return func(*args,**kw)
  File "/usr/local/lib/python3.7/site-packages/twisted/enterprise/adbapi.py", line 306, in _runWithConnection
    compat.reraise(excValue, excTraceback)
  File "/usr/local/lib/python3.7/site-packages/twisted/python/compat.py", line 464, in reraise
    raise exception.with_traceback(traceback)
  File "/usr/local/lib/python3.7/site-packages/twisted/enterprise/adbapi.py", line 297, in _runWithConnection
    result = func(conn, *args, **kw)
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/database.py", line 626, in inner_func
    assert not self.engine.in_transaction(conn)
AssertionError

I tried to DM you about this on matrix (because I'm running into the exact same problem since a few days ago), but it seems your server is not responding to requests right now..

Well yes, that's exactly why I made this issue haha. I become unreachable for ages at a time and I'm using an old matrix.org alongside my normal one now just to keep up-to-date with some important rooms, it's really annoying.

I use Nginx for proxying incoming requests to the Synapse workers, but internally Synapse has a direct connection to the PostgreSQL box. PostgreSQL does run in a Docker container, but I'm not sure if that matters.

I've been experiencing this from around 1.20 actually, but I'm not sure if the Synapse version is to blame in my case as I just changed my database setup when this started happening (from a beefy, power hungry x86_64 machine to a lower performance but low power ARM64 machine) which might also be the cause of this.

I usually notice within a few hours, considering matrix is my primary means of communication, but still... I've been offline again for 3 hours this morning due to this bug. @anoadragon453 Is there anything we can do to help this getting fixed?

psycopg2.InterfaceError: connection already closed

It seems that Synapse is losing its connection to Postgres. It is known that Synapse does not handle the db connection disappearing and reappearing very well. Is your Postgres overloaded or running out of memory?

just had a similar issue overnight, after an automated postgres restart at 03:30, which promptly made Synapse break, with low cpu, database failures and http 500 responses.
Any reason why synapse handles postgres connection loss this badly? Mastodon survives this just fine each night, and Synapse deadlocking like this has happened a few times now

t2bot.io also had this overnight - at roughly 08:00 UTC the database went missing on the network for 2 minutes which sent the server into sadness. The synchrotron started consuming memory until it was eventually killed and the workers responded with 500 errors to all other requests. This lead to a 4 hour outage (though that is partially because of how I have the paging rules set up).

It used to recover fine from database problems - it might have been a bit rocky and still wanted a restart, but it at least didn't require manual intervention.

Well my homeserver has been unreachable because of this for a few days now since I got annoyed by constantly restarting it. Element seems to be connected, but I haven't received any messages in any room for ages, even though those same rooms seem to be very active if I check them from a matrix.org account.

Could this _please_ get some dev attention? My setup is literally unusable because of this.

happened today at exactly 3:30 again, I've now disabled my postgres restart service, but that makes it rather reproducible?

Hey all, apologies for the delay on getting this fixed and the annoyance it's caused in the meantime. We suspect this is caused by the postgres autocommit change made in https://github.com/matrix-org/synapse/issues/8456 that went out in v1.21.0.

It's a subtle and nasty bug. We'll try to get this sorted and a fix out asap.

@clokep that title seems wrong. My PostgreSQL instance doesn't restart at any point. It just gets a bit overloaded at some points after which Synapse stops responding. I never had the PostgreSQL service restart at any point.

@clokep that title seems wrong. My PostgreSQL instance doesn't restart at any point. It just gets a bit overloaded at some points after which Synapse stops responding. I never had the PostgreSQL service restart at any point.

Ah, that's good info. It seemed from the conversation it was only due to restarts. Thanks!

I'm running into this issue roughly every other day, it's really annoying. One of the worst parts: It seems like Synapse still answers to requests that don't need to touch the database just fine, so my syncs don't break and I often just realize I've run into this after hours of suspicious silence or when trying to send a message.

Can people try setting cp_reconnect: true under their database.args homeserver config option and see if that helps? e.g:

database:
  names: psycopg2
  args:
    user: xxx
    password: yyy
    database: synapse
    host: localhost
    cp_min: 5
    cp_max: 10
    cp_reconnect: true

We've got a PR (#8726) targeting this issue that enables it, but it'd be good to get confirmation on whether it works early on.

I've reproduced the issue by restarting pg to force connection loss to make sure this works for reproducing the issue, the logs appeared as above and message sending broke. After adding cp_reconnect: true, the "connection already closed" log lines and tracebacks disappeared, but I still can't send messages until I also restart synapse.

@jcgruenhage Are there any logs in postgres during the outage? I know sometimes after restart it can take forever for postgres to start accepting connections.

@jcgruenhage If you can reproduce while having a tcpdump running then that would be awesomely useful

I gave that setting a shot for a few days, but after a day or so of trying to catch up after not having incoming federation for at least a month, it stopped responding again. I saw no PostgreSQL connection errors in the log this time, but the symptons were the same.

Weird. I think we'll need TCP dumps to see what's going on.

A random stab in the dark is that the connections are being dropped/black holed without Synapse noticing, if so then the default tcp timeout settings will mean it can take ages for the connection to finally timeout. Fiddling with the tcp_keepalives_* settings to make things timeout quicker may help in such a case: https://www.postgresql.org/docs/current/runtime-config-connection.html

I'm seeing a bunch of new errors after changing the setting as well, here are copies from my federation sender worker;

2020-11-18 08:37:40,582 - synapse.storage.databases.main.event_push_actions - 530 - INFO - event_push_action_stream_orderings-221- Searching for stream ordering 1 month ago
2020-11-18 08:37:40,582 - synapse.storage.txn - 512 - WARNING - event_push_action_stream_orderings-221- [TXN OPERROR] {_find_stream_orderings_for_times-f5a1a} server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
 0/5
2020-11-18 08:37:40,582 - twisted - 250 - CRITICAL - event_push_action_stream_orderings-221- Rollback failed
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/twisted/enterprise/adbapi.py", line 297, in _runWithConnection
    result = func(conn, *args, **kw)
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/database.py", line 702, in inner_func
    return func(db_conn, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/database.py", line 517, in new_transaction
    conn.rollback()
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/database.py", line 157, in rollback
    self.conn.rollback(*args, **kwargs)
--- <exception caught here> ---
  File "/usr/local/lib/python3.7/site-packages/twisted/enterprise/adbapi.py", line 56, in rollback
    self._connection.rollback()
psycopg2.InterfaceError: connection already closed

2020-11-18 08:37:40,583 - twisted - 250 - CRITICAL - - Rollback failed
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/twisted/python/threadpool.py", line 266, in <lambda>
    inContext.theWork = lambda: context.call(ctx, func, *args, **kw)
  File "/usr/local/lib/python3.7/site-packages/twisted/python/context.py", line 122, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/local/lib/python3.7/site-packages/twisted/python/context.py", line 85, in callWithContext
    return func(*args,**kw)
  File "/usr/local/lib/python3.7/site-packages/twisted/enterprise/adbapi.py", line 303, in _runWithConnection
    conn.rollback()
--- <exception caught here> ---
  File "/usr/local/lib/python3.7/site-packages/twisted/enterprise/adbapi.py", line 56, in rollback
    self._connection.rollback()
psycopg2.InterfaceError: connection already closed

2020-11-18 08:37:40,583 - twisted - 250 - CRITICAL - - Rollback failed
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/twisted/python/threadpool.py", line 250, in inContext
    result = inContext.theWork()
  File "/usr/local/lib/python3.7/site-packages/twisted/python/threadpool.py", line 266, in <lambda>
    inContext.theWork = lambda: context.call(ctx, func, *args, **kw)
  File "/usr/local/lib/python3.7/site-packages/twisted/python/context.py", line 122, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/local/lib/python3.7/site-packages/twisted/python/context.py", line 85, in callWithContext
    return func(*args,**kw)
--- <exception caught here> ---
  File "/usr/local/lib/python3.7/site-packages/twisted/enterprise/adbapi.py", line 303, in _runWithConnection
    conn.rollback()
  File "/usr/local/lib/python3.7/site-packages/twisted/enterprise/adbapi.py", line 65, in rollback
    self._pool.disconnect(self._connection)
  File "/usr/local/lib/python3.7/site-packages/twisted/enterprise/adbapi.py", line 444, in disconnect
    raise Exception("wrong connection for thread")
builtins.Exception: wrong connection for thread

2020-11-18 08:37:40,583 - synapse.metrics.background_process_metrics - 211 - ERROR - event_push_action_stream_orderings-221- Background process 'event_push_action_stream_orderings' threw an exception
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/database.py", line 505, in new_transaction
    r = func(cursor, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/databases/main/event_push_actions.py", line 532, in _find_stream_orderings_for_times_txn
    txn, self._clock.time_msec() - 30 * 24 * 60 * 60 * 1000
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/databases/main/event_push_actions.py", line 584, in _find_first_stream_ordering_after_ts_txn
    txn.execute("SELECT MAX(stream_ordering) FROM events")
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/database.py", line 266, in execute
    self._do_execute(self.txn.execute, sql, *args)
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/database.py", line 292, in _do_execute
    return func(sql, *args)
psycopg2.OperationalError: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/synapse/metrics/background_process_metrics.py", line 206, in run
    result = await result
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/databases/main/event_push_actions.py", line 526, in _find_stream_orderings_for_times
    self._find_stream_orderings_for_times_txn,
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/database.py", line 635, in runInteraction
    **kwargs
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/database.py", line 708, in runWithConnection
    self._db_pool.runWithConnection(inner_func, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/twisted/python/threadpool.py", line 250, in inContext
    result = inContext.theWork()
  File "/usr/local/lib/python3.7/site-packages/twisted/python/threadpool.py", line 266, in <lambda>
    inContext.theWork = lambda: context.call(ctx, func, *args, **kw)
  File "/usr/local/lib/python3.7/site-packages/twisted/python/context.py", line 122, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/local/lib/python3.7/site-packages/twisted/python/context.py", line 85, in callWithContext
    return func(*args,**kw)
  File "/usr/local/lib/python3.7/site-packages/twisted/enterprise/adbapi.py", line 306, in _runWithConnection
    compat.reraise(excValue, excTraceback)
  File "/usr/local/lib/python3.7/site-packages/twisted/python/compat.py", line 464, in reraise
    raise exception.with_traceback(traceback)
  File "/usr/local/lib/python3.7/site-packages/twisted/enterprise/adbapi.py", line 297, in _runWithConnection
    result = func(conn, *args, **kw)
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/database.py", line 702, in inner_func
    return func(db_conn, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/database.py", line 517, in new_transaction
    conn.rollback()
  File "/usr/local/lib/python3.7/site-packages/synapse/storage/database.py", line 157, in rollback
    self.conn.rollback(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/twisted/enterprise/adbapi.py", line 70, in rollback
    raise ConnectionLost()
twisted.enterprise.adbapi.ConnectionLost

These errors continue until I restart the container, at which point everything starts working again until the next time my postgres becomes overloaded.

I think the info from https://github.com/matrix-org/synapse/issues/8574#issuecomment-728104895 is still what is needed to help debug this further! Thanks!

Unfortunately I haven't run into the issue at a time when I haven't _required_ the continued use of federation, but if it happens during non-working hours I'll do my best to grab a tcpdump as well.

So I just ran into this again, I took the time to grab a tcpdump of all PostgreSQL traffic this time, can't really upload all of it though as it's massive (generates at over 10MB per second) and completely full of sensitive information. Though there's not a single lost connection or response-less request in there.
Synapse seems to have gotten stuck constantly running a whole bunch of SELECT state_key FROM current_state_events WHERE type = 'm.room.member' AND room_id = '[...]' AND membership = 'join' queries. Rapidly and repeatedly looping through all the rooms the server is part of, at around two times per second. In 18 seconds of capture I can see a full 38 queries to get all members of !OGEhHVWSdvArJzumhm:matrix.org - the Matrix HQ room (with its 8k members) - for example.

Apart from those, the entire rest of the capture is just a mix of COMMIT/BEGIN ISOLATION LEVEL REPEATABLE READ/SET default_transaction_isolation TO DEFAULT, and what seems like the handling of incoming federation (inserting transactions, SELECT-ing server signatures, INSERT-ing server_keys_json, updating presence, looking at destinations), checking of group information/membership, and updating of user last-seen data.

Could it be that a cache breaks when the database connection resets?

Edit:
After a restart, another 15 seconds of capture shows a grand total of two such SELECT state_key ... queries, so it seems reasonable.

Edit:
I'm going to see if I can get a similar capture from if my federation sender worker breaks, so I can cut away any other noise.

Was this page helpful?
0 / 5 - 0 ratings