Synapse: Some Postgres database queries are long running / blocking

Created on 12 Aug 2018 · 23Comments · Source: matrix-org/synapse

Description

I installed synapse from source on Debian using virtualenv. I am using postgres 9.6 as a backend.
The client is behind a Let's Encrypted nginx reverse proxy.
The federation port is in the open.

The installation is pretty fresh, only one user (me) is on the system, however I already joined some big channels (#matrix:matrix.org).

When the instance boots, some queries blocks requests from my client (Riot), to unblock it, I need to kill the this query:

DROP INDEX IF EXISTS access_tokens_device_id to probably unblock this:
SELECT users.name, users.is_guest, access_tokens.id as token_id, access_tokens.device_id FROM users INNER JOIN access_tokens on users.name = access_tokens.user_id WHERE token = 'FOOBAR'

Afterwards I have new queries blocking:

LOCK TABLE user_ips in EXCLUSIVE MODE and
CREATE INDEX CONCURRENTLY user_ips_device_id ON user_ips (user_id, device_id, last_seen)

This query also sometimes hangs around:
SELECT room_id, MAX(stream_ordering) FROM events WHERE stream_ordering > 6131 - 100000 GROUP BY room_id

Steps to reproduce

not really sure

this is my DB setup:

database:
  name: "psycopg2"
  args:
    user: "user"
    password: "xxx"
    database: "synapse"
    host: "127.0.0.1"
    cp_min: 5
    cp_max: 10

Version information

Version: 67dbe4c89944cf5957f806ad63f60880e2242b23
Install method: pip
Platform: VPS with Debian Stretch
Postgresql: 9.6

Source

mguentner

👍1

Most helpful comment

Right, I think I know what's causing this. I'll sort out a fix.

richvdh on 24 Aug 2018

❤1 🎉1 😄1 👍1

All 23 comments

So, I changed the value of deadlock_timeout in /etc/postgresql/9.6/main/postgresql.conf
to 1s like this:

deadlock_timeout = 1s

Documentation: https://www.postgresql.org/docs/9.6/static/runtime-config-locks.html

I don't know whether the behavior I described is a bug of synapse or not. But the fact that these queries block themselves seems strange to me.

mguentner on 13 Aug 2018

Apparently deadlock_timeout has little effect.
I now have lock_timeout = 2s in /etc/postgresql/9.6/main/postgresql.conf and see this in the log file:

2018-08-17 22:30:01,071 - synapse.storage.background_updates - 187 - INFO - background_updates-0 - Starting update batch on background update 'user_ips_device_index'
2018-08-17 22:30:01,082 - synapse.storage.background_updates - 361 - INFO - background_updates-0 - Adding index user_ips_device_id to user_ips
2018-08-17 22:30:02,096 - synapse.storage.background_updates - 109 - ERROR - background_updates-0 - Error doing update
Traceback (most recent call last):
  File "/home/trinity/.synapse/local/lib/python2.7/site-packages/synapse/storage/background_updates.py", line 106, in _run_background_updates
    self.BACKGROUND_UPDATE_DURATION_MS
  File "/home/trinity/.synapse/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/home/trinity/.synapse/local/lib/python2.7/site-packages/twisted/python/failure.py", line 491, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/home/trinity/.synapse/local/lib/python2.7/site-packages/synapse/storage/background_updates.py", line 181, in do_next_background_update
    res = yield self._do_background_update(update_name, desired_duration_ms)
  File "/home/trinity/.synapse/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/home/trinity/.synapse/local/lib/python2.7/site-packages/twisted/python/failure.py", line 491, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/home/trinity/.synapse/local/lib/python2.7/site-packages/synapse/storage/background_updates.py", line 215, in _do_background_update
    items_updated = yield update_handler(progress, batch_size)
  File "/home/trinity/.synapse/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/home/trinity/.synapse/local/lib/python2.7/site-packages/twisted/python/failure.py", line 491, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/home/trinity/.synapse/local/lib/python2.7/site-packages/synapse/storage/background_updates.py", line 362, in updater
    yield self.runWithConnection(runner)
  File "/home/trinity/.synapse/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/home/trinity/.synapse/local/lib/python2.7/site-packages/twisted/python/failure.py", line 491, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/home/trinity/.synapse/local/lib/python2.7/site-packages/synapse/storage/_base.py", line 373, in runWithConnection
    inner_func, *args, **kwargs
  File "/home/trinity/.synapse/local/lib/python2.7/site-packages/twisted/python/threadpool.py", line 250, in inContext
    result = inContext.theWork()
  File "/home/trinity/.synapse/local/lib/python2.7/site-packages/twisted/python/threadpool.py", line 266, in <lambda>
    inContext.theWork = lambda: context.call(ctx, func, *args, **kw)
  File "/home/trinity/.synapse/local/lib/python2.7/site-packages/twisted/python/context.py", line 122, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/home/trinity/.synapse/local/lib/python2.7/site-packages/twisted/python/context.py", line 85, in callWithContext
    return func(*args,**kw)
  File "/home/trinity/.synapse/local/lib/python2.7/site-packages/twisted/enterprise/adbapi.py", line 306, in _runWithConnection
    compat.reraise(excValue, excTraceback)
  File "/home/trinity/.synapse/local/lib/python2.7/site-packages/twisted/enterprise/adbapi.py", line 297, in _runWithConnection
    result = func(conn, *args, **kw)
  File "/home/trinity/.synapse/local/lib/python2.7/site-packages/synapse/storage/_base.py", line 369, in inner_func
    return func(conn, *args, **kwargs)
  File "/home/trinity/.synapse/local/lib/python2.7/site-packages/synapse/storage/background_updates.py", line 322, in create_index_psql
    c.execute(sql)
OperationalError: canceling statement due to lock timeout

mguentner on 17 Aug 2018

I have the same problem.

database:
  # The database engine name
  name: 'psycopg2'
  # Arguments to pass to the engine
  args:
    # Path to the database
    user: 'messenger'
    dbname: 'messenger'
    cp_min: 5
    cp_max: 10

Synapse sucessfully connects to PostgreSQL via the UNIX socket. I, too, changed the lock_timeout to five seconds which caused the Synapse background update to fail:

2018-08-20 07:37:38,658 - synapse.storage._base - 253 - WARNING - GET-1 - [TXN OPERROR] {get_user_by_access_token-6} terminating connection due to administra│
LINE 1: ...id, access_tokens.device_id FROM users INNER JOIN access_tok...                                                                                   │
                                                             ^                                                                                               │
server closed the connection unexpectedly                                                                                                                    │
        This probably means the server terminated abnormally                                                                                                 │
        before or while processing the request.                                                                                                              │
 0/5                                                                                                                                                         │
2018-08-20 07:37:38,659 - synapse.storage._base - 262 - WARNING - GET-1 - [TXN EROLL] {get_user_by_access_token-6} connection already closed                 │
2018-08-20 07:37:38,660 - twisted - 131 - CRITICAL -  - Rollback failed                                                                                      │
Traceback (most recent call last):                                                                                                                           │
  File "/usr/local/lib/python2.7/dist-packages/twisted/python/threadpool.py", line 250, in inContext                                                         │
    result = inContext.theWork()                                                                                                                             │
  File "/usr/local/lib/python2.7/dist-packages/twisted/python/threadpool.py", line 266, in <lambda>                                                          │
    inContext.theWork = lambda: context.call(ctx, func, *args, **kw)                                                                                         │
  File "/usr/local/lib/python2.7/dist-packages/twisted/python/context.py", line 122, in callWithContext                                                      │
    return self.currentContext().callWithContext(ctx, func, *args, **kw)                                                                                     │
  File "/usr/local/lib/python2.7/dist-packages/twisted/python/context.py", line 85, in callWithContext                                                       │
    return func(*args,**kw)                                                                                                                                  │
--- <exception caught here> ---                                                                                                                              │
  File "/usr/local/lib/python2.7/dist-packages/twisted/enterprise/adbapi.py", line 303, in _runWithConnection                                                │
    conn.rollback()                                                                                                                                          │
  File "/usr/local/lib/python2.7/dist-packages/twisted/enterprise/adbapi.py", line 52, in rollback                                                           │
    self._connection.rollback()                                                                                                                              │
psycopg2.InterfaceError: connection already closed

timoschwarzer on 20 Aug 2018

@timoschwarzer Good to hear, thanks for confirming. What OS and PostgreSQL version are you running?

mguentner on 20 Aug 2018

Debian Stretch and PostgreSQL 9.6

timoschwarzer on 21 Aug 2018

Things logged against "background_updates" are one-time update jobs which happen when you upgrade synapse. They shouldn't take very long. I suggest you just let them run.

richvdh on 22 Aug 2018

They never go away, even after hours. I let them run for ~ 24h. Almost no load while they ran.

mguentner on 22 Aug 2018

ok, can you leave them running for a while, and then grep your logs for background_updates and share what you see there, along with the postgres queries you see in flight.

richvdh on 22 Aug 2018

I have the same issue:

$ grep background_updates path/to/homeserver.log
2018-08-22 17:27:44,370 - synapse.storage.background_updates - 187 - INFO - background_updates-0- Starting update batch on background update 'user_ips_last_seen_index'                                                                                                                                                       
2018-08-22 17:27:44,371 - synapse.storage.background_updates - 361 - INFO - background_updates-0- Adding index user_ips_last_seen to user_ips

$ sudo -u postgres psql synapse_cschwarz_org -c "select * from pg_stat_activity where datname='synapse_cschwarz_org';"
  datid  |       datname        |  pid  | usesysid | usename  | application_name |  client_addr  | client_hostname | client_port |         backend_start         |          xact_start           |          query_start          |         state_change          | wait_event_type |  wait_event  |        state        | backend_xid | backend_xmin |                                                                                                                               query                                                                                                                               
---------+----------------------+-------+----------+----------+------------------+---------------+-----------------+-------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-----------------+--------------+---------------------+-------------+--------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 2688374 | synapse_cschwarz_org | 26593 |       10 | postgres | psql             |               |                 |          -1 | 2018-08-22 17:17:15.796457+00 |                               | 2018-08-22 17:28:48.523978+00 | 2018-08-22 17:28:48.525045+00 |                 |              | idle                |             |              | select * from pg_stat_activity where datname='synapse_cschwarz_org';
 2688374 | synapse_cschwarz_org | 78904 |   945951 | synapse  |                  | 10.123.236.10 |                 |       20698 | 2018-08-22 17:27:44.372545+00 | 2018-08-22 17:27:48.411264+00 | 2018-08-22 17:27:48.411264+00 | 2018-08-22 17:27:48.411266+00 | Lock            | virtualxid   | active              |             |              | CREATE  INDEX CONCURRENTLY user_ips_last_seen ON user_ips (user_id, last_seen) 
 2688374 | synapse_cschwarz_org | 70433 |   945951 | synapse  |                  | 10.123.236.10 |                 |       62787 | 2018-08-22 17:27:43.289467+00 | 2018-08-22 17:27:43.291627+00 | 2018-08-22 17:27:43.319919+00 | 2018-08-22 17:27:43.320248+00 |                 |              | idle in transaction |             |    324665381 | SELECT room_id, MAX(stream_ordering) FROM events WHERE stream_ordering > 35694143 - 100000 GROUP BY room_id
 2688374 | synapse_cschwarz_org | 72459 |   945951 | synapse  |                  | 10.123.236.10 |                 |       37550 | 2018-08-22 17:27:43.367816+00 | 2018-08-22 17:27:53.33506+00  | 2018-08-22 17:27:53.335204+00 | 2018-08-22 17:27:53.335204+00 | Lock            | relation     | active              |             |    324665381 | LOCK TABLE user_ips in EXCLUSIVE MODE
 2688374 | synapse_cschwarz_org | 75793 |   945951 | synapse  |                  | 10.123.236.10 |                 |       32877 | 2018-08-22 17:27:43.369028+00 |                               | 2018-08-22 17:31:02.712776+00 | 2018-08-22 17:31:02.712797+00 |                 |              | idle                |             |              | COMMIT
 2688374 | synapse_cschwarz_org | 78873 |   945951 | synapse  |                  | 10.123.236.10 |                 |       40119 | 2018-08-22 17:27:43.393159+00 |                               | 2018-08-22 17:31:02.714279+00 | 2018-08-22 17:31:02.714302+00 |                 |              | idle                |             |              | COMMIT
 2688374 | synapse_cschwarz_org | 82247 |   945951 | synapse  |                  | 10.123.236.10 |                 |       62285 | 2018-08-22 17:27:44.406715+00 | 2018-08-22 17:31:02.715022+00 | 2018-08-22 17:31:02.715087+00 | 2018-08-22 17:31:02.715089+00 | LWLockNamed     | WALWriteLock | active              |   324665761 |    324665381 | INSERT INTO presence_stream (currently_active, last_active_ts, last_federation_update_ts, last_user_sync_ts, state, status_msg, stream_id, user_id) VALUES(false, 1529430112714, 1534959062714, 0, 'offline', NULL, 2450447, '@bigcheese:adena.otakukingdom.com')
 2688374 | synapse_cschwarz_org | 54040 |       10 | postgres | psql             |               |                 |          -1 | 2018-08-22 17:31:06.360557+00 | 2018-08-22 17:31:06.362679+00 | 2018-08-22 17:31:06.362679+00 | 2018-08-22 17:31:06.362688+00 |                 |              | active              |             |    324665381 | select * from pg_stat_activity where datname='synapse_cschwarz_org';
(8 rows)

problame on 22 Aug 2018

👍1

@problame: thanks, that's very helpful. However in your case it looks like the background update has only been running for three minutes (and yes, it will block the user_ips update transaction while it happens; see also #3652). Could you leave it going for an hour or so and see how things look then?

richvdh on 22 Aug 2018

Same thing, no changes. Also, user_ips is really small, so building an index shouldn't take long. (And all postgres processes are idle anyway):

select COUNT(*) from user_ips;
 count 
-------
   443
(1 row)

problame on 22 Aug 2018

@richvdh

Disabled the lock_timeout (so infinite)

After synapse start, this query is taking very long (minutes, hours, days (?)).

SELECT room_id, MAX(stream_ordering) FROM events WHERE stream_ordering > 275 - 100000 GROUP BY room_id

it blocks multiple background_updates like DROP INDEX / CREATE INDEX CONCURRENTLY.

after killing this SELECT room_id, MAX... query I see this in the logs:

2018-08-23 00:32:46,725 - synapse.storage.background_updates - 187 - INFO - background_updates-0 - Starting update batch on background update 'state_group_state_type_index'
2018-08-23 00:32:46,742 - synapse.storage.background_updates - 227 - INFO - background_updates-0 - Updating u'state_group_state_type_index'. Updated 1 items in 15ms. (total_rate=None/ms, current_rate=None/ms, total_updated=0, batch_size=100)
2018-08-23 00:32:47,746 - synapse.storage.background_updates - 113 - INFO - background_updates-0 - No more background updates to do. Unscheduling background update task.

The instance now runs rather smoothly but I suspect that it might go into a deadlock state soon. I switched back to lock_timeout = 5s. I will try in a few hours again when the background processes will be scheduled again (?).

@timoschwarzer Could you try the same and report what requests are blocking for you?
You can do that by opening a psql shell, best as the postgres user:

su postgres
psql
>  SELECT query FROM pg_stat_activity;

They should also show up in ps -ef. Try to kill the one that is blocking the other.
I also found this mega statement here to figure out the dependency:

SELECT blocked_locks.pid     AS blocked_pid,
         blocked_activity.usename  AS blocked_user,
         blocking_locks.pid     AS blocking_pid,
         blocking_activity.usename AS blocking_user,
         blocked_activity.query    AS blocked_statement,
         blocking_activity.query   AS current_statement_in_blocking_process
   FROM  pg_catalog.pg_locks         blocked_locks
    JOIN pg_catalog.pg_stat_activity blocked_activity  ON blocked_activity.pid = blocked_locks.pid
    JOIN pg_catalog.pg_locks         blocking_locks
        ON blocking_locks.locktype = blocked_locks.locktype
        AND blocking_locks.DATABASE IS NOT DISTINCT FROM blocked_locks.DATABASE
        AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
        AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
        AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
        AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
        AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
        AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
        AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
        AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
        AND blocking_locks.pid != blocked_locks.pid
    JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
   WHERE NOT blocked_locks.GRANTED;

mguentner on 23 Aug 2018

(Did not reload the thread before posting)
I also see that @problame has the same long running SELECT room_id, MAX(stream_ordering) FROM events WHERE stream_ordering > 275 - 100000 GROUP BY room_id

mguentner on 23 Aug 2018

@mguentner

I recreated the entire database, so no tables were present before starting Synapse. Then I started it and these two queries are now running forever:

 SELECT room_id, MAX(stream_ordering) FROM events WHERE stream_ordering > 1 - 100000 GROUP BY room_id
 CREATE  INDEX CONCURRENTLY access_tokens_device_id ON access_tokens (user_id, device_id)

The result of the big query you found is:
Blocked PID: 8470
Blocked user: docker_matrix
Blocking PID: 8463
Blocking user: docker_matrix
Blocked statement:

CREATE  INDEX CONCURRENTLY access_tokens_device_id ON access_tokens (user_id, device_id)

Current statement in blocking process:

SELECT room_id, MAX(stream_ordering) FROM events WHERE stream_ordering > 1 - 100000 GROUP BY room_id

After killing %8463, Synapse works as expected.

timoschwarzer on 23 Aug 2018

@timoschwarzer can confirm that killing the CREATE INDEX process solved the issue, also after restarts of synapse.

problame on 23 Aug 2018

@problame I've actually killed the SELECT room_id ... query. Appearantly, they are blocking each other so that you can just kill one of them to bring Synapse back to a working state.

timoschwarzer on 23 Aug 2018

@timoschwarzer And you are right, the SELECT room_id shows up again after restarts.

select pid,query from pg_stat_activity where datname='synapse_cschwarz_org';
  pid  |                                                    query                                                    
-------+-------------------------------------------------------------------------------------------------------------
 41600 | COMMIT
 26593 | select pid,query from pg_stat_activity where datname='synapse_cschwarz_org';
 36813 | SELECT room_id, MAX(stream_ordering) FROM events WHERE stream_ordering > 35694153 - 100000 GROUP BY room_id
 40088 | COMMIT
 41022 | COMMIT
 42121 | COMMIT
 44058 | COMMIT
(7 rows)

What about those other COMMIT queries? shouldn't they go away? they are all in idle state (column not shown here).

problame on 23 Aug 2018

Interestingly, when I execute the SELECT room_id ... query on the database (even while all the other queries are running), I get an instant response (8 rows).

problame on 23 Aug 2018

@problame Can confirm. Saw the same behavior. Manually running the queries works.

mguentner on 23 Aug 2018

What about those other COMMIT queries? shouldn't they go away? they are all in idle state (column not shown here).

No, those are normal idle connections. Synapse uses a connection pool to reduce the overhead of database queries.

richvdh on 24 Aug 2018

Right, I think I know what's causing this. I'll sort out a fix.

richvdh on 24 Aug 2018

❤1 🎉1 😄1 👍1

Folks can you confirm if this is still a problem with synapse 0.33.3? If my theory is correct it was introduced in https://github.com/matrix-org/synapse/pull/3630 (which was in 0.33.2) but will then have been fixed by https://github.com/matrix-org/synapse/pull/3633 (which was in 0.33.3)

richvdh on 28 Aug 2018

Yes, it's gone with 0.33.3 (verified with several synapse restarts). Thanks!

problame on 28 Aug 2018

👍1

Was this page helpful?

0 / 5 - 0 ratings