I installed synapse from source on Debian using virtualenv. I am using postgres 9.6 as a backend.
The client is behind a Let's Encrypted nginx reverse proxy.
The federation port is in the open.
The installation is pretty fresh, only one user (me) is on the system, however I already joined some big channels (#matrix:matrix.org).
When the instance boots, some queries blocks requests from my client (Riot), to unblock it, I need to kill the this query:
DROP INDEX IF EXISTS access_tokens_device_id to probably unblock this:
SELECT users.name, users.is_guest, access_tokens.id as token_id, access_tokens.device_id FROM users INNER JOIN access_tokens on users.name = access_tokens.user_id WHERE token = 'FOOBAR'
Afterwards I have new queries blocking:
LOCK TABLE user_ips in EXCLUSIVE MODE and
CREATE INDEX CONCURRENTLY user_ips_device_id ON user_ips (user_id, device_id, last_seen)
This query also sometimes hangs around:
SELECT room_id, MAX(stream_ordering) FROM events WHERE stream_ordering > 6131 - 100000 GROUP BY room_id
this is my DB setup:
database:
name: "psycopg2"
args:
user: "user"
password: "xxx"
database: "synapse"
host: "127.0.0.1"
cp_min: 5
cp_max: 10
So, I changed the value of deadlock_timeout in /etc/postgresql/9.6/main/postgresql.conf
to 1s like this:
deadlock_timeout = 1s
Documentation: https://www.postgresql.org/docs/9.6/static/runtime-config-locks.html
I don't know whether the behavior I described is a bug of synapse or not. But the fact that these queries block themselves seems strange to me.
Apparently deadlock_timeout has little effect.
I now have lock_timeout = 2s in /etc/postgresql/9.6/main/postgresql.conf and see this in the log file:
2018-08-17 22:30:01,071 - synapse.storage.background_updates - 187 - INFO - background_updates-0 - Starting update batch on background update 'user_ips_device_index'
2018-08-17 22:30:01,082 - synapse.storage.background_updates - 361 - INFO - background_updates-0 - Adding index user_ips_device_id to user_ips
2018-08-17 22:30:02,096 - synapse.storage.background_updates - 109 - ERROR - background_updates-0 - Error doing update
Traceback (most recent call last):
File "/home/trinity/.synapse/local/lib/python2.7/site-packages/synapse/storage/background_updates.py", line 106, in _run_background_updates
self.BACKGROUND_UPDATE_DURATION_MS
File "/home/trinity/.synapse/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "/home/trinity/.synapse/local/lib/python2.7/site-packages/twisted/python/failure.py", line 491, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/home/trinity/.synapse/local/lib/python2.7/site-packages/synapse/storage/background_updates.py", line 181, in do_next_background_update
res = yield self._do_background_update(update_name, desired_duration_ms)
File "/home/trinity/.synapse/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "/home/trinity/.synapse/local/lib/python2.7/site-packages/twisted/python/failure.py", line 491, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/home/trinity/.synapse/local/lib/python2.7/site-packages/synapse/storage/background_updates.py", line 215, in _do_background_update
items_updated = yield update_handler(progress, batch_size)
File "/home/trinity/.synapse/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "/home/trinity/.synapse/local/lib/python2.7/site-packages/twisted/python/failure.py", line 491, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/home/trinity/.synapse/local/lib/python2.7/site-packages/synapse/storage/background_updates.py", line 362, in updater
yield self.runWithConnection(runner)
File "/home/trinity/.synapse/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "/home/trinity/.synapse/local/lib/python2.7/site-packages/twisted/python/failure.py", line 491, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/home/trinity/.synapse/local/lib/python2.7/site-packages/synapse/storage/_base.py", line 373, in runWithConnection
inner_func, *args, **kwargs
File "/home/trinity/.synapse/local/lib/python2.7/site-packages/twisted/python/threadpool.py", line 250, in inContext
result = inContext.theWork()
File "/home/trinity/.synapse/local/lib/python2.7/site-packages/twisted/python/threadpool.py", line 266, in <lambda>
inContext.theWork = lambda: context.call(ctx, func, *args, **kw)
File "/home/trinity/.synapse/local/lib/python2.7/site-packages/twisted/python/context.py", line 122, in callWithContext
return self.currentContext().callWithContext(ctx, func, *args, **kw)
File "/home/trinity/.synapse/local/lib/python2.7/site-packages/twisted/python/context.py", line 85, in callWithContext
return func(*args,**kw)
File "/home/trinity/.synapse/local/lib/python2.7/site-packages/twisted/enterprise/adbapi.py", line 306, in _runWithConnection
compat.reraise(excValue, excTraceback)
File "/home/trinity/.synapse/local/lib/python2.7/site-packages/twisted/enterprise/adbapi.py", line 297, in _runWithConnection
result = func(conn, *args, **kw)
File "/home/trinity/.synapse/local/lib/python2.7/site-packages/synapse/storage/_base.py", line 369, in inner_func
return func(conn, *args, **kwargs)
File "/home/trinity/.synapse/local/lib/python2.7/site-packages/synapse/storage/background_updates.py", line 322, in create_index_psql
c.execute(sql)
OperationalError: canceling statement due to lock timeout
I have the same problem.
database:
# The database engine name
name: 'psycopg2'
# Arguments to pass to the engine
args:
# Path to the database
user: 'messenger'
dbname: 'messenger'
cp_min: 5
cp_max: 10
Synapse sucessfully connects to PostgreSQL via the UNIX socket. I, too, changed the lock_timeout to five seconds which caused the Synapse background update to fail:
2018-08-20 07:37:38,658 - synapse.storage._base - 253 - WARNING - GET-1 - [TXN OPERROR] {get_user_by_access_token-6} terminating connection due to administraβ
LINE 1: ...id, access_tokens.device_id FROM users INNER JOIN access_tok... β
^ β
server closed the connection unexpectedly β
This probably means the server terminated abnormally β
before or while processing the request. β
0/5 β
2018-08-20 07:37:38,659 - synapse.storage._base - 262 - WARNING - GET-1 - [TXN EROLL] {get_user_by_access_token-6} connection already closed β
2018-08-20 07:37:38,660 - twisted - 131 - CRITICAL - - Rollback failed β
Traceback (most recent call last): β
File "/usr/local/lib/python2.7/dist-packages/twisted/python/threadpool.py", line 250, in inContext β
result = inContext.theWork() β
File "/usr/local/lib/python2.7/dist-packages/twisted/python/threadpool.py", line 266, in <lambda> β
inContext.theWork = lambda: context.call(ctx, func, *args, **kw) β
File "/usr/local/lib/python2.7/dist-packages/twisted/python/context.py", line 122, in callWithContext β
return self.currentContext().callWithContext(ctx, func, *args, **kw) β
File "/usr/local/lib/python2.7/dist-packages/twisted/python/context.py", line 85, in callWithContext β
return func(*args,**kw) β
--- <exception caught here> --- β
File "/usr/local/lib/python2.7/dist-packages/twisted/enterprise/adbapi.py", line 303, in _runWithConnection β
conn.rollback() β
File "/usr/local/lib/python2.7/dist-packages/twisted/enterprise/adbapi.py", line 52, in rollback β
self._connection.rollback() β
psycopg2.InterfaceError: connection already closed
@timoschwarzer Good to hear, thanks for confirming. What OS and PostgreSQL version are you running?
Debian Stretch and PostgreSQL 9.6
Things logged against "background_updates" are one-time update jobs which happen when you upgrade synapse. They shouldn't take very long. I suggest you just let them run.
They never go away, even after hours. I let them run for ~ 24h. Almost no load while they ran.
ok, can you leave them running for a while, and then grep your logs for background_updates and share what you see there, along with the postgres queries you see in flight.
I have the same issue:
$ grep background_updates path/to/homeserver.log
2018-08-22 17:27:44,370 - synapse.storage.background_updates - 187 - INFO - background_updates-0- Starting update batch on background update 'user_ips_last_seen_index'
2018-08-22 17:27:44,371 - synapse.storage.background_updates - 361 - INFO - background_updates-0- Adding index user_ips_last_seen to user_ips
$ sudo -u postgres psql synapse_cschwarz_org -c "select * from pg_stat_activity where datname='synapse_cschwarz_org';"
datid | datname | pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | xact_start | query_start | state_change | wait_event_type | wait_event | state | backend_xid | backend_xmin | query
---------+----------------------+-------+----------+----------+------------------+---------------+-----------------+-------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-----------------+--------------+---------------------+-------------+--------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
2688374 | synapse_cschwarz_org | 26593 | 10 | postgres | psql | | | -1 | 2018-08-22 17:17:15.796457+00 | | 2018-08-22 17:28:48.523978+00 | 2018-08-22 17:28:48.525045+00 | | | idle | | | select * from pg_stat_activity where datname='synapse_cschwarz_org';
2688374 | synapse_cschwarz_org | 78904 | 945951 | synapse | | 10.123.236.10 | | 20698 | 2018-08-22 17:27:44.372545+00 | 2018-08-22 17:27:48.411264+00 | 2018-08-22 17:27:48.411264+00 | 2018-08-22 17:27:48.411266+00 | Lock | virtualxid | active | | | CREATE INDEX CONCURRENTLY user_ips_last_seen ON user_ips (user_id, last_seen)
2688374 | synapse_cschwarz_org | 70433 | 945951 | synapse | | 10.123.236.10 | | 62787 | 2018-08-22 17:27:43.289467+00 | 2018-08-22 17:27:43.291627+00 | 2018-08-22 17:27:43.319919+00 | 2018-08-22 17:27:43.320248+00 | | | idle in transaction | | 324665381 | SELECT room_id, MAX(stream_ordering) FROM events WHERE stream_ordering > 35694143 - 100000 GROUP BY room_id
2688374 | synapse_cschwarz_org | 72459 | 945951 | synapse | | 10.123.236.10 | | 37550 | 2018-08-22 17:27:43.367816+00 | 2018-08-22 17:27:53.33506+00 | 2018-08-22 17:27:53.335204+00 | 2018-08-22 17:27:53.335204+00 | Lock | relation | active | | 324665381 | LOCK TABLE user_ips in EXCLUSIVE MODE
2688374 | synapse_cschwarz_org | 75793 | 945951 | synapse | | 10.123.236.10 | | 32877 | 2018-08-22 17:27:43.369028+00 | | 2018-08-22 17:31:02.712776+00 | 2018-08-22 17:31:02.712797+00 | | | idle | | | COMMIT
2688374 | synapse_cschwarz_org | 78873 | 945951 | synapse | | 10.123.236.10 | | 40119 | 2018-08-22 17:27:43.393159+00 | | 2018-08-22 17:31:02.714279+00 | 2018-08-22 17:31:02.714302+00 | | | idle | | | COMMIT
2688374 | synapse_cschwarz_org | 82247 | 945951 | synapse | | 10.123.236.10 | | 62285 | 2018-08-22 17:27:44.406715+00 | 2018-08-22 17:31:02.715022+00 | 2018-08-22 17:31:02.715087+00 | 2018-08-22 17:31:02.715089+00 | LWLockNamed | WALWriteLock | active | 324665761 | 324665381 | INSERT INTO presence_stream (currently_active, last_active_ts, last_federation_update_ts, last_user_sync_ts, state, status_msg, stream_id, user_id) VALUES(false, 1529430112714, 1534959062714, 0, 'offline', NULL, 2450447, '@bigcheese:adena.otakukingdom.com')
2688374 | synapse_cschwarz_org | 54040 | 10 | postgres | psql | | | -1 | 2018-08-22 17:31:06.360557+00 | 2018-08-22 17:31:06.362679+00 | 2018-08-22 17:31:06.362679+00 | 2018-08-22 17:31:06.362688+00 | | | active | | 324665381 | select * from pg_stat_activity where datname='synapse_cschwarz_org';
(8 rows)
@problame: thanks, that's very helpful. However in your case it looks like the background update has only been running for three minutes (and yes, it will block the user_ips update transaction while it happens; see also #3652). Could you leave it going for an hour or so and see how things look then?
Same thing, no changes. Also, user_ips is really small, so building an index shouldn't take long. (And all postgres processes are idle anyway):
select COUNT(*) from user_ips;
count
-------
443
(1 row)
@richvdh
Disabled the lock_timeout (so infinite)
After synapse start, this query is taking very long (minutes, hours, days (?)).
SELECT room_id, MAX(stream_ordering) FROM events WHERE stream_ordering > 275 - 100000 GROUP BY room_id
it blocks multiple background_updates like DROP INDEX / CREATE INDEX CONCURRENTLY.
after killing this SELECT room_id, MAX... query I see this in the logs:
2018-08-23 00:32:46,725 - synapse.storage.background_updates - 187 - INFO - background_updates-0 - Starting update batch on background update 'state_group_state_type_index'
2018-08-23 00:32:46,742 - synapse.storage.background_updates - 227 - INFO - background_updates-0 - Updating u'state_group_state_type_index'. Updated 1 items in 15ms. (total_rate=None/ms, current_rate=None/ms, total_updated=0, batch_size=100)
2018-08-23 00:32:47,746 - synapse.storage.background_updates - 113 - INFO - background_updates-0 - No more background updates to do. Unscheduling background update task.
The instance now runs rather smoothly but I suspect that it might go into a deadlock state soon. I switched back to lock_timeout = 5s. I will try in a few hours again when the background processes will be scheduled again (?).
@timoschwarzer Could you try the same and report what requests are blocking for you?
You can do that by opening a psql shell, best as the postgres user:
su postgres
psql
> SELECT query FROM pg_stat_activity;
They should also show up in ps -ef. Try to kill the one that is blocking the other.
I also found this mega statement here to figure out the dependency:
SELECT blocked_locks.pid AS blocked_pid,
blocked_activity.usename AS blocked_user,
blocking_locks.pid AS blocking_pid,
blocking_activity.usename AS blocking_user,
blocked_activity.query AS blocked_statement,
blocking_activity.query AS current_statement_in_blocking_process
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks
ON blocking_locks.locktype = blocked_locks.locktype
AND blocking_locks.DATABASE IS NOT DISTINCT FROM blocked_locks.DATABASE
AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
AND blocking_locks.pid != blocked_locks.pid
JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.GRANTED;
(Did not reload the thread before posting)
I also see that @problame has the same long running SELECT room_id, MAX(stream_ordering) FROM events WHERE stream_ordering > 275 - 100000 GROUP BY room_id
@mguentner
I recreated the entire database, so no tables were present before starting Synapse. Then I started it and these two queries are now running forever:
SELECT room_id, MAX(stream_ordering) FROM events WHERE stream_ordering > 1 - 100000 GROUP BY room_id
CREATE INDEX CONCURRENTLY access_tokens_device_id ON access_tokens (user_id, device_id)
The result of the big query you found is:
Blocked PID: 8470
Blocked user: docker_matrix
Blocking PID: 8463
Blocking user: docker_matrix
Blocked statement:
CREATE INDEX CONCURRENTLY access_tokens_device_id ON access_tokens (user_id, device_id)
Current statement in blocking process:
SELECT room_id, MAX(stream_ordering) FROM events WHERE stream_ordering > 1 - 100000 GROUP BY room_id
After killing %8463, Synapse works as expected.
@timoschwarzer can confirm that killing the CREATE INDEX process solved the issue, also after restarts of synapse.
@problame I've actually killed the SELECT room_id ... query. Appearantly, they are blocking each other so that you can just kill one of them to bring Synapse back to a working state.
@timoschwarzer And you are right, the SELECT room_id shows up again after restarts.
select pid,query from pg_stat_activity where datname='synapse_cschwarz_org';
pid | query
-------+-------------------------------------------------------------------------------------------------------------
41600 | COMMIT
26593 | select pid,query from pg_stat_activity where datname='synapse_cschwarz_org';
36813 | SELECT room_id, MAX(stream_ordering) FROM events WHERE stream_ordering > 35694153 - 100000 GROUP BY room_id
40088 | COMMIT
41022 | COMMIT
42121 | COMMIT
44058 | COMMIT
(7 rows)
What about those other COMMIT queries? shouldn't they go away? they are all in idle state (column not shown here).
Interestingly, when I execute the SELECT room_id ... query on the database (even while all the other queries are running), I get an instant response (8 rows).
@problame Can confirm. Saw the same behavior. Manually running the queries works.
What about those other COMMIT queries? shouldn't they go away? they are all in idle state (column not shown here).
No, those are normal idle connections. Synapse uses a connection pool to reduce the overhead of database queries.
Right, I think I know what's causing this. I'll sort out a fix.
Folks can you confirm if this is still a problem with synapse 0.33.3? If my theory is correct it was introduced in https://github.com/matrix-org/synapse/pull/3630 (which was in 0.33.2) but will then have been fixed by https://github.com/matrix-org/synapse/pull/3633 (which was in 0.33.3)
Yes, it's gone with 0.33.3 (verified with several synapse restarts). Thanks!
Most helpful comment
Right, I think I know what's causing this. I'll sort out a fix.