Long-running Jepsen append tests (~1200 seconds) of YugaByte DB 1.3.1.0 have seen repeated tserver crashes, due to both file limits and memory exhaustion. There are no forced crashes, partitions, or other nemesis activities in this test: it's just a healthy happy cluster trying to serve transactions. I've attached memory, rpc, and thread stats from the admin interface for the first node in one such test, while it was allocating ~5MB/sec of memory, but before it'd crashed.
Full Jepsen & node logs (16GB uncompressed)
stats.tar.gz
The workload in question is a variant of the append test, where our reads perform a select on an un-indexed secondary key field (k2), rather than selecting by primary key. Updates still use primary keys, in this particular test. You can reproduce this issue using Jepsen 0f46d2ea72b437a7c851bdce71757feaf2a2092f, and running (e.g. on EC2):
lein run test-all --username admin --nodes-file ~/nodes --os debian -w ysql/append --concurrency 3n --time-limit 1200 --test-count 10
A few stats, taken 60s apart, from node n1:
stats.tar.gz
Ah, I think I might have found a possible culprit for memory use: my last test left 1593 postgres processes running:
$ ps aux | grep postgres
root 32700 0.0 0.1 1106284 11996 ? Ssl Aug15 0:47 postgres: postgres postgres 172.31.24.221(59934) idle in transaction (aborted)
root 32707 0.0 0.1 1106524 12424 ? Ssl Aug15 0:52 postgres: postgres postgres 172.31.24.221(38396) idle in transaction (aborted)
root 32752 0.0 0.1 1106396 12092 ? Ssl Aug15 0:53 postgres: postgres postgres 172.31.24.221(38414) idle in transaction (aborted)
...
cc: @m-iancu , @mbautin , @spolitov ^
Doing a little more experimenting: it looks like the number of postgres processes gradually rises through the course of a test, at the rate of roughly 1.5 process/second, across the whole cluster. Killing the client doesn't cause those processes to exit, and they also survive the death of yb-tserver--possibly because they've daemonized? Their parent in pstree is systemd.
Collectively, ~120 postgres processes on one node used about 500MB of memory (as measured by free before and after killing them), and yb-tserver+yb-master were using another ~500MB together. I think this might be where the memory's going.
admin@ip-172-31-24-221:~$ parallel-ssh -i -H n1 -H n2 -H n3 -H n4 -H n5 ps aux | grep postgres | wc -l
594
admin@ip-172-31-24-221:~$ parallel-ssh -i -H n1 -H n2 -H n3 -H n4 -H n5 ps aux | grep postgres | wc -l
598
admin@ip-172-31-24-221:~$ parallel-ssh -i -H n1 -H n2 -H n3 -H n4 -H n5 ps aux | grep postgres | wc -l
600
admin@ip-172-31-24-221:~$ parallel-ssh -i -H n1 -H n2 -H n3 -H n4 -H n5 ps aux | grep postgres | wc -l
601
admin@ip-172-31-5-223:~$ free -m
total used free shared buff/cache available
Mem: 7986 588 6910 29 487 7142
Swap: 0 0 0
admin@ip-172-31-5-223:~$ sudo killall -9 postgres
admin@ip-172-31-5-223:~$ free -m
total used free shared buff/cache available
Mem: 7986 70 7439 19 476 7671
Swap: 0 0 0
Initial finding:
yb-tserver starts a postgres process (aka postmaster). When a client connects to postgres/ysql, the postmaster starts a separate postgres server for each connection. Based on the logs and the above snippet:
$ ps aux | grep postgres
root 32700 0.0 0.1 1106284 11996 ? Ssl Aug15 0:47 postgres: postgres postgres 172.31.24.221(59934) idle in transaction (aborted)
root 32707 0.0 0.1 1106524 12424 ? Ssl Aug15 0:52 postgres: postgres postgres 172.31.24.221(38396) idle in transaction (aborted)
root 32752 0.0 0.1 1106396 12092 ? Ssl Aug15 0:53 postgres: postgres postgres 172.31.24.221(38414) idle in transaction (aborted)
...
it looks like when yb-tserver terminates, it terminates postmaster process but the per-connection postgres processes are not being terminated.
I've been thinking about this a bit, and--even if yb-tserver did kill every postgres process when it exited, that doesn't solve the underlying problem, does it? Since this problem persists even after killing the Jepsen process (which is where all the clients are), it feels... there should be some code that detects the connection is closed and terminates the postgres worker process, yeah? Is it possible that the postgres processes are locked up in some way that's preventing them from exiting correctly, perhaps because of their 'idle in transaction' state?
Per postgres docs (https://www.postgresql.org/docs/11/server-shutdown.html):
It is best not to use SIGKILL to shut down the server. Doing so will prevent the server from releasing shared memory and semaphores, which might then have to be done manually before a new server can be started. Furthermore, SIGKILL kills the postgres process without letting it relay the signal to its subprocesses, so it will be necessary to kill the individual subprocesses by hand as well.
We use sigkill in two places currently:
Subprocess which is used to start the postgres server uses SIGKILL when Subprocess object is destructed.We should fix both. In this case, however, it does not look like (2) is the cause, because we log that we are sending sigkill to the process and I don't see that log line in the uploaded logs.
On Linux, we tell tserver to send SIGINT to its child process when it terminates: pg_proc_->SetParentDeathSignal(SIGINT); in pg_wrapper.cc. It is possible that "idle in transaction" somehow prevents normal termination of backend processes. It is also possible that we have some refcount leaks for C++-side objects that are preventing the backend processes from stopping correctly (we've fixed some of those bugs in the past).
Some PostgreSQL links on "idle in transaction (aborted)":
From the second link:
idle in transaction – This is where your app has run a BEGIN but it’s now waiting somewhere in a transaction and not doing work.
Presumably, all client processes have terminated by this point, so all the connections should be closed. We need to double-check we are detecting this and stopping the backend properly.
This issue is indeed the problem with v1.3.1.0, but no longer reproducible with --version 2.0.0.0.
Closing as resolved.
@frozenspider -- let's try to figure out which commit fixed the issue.
Fixed in 73e966e5735efc21bf2ad43f9d961a488afbe050 (#1857)
Thank you very much @frozenspider for confirming the commit!