This is not strictly reproducible as we tried to reproduce it in non productive environment (same OS and ProxySQL version) but we didn't manage to, I hope it is enough anyway.
# proxysql -V
ProxySQL version 2.0.15-20-g32bb92c, codename Truls
On
# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.04.3 LTS
Release: 18.04
Codename: bionic
Basically we were in a state were proxysql was complaining a lot about
[ERROR] Connect timeout on api-master.mysql:3306 : exceeded by 727us
[ERROR] Detected an offline server during query: api-master.mysql, 3306
while the backend server (AWS MySQL RDS instance, main write instance) was perfectly up and working. But the actual behaviour for which I'm opening this issues is that we got a lot of:
2021-01-22 18:28:37 main.cpp:1865:main(): [ERROR] Watchdog: 16 threads missed a heartbeat
2021-01-22 18:28:46 main.cpp:1865:main(): [ERROR] Watchdog: 3 threads missed a heartbeat
2021-01-22 18:28:52 main.cpp:1865:main(): [ERROR] Watchdog: 15 threads missed a heartbeat
2021-01-22 18:28:58 main.cpp:1865:main(): [ERROR] Watchdog: 16 threads missed a heartbeat
2021-01-22 18:29:07 main.cpp:1865:main(): [ERROR] Watchdog: 1 threads missed a heartbeat
2021-01-22 18:29:40 main.cpp:1865:main(): [ERROR] Watchdog: 13 threads missed a heartbeat
2021-01-22 18:29:46 main.cpp:1865:main(): [ERROR] Watchdog: 8 threads missed a heartbeat
2021-01-22 18:30:04 main.cpp:1865:main(): [ERROR] Watchdog: 9 threads missed a heartbeat
2021-01-22 18:30:16 main.cpp:1865:main(): [ERROR] Watchdog: 12 threads missed a heartbeat
2021-01-22 18:30:31 main.cpp:1865:main(): [ERROR] Watchdog: 15 threads missed a heartbeat
2021-01-22 18:30:40 main.cpp:1865:main(): [ERROR] Watchdog: 13 threads missed a heartbeat
2021-01-22 18:31:13 main.cpp:1865:main(): [ERROR] Watchdog: 15 threads missed a heartbeat
2021-01-22 18:31:28 main.cpp:1865:main(): [ERROR] Watchdog: 11 threads missed a heartbeat
2021-01-22 18:31:37 main.cpp:1865:main(): [ERROR] Watchdog: 11 threads missed a heartbeat
2021-01-22 18:31:46 main.cpp:1865:main(): [ERROR] Watchdog: 15 threads missed a heartbeat
2021-01-22 18:31:52 main.cpp:1865:main(): [ERROR] Watchdog: 13 threads missed a heartbeat
2021-01-22 18:32:04 main.cpp:1865:main(): [ERROR] Watchdog: 13 threads missed a heartbeat
2021-01-22 18:32:16 main.cpp:1865:main(): [ERROR] Watchdog: 13 threads missed a heartbeat
2021-01-22 18:32:31 main.cpp:1865:main(): [ERROR] Watchdog: 14 threads missed a heartbeat
2021-01-22 18:33:04 main.cpp:1865:main(): [ERROR] Watchdog: 10 threads missed a heartbeat
2021-01-22 18:33:10 main.cpp:1865:main(): [ERROR] Watchdog: 16 threads missed a heartbeat
2021-01-22 18:33:46 main.cpp:1865:main(): [ERROR] Watchdog: 14 threads missed a heartbeat
2021-01-22 18:33:55 main.cpp:1865:main(): [ERROR] Watchdog: 8 threads missed a heartbeat
2021-01-22 18:34:07 main.cpp:1865:main(): [ERROR] Watchdog: 4 threads missed a heartbeat
2021-01-22 18:34:31 main.cpp:1865:main(): [ERROR] Watchdog: 12 threads missed a heartbeat
2021-01-22 18:34:52 main.cpp:1865:main(): [ERROR] Watchdog: 13 threads missed a heartbeat
2021-01-22 18:35:10 main.cpp:1865:main(): [ERROR] Watchdog: 10 threads missed a heartbeat
2021-01-22 18:35:25 main.cpp:1865:main(): [ERROR] Watchdog: 10 threads missed a heartbeat
2021-01-22 18:35:37 main.cpp:1865:main(): [ERROR] Watchdog: 11 threads missed a heartbeat
2021-01-22 18:35:55 main.cpp:1865:main(): [ERROR] Watchdog: 11 threads missed a heartbeat
2021-01-22 18:36:10 main.cpp:1865:main(): [ERROR] Watchdog: 16 threads missed a heartbeat
2021-01-22 18:36:40 main.cpp:1865:main(): [ERROR] Watchdog: 13 threads missed a heartbeat
2021-01-22 18:38:10 main.cpp:1865:main(): [ERROR] Watchdog: 15 threads missed a heartbeat
2021-01-22 18:38:55 main.cpp:1865:main(): [ERROR] Watchdog: 11 threads missed a heartbeat
2021-01-22 18:40:04 main.cpp:1865:main(): [ERROR] Watchdog: 16 threads missed a heartbeat
2021-01-22 18:40:16 main.cpp:1865:main(): [ERROR] Watchdog: 11 threads missed a heartbeat
2021-01-22 18:40:31 main.cpp:1865:main(): [ERROR] Watchdog: 11 threads missed a heartbeat
2021-01-22 18:40:43 main.cpp:1865:main(): [ERROR] Watchdog: 13 threads missed a heartbeat
2021-01-22 18:41:01 main.cpp:1865:main(): [ERROR] Watchdog: 9 threads missed a heartbeat
2021-01-22 18:41:10 main.cpp:1865:main(): [ERROR] Watchdog: 15 threads missed a heartbeat
2021-01-22 18:41:43 main.cpp:1865:main(): [ERROR] Watchdog: 14 threads missed a heartbeat
2021-01-22 18:43:10 main.cpp:1865:main(): [ERROR] Watchdog: 15 threads missed a heartbeat
2021-01-22 18:43:40 main.cpp:1865:main(): [ERROR] Watchdog: 10 threads missed a heartbeat
2021-01-22 18:43:52 main.cpp:1865:main(): [ERROR] Watchdog: 16 threads missed a heartbeat
2021-01-22 18:44:13 main.cpp:1865:main(): [ERROR] Watchdog: 10 threads missed a heartbeat
2021-01-22 18:44:28 main.cpp:1865:main(): [ERROR] Watchdog: 13 threads missed a heartbeat
2021-01-22 18:44:46 main.cpp:1865:main(): [ERROR] Watchdog: 10 threads missed a heartbeat
2021-01-22 18:45:19 main.cpp:1865:main(): [ERROR] Watchdog: 13 threads missed a heartbeat
2021-01-22 18:45:55 main.cpp:1865:main(): [ERROR] Watchdog: 11 threads missed a heartbeat
2021-01-22 18:47:01 main.cpp:1865:main(): [ERROR] Watchdog: 9 threads missed a heartbeat
2021-01-22 18:47:34 main.cpp:1865:main(): [ERROR] Watchdog: 14 threads missed a heartbeat
2021-01-22 18:47:43 main.cpp:1865:main(): [ERROR] Watchdog: 14 threads missed a heartbeat
2021-01-22 18:47:55 main.cpp:1865:main(): [ERROR] Watchdog: 11 threads missed a heartbeat
2021-01-22 18:48:04 main.cpp:1865:main(): [ERROR] Watchdog: 10 threads missed a heartbeat
2021-01-22 18:48:13 main.cpp:1865:main(): [ERROR] Watchdog: 13 threads missed a heartbeat
2021-01-22 18:49:01 main.cpp:1865:main(): [ERROR] Watchdog: 14 threads missed a heartbeat
2021-01-22 18:50:01 main.cpp:1865:main(): [ERROR] Watchdog: 16 threads missed a heartbeat
2021-01-22 18:51:53 main.cpp:1865:main(): [ERROR] Watchdog: 13 threads missed a heartbeat
2021-01-22 18:52:02 main.cpp:1865:main(): [ERROR] Watchdog: 16 threads missed a heartbeat
2021-01-22 18:52:29 main.cpp:1865:main(): [ERROR] Watchdog: 14 threads missed a heartbeat
2021-01-22 18:52:44 main.cpp:1865:main(): [ERROR] Watchdog: 14 threads missed a heartbeat
2021-01-22 18:52:50 main.cpp:1865:main(): [ERROR] Watchdog: 16 threads missed a heartbeat
2021-01-22 18:53:05 main.cpp:1865:main(): [ERROR] Watchdog: 16 threads missed a heartbeat
2021-01-22 18:53:11 main.cpp:1865:main(): [ERROR] Watchdog: 14 threads missed a heartbeat
2021-01-22 18:53:26 main.cpp:1865:main(): [ERROR] Watchdog: 12 threads missed a heartbeat
2021-01-22 18:53:32 main.cpp:1865:main(): [ERROR] Watchdog: 14 threads missed a heartbeat
2021-01-22 18:53:38 main.cpp:1865:main(): [ERROR] Watchdog: 14 threads missed a heartbeat
2021-01-22 18:53:53 main.cpp:1865:main(): [ERROR] Watchdog: 14 threads missed a heartbeat
2021-01-22 18:54:14 main.cpp:1865:main(): [ERROR] Watchdog: 16 threads missed a heartbeat
2021-01-22 18:54:20 main.cpp:1865:main(): [ERROR] Watchdog: 16 threads missed a heartbeat
2021-01-22 18:54:29 main.cpp:1865:main(): [ERROR] Watchdog: 1 threads missed a heartbeat
2021-01-22 18:54:35 main.cpp:1865:main(): [ERROR] Watchdog: 16 threads missed a heartbeat
2021-01-22 18:54:41 main.cpp:1865:main(): [ERROR] Watchdog: 11 threads missed a heartbeat
2021-01-22 18:54:56 main.cpp:1865:main(): [ERROR] Watchdog: 15 threads missed a heartbeat
2021-01-22 18:55:02 main.cpp:1865:main(): [ERROR] Watchdog: 15 threads missed a heartbeat
2021-01-22 18:55:17 main.cpp:1865:main(): [ERROR] Watchdog: 16 threads missed a heartbeat
2021-01-22 18:55:23 main.cpp:1865:main(): [ERROR] Watchdog: 14 threads missed a heartbeat
2021-01-22 18:55:38 main.cpp:1865:main(): [ERROR] Watchdog: 10 threads missed a heartbeat
2021-01-22 18:55:44 main.cpp:1865:main(): [ERROR] Watchdog: 13 threads missed a heartbeat
2021-01-22 18:56:02 main.cpp:1865:main(): [ERROR] Watchdog: 4 threads missed a heartbeat
2021-01-22 18:56:08 main.cpp:1865:main(): [ERROR] Watchdog: 16 threads missed a heartbeat
2021-01-22 18:56:14 main.cpp:1865:main(): [ERROR] Watchdog: 12 threads missed a heartbeat
2021-01-22 18:56:23 main.cpp:1865:main(): [ERROR] Watchdog: 2 threads missed a heartbeat
2021-01-22 18:56:29 main.cpp:1865:main(): [ERROR] Watchdog: 15 threads missed a heartbeat
2021-01-22 18:56:35 main.cpp:1865:main(): [ERROR] Watchdog: 15 threads missed a heartbeat
2021-01-22 18:56:44 main.cpp:1865:main(): [ERROR] Watchdog: 4 threads missed a heartbeat
2021-01-22 18:56:50 main.cpp:1865:main(): [ERROR] Watchdog: 10 threads missed a heartbeat
2021-01-22 18:57:05 main.cpp:1865:main(): [ERROR] Watchdog: 15 threads missed a heartbeat
2021-01-22 18:57:11 main.cpp:1865:main(): [ERROR] Watchdog: 13 threads missed a heartbeat
2021-01-22 18:57:26 main.cpp:1865:main(): [ERROR] Watchdog: 13 threads missed a heartbeat
2021-01-22 18:57:32 main.cpp:1865:main(): [ERROR] Watchdog: 15 threads missed a heartbeat
2021-01-22 18:57:47 main.cpp:1865:main(): [ERROR] Watchdog: 13 threads missed a heartbeat
2021-01-22 18:57:53 main.cpp:1865:main(): [ERROR] Watchdog: 12 threads missed a heartbeat
2021-01-22 18:58:08 main.cpp:1865:main(): [ERROR] Watchdog: 15 threads missed a heartbeat
2021-01-22 18:58:14 main.cpp:1865:main(): [ERROR] Watchdog: 13 threads missed a heartbeat
2021-01-22 18:58:23 main.cpp:1865:main(): [ERROR] Watchdog: 1 threads missed a heartbeat
2021-01-22 18:58:29 main.cpp:1865:main(): [ERROR] Watchdog: 15 threads missed a heartbeat
2021-01-22 18:58:35 main.cpp:1865:main(): [ERROR] Watchdog: 13 threads missed a heartbeat
2021-01-22 18:58:50 main.cpp:1865:main(): [ERROR] Watchdog: 15 threads missed a heartbeat
2021-01-22 18:58:56 main.cpp:1865:main(): [ERROR] Watchdog: 14 threads missed a heartbeat
2021-01-22 18:59:11 main.cpp:1865:main(): [ERROR] Watchdog: 14 threads missed a heartbeat
2021-01-22 18:59:17 main.cpp:1865:main(): [ERROR] Watchdog: 15 threads missed a heartbeat
2021-01-22 18:59:32 main.cpp:1865:main(): [ERROR] Watchdog: 15 threads missed a heartbeat
2021-01-22 18:59:38 main.cpp:1865:main(): [ERROR] Watchdog: 14 threads missed a heartbeat
2021-01-22 18:59:53 main.cpp:1865:main(): [ERROR] Watchdog: 15 threads missed a heartbeat
2021-01-22 18:59:59 main.cpp:1865:main(): [ERROR] Watchdog: 15 threads missed a heartbeat
2021-01-22 19:00:14 main.cpp:1865:main(): [ERROR] Watchdog: 15 threads missed a heartbeat
2021-01-22 19:00:20 main.cpp:1865:main(): [ERROR] Watchdog: 15 threads missed a heartbeat
2021-01-22 19:00:35 main.cpp:1865:main(): [ERROR] Watchdog: 13 threads missed a heartbeat
2021-01-22 19:00:41 main.cpp:1865:main(): [ERROR] Watchdog: 14 threads missed a heartbeat
2021-01-22 19:00:56 main.cpp:1865:main(): [ERROR] Watchdog: 14 threads missed a heartbeat
2021-01-22 19:01:02 main.cpp:1865:main(): [ERROR] Watchdog: 14 threads missed a heartbeat
2021-01-22 19:01:17 main.cpp:1865:main(): [ERROR] Watchdog: 13 threads missed a heartbeat
2021-01-22 19:01:23 main.cpp:1865:main(): [ERROR] Watchdog: 16 threads missed a heartbeat
2021-01-22 19:01:38 main.cpp:1865:main(): [ERROR] Watchdog: 15 threads missed a heartbeat
2021-01-22 19:01:44 main.cpp:1865:main(): [ERROR] Watchdog: 14 threads missed a heartbeat
2021-01-22 19:01:59 main.cpp:1865:main(): [ERROR] Watchdog: 15 threads missed a heartbeat
2021-01-22 19:02:05 main.cpp:1865:main(): [ERROR] Watchdog: 16 threads missed a heartbeat
2021-01-22 19:02:20 main.cpp:1865:main(): [ERROR] Watchdog: 14 threads missed a heartbeat
2021-01-22 19:02:26 main.cpp:1865:main(): [ERROR] Watchdog: 16 threads missed a heartbeat
2021-01-22 19:02:41 main.cpp:1865:main(): [ERROR] Watchdog: 14 threads missed a heartbeat
2021-01-22 19:02:47 main.cpp:1865:main(): [ERROR] Watchdog: 13 threads missed a heartbeat
2021-01-22 19:03:02 main.cpp:1865:main(): [ERROR] Watchdog: 14 threads missed a heartbeat
2021-01-22 19:03:08 main.cpp:1865:main(): [ERROR] Watchdog: 14 threads missed a heartbeat
2021-01-22 19:03:23 main.cpp:1865:main(): [ERROR] Watchdog: 14 threads missed a heartbeat
2021-01-22 19:03:29 main.cpp:1865:main(): [ERROR] Watchdog: 13 threads missed a heartbeat
and by a lot I mean
# grep -c heartbeat proxysql.log
126
and we did not set any value for restart_on_missing_heartbeats in which case, according to the documentation, it should be 10. So we would have expected for proxysql to crash at the 11th error and exit the strange situation in which it was. But it did not restart at all automatically. In fact, after manually restarting it (and restarting all the Rails clients we have connected to it), everything started working again as usual. Restarting the clients before restarting proxysql did not fix anything.
Hi.
Is there any reason why timestamps are missing?
Is there any reason why error log is not attached to this issue?
ProxySQL automatically restarts if heartbeats are missed for restart_on_missing_heartbeats in a row, but because you even cut the timestamp it is really impossible for us to make any conclusion.
Thanks
sorry, editing with the timestamps, sure. Anyway they were not "in a row". the whole error log is not attached due to size constraints but I'll bzip it and upload it somewhere
Edit: at 18:55:48 proxysql was restarted, at ~19:04 all the clients were restarted, and the errors stopped.
Before 18:55 we had some clients restarted/joined, they couldn't contact with RDS
Edit2: I have to redact the log file since there is some customers personal data leaked in it (phone numbers are part of some keys and there are some duplicated entry errors related to those keys)
Thank you, now with the timestamp things are more clear.
ProxySQL won't restart on those conditions. When the missed heartbeat are logged, it seems that most of the time the next missed heartbeat is several seconds apart, that means that threads started sending heartbeat for some time, then stopped again.
Also, the fact that the number of threads missing the heartbeats at times decreases, it suggests that proxysql is not really blocked.
It is, probably, extremely slow.
We have seen this type of behavior when proxysql is swapping a lot, thus becoming very slow.
This hypothesis can explain also why restarting it fixed the issue.
Do you have memory graphs?

(1) issue began
(2) proxysql restarted
(3) rails clients restarted
As you can see, plenty of available RAM. CPU and network traffic went down to very low values instead
CPU graph

There was a small spike in memory usage when the problem started, but it doesn't seem too relevant.
But in this case I would exclude the possibility of swapping.
Next time I would recommend to execute gcore on proxysql pid, so to get a core dump and understand what the threads were blocked on
Were there any unusual messages in error log?
If you could attach the error log from around 18:00 to 20:00 , that would be great
Here you go, anonymized logs. Thanks @renecannao for your help 馃檶馃徑
proxysql-reduced.zip
$ cat /tmp/proxysql-reduced.log | egrep -v '(Deadlock found when trying to get lock|Duplicate entry|Closing unhealthy client connection)' | cut -b 21-89 | sort | uniq -c | sort -n -r | head -n 20
251441 mysql_connection.cpp:969:handler(): [ERROR] Failed to mysql_real_conn
77908 mysql_connection.cpp:1006:handler(): [ERROR] Connect timeout on api-m
7645 MySQL_Session.cpp:3854:handler(): [ERROR] Detected an offline server
5803 MySQL_Session.cpp:3863:handler(): [WARNING] Retrying query.
1246 MySQL_Session.cpp:2278:handler_again___status_SETTING_GENERIC_VARIABL
895 MySQL_HostGroups_Manager.cpp:2446:get_random_MySrvC(): [ERROR] Hostgr
755 MySQL_HostGroups_Manager.cpp:906:connect_error(): [ERROR] Shunning se
487 MySQL_Session.cpp:1471:handler_again___status_PINGING_SERVER(): [ERRO
83 MySQL_Session.cpp:135:kill_query_thread(): [WARNING] KILL CONNECTION
32 MySQL_Session.cpp:3455:handler(): [ERROR] Unexpected COM_QUIT from cl
25 main.cpp:1865:main(): [ERROR] Watchdog: 14 threads missed a heartbeat
24 main.cpp:1865:main(): [ERROR] Watchdog: 15 threads missed a heartbeat
22 main.cpp:1865:main(): [ERROR] Watchdog: 13 threads missed a heartbeat
19 MySQL_Thread.cpp:4190:process_data_on_data_stream(): [WARNING] Detect
18 main.cpp:1865:main(): [ERROR] Watchdog: 16 threads missed a heartbeat
17 mysql_connection.cpp:1006:handler(): [ERROR] Connect timeout on api-s
11 MySQL_Session.cpp:3902:handler(): [WARNING] Retrying query.
11 MySQL_Session.cpp:3894:handler(): [ERROR] Detected a broken connectio
11 MySQL_Session.cpp:3851:handler(): [ERROR] Detected a lagging server d
10 main.cpp:1865:main(): [ERROR] Watchdog: 11 threads missed a heartbeat
There was a clear problem related to mysql_real_connect(). Looking a bit more in the log:
$ cat /tmp/proxysql-reduced.log | grep "Failed to mysql_real_conn" | awk -F ' , ' '{print $4}' | sort | uniq -c
251441 1129: Host '10.31.30.3' is blocked because of many connection errors; unblock with 'mysqladmin flush-hosts'.
To be more specific, api-master was locking proxysql:
$ cat /tmp/proxysql-reduced.log | grep "Failed to mysql_real_conn" | cut -b 99- | awk '{print $1}' | sort | uniq -c
251441 api-master.mysql:3306
The problem lasted from 18:29:03 until 18:52:23 . You should have unblocked it on the backend (this information was omitted so far):
$ cat /tmp/proxysql-reduced.log | grep "Failed to mysql_real_conn" | cut -b 1-20 | head -n 5
2021-01-22 18:29:03
2021-01-22 18:29:03
2021-01-22 18:29:03
2021-01-22 18:29:03
2021-01-22 18:29:03
$ cat /tmp/proxysql-reduced.log | grep "Failed to mysql_real_conn" | cut -b 1-20 | tail -n 5
2021-01-22 18:52:23
2021-01-22 18:52:23
2021-01-22 18:52:23
2021-01-22 18:52:23
2021-01-22 18:52:23
Before the backend blocked proxysql, proxysql was timing out while trying to connect to the backend:
$ grep "Connect timeout on" /tmp/proxysql-reduced.log | cut -b 1-16 | sort | uniq -c
4419 2021-01-22 18:28
1331 2021-01-22 18:29
40 2021-01-22 18:30
39 2021-01-22 18:31
28 2021-01-22 18:32
27 2021-01-22 18:33
21 2021-01-22 18:34
19 2021-01-22 18:35
7 2021-01-22 18:36
15 2021-01-22 18:38
6 2021-01-22 18:39
15 2021-01-22 18:40
44 2021-01-22 18:41
1 2021-01-22 18:42
21 2021-01-22 18:43
12 2021-01-22 18:44
8 2021-01-22 18:45
7 2021-01-22 18:46
14 2021-01-22 18:47
23 2021-01-22 18:48
7 2021-01-22 18:49
18 2021-01-22 18:50
8 2021-01-22 18:51
3420 2021-01-22 18:52
4856 2021-01-22 18:53
4960 2021-01-22 18:54
3399 2021-01-22 18:55
12427 2021-01-22 18:56
7567 2021-01-22 18:57
6784 2021-01-22 18:58
5422 2021-01-22 18:59
7443 2021-01-22 19:00
5323 2021-01-22 19:01
5432 2021-01-22 19:02
4762 2021-01-22 19:03
To be more specific, the backend started timing out connections at 18:28:32:
$ egrep "^2021-01-22 18:2.*Connect timeout on" /tmp/proxysql-reduced.log | cut -b 1-19 | sort | uniq -c
147 2021-01-22 18:28:32
1690 2021-01-22 18:28:39
13 2021-01-22 18:28:44
1282 2021-01-22 18:28:46
1287 2021-01-22 18:28:53
1313 2021-01-22 18:29:00
7 2021-01-22 18:29:07
2 2021-01-22 18:29:28
9 2021-01-22 18:29:48
Things were really bad at 18:28:32:
$ cat /tmp/proxysql-reduced.log | egrep -v '(Deadlock found when trying to get lock|Duplicate entry|Closing unhealthy client connection)' | egrep "^2021-01-22 18:[0-2]" | cut -b 1-19 | sort | uniq -c | head -n 20
1 2021-01-22 18:15:57
1 2021-01-22 18:25:36
1131 2021-01-22 18:28:32
8 2021-01-22 18:28:33
1 2021-01-22 18:28:34
1 2021-01-22 18:28:37
2601 2021-01-22 18:28:39
1 2021-01-22 18:28:41
41 2021-01-22 18:28:44
146 2021-01-22 18:28:45
1463 2021-01-22 18:28:46
1 2021-01-22 18:28:47
1 2021-01-22 18:28:48
1 2021-01-22 18:28:52
1592 2021-01-22 18:28:53
1 2021-01-22 18:28:55
1 2021-01-22 18:28:58
1380 2021-01-22 18:29:00
1 2021-01-22 18:29:02
159 2021-01-22 18:29:03
To be more specific, connections to backend api-master were completely failing:
$ cat /tmp/proxysql-reduced.log | egrep -v '(Deadlock found when trying to get lock|Duplicate entry|Closing unhealthy client connection)' | egrep "^2021-01-22 18:28:32" | cut -b 21- | sed -e 's/exceeded by.*/exceed by/' | sort | uniq -c | sort -n -r
597 MySQL_Session.cpp:3854:handler(): [ERROR] Detected an offline server during query: api-master.mysql, 3306
251 MySQL_Session.cpp:3863:handler(): [WARNING] Retrying query.
147 mysql_connection.cpp:1006:handler(): [ERROR] Connect timeout on api-master.mysql:3306 : exceed by
134 MySQL_Session.cpp:2278:handler_again___status_SETTING_GENERIC_VARIABLE(): [WARNING] Error while setting sql_mode to "CONCAT(CONCAT(@@sql_mode, ',STRICT_ALL_TABLES'), ',NO_AUTO_VALUE_ON_ZERO')" on api-master.mysql:3306 hg 0 : 0,
1 MySQL_HostGroups_Manager.cpp:906:connect_error(): [ERROR] Shunning server api-master.mysql:3306 with 5 errors/sec. Shunning for 10 seconds
1 MySQL_HostGroups_Manager.cpp:2446:get_random_MySrvC(): [ERROR] Hostgroup 0 has no servers available! Checking servers shunned for more than 1 second
One could make the hypothesis that the backend was unhealthy, but:
a) an unhealthy backend shouldn't block the client (proxysql) thinking it is misbehaving
b) an unhealthy backend shouldn't cause proxysql threads to stall (and miss heartbeats)
A possible theory at this point is a (temporarily) slow or unresponsive name lookup (DNS server): this is the only part in mysql_real_connect() that is blocking operation, everything else should be non-blocking
Thanks for the thorough review, it matches what we discussed during the outage and the internal post-mortem. Anyway we ran mysqladmin flush-hosts against the RDS backend I would say before 18:52:23 but I don't have any hard-proof about it, but we did indeed see those errors quite early in the outage and actioned upon them.
We will try to dig deeper on the DNS slowness hypothesis. Thanks a lot
@renecannao Ok, looks like that adding delay to the DNS with tc (and this nice script) we managed to reproduce the issue in our beta environment.
A couple of caveats thou, since we don't have the same traffic pressure on that proxysql instance:
1) when removing the IP delay, it starts working again almost immediately
2) we only managed to get the api-master SHUNNED when restarting proxysql while already issuing traffic
3) we need to kill backend connections with tcpkill because otherwise DNS is not refreshed enough quickly with the low volume of requests we are generating
I know that this scenario is a bit too much synthetic but there is clearly something fishy going on when DNS has a hiccup and proxysql is under tons of requests/sec
Sorry for the late catchup, after some deeper investigation with the great @renecannao the core issue here was a misconfiguration on our side, specifically the max_connections number which we initially wrongly set to a really high number (9000) trying to match the max connections number accepted by our backend mysql server. We lovered this down to 200 and since then we didn't have any other incident like this, with more load on the DB.
Most helpful comment
There was a clear problem related to
mysql_real_connect(). Looking a bit more in the log:To be more specific,
api-masterwas locking proxysql:The problem lasted from 18:29:03 until 18:52:23 . You should have unblocked it on the backend (this information was omitted so far):
Before the backend blocked proxysql, proxysql was timing out while trying to connect to the backend:
To be more specific, the backend started timing out connections at 18:28:32:
Things were really bad at 18:28:32:
To be more specific, connections to backend api-master were completely failing:
One could make the hypothesis that the backend was unhealthy, but:
a) an unhealthy backend shouldn't block the client (proxysql) thinking it is misbehaving
b) an unhealthy backend shouldn't cause proxysql threads to stall (and miss heartbeats)
A possible theory at this point is a (temporarily) slow or unresponsive name lookup (DNS server): this is the only part in
mysql_real_connect()that is blocking operation, everything else should be non-blocking