Proxysql: Watchdog: threads missed a heartbeat didn't crash ProxySQL

Created on 25 Jan 2021  路  11Comments  路  Source: sysown/proxysql

This is not strictly reproducible as we tried to reproduce it in non productive environment (same OS and ProxySQL version) but we didn't manage to, I hope it is enough anyway.

# proxysql -V
ProxySQL version 2.0.15-20-g32bb92c, codename Truls

On

# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 18.04.3 LTS
Release:        18.04
Codename:       bionic

Basically we were in a state were proxysql was complaining a lot about

[ERROR] Connect timeout on api-master.mysql:3306 : exceeded by 727us
[ERROR] Detected an offline server during query: api-master.mysql, 3306

while the backend server (AWS MySQL RDS instance, main write instance) was perfectly up and working. But the actual behaviour for which I'm opening this issues is that we got a lot of:

2021-01-22 18:28:37 main.cpp:1865:main(): [ERROR] Watchdog: 16 threads missed a heartbeat
2021-01-22 18:28:46 main.cpp:1865:main(): [ERROR] Watchdog: 3 threads missed a heartbeat
2021-01-22 18:28:52 main.cpp:1865:main(): [ERROR] Watchdog: 15 threads missed a heartbeat
2021-01-22 18:28:58 main.cpp:1865:main(): [ERROR] Watchdog: 16 threads missed a heartbeat
2021-01-22 18:29:07 main.cpp:1865:main(): [ERROR] Watchdog: 1 threads missed a heartbeat
2021-01-22 18:29:40 main.cpp:1865:main(): [ERROR] Watchdog: 13 threads missed a heartbeat
2021-01-22 18:29:46 main.cpp:1865:main(): [ERROR] Watchdog: 8 threads missed a heartbeat
2021-01-22 18:30:04 main.cpp:1865:main(): [ERROR] Watchdog: 9 threads missed a heartbeat
2021-01-22 18:30:16 main.cpp:1865:main(): [ERROR] Watchdog: 12 threads missed a heartbeat
2021-01-22 18:30:31 main.cpp:1865:main(): [ERROR] Watchdog: 15 threads missed a heartbeat
2021-01-22 18:30:40 main.cpp:1865:main(): [ERROR] Watchdog: 13 threads missed a heartbeat
2021-01-22 18:31:13 main.cpp:1865:main(): [ERROR] Watchdog: 15 threads missed a heartbeat
2021-01-22 18:31:28 main.cpp:1865:main(): [ERROR] Watchdog: 11 threads missed a heartbeat
2021-01-22 18:31:37 main.cpp:1865:main(): [ERROR] Watchdog: 11 threads missed a heartbeat
2021-01-22 18:31:46 main.cpp:1865:main(): [ERROR] Watchdog: 15 threads missed a heartbeat
2021-01-22 18:31:52 main.cpp:1865:main(): [ERROR] Watchdog: 13 threads missed a heartbeat
2021-01-22 18:32:04 main.cpp:1865:main(): [ERROR] Watchdog: 13 threads missed a heartbeat
2021-01-22 18:32:16 main.cpp:1865:main(): [ERROR] Watchdog: 13 threads missed a heartbeat
2021-01-22 18:32:31 main.cpp:1865:main(): [ERROR] Watchdog: 14 threads missed a heartbeat
2021-01-22 18:33:04 main.cpp:1865:main(): [ERROR] Watchdog: 10 threads missed a heartbeat
2021-01-22 18:33:10 main.cpp:1865:main(): [ERROR] Watchdog: 16 threads missed a heartbeat
2021-01-22 18:33:46 main.cpp:1865:main(): [ERROR] Watchdog: 14 threads missed a heartbeat
2021-01-22 18:33:55 main.cpp:1865:main(): [ERROR] Watchdog: 8 threads missed a heartbeat
2021-01-22 18:34:07 main.cpp:1865:main(): [ERROR] Watchdog: 4 threads missed a heartbeat
2021-01-22 18:34:31 main.cpp:1865:main(): [ERROR] Watchdog: 12 threads missed a heartbeat
2021-01-22 18:34:52 main.cpp:1865:main(): [ERROR] Watchdog: 13 threads missed a heartbeat
2021-01-22 18:35:10 main.cpp:1865:main(): [ERROR] Watchdog: 10 threads missed a heartbeat
2021-01-22 18:35:25 main.cpp:1865:main(): [ERROR] Watchdog: 10 threads missed a heartbeat
2021-01-22 18:35:37 main.cpp:1865:main(): [ERROR] Watchdog: 11 threads missed a heartbeat
2021-01-22 18:35:55 main.cpp:1865:main(): [ERROR] Watchdog: 11 threads missed a heartbeat
2021-01-22 18:36:10 main.cpp:1865:main(): [ERROR] Watchdog: 16 threads missed a heartbeat
2021-01-22 18:36:40 main.cpp:1865:main(): [ERROR] Watchdog: 13 threads missed a heartbeat
2021-01-22 18:38:10 main.cpp:1865:main(): [ERROR] Watchdog: 15 threads missed a heartbeat
2021-01-22 18:38:55 main.cpp:1865:main(): [ERROR] Watchdog: 11 threads missed a heartbeat
2021-01-22 18:40:04 main.cpp:1865:main(): [ERROR] Watchdog: 16 threads missed a heartbeat
2021-01-22 18:40:16 main.cpp:1865:main(): [ERROR] Watchdog: 11 threads missed a heartbeat
2021-01-22 18:40:31 main.cpp:1865:main(): [ERROR] Watchdog: 11 threads missed a heartbeat
2021-01-22 18:40:43 main.cpp:1865:main(): [ERROR] Watchdog: 13 threads missed a heartbeat
2021-01-22 18:41:01 main.cpp:1865:main(): [ERROR] Watchdog: 9 threads missed a heartbeat
2021-01-22 18:41:10 main.cpp:1865:main(): [ERROR] Watchdog: 15 threads missed a heartbeat
2021-01-22 18:41:43 main.cpp:1865:main(): [ERROR] Watchdog: 14 threads missed a heartbeat
2021-01-22 18:43:10 main.cpp:1865:main(): [ERROR] Watchdog: 15 threads missed a heartbeat
2021-01-22 18:43:40 main.cpp:1865:main(): [ERROR] Watchdog: 10 threads missed a heartbeat
2021-01-22 18:43:52 main.cpp:1865:main(): [ERROR] Watchdog: 16 threads missed a heartbeat
2021-01-22 18:44:13 main.cpp:1865:main(): [ERROR] Watchdog: 10 threads missed a heartbeat
2021-01-22 18:44:28 main.cpp:1865:main(): [ERROR] Watchdog: 13 threads missed a heartbeat
2021-01-22 18:44:46 main.cpp:1865:main(): [ERROR] Watchdog: 10 threads missed a heartbeat
2021-01-22 18:45:19 main.cpp:1865:main(): [ERROR] Watchdog: 13 threads missed a heartbeat
2021-01-22 18:45:55 main.cpp:1865:main(): [ERROR] Watchdog: 11 threads missed a heartbeat
2021-01-22 18:47:01 main.cpp:1865:main(): [ERROR] Watchdog: 9 threads missed a heartbeat
2021-01-22 18:47:34 main.cpp:1865:main(): [ERROR] Watchdog: 14 threads missed a heartbeat
2021-01-22 18:47:43 main.cpp:1865:main(): [ERROR] Watchdog: 14 threads missed a heartbeat
2021-01-22 18:47:55 main.cpp:1865:main(): [ERROR] Watchdog: 11 threads missed a heartbeat
2021-01-22 18:48:04 main.cpp:1865:main(): [ERROR] Watchdog: 10 threads missed a heartbeat
2021-01-22 18:48:13 main.cpp:1865:main(): [ERROR] Watchdog: 13 threads missed a heartbeat
2021-01-22 18:49:01 main.cpp:1865:main(): [ERROR] Watchdog: 14 threads missed a heartbeat
2021-01-22 18:50:01 main.cpp:1865:main(): [ERROR] Watchdog: 16 threads missed a heartbeat
2021-01-22 18:51:53 main.cpp:1865:main(): [ERROR] Watchdog: 13 threads missed a heartbeat
2021-01-22 18:52:02 main.cpp:1865:main(): [ERROR] Watchdog: 16 threads missed a heartbeat
2021-01-22 18:52:29 main.cpp:1865:main(): [ERROR] Watchdog: 14 threads missed a heartbeat
2021-01-22 18:52:44 main.cpp:1865:main(): [ERROR] Watchdog: 14 threads missed a heartbeat
2021-01-22 18:52:50 main.cpp:1865:main(): [ERROR] Watchdog: 16 threads missed a heartbeat
2021-01-22 18:53:05 main.cpp:1865:main(): [ERROR] Watchdog: 16 threads missed a heartbeat
2021-01-22 18:53:11 main.cpp:1865:main(): [ERROR] Watchdog: 14 threads missed a heartbeat
2021-01-22 18:53:26 main.cpp:1865:main(): [ERROR] Watchdog: 12 threads missed a heartbeat
2021-01-22 18:53:32 main.cpp:1865:main(): [ERROR] Watchdog: 14 threads missed a heartbeat
2021-01-22 18:53:38 main.cpp:1865:main(): [ERROR] Watchdog: 14 threads missed a heartbeat
2021-01-22 18:53:53 main.cpp:1865:main(): [ERROR] Watchdog: 14 threads missed a heartbeat
2021-01-22 18:54:14 main.cpp:1865:main(): [ERROR] Watchdog: 16 threads missed a heartbeat
2021-01-22 18:54:20 main.cpp:1865:main(): [ERROR] Watchdog: 16 threads missed a heartbeat
2021-01-22 18:54:29 main.cpp:1865:main(): [ERROR] Watchdog: 1 threads missed a heartbeat
2021-01-22 18:54:35 main.cpp:1865:main(): [ERROR] Watchdog: 16 threads missed a heartbeat
2021-01-22 18:54:41 main.cpp:1865:main(): [ERROR] Watchdog: 11 threads missed a heartbeat
2021-01-22 18:54:56 main.cpp:1865:main(): [ERROR] Watchdog: 15 threads missed a heartbeat
2021-01-22 18:55:02 main.cpp:1865:main(): [ERROR] Watchdog: 15 threads missed a heartbeat
2021-01-22 18:55:17 main.cpp:1865:main(): [ERROR] Watchdog: 16 threads missed a heartbeat
2021-01-22 18:55:23 main.cpp:1865:main(): [ERROR] Watchdog: 14 threads missed a heartbeat
2021-01-22 18:55:38 main.cpp:1865:main(): [ERROR] Watchdog: 10 threads missed a heartbeat
2021-01-22 18:55:44 main.cpp:1865:main(): [ERROR] Watchdog: 13 threads missed a heartbeat
2021-01-22 18:56:02 main.cpp:1865:main(): [ERROR] Watchdog: 4 threads missed a heartbeat
2021-01-22 18:56:08 main.cpp:1865:main(): [ERROR] Watchdog: 16 threads missed a heartbeat
2021-01-22 18:56:14 main.cpp:1865:main(): [ERROR] Watchdog: 12 threads missed a heartbeat
2021-01-22 18:56:23 main.cpp:1865:main(): [ERROR] Watchdog: 2 threads missed a heartbeat
2021-01-22 18:56:29 main.cpp:1865:main(): [ERROR] Watchdog: 15 threads missed a heartbeat
2021-01-22 18:56:35 main.cpp:1865:main(): [ERROR] Watchdog: 15 threads missed a heartbeat
2021-01-22 18:56:44 main.cpp:1865:main(): [ERROR] Watchdog: 4 threads missed a heartbeat
2021-01-22 18:56:50 main.cpp:1865:main(): [ERROR] Watchdog: 10 threads missed a heartbeat
2021-01-22 18:57:05 main.cpp:1865:main(): [ERROR] Watchdog: 15 threads missed a heartbeat
2021-01-22 18:57:11 main.cpp:1865:main(): [ERROR] Watchdog: 13 threads missed a heartbeat
2021-01-22 18:57:26 main.cpp:1865:main(): [ERROR] Watchdog: 13 threads missed a heartbeat
2021-01-22 18:57:32 main.cpp:1865:main(): [ERROR] Watchdog: 15 threads missed a heartbeat
2021-01-22 18:57:47 main.cpp:1865:main(): [ERROR] Watchdog: 13 threads missed a heartbeat
2021-01-22 18:57:53 main.cpp:1865:main(): [ERROR] Watchdog: 12 threads missed a heartbeat
2021-01-22 18:58:08 main.cpp:1865:main(): [ERROR] Watchdog: 15 threads missed a heartbeat
2021-01-22 18:58:14 main.cpp:1865:main(): [ERROR] Watchdog: 13 threads missed a heartbeat
2021-01-22 18:58:23 main.cpp:1865:main(): [ERROR] Watchdog: 1 threads missed a heartbeat
2021-01-22 18:58:29 main.cpp:1865:main(): [ERROR] Watchdog: 15 threads missed a heartbeat
2021-01-22 18:58:35 main.cpp:1865:main(): [ERROR] Watchdog: 13 threads missed a heartbeat
2021-01-22 18:58:50 main.cpp:1865:main(): [ERROR] Watchdog: 15 threads missed a heartbeat
2021-01-22 18:58:56 main.cpp:1865:main(): [ERROR] Watchdog: 14 threads missed a heartbeat
2021-01-22 18:59:11 main.cpp:1865:main(): [ERROR] Watchdog: 14 threads missed a heartbeat
2021-01-22 18:59:17 main.cpp:1865:main(): [ERROR] Watchdog: 15 threads missed a heartbeat
2021-01-22 18:59:32 main.cpp:1865:main(): [ERROR] Watchdog: 15 threads missed a heartbeat
2021-01-22 18:59:38 main.cpp:1865:main(): [ERROR] Watchdog: 14 threads missed a heartbeat
2021-01-22 18:59:53 main.cpp:1865:main(): [ERROR] Watchdog: 15 threads missed a heartbeat
2021-01-22 18:59:59 main.cpp:1865:main(): [ERROR] Watchdog: 15 threads missed a heartbeat
2021-01-22 19:00:14 main.cpp:1865:main(): [ERROR] Watchdog: 15 threads missed a heartbeat
2021-01-22 19:00:20 main.cpp:1865:main(): [ERROR] Watchdog: 15 threads missed a heartbeat
2021-01-22 19:00:35 main.cpp:1865:main(): [ERROR] Watchdog: 13 threads missed a heartbeat
2021-01-22 19:00:41 main.cpp:1865:main(): [ERROR] Watchdog: 14 threads missed a heartbeat
2021-01-22 19:00:56 main.cpp:1865:main(): [ERROR] Watchdog: 14 threads missed a heartbeat
2021-01-22 19:01:02 main.cpp:1865:main(): [ERROR] Watchdog: 14 threads missed a heartbeat
2021-01-22 19:01:17 main.cpp:1865:main(): [ERROR] Watchdog: 13 threads missed a heartbeat
2021-01-22 19:01:23 main.cpp:1865:main(): [ERROR] Watchdog: 16 threads missed a heartbeat
2021-01-22 19:01:38 main.cpp:1865:main(): [ERROR] Watchdog: 15 threads missed a heartbeat
2021-01-22 19:01:44 main.cpp:1865:main(): [ERROR] Watchdog: 14 threads missed a heartbeat
2021-01-22 19:01:59 main.cpp:1865:main(): [ERROR] Watchdog: 15 threads missed a heartbeat
2021-01-22 19:02:05 main.cpp:1865:main(): [ERROR] Watchdog: 16 threads missed a heartbeat
2021-01-22 19:02:20 main.cpp:1865:main(): [ERROR] Watchdog: 14 threads missed a heartbeat
2021-01-22 19:02:26 main.cpp:1865:main(): [ERROR] Watchdog: 16 threads missed a heartbeat
2021-01-22 19:02:41 main.cpp:1865:main(): [ERROR] Watchdog: 14 threads missed a heartbeat
2021-01-22 19:02:47 main.cpp:1865:main(): [ERROR] Watchdog: 13 threads missed a heartbeat
2021-01-22 19:03:02 main.cpp:1865:main(): [ERROR] Watchdog: 14 threads missed a heartbeat
2021-01-22 19:03:08 main.cpp:1865:main(): [ERROR] Watchdog: 14 threads missed a heartbeat
2021-01-22 19:03:23 main.cpp:1865:main(): [ERROR] Watchdog: 14 threads missed a heartbeat
2021-01-22 19:03:29 main.cpp:1865:main(): [ERROR] Watchdog: 13 threads missed a heartbeat

and by a lot I mean

# grep -c heartbeat proxysql.log
126

and we did not set any value for restart_on_missing_heartbeats in which case, according to the documentation, it should be 10. So we would have expected for proxysql to crash at the 11th error and exit the strange situation in which it was. But it did not restart at all automatically. In fact, after manually restarting it (and restarting all the Rails clients we have connected to it), everything started working again as usual. Restarting the clients before restarting proxysql did not fix anything.

Most helpful comment

$ cat /tmp/proxysql-reduced.log | egrep -v '(Deadlock found when trying to get lock|Duplicate entry|Closing unhealthy client connection)' | cut -b 21-89 | sort | uniq -c | sort -n -r | head -n 20
 251441 mysql_connection.cpp:969:handler(): [ERROR] Failed to mysql_real_conn
  77908 mysql_connection.cpp:1006:handler(): [ERROR] Connect timeout on api-m
   7645 MySQL_Session.cpp:3854:handler(): [ERROR] Detected an offline server 
   5803 MySQL_Session.cpp:3863:handler(): [WARNING] Retrying query.
   1246 MySQL_Session.cpp:2278:handler_again___status_SETTING_GENERIC_VARIABL
    895 MySQL_HostGroups_Manager.cpp:2446:get_random_MySrvC(): [ERROR] Hostgr
    755 MySQL_HostGroups_Manager.cpp:906:connect_error(): [ERROR] Shunning se
    487 MySQL_Session.cpp:1471:handler_again___status_PINGING_SERVER(): [ERRO
     83 MySQL_Session.cpp:135:kill_query_thread(): [WARNING] KILL CONNECTION 
     32 MySQL_Session.cpp:3455:handler(): [ERROR] Unexpected COM_QUIT from cl
     25 main.cpp:1865:main(): [ERROR] Watchdog: 14 threads missed a heartbeat
     24 main.cpp:1865:main(): [ERROR] Watchdog: 15 threads missed a heartbeat
     22 main.cpp:1865:main(): [ERROR] Watchdog: 13 threads missed a heartbeat
     19 MySQL_Thread.cpp:4190:process_data_on_data_stream(): [WARNING] Detect
     18 main.cpp:1865:main(): [ERROR] Watchdog: 16 threads missed a heartbeat
     17 mysql_connection.cpp:1006:handler(): [ERROR] Connect timeout on api-s
     11 MySQL_Session.cpp:3902:handler(): [WARNING] Retrying query.
     11 MySQL_Session.cpp:3894:handler(): [ERROR] Detected a broken connectio
     11 MySQL_Session.cpp:3851:handler(): [ERROR] Detected a lagging server d
     10 main.cpp:1865:main(): [ERROR] Watchdog: 11 threads missed a heartbeat

There was a clear problem related to mysql_real_connect(). Looking a bit more in the log:

$ cat /tmp/proxysql-reduced.log | grep "Failed to mysql_real_conn" | awk -F ' , ' '{print $4}' | sort | uniq -c
 251441 1129: Host '10.31.30.3' is blocked because of many connection errors; unblock with 'mysqladmin flush-hosts'.

To be more specific, api-master was locking proxysql:

$ cat /tmp/proxysql-reduced.log | grep "Failed to mysql_real_conn" | cut -b 99- | awk '{print $1}' | sort | uniq -c
 251441 api-master.mysql:3306

The problem lasted from 18:29:03 until 18:52:23 . You should have unblocked it on the backend (this information was omitted so far):

$ cat /tmp/proxysql-reduced.log | grep "Failed to mysql_real_conn" | cut -b 1-20 | head -n 5
2021-01-22 18:29:03 
2021-01-22 18:29:03 
2021-01-22 18:29:03 
2021-01-22 18:29:03 
2021-01-22 18:29:03 

$ cat /tmp/proxysql-reduced.log | grep "Failed to mysql_real_conn" | cut -b 1-20 | tail -n 5
2021-01-22 18:52:23 
2021-01-22 18:52:23 
2021-01-22 18:52:23 
2021-01-22 18:52:23 
2021-01-22 18:52:23 

Before the backend blocked proxysql, proxysql was timing out while trying to connect to the backend:

$ grep "Connect timeout on" /tmp/proxysql-reduced.log | cut -b 1-16 | sort | uniq -c
   4419 2021-01-22 18:28
   1331 2021-01-22 18:29
     40 2021-01-22 18:30
     39 2021-01-22 18:31
     28 2021-01-22 18:32
     27 2021-01-22 18:33
     21 2021-01-22 18:34
     19 2021-01-22 18:35
      7 2021-01-22 18:36
     15 2021-01-22 18:38
      6 2021-01-22 18:39
     15 2021-01-22 18:40
     44 2021-01-22 18:41
      1 2021-01-22 18:42
     21 2021-01-22 18:43
     12 2021-01-22 18:44
      8 2021-01-22 18:45
      7 2021-01-22 18:46
     14 2021-01-22 18:47
     23 2021-01-22 18:48
      7 2021-01-22 18:49
     18 2021-01-22 18:50
      8 2021-01-22 18:51
   3420 2021-01-22 18:52
   4856 2021-01-22 18:53
   4960 2021-01-22 18:54
   3399 2021-01-22 18:55
  12427 2021-01-22 18:56
   7567 2021-01-22 18:57
   6784 2021-01-22 18:58
   5422 2021-01-22 18:59
   7443 2021-01-22 19:00
   5323 2021-01-22 19:01
   5432 2021-01-22 19:02
   4762 2021-01-22 19:03

To be more specific, the backend started timing out connections at 18:28:32:

$ egrep "^2021-01-22 18:2.*Connect timeout on" /tmp/proxysql-reduced.log | cut -b 1-19 | sort | uniq -c
    147 2021-01-22 18:28:32
   1690 2021-01-22 18:28:39
     13 2021-01-22 18:28:44
   1282 2021-01-22 18:28:46
   1287 2021-01-22 18:28:53
   1313 2021-01-22 18:29:00
      7 2021-01-22 18:29:07
      2 2021-01-22 18:29:28
      9 2021-01-22 18:29:48

Things were really bad at 18:28:32:

$ cat /tmp/proxysql-reduced.log | egrep -v '(Deadlock found when trying to get lock|Duplicate entry|Closing unhealthy client connection)' | egrep "^2021-01-22 18:[0-2]" | cut -b 1-19 | sort | uniq -c | head -n 20
      1 2021-01-22 18:15:57
      1 2021-01-22 18:25:36
   1131 2021-01-22 18:28:32
      8 2021-01-22 18:28:33
      1 2021-01-22 18:28:34
      1 2021-01-22 18:28:37
   2601 2021-01-22 18:28:39
      1 2021-01-22 18:28:41
     41 2021-01-22 18:28:44
    146 2021-01-22 18:28:45
   1463 2021-01-22 18:28:46
      1 2021-01-22 18:28:47
      1 2021-01-22 18:28:48
      1 2021-01-22 18:28:52
   1592 2021-01-22 18:28:53
      1 2021-01-22 18:28:55
      1 2021-01-22 18:28:58
   1380 2021-01-22 18:29:00
      1 2021-01-22 18:29:02
    159 2021-01-22 18:29:03

To be more specific, connections to backend api-master were completely failing:

$ cat /tmp/proxysql-reduced.log | egrep -v '(Deadlock found when trying to get lock|Duplicate entry|Closing unhealthy client connection)' | egrep "^2021-01-22 18:28:32" | cut -b 21- | sed -e 's/exceeded by.*/exceed by/' | sort | uniq -c | sort -n -r
    597 MySQL_Session.cpp:3854:handler(): [ERROR] Detected an offline server during query: api-master.mysql, 3306
    251 MySQL_Session.cpp:3863:handler(): [WARNING] Retrying query.
    147 mysql_connection.cpp:1006:handler(): [ERROR] Connect timeout on api-master.mysql:3306 : exceed by
    134 MySQL_Session.cpp:2278:handler_again___status_SETTING_GENERIC_VARIABLE(): [WARNING] Error while setting sql_mode to "CONCAT(CONCAT(@@sql_mode, ',STRICT_ALL_TABLES'), ',NO_AUTO_VALUE_ON_ZERO')" on api-master.mysql:3306 hg 0 :  0,
      1 MySQL_HostGroups_Manager.cpp:906:connect_error(): [ERROR] Shunning server api-master.mysql:3306 with 5 errors/sec. Shunning for 10 seconds
      1 MySQL_HostGroups_Manager.cpp:2446:get_random_MySrvC(): [ERROR] Hostgroup 0 has no servers available! Checking servers shunned for more than 1 second

One could make the hypothesis that the backend was unhealthy, but:
a) an unhealthy backend shouldn't block the client (proxysql) thinking it is misbehaving
b) an unhealthy backend shouldn't cause proxysql threads to stall (and miss heartbeats)

A possible theory at this point is a (temporarily) slow or unresponsive name lookup (DNS server): this is the only part in mysql_real_connect() that is blocking operation, everything else should be non-blocking

All 11 comments

Hi.

Is there any reason why timestamps are missing?
Is there any reason why error log is not attached to this issue?

ProxySQL automatically restarts if heartbeats are missed for restart_on_missing_heartbeats in a row, but because you even cut the timestamp it is really impossible for us to make any conclusion.

Thanks

sorry, editing with the timestamps, sure. Anyway they were not "in a row". the whole error log is not attached due to size constraints but I'll bzip it and upload it somewhere

Edit: at 18:55:48 proxysql was restarted, at ~19:04 all the clients were restarted, and the errors stopped.
Before 18:55 we had some clients restarted/joined, they couldn't contact with RDS

Edit2: I have to redact the log file since there is some customers personal data leaked in it (phone numbers are part of some keys and there are some duplicated entry errors related to those keys)

Thank you, now with the timestamp things are more clear.
ProxySQL won't restart on those conditions. When the missed heartbeat are logged, it seems that most of the time the next missed heartbeat is several seconds apart, that means that threads started sending heartbeat for some time, then stopped again.
Also, the fact that the number of threads missing the heartbeats at times decreases, it suggests that proxysql is not really blocked.
It is, probably, extremely slow.
We have seen this type of behavior when proxysql is swapping a lot, thus becoming very slow.
This hypothesis can explain also why restarting it fixed the issue.
Do you have memory graphs?

image

(1) issue began
(2) proxysql restarted
(3) rails clients restarted

As you can see, plenty of available RAM. CPU and network traffic went down to very low values instead

CPU graph
image

There was a small spike in memory usage when the problem started, but it doesn't seem too relevant.
But in this case I would exclude the possibility of swapping.
Next time I would recommend to execute gcore on proxysql pid, so to get a core dump and understand what the threads were blocked on

Were there any unusual messages in error log?
If you could attach the error log from around 18:00 to 20:00 , that would be great

Here you go, anonymized logs. Thanks @renecannao for your help 馃檶馃徑
proxysql-reduced.zip

$ cat /tmp/proxysql-reduced.log | egrep -v '(Deadlock found when trying to get lock|Duplicate entry|Closing unhealthy client connection)' | cut -b 21-89 | sort | uniq -c | sort -n -r | head -n 20
 251441 mysql_connection.cpp:969:handler(): [ERROR] Failed to mysql_real_conn
  77908 mysql_connection.cpp:1006:handler(): [ERROR] Connect timeout on api-m
   7645 MySQL_Session.cpp:3854:handler(): [ERROR] Detected an offline server 
   5803 MySQL_Session.cpp:3863:handler(): [WARNING] Retrying query.
   1246 MySQL_Session.cpp:2278:handler_again___status_SETTING_GENERIC_VARIABL
    895 MySQL_HostGroups_Manager.cpp:2446:get_random_MySrvC(): [ERROR] Hostgr
    755 MySQL_HostGroups_Manager.cpp:906:connect_error(): [ERROR] Shunning se
    487 MySQL_Session.cpp:1471:handler_again___status_PINGING_SERVER(): [ERRO
     83 MySQL_Session.cpp:135:kill_query_thread(): [WARNING] KILL CONNECTION 
     32 MySQL_Session.cpp:3455:handler(): [ERROR] Unexpected COM_QUIT from cl
     25 main.cpp:1865:main(): [ERROR] Watchdog: 14 threads missed a heartbeat
     24 main.cpp:1865:main(): [ERROR] Watchdog: 15 threads missed a heartbeat
     22 main.cpp:1865:main(): [ERROR] Watchdog: 13 threads missed a heartbeat
     19 MySQL_Thread.cpp:4190:process_data_on_data_stream(): [WARNING] Detect
     18 main.cpp:1865:main(): [ERROR] Watchdog: 16 threads missed a heartbeat
     17 mysql_connection.cpp:1006:handler(): [ERROR] Connect timeout on api-s
     11 MySQL_Session.cpp:3902:handler(): [WARNING] Retrying query.
     11 MySQL_Session.cpp:3894:handler(): [ERROR] Detected a broken connectio
     11 MySQL_Session.cpp:3851:handler(): [ERROR] Detected a lagging server d
     10 main.cpp:1865:main(): [ERROR] Watchdog: 11 threads missed a heartbeat

There was a clear problem related to mysql_real_connect(). Looking a bit more in the log:

$ cat /tmp/proxysql-reduced.log | grep "Failed to mysql_real_conn" | awk -F ' , ' '{print $4}' | sort | uniq -c
 251441 1129: Host '10.31.30.3' is blocked because of many connection errors; unblock with 'mysqladmin flush-hosts'.

To be more specific, api-master was locking proxysql:

$ cat /tmp/proxysql-reduced.log | grep "Failed to mysql_real_conn" | cut -b 99- | awk '{print $1}' | sort | uniq -c
 251441 api-master.mysql:3306

The problem lasted from 18:29:03 until 18:52:23 . You should have unblocked it on the backend (this information was omitted so far):

$ cat /tmp/proxysql-reduced.log | grep "Failed to mysql_real_conn" | cut -b 1-20 | head -n 5
2021-01-22 18:29:03 
2021-01-22 18:29:03 
2021-01-22 18:29:03 
2021-01-22 18:29:03 
2021-01-22 18:29:03 

$ cat /tmp/proxysql-reduced.log | grep "Failed to mysql_real_conn" | cut -b 1-20 | tail -n 5
2021-01-22 18:52:23 
2021-01-22 18:52:23 
2021-01-22 18:52:23 
2021-01-22 18:52:23 
2021-01-22 18:52:23 

Before the backend blocked proxysql, proxysql was timing out while trying to connect to the backend:

$ grep "Connect timeout on" /tmp/proxysql-reduced.log | cut -b 1-16 | sort | uniq -c
   4419 2021-01-22 18:28
   1331 2021-01-22 18:29
     40 2021-01-22 18:30
     39 2021-01-22 18:31
     28 2021-01-22 18:32
     27 2021-01-22 18:33
     21 2021-01-22 18:34
     19 2021-01-22 18:35
      7 2021-01-22 18:36
     15 2021-01-22 18:38
      6 2021-01-22 18:39
     15 2021-01-22 18:40
     44 2021-01-22 18:41
      1 2021-01-22 18:42
     21 2021-01-22 18:43
     12 2021-01-22 18:44
      8 2021-01-22 18:45
      7 2021-01-22 18:46
     14 2021-01-22 18:47
     23 2021-01-22 18:48
      7 2021-01-22 18:49
     18 2021-01-22 18:50
      8 2021-01-22 18:51
   3420 2021-01-22 18:52
   4856 2021-01-22 18:53
   4960 2021-01-22 18:54
   3399 2021-01-22 18:55
  12427 2021-01-22 18:56
   7567 2021-01-22 18:57
   6784 2021-01-22 18:58
   5422 2021-01-22 18:59
   7443 2021-01-22 19:00
   5323 2021-01-22 19:01
   5432 2021-01-22 19:02
   4762 2021-01-22 19:03

To be more specific, the backend started timing out connections at 18:28:32:

$ egrep "^2021-01-22 18:2.*Connect timeout on" /tmp/proxysql-reduced.log | cut -b 1-19 | sort | uniq -c
    147 2021-01-22 18:28:32
   1690 2021-01-22 18:28:39
     13 2021-01-22 18:28:44
   1282 2021-01-22 18:28:46
   1287 2021-01-22 18:28:53
   1313 2021-01-22 18:29:00
      7 2021-01-22 18:29:07
      2 2021-01-22 18:29:28
      9 2021-01-22 18:29:48

Things were really bad at 18:28:32:

$ cat /tmp/proxysql-reduced.log | egrep -v '(Deadlock found when trying to get lock|Duplicate entry|Closing unhealthy client connection)' | egrep "^2021-01-22 18:[0-2]" | cut -b 1-19 | sort | uniq -c | head -n 20
      1 2021-01-22 18:15:57
      1 2021-01-22 18:25:36
   1131 2021-01-22 18:28:32
      8 2021-01-22 18:28:33
      1 2021-01-22 18:28:34
      1 2021-01-22 18:28:37
   2601 2021-01-22 18:28:39
      1 2021-01-22 18:28:41
     41 2021-01-22 18:28:44
    146 2021-01-22 18:28:45
   1463 2021-01-22 18:28:46
      1 2021-01-22 18:28:47
      1 2021-01-22 18:28:48
      1 2021-01-22 18:28:52
   1592 2021-01-22 18:28:53
      1 2021-01-22 18:28:55
      1 2021-01-22 18:28:58
   1380 2021-01-22 18:29:00
      1 2021-01-22 18:29:02
    159 2021-01-22 18:29:03

To be more specific, connections to backend api-master were completely failing:

$ cat /tmp/proxysql-reduced.log | egrep -v '(Deadlock found when trying to get lock|Duplicate entry|Closing unhealthy client connection)' | egrep "^2021-01-22 18:28:32" | cut -b 21- | sed -e 's/exceeded by.*/exceed by/' | sort | uniq -c | sort -n -r
    597 MySQL_Session.cpp:3854:handler(): [ERROR] Detected an offline server during query: api-master.mysql, 3306
    251 MySQL_Session.cpp:3863:handler(): [WARNING] Retrying query.
    147 mysql_connection.cpp:1006:handler(): [ERROR] Connect timeout on api-master.mysql:3306 : exceed by
    134 MySQL_Session.cpp:2278:handler_again___status_SETTING_GENERIC_VARIABLE(): [WARNING] Error while setting sql_mode to "CONCAT(CONCAT(@@sql_mode, ',STRICT_ALL_TABLES'), ',NO_AUTO_VALUE_ON_ZERO')" on api-master.mysql:3306 hg 0 :  0,
      1 MySQL_HostGroups_Manager.cpp:906:connect_error(): [ERROR] Shunning server api-master.mysql:3306 with 5 errors/sec. Shunning for 10 seconds
      1 MySQL_HostGroups_Manager.cpp:2446:get_random_MySrvC(): [ERROR] Hostgroup 0 has no servers available! Checking servers shunned for more than 1 second

One could make the hypothesis that the backend was unhealthy, but:
a) an unhealthy backend shouldn't block the client (proxysql) thinking it is misbehaving
b) an unhealthy backend shouldn't cause proxysql threads to stall (and miss heartbeats)

A possible theory at this point is a (temporarily) slow or unresponsive name lookup (DNS server): this is the only part in mysql_real_connect() that is blocking operation, everything else should be non-blocking

Thanks for the thorough review, it matches what we discussed during the outage and the internal post-mortem. Anyway we ran mysqladmin flush-hosts against the RDS backend I would say before 18:52:23 but I don't have any hard-proof about it, but we did indeed see those errors quite early in the outage and actioned upon them.

We will try to dig deeper on the DNS slowness hypothesis. Thanks a lot

@renecannao Ok, looks like that adding delay to the DNS with tc (and this nice script) we managed to reproduce the issue in our beta environment.

A couple of caveats thou, since we don't have the same traffic pressure on that proxysql instance:
1) when removing the IP delay, it starts working again almost immediately
2) we only managed to get the api-master SHUNNED when restarting proxysql while already issuing traffic
3) we need to kill backend connections with tcpkill because otherwise DNS is not refreshed enough quickly with the low volume of requests we are generating

I know that this scenario is a bit too much synthetic but there is clearly something fishy going on when DNS has a hiccup and proxysql is under tons of requests/sec

Sorry for the late catchup, after some deeper investigation with the great @renecannao the core issue here was a misconfiguration on our side, specifically the max_connections number which we initially wrongly set to a really high number (9000) trying to match the max connections number accepted by our backend mysql server. We lovered this down to 200 and since then we didn't have any other incident like this, with more load on the DB.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

vlanse picture vlanse  路  31Comments

lengerad picture lengerad  路  23Comments

everpcpc picture everpcpc  路  20Comments

williamsantanamoip picture williamsantanamoip  路  17Comments

rishimittal picture rishimittal  路  48Comments