Proxysql: Issue with the latest version of AWS RDS Aurora

Created on 24 Sep 2020  路  12Comments  路  Source: sysown/proxysql

Description of the issue:

Upgraded AWS RDS Aurora to the latest version version 2.09.0;
Immediately after upgrade ProxySQL started to report the following errror:

  • MySQL_HostGroups_Manager.cpp:5445:aws_aurora_replication_lag_action(): [WARNING] Shunning server :3306 from HG 10 with replication lag of 18446744949882880.000000 microseconds;

It was the first time when we encountered this type of error after Aurora upgrade.

ProxySQL version: proxysql-2.0.8-1.x86_64 (we tried the latest version, same issue)
OS version: Amazon Linux 1
The steps to reproduce the issue: Configure ProxySQL with the AWS Aurora 2.09.0
ProxySQL error log:

```text
2020-09-21 01:58:39 MySQL_Monitor.cpp:985:monitor_read_only_thread(): [ERROR] Timeout on read_only check for :3306 after 502ms. If the server is overload, increase mysql-monitor_read_only_timeout.
2020-09-21 01:58:39 MySQL_Monitor.cpp:1130:monitor_read_only_thread(): [ERROR] Got error: mmsd 0x7eff4420aac0 , MYSQL 0x7eff41eb6a00 , FD 28 : timeout check
2020-09-21 01:58:39 MySQL_Monitor.cpp:985:monitor_read_only_thread(): [ERROR] Timeout on read_only check for :3306 after 503ms. If the server is overload, increase mysql-monitor_read_only_timeout.
2020-09-21 01:58:39 MySQL_Monitor.cpp:1130:monitor_read_only_thread(): [ERROR] Got error: mmsd 0x7eff4420ac00 , MYSQL 0x7eff41eb7900 , FD 29 : timeout check
2020-09-21 01:58:39 MySQL_Monitor.cpp:4005:monitor_AWS_Aurora_thread_HG(): [ERROR] Timeout on AWS Aurora health check for :3306 after 804ms. If the server is overload, increase mysql_aws_aurora_hostgroups.check_timeout_ms
2020-09-21 01:58:39 MySQL_Monitor.cpp:4129:monitor_AWS_Aurora_thread_HG(): [ERROR] Error after 804ms on server :3306 : timeout check
2020-09-21 01:58:40 MySQL_Monitor.cpp:4005:monitor_AWS_Aurora_thread_HG(): [ERROR] Timeout on AWS Aurora health check for :3306 after 804ms. If the server is overload, increase mysql_aws_aurora_hostgroups.check_timeout_ms
2020-09-21 01:58:40 MySQL_Monitor.cpp:4129:monitor_AWS_Aurora_thread_HG(): [ERROR] Error after 804ms on server :3306 : timeout check
2020-09-21 01:58:41 ProxySQL_GloVars.cpp:9:term_handler(): [WARNING] Received TERM signal: shutdown in progress...
2020-09-21 01:58:41 [INFO] Starting shutdown...
2020-09-21 01:58:42 MySQL_Monitor.cpp:3988:monitor_AWS_Aurora_thread_HG(): [ERROR] Error on AWS Aurora check for :3306 after 2099ms. Unable to create a connection. If the server is overload, increase mysql-monitor_connect_timeout. Error: timeout or error in creating new connection: Unknown MySQL server host '' (1).
2020-09-21 01:58:42 [INFO] Stopping Monitor thread for AWS Aurora writer HG 10
2020-09-21 01:58:43 MySQL_Monitor.cpp:960:monitor_read_only_thread(): [ERROR] Timeout on read_only check for :3306 after 2099ms. Unable to create a connection. If the server is overload, increase mysql-monitor_connect_timeout. Error: timeout on creating new connection: Unknown MySQL server host '' (1).
2020-09-21 01:58:43 MySQL_Monitor.cpp:960:monitor_read_only_thread(): [ERROR] Timeout on read_only check for :3306 after 2099ms. Unable to create a connection. If the server is overload, increase mysql-monitor_connect_timeout. Error: timeout on creating new connection: Unknown MySQL server host '' (1).
.....
2020-09-21 02:00:46 [INFO] Creating new server in HG 10 : :3306 , gtid_port=0, weight=1000, status=0
2020-09-21 02:00:46 [INFO] New mysql_group_replication_hostgroups table
2020-09-21 02:00:46 [INFO] New mysql_galera_hostgroups table
2020-09-21 02:00:46 [INFO] New mysql_aws_aurora_hostgroups table
2020-09-21 02:00:46 [INFO] Loading AWS Aurora info for (10,20,on,3306,"..us-east-1.rds.amazonaws.com",600000,1000,800,"(null)")
2020-09-21 02:00:46 [INFO] MySQL_HostGroups_Manager::commit() locked for 1us
Standard Query Processor rev. 2.0.6.0805 -- Query_Processor.cpp -- Tue Nov 5 11:48:50 2019
In memory Standard Query Cache (SQC) rev. 1.2.0905 -- Query_Cache.cpp -- Tue Oct 15 17:03:55 2019
Standard MySQL Monitor (StdMyMon) rev. 2.0.1226 -- MySQL_Monitor.cpp -- Mon Nov 4 10:31:48 2019
2020-09-21 02:00:46 [INFO] Detected new/changed definition for AWS Aurora monitoring
2020-09-21 02:00:46 [INFO] Activating Monitoring of 1 AWS Aurora clusters
2020-09-21 02:00:46 [INFO] Starting Monitor thread for AWS Aurora writer HG 10
2020-09-21 02:00:46 [INFO] Started Monitor thread for AWS Aurora writer HG 10
2020-09-21 02:00:46 MySQL_HostGroups_Manager.cpp:5445:aws_aurora_replication_lag_action(): [WARNING] Shunning server :3306 from HG 10 with replication lag of 18446744949882880.000000 microseconds
2020-09-21 02:00:46 [INFO] Adding new discovered AWS Aurora node :3306 with: hostgroup=20, weight=1, max_connections=1000
2020-09-21 02:00:46 [INFO] Dumping current MySQL Servers structures for hostgroup 10
HID: 10 , address: , port: 3306 , gtid_port: 0 , weight: 1000 , status: SHUNNED , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
2020-09-21 02:00:46 [INFO] Dumping mysql_servers: HG 10

Most helpful comment

Same issue here - this is quite critical.

All 12 comments

I am seeing the same issue with the latest version of Aurora MySQL 5.6 (aurora_version_1.23.0) as well.

ProxySQL verison: version 2.0.14-70-g7b02b19, codename Truls
OS version: Amazon Linux 2

Monitor output from latest Aurora MySQL 5.6 (aurora_version_1.23.0):
select hostname, replica_lag_in_milliseconds from mysql_server_aws_aurora_log; +---------------------------------------------------------------+-----------------------------+ | hostname | replica_lag_in_milliseconds | +---------------------------------------------------------------+-----------------------------+ | rds-aurora-*** | 18446744949882880 |

Monitor output from previous Aurora MySQL 5.6 (aurora_version_1.22.0):
select hostname, replica_lag_in_milliseconds from mysql_server_aws_aurora_log; +-------------------------------------------------------------------+-----------------------------+ | hostname | replica_lag_in_milliseconds | +-------------------------------------------------------------------+-----------------------------+ | rds-aurora-*** | 19.3509998321533 |

It appears to be a unit conversion change for the lag reporting.

+1. I've seen the same issue.

Yes, currently hit by this. Our only option to recover was to snapshot and relaunch Aurora on 2.07.2. No bueno :(

+1. I have the same problem in aurora 2.09.1

Same issue here - this is quite critical.

I had to quickly patch proxysql to get it to work, specifically:

diff --git a/lib/MySQL_Monitor.cpp b/lib/MySQL_Monitor.cpp
index 346cbde8..813cf00b 100644
--- a/lib/MySQL_Monitor.cpp
+++ b/lib/MySQL_Monitor.cpp
@@ -4079,9 +4079,9 @@ void * monitor_AWS_Aurora_thread_HG(void *arg) {
        mmsd->t1=monotonic_time();
        mmsd->interr=0; // reset the value
 #ifdef TEST_AURORA
-       mmsd->async_exit_status = mysql_query_start(&mmsd->interr, mmsd->mysql, "SELECT SERVER_ID, SESSION_ID, LAST_UPDATE_TIMESTAMP, REPLICA_LAG_IN_MILLISECONDS, CPU FROM REPLICA_HOST_STATUS ORDER BY SERVER_ID");
+       mmsd->async_exit_status = mysql_query_start(&mmsd->interr, mmsd->mysql, "SELECT SERVER_ID, SESSION_ID, LAST_UPDATE_TIMESTAMP, IF(REPLICA_LAG_IN_MILLISECONDS >= 2147483647, 0, REPLICA_LAG_IN_MILLISECONDS) AS REPLICA_LAG_IN_MILLISECONDS, CPU FROM REPLICA_HOST_STATUS ORDER BY SERVER_ID");
 #else
-       mmsd->async_exit_status = mysql_query_start(&mmsd->interr, mmsd->mysql, "SELECT SERVER_ID, SESSION_ID, LAST_UPDATE_TIMESTAMP, REPLICA_LAG_IN_MILLISECONDS, CPU FROM INFORMATION_SCHEMA.REPLICA_HOST_STATUS WHERE (REPLICA_LAG_IN_MILLISECONDS > 0 AND REPLICA_LAG_IN_MILLISECONDS <= 600000) OR SESSION_ID = 'MASTER_SESSION_ID' ORDER BY SERVER_ID");
+       mmsd->async_exit_status = mysql_query_start(&mmsd->interr, mmsd->mysql, "SELECT SERVER_ID, SESSION_ID, LAST_UPDATE_TIMESTAMP, IF(REPLICA_LAG_IN_MILLISECONDS >= 2147483647, 0, REPLICA_LAG_IN_MILLISECONDS) AS REPLICA_LAG_IN_MILLISECONDS, CPU FROM INFORMATION_SCHEMA.REPLICA_HOST_STATUS WHERE (REPLICA_LAG_IN_MILLISECONDS > 0 AND REPLICA_LAG_IN_MILLISECONDS <= 600000) OR SESSION_ID = 'MASTER_SESSION_ID' ORDER BY SERVER_ID");
 #endif // TEST_AURORA
        while (mmsd->async_exit_status) {
                mmsd->async_exit_status=wait_for_mysql(mmsd->mysql, mmsd->async_exit_status);
@@ -4493,9 +4493,9 @@ void * monitor_AWS_Aurora_thread(void *arg) {
        mmsd->t1=monotonic_time();
        mmsd->interr=0; // reset the value
 #ifdef TEST_AURORA
-       mmsd->async_exit_status = mysql_query_start(&mmsd->interr, mmsd->mysql, "SELECT SERVER_ID, SESSION_ID, LAST_UPDATE_TIMESTAMP, REPLICA_LAG_IN_MILLISECONDS, CPU FROM REPLICA_HOST_STATUS ORDER BY SERVER_ID");
+       mmsd->async_exit_status = mysql_query_start(&mmsd->interr, mmsd->mysql, "SELECT SERVER_ID, SESSION_ID, LAST_UPDATE_TIMESTAMP, IF(REPLICA_LAG_IN_MILLISECONDS >= 2147483647, 0, REPLICA_LAG_IN_MILLISECONDS) AS REPLICA_LAG_IN_MILLISECONDS, CPU FROM REPLICA_HOST_STATUS ORDER BY SERVER_ID");
 #else
-       mmsd->async_exit_status = mysql_query_start(&mmsd->interr, mmsd->mysql, "SELECT SERVER_ID, SESSION_ID, LAST_UPDATE_TIMESTAMP, REPLICA_LAG_IN_MILLISECONDS, CPU FROM INFORMATION_SCHEMA.REPLICA_HOST_STATUS ORDER BY SERVER_ID");
+       mmsd->async_exit_status = mysql_query_start(&mmsd->interr, mmsd->mysql, "SELECT SERVER_ID, SESSION_ID, LAST_UPDATE_TIMESTAMP, IF(REPLICA_LAG_IN_MILLISECONDS >= 2147483647, 0, REPLICA_LAG_IN_MILLISECONDS) AS REPLICA_LAG_IN_MILLISECONDS, CPU FROM INFORMATION_SCHEMA.REPLICA_HOST_STATUS ORDER BY SERVER_ID");
 #endif // TEST_AURORA
        while (mmsd->async_exit_status) {
                mmsd->async_exit_status=wait_for_mysql(mmsd->mysql, mmsd->async_exit_status);

The replica lag in milliseconds should never be larger than MAX_INT, so in that case I just default it to 0. Probably this is not the proper fix, but it got me going. Maybe a fix would be to default the replication lag to 0 on the master session..

Looks like the column format changed in 1.23+ and 2.9+. Previously MASTER_SESSION_ID seems to have always been 0, now it is being populated and it seems proxysql is handling it incorrectly and setting it to max int
Prior to 1.23 and 2.9

MySQL [test]> select SERVER_ID,SESSION_ID,REPLICA_LAG_IN_MILLISECONDS from INFORMATION_SCHEMA.REPLICA_HOST_STATUS;
+-----------+--------------------------------------+-----------------------------+
| SERVER_ID | SESSION_ID                           | REPLICA_LAG_IN_MILLISECONDS |
+-----------+--------------------------------------+-----------------------------+
| node1     | MASTER_SESSION_ID                    |                           0 |
| node2     | c3p3f337-c95d-4ss9-a05e-sdgsd5454sdf |                          19 |
| node3     | 0033ds35-c241-4rrb-a444-aasdasd44sd4 |                          19 |
+-----------+--------------------------------------+-----------------------------+
3 rows in set (0.00 sec)

Post 1.23 and 2.9, m4ce prob has the right idea here that master session id should always be 0. replica lag is only valid for readers
```
MySQL [test]> select SERVER_ID,SESSION_ID,REPLICA_LAG_IN_MILLISECONDS from INFORMATION_SCHEMA.REPLICA_HOST_STATUS;
+-----------+--------------------------------------+-----------------------------+
| SERVER_ID | SESSION_ID | REPLICA_LAG_IN_MILLISECONDS |
+-----------+--------------------------------------+-----------------------------+
| node1 | MASTER_SESSION_ID | 1.844674407370955e16 |
| node2 | c3p3f337-c95d-4ss9-a05e-sdgsd5454sdf | 19 |
| node3 | 0033ds35-c241-4rrb-a444-aasdasd44sd4 | 19 |
+-----------+--------------------------------------+-----------------------------+
3 rows in set (0.01 sec)
````

Hi I'm facing this issue with 5.7.mysql_aurora.2.07.2 - I have observed something weird:

If started as a service it fails instantly:

/var/lib/proxysql/proxysql.log
Standard Query Processor rev. 2.0.6.0805 -- Query_Processor.cpp -- Wed Oct 28 20:09:40 2020
In memory Standard Query Cache (SQC) rev. 1.2.0905 -- Query_Cache.cpp -- Wed Oct 28 20:09:40 2020
Standard MySQL Monitor (StdMyMon) rev. 2.0.1226 -- MySQL_Monitor.cpp -- Wed Oct 28 20:09:40 2020
2021-01-03 19:23:23 [INFO] Latest ProxySQL version available: 2.0.15-20-g32bb92c

2021-01-03 19:23:45 ProxySQL_GloVars.cpp:9:term_handler(): [WARNING] Received TERM signal: shutdown in progress...
2021-01-03 19:23:45 [INFO] Starting shutdown...
2021-01-03 19:23:45 [INFO] Shutdown completed!
2021-01-03 19:23:45 [INFO] Exiting...
2021-01-03 19:23:45 [INFO] Shutdown angel process

If started as foreground app it works!:

sudo /usr/bin/proxysql --idle-threads -c /etc/proxysql.cnf -f
2021-01-03 19:42:24 [INFO] Using config file /etc/proxysql.cnf
2021-01-03 19:42:24 [INFO] Using OpenSSL version: OpenSSL 1.1.1d  10 Sep 2019
2021-01-03 19:42:24 [INFO] SSL keys/certificates found in datadir (/var/lib/proxysql): loading them.
2021-01-03 19:42:24 [INFO] ProxySQL version 2.0.15-20-g32bb92cd
2021-01-03 19:42:24 [INFO] Detected OS: Linux ip-172-31-25-163 5.4.0-1032-aws #33-Ubuntu SMP Wed Dec 9 17:13:04 UTC 2020 x86_64
2021-01-03 19:42:24 [INFO] ProxySQL SHA1 checksum: ef0c72078a3b9fb57a613c91d1fd1eef4f1cd314
Standard ProxySQL MySQL Logger rev. 2.0.0714 -- MySQL_Logger.cpp -- Wed Oct 28 20:09:40 2020
Standard ProxySQL Cluster rev. 0.4.0906 -- ProxySQL_Cluster.cpp -- Wed Oct 28 20:09:40 2020
Standard ProxySQL Statistics rev. 1.4.1027 -- ProxySQL_Statistics.cpp -- Wed Oct 28 20:09:40 2020
Standard ProxySQL HTTP Server Handler rev. 1.4.1031 -- ProxySQL_HTTP_Server.cpp -- Wed Oct 28 20:09:40 2020
Standard ProxySQL Admin rev. 2.0.6.0805 -- ProxySQL_Admin.cpp -- Wed Oct 28 20:09:40 2020
2021-01-03 19:42:24 [INFO] ProxySQL SHA1 checksum: ef0c72078a3b9fb57a613c91d1fd1eef4f1cd314
Standard MySQL Threads Handler rev. 0.2.0902 -- MySQL_Thread.cpp -- Wed Oct 28 20:09:40 2020
Standard MySQL Authentication rev. 0.2.0902 -- MySQL_Authentication.cpp -- Wed Oct 28 20:09:40 2020
2021-01-03 19:42:24 [INFO] Dumping mysql_servers_incoming
+--------------+----------------------------------------------------------------+------+-----------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
| hostgroup_id | hostname                                                       | port | gtid_port | weight | status | compression | max_connections | max_replication_lag | use_ssl | max_latency_ms | comment |
+--------------+----------------------------------------------------------------+------+-----------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
| 1            | database-1.cluster-xxxxx.us-east-2.rds.amazonaws.com    | 3306 | 0         | 1      | 0      | 0           | 1000            | 0                   | 0       | 0              |         |
| 2            | database-1.cluster-ro-xxxxx.us-east-2.rds.amazonaws.com | 3306 | 0         | 1      | 0      | 0           | 1000            | 0                   | 0       | 0              |         |
+--------------+----------------------------------------------------------------+------+-----------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
2021-01-03 19:42:24 [INFO] Dumping mysql_servers LEFT JOIN mysql_servers_incoming
+-------------+--------------+----------+------+
| mem_pointer | hostgroup_id | hostname | port |
+-------------+--------------+----------+------+
+-------------+--------------+----------+------+
2021-01-03 19:42:24 [INFO] Dumping mysql_servers JOIN mysql_servers_incoming
+--------------+----------------------------------------------------------------+------+-----------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+-------------+-----------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
| hostgroup_id | hostname                                                       | port | gtid_port | weight | status | compression | max_connections | max_replication_lag | use_ssl | max_latency_ms | comment | mem_pointer | gtid_port | weight | status | compression | max_connections | max_replication_lag | use_ssl | max_latency_ms | comment |
+--------------+----------------------------------------------------------------+------+-----------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+-------------+-----------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
| 1            | database-1.cluster-xxxx.us-east-2.rds.amazonaws.com    | 3306 | 0         | 1      | 0      | 0           | 1000            | 0                   | 0       | 0              |         | 0           | 0         | 1      | 0      | 0           | 1000            | 0                   | 0       | 0              |         |
| 2            | database-1.cluster-ro-xxxxx.us-east-2.rds.amazonaws.com | 3306 | 0         | 1      | 0      | 0           | 1000            | 0                   | 0       | 0              |         | 0           | 0         | 1      | 0      | 0           | 1000            | 0                   | 0       | 0              |         |
+--------------+----------------------------------------------------------------+------+-----------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+-------------+-----------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
2021-01-03 19:42:24 [INFO] Creating new server in HG 1 : database-1.cluster-xxxxx.us-east-2.rds.amazonaws.com:3306 , gtid_port=0, weight=1, status=0
2021-01-03 19:42:24 [INFO] Creating new server in HG 2 : database-1.cluster-ro-xxxx.us-east-2.rds.amazonaws.com:3306 , gtid_port=0, weight=1, status=0
2021-01-03 19:42:24 [INFO] New mysql_group_replication_hostgroups table
2021-01-03 19:42:24 [INFO] New mysql_galera_hostgroups table
2021-01-03 19:42:24 [INFO] New mysql_aws_aurora_hostgroups table
2021-01-03 19:42:24 [INFO] MySQL_HostGroups_Manager::commit() locked for 1ms
Standard Query Processor rev. 2.0.6.0805 -- Query_Processor.cpp -- Wed Oct 28 20:09:40 2020
In memory Standard Query Cache (SQC) rev. 1.2.0905 -- Query_Cache.cpp -- Wed Oct 28 20:09:40 2020
Standard MySQL Monitor (StdMyMon) rev. 2.0.1226 -- MySQL_Monitor.cpp -- Wed Oct 28 20:09:40 2020
2021-01-03 19:42:25 [INFO] Latest ProxySQL version available: 2.0.15-20-g32bb92c

Please nvm above message, turns out that I started proxysql in foreground as root and it messed up the file ownership, but service start didn't complain about it, it just died. Executing sudo chown -R proxysql:proxysql /var/lib/proxysql/ fixed it for me.

At least I can confirm that it works with 5.7.mysql_aurora.2.07.2 :)

@renecannao - any chance we could apply my fix? I have been running it in prod for weeks now and it's been working fine.

See #3234 for a PR based on @m4ce's patch at https://github.com/sysown/proxysql/issues/3082#issuecomment-743352833 with a slightly adjusted query.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

vy-nguyentan picture vy-nguyentan  路  4Comments

ShahriyarR picture ShahriyarR  路  3Comments

eightnoteight picture eightnoteight  路  4Comments

renecannao picture renecannao  路  3Comments

renecannao picture renecannao  路  3Comments