During backend server master failover (ie., in the target host group all the write nodes are either offline soft or offline hard), write queries are failing due to time-out from backend server (_2013, 'Lost connection to MySQL server during query (timed out)_). But, front-end proxysql connection is getting dropped immediately. We need to reconnect to query again.
Although mysql-connect_retries_on_failure is set to 10, it is returning after a single failure.
Transition of nodes status during this failover:
Initial:
Master - ONLINE
StandBy - OFFLINE_SOFT
During failover (this is where we are getting query timeouts):
Master - OFFLINE_HARD
StandBy - OFFLINE_SOFT
After failover:
Master - OFFLINE_HARD
StandBy(new Master) - ONLINE
ProxySQL version: v1.4.9-1.1.el7
@chaithanyaganta , if the client got disconnect it means that (for a variety of reasons) ProxySQL has disabled multiplexing in the connection used by the client and therefore cannot re-establish a new connection to the backend and consider that connection safe.
The classic example is if you have a transaction still running: proxysql cannot recreate the transaction on a new connection, therefore it disconnects the client.
There are a variety of other reasons why ProxySQL disables multiplexing, listed in the wiki.
A very common reason for multiplexing to be disabled is due to session variables used in many drivers (for example JDBC). You often should instruct proxysql to not disable multiplexing for such queries. See again the wiki
@renecannao Thanks for the quick reply. Sorry that I'm new to proxysql.
Our proxysql setup is as below:
INSERT INTO test values (NOW()). Still we are seeing the connection drop. Please let me know any other configuration I need to look into which may cause this issue.
Is it possible to have the output of SELECT * FROM stats_mysql_query_digest and SHOW MYSQL STATUS?
Sure, Updated the gist with the required output.
Thanks.
From the output it is clear that the application never issues a commit, and since it uses autocommit sets to False, it is all a big transaction. So back on my first comment is explained why the connection is being dropped
Actually, we tried setting autocommit to True as well. Still we are seeing this issue.
@chaithanyaganta , please provide a reproducible test case.
It is possible you aren't doing the failover correctly, or something else is happening.
For completeness, below is an example of how it works successfully using sysbench.
Starting point:
mysql> SELECT hostgroup_id, hostname, port, status FROM mysql_servers;
+--------------+-----------+-------+--------------+
| hostgroup_id | hostname | port | status |
+--------------+-----------+-------+--------------+
| 1 | 127.0.0.1 | 17822 | ONLINE |
| 1 | 127.0.0.1 | 17823 | OFFLINE_SOFT |
+--------------+-----------+-------+--------------+
2 rows in set (0,00 sec)
Running sysbench:
../sysbench oltp_read_write.lua --threads=16 --report-interval=1 --max-requests=0 \
--time=30 --mysql-host=127.0.0.1 --mysql-user=sbtest --mysql-password=sbtest \
--table-size=1000000 --db-ps-mode=disable --mysql-ignore-errors=1062 \
--skip-trx=on --mysql-port=6033 run
Note: sysbench is ignoring duplicate key error (1062) because they are expected to happen when --skip-trx=on is used.
Performing the failover as you described it:
mysql> UPDATE mysql_servers SET status='OFFLINE_HARD' WHERE port=17822; SELECT hostgroup_id, hostname, port, status FROM mysql_servers; LOAD MYSQL SERVERS TO RUNTIME;
Query OK, 1 row affected (0,00 sec)
+--------------+-----------+-------+--------------+
| hostgroup_id | hostname | port | status |
+--------------+-----------+-------+--------------+
| 1 | 127.0.0.1 | 17822 | OFFLINE_HARD |
| 1 | 127.0.0.1 | 17823 | OFFLINE_SOFT |
+--------------+-----------+-------+--------------+
2 rows in set (0,00 sec)
Query OK, 0 rows affected (0,00 sec)
... and after few seconds:
mysql> UPDATE mysql_servers SET status='ONLINE' WHERE port=17823; SELECT hostgroup_id, hostname, port, status FROM mysql_servers; LOAD MYSQL SERVERS TO RUNTIME;
Query OK, 1 row affected (0,00 sec)
+--------------+-----------+-------+--------------+
| hostgroup_id | hostname | port | status |
+--------------+-----------+-------+--------------+
| 1 | 127.0.0.1 | 17822 | OFFLINE_HARD |
| 1 | 127.0.0.1 | 17823 | ONLINE |
+--------------+-----------+-------+--------------+
2 rows in set (0,00 sec)
Query OK, 0 rows affected (0,01 sec)
Output of sysbench:
../sysbench oltp_read_write.lua --threads=16 --report-interval=1 --max-requests=0 --time=30 --mysql-host=127.0.0.1 --mysql-user=sbtest --mysql-password=sbtest --table-size=1000000 --db-ps-mode=disable --mysql-ignore-errors=1062 --skip-trx=on --mysql-port=6033 run
sysbench 1.1.0-81ae246 (using bundled LuaJIT 2.1.0-beta3)
Running the test with following options:
Number of threads: 16
Report intermediate results every 1 second(s)
Initializing random number generator from current time
Initializing worker threads...
Threads started!
[ 1s ] thds: 16 tps: 281.03 qps: 5301.70 (r/w/o: 4152.67/1149.03/0.00) lat (ms,95%): 77.19 err/s: 0.00 reconn/s: 0.00
[ 2s ] thds: 16 tps: 282.13 qps: 5084.27 (r/w/o: 3954.77/1129.51/0.00) lat (ms,95%): 95.81 err/s: 0.00 reconn/s: 0.00
[ 3s ] thds: 16 tps: 337.98 qps: 6062.65 (r/w/o: 4713.73/1348.92/0.00) lat (ms,95%): 66.84 err/s: 0.00 reconn/s: 0.00
[ 4s ] thds: 16 tps: 321.98 qps: 5822.57 (r/w/o: 4525.67/1295.90/1.00) lat (ms,95%): 73.13 err/s: 0.00 reconn/s: 0.00
[ 5s ] thds: 16 tps: 380.03 qps: 6824.52 (r/w/o: 5319.40/1503.11/2.00) lat (ms,95%): 50.11 err/s: 0.00 reconn/s: 0.00
[ 6s ] thds: 16 tps: 342.93 qps: 6181.70 (r/w/o: 4801.99/1378.71/1.00) lat (ms,95%): 59.99 err/s: 0.00 reconn/s: 0.00
[ 7s ] thds: 16 tps: 293.08 qps: 5246.36 (r/w/o: 4086.06/1159.30/1.00) lat (ms,95%): 64.47 err/s: 0.00 reconn/s: 0.00
[ 8s ] thds: 16 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 9s ] thds: 16 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 10s ] thds: 16 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 11s ] thds: 16 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 12s ] thds: 16 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 13s ] thds: 16 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 14s ] thds: 16 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 15s ] thds: 16 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 16s ] thds: 16 tps: 137.00 qps: 2481.91 (r/w/o: 1930.93/550.98/0.00) lat (ms,95%): 8795.93 err/s: 1.00 reconn/s: 0.00
[ 17s ] thds: 16 tps: 313.01 qps: 5638.10 (r/w/o: 4382.08/1256.02/0.00) lat (ms,95%): 73.13 err/s: 0.00 reconn/s: 0.00
[ 18s ] thds: 16 tps: 330.01 qps: 5882.11 (r/w/o: 4569.09/1313.03/0.00) lat (ms,95%): 86.00 err/s: 0.00 reconn/s: 0.00
[ 19s ] thds: 16 tps: 345.99 qps: 6286.83 (r/w/o: 4888.87/1397.96/0.00) lat (ms,95%): 61.08 err/s: 0.00 reconn/s: 0.00
[ 20s ] thds: 16 tps: 365.01 qps: 6575.13 (r/w/o: 5118.10/1455.03/2.00) lat (ms,95%): 54.83 err/s: 0.00 reconn/s: 0.00
[ 21s ] thds: 16 tps: 357.98 qps: 6400.59 (r/w/o: 4974.68/1425.91/0.00) lat (ms,95%): 61.08 err/s: 0.00 reconn/s: 0.00
[ 22s ] thds: 16 tps: 328.01 qps: 5948.22 (r/w/o: 4632.18/1316.05/0.00) lat (ms,95%): 71.83 err/s: 0.00 reconn/s: 0.00
[ 23s ] thds: 16 tps: 337.96 qps: 6018.24 (r/w/o: 4675.41/1342.83/0.00) lat (ms,95%): 70.55 err/s: 0.00 reconn/s: 0.00
[ 24s ] thds: 16 tps: 352.03 qps: 6418.51 (r/w/o: 4997.40/1421.11/0.00) lat (ms,95%): 62.19 err/s: 0.00 reconn/s: 0.00
[ 25s ] thds: 16 tps: 362.97 qps: 6510.54 (r/w/o: 5063.64/1446.90/0.00) lat (ms,95%): 62.19 err/s: 0.00 reconn/s: 0.00
[ 26s ] thds: 16 tps: 338.03 qps: 6131.60 (r/w/o: 4764.47/1366.13/1.00) lat (ms,95%): 65.65 err/s: 1.00 reconn/s: 0.00
[ 27s ] thds: 16 tps: 367.00 qps: 6588.06 (r/w/o: 5133.05/1454.01/1.00) lat (ms,95%): 63.32 err/s: 0.00 reconn/s: 0.00
[ 28s ] thds: 16 tps: 327.00 qps: 5893.07 (r/w/o: 4583.05/1310.01/0.00) lat (ms,95%): 84.47 err/s: 0.00 reconn/s: 0.00
[ 29s ] thds: 16 tps: 325.01 qps: 5818.25 (r/w/o: 4516.20/1301.06/1.00) lat (ms,95%): 78.60 err/s: 1.00 reconn/s: 0.00
[ 30s ] thds: 16 tps: 321.98 qps: 5841.56 (r/w/o: 4555.66/1285.90/0.00) lat (ms,95%): 75.82 err/s: 0.00 reconn/s: 0.00
SQL statistics:
queries performed:
read: 100352
write: 28659
other: 10
total: 129021
transactions: 7165 (238.42 per sec.)
queries: 129021 (4293.32 per sec.)
ignored errors: 3 (0.10 per sec.)
reconnects: 0 (0.00 per sec.)
Throughput:
events/s (eps): 238.4235
time elapsed: 30.0516s
total number of events: 7165
Latency (ms):
min: 26.00
avg: 67.06
max: 8868.70
95th percentile: 69.29
sum: 480498.49
Threads fairness:
events (avg/stddev): 447.8125/7.90
execution time (avg/stddev): 30.0312/0.01
In the above output you can see that the failover took around 9 seconds, and there was no reconnects (sysbench was never disconnected).
Starting point:
mysql> SELECT hostgroup_id, hostname, port, status FROM mysql_servers;
+--------------+-----------+-------+--------------+
| hostgroup_id | hostname | port | status |
+--------------+-----------+-------+--------------+
| 1 | 127.0.0.1 | 17822 | ONLINE |
| 1 | 127.0.0.1 | 17823 | OFFLINE_SOFT |
+--------------+-----------+-------+--------------+
2 rows in set (0,00 sec)
Running sysbench:
../sysbench oltp_read_write.lua --threads=16 --report-interval=1 --max-requests=0 \
--time=30 --mysql-host=127.0.0.1 --mysql-user=sbtest --mysql-password=sbtest \
--table-size=1000000 --db-ps-mode=disable --mysql-ignore-errors=1062 \
--skip-trx=off --mysql-port=6033 run
Performing the graceful switchover in 3 steps (it can be done in 2 steps combining the last 2, but I want to show the overall reliability of the system).
Step1: current writer is set OFFLINE_SOFT:
mysql> UPDATE mysql_servers SET status='OFFLINE_SOFT' WHERE port=17822; SELECT hostgroup_id, hostname, port, status FROM mysql_servers; LOAD MYSQL SERVERS TO RUNTIME;
Query OK, 1 row affected (0,00 sec)
+--------------+-----------+-------+--------------+
| hostgroup_id | hostname | port | status |
+--------------+-----------+-------+--------------+
| 1 | 127.0.0.1 | 17822 | OFFLINE_SOFT |
| 1 | 127.0.0.1 | 17823 | OFFLINE_SOFT |
+--------------+-----------+-------+--------------+
2 rows in set (0,00 sec)
After few seconds, step 2: current master is set OFFLINE_HARD:
mysql> UPDATE mysql_servers SET status='OFFLINE_HARD' WHERE port=17822; SELECT hostgroup_id, hostname, port, status FROM mysql_servers; LOAD MYSQL SERVERS TO RUNTIME;
Query OK, 1 row affected (0,00 sec)
+--------------+-----------+-------+--------------+
| hostgroup_id | hostname | port | status |
+--------------+-----------+-------+--------------+
| 1 | 127.0.0.1 | 17822 | OFFLINE_HARD |
| 1 | 127.0.0.1 | 17823 | OFFLINE_SOFT |
+--------------+-----------+-------+--------------+
2 rows in set (0,00 sec)
After few more seconds, step 3: new master is set ONLINE:
mysql> UPDATE mysql_servers SET status='ONLINE' WHERE port=17823; SELECT hostgroup_id, hostname, port, status FROM mysql_servers; LOAD MYSQL SERVERS TO RUNTIME; Query OK, 1 row affected (0,00 sec)
+--------------+-----------+-------+--------------+
| hostgroup_id | hostname | port | status |
+--------------+-----------+-------+--------------+
| 1 | 127.0.0.1 | 17822 | OFFLINE_HARD |
| 1 | 127.0.0.1 | 17823 | ONLINE |
+--------------+-----------+-------+--------------+
2 rows in set (0,00 sec)
Query OK, 0 rows affected (0,01 sec)
Complete output of sysbench:
../sysbench oltp_read_write.lua --threads=16 --report-interval=1 --max-requests=0 --time=30 --mysql-host=127.0.0.1 --mysql-user=sbtest --mysql-password=sbtest --table-size=1000000 --db-ps-mode=disable --mysql-ignore-errors=1062 --skip-trx=off --mysql-port=6033 run
sysbench 1.1.0-81ae246 (using bundled LuaJIT 2.1.0-beta3)
Running the test with following options:
Number of threads: 16
Report intermediate results every 1 second(s)
Initializing random number generator from current time
Initializing worker threads...
Threads started!
[ 1s ] thds: 16 tps: 834.09 qps: 16977.05 (r/w/o: 11899.70/3393.22/1684.14) lat (ms,95%): 30.26 err/s: 0.00 reconn/s: 0.00
[ 2s ] thds: 16 tps: 875.33 qps: 17479.67 (r/w/o: 12240.67/3488.33/1750.67) lat (ms,95%): 29.72 err/s: 0.00 reconn/s: 0.00
[ 3s ] thds: 16 tps: 831.80 qps: 16542.95 (r/w/o: 11569.16/3310.19/1663.59) lat (ms,95%): 31.94 err/s: 0.00 reconn/s: 0.00
[ 4s ] thds: 16 tps: 728.23 qps: 14388.78 (r/w/o: 10062.36/2884.95/1441.48) lat (ms,95%): 36.24 err/s: 0.00 reconn/s: 0.00
[ 5s ] thds: 16 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 6s ] thds: 16 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 7s ] thds: 16 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 8s ] thds: 16 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 9s ] thds: 16 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 10s ] thds: 16 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 11s ] thds: 16 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 12s ] thds: 16 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 13s ] thds: 16 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 14s ] thds: 16 tps: 179.25 qps: 3800.36 (r/w/o: 2672.77/753.06/374.53) lat (ms,95%): 9799.46 err/s: 0.00 reconn/s: 0.00
[ 15s ] thds: 16 tps: 843.04 qps: 16919.76 (r/w/o: 11851.53/3382.15/1686.08) lat (ms,95%): 29.72 err/s: 0.00 reconn/s: 0.00
[ 16s ] thds: 16 tps: 921.01 qps: 18357.27 (r/w/o: 12840.19/3675.05/1842.03) lat (ms,95%): 27.66 err/s: 0.00 reconn/s: 0.00
[ 17s ] thds: 16 tps: 855.82 qps: 17111.39 (r/w/o: 11993.45/3406.30/1711.64) lat (ms,95%): 29.19 err/s: 0.00 reconn/s: 0.00
[ 18s ] thds: 16 tps: 811.97 qps: 16326.59 (r/w/o: 11418.70/3283.94/1623.95) lat (ms,95%): 31.37 err/s: 0.00 reconn/s: 0.00
[ 19s ] thds: 16 tps: 923.90 qps: 18457.03 (r/w/o: 12929.61/3679.62/1847.80) lat (ms,95%): 28.16 err/s: 0.00 reconn/s: 0.00
[ 20s ] thds: 16 tps: 940.91 qps: 18730.21 (r/w/o: 13112.75/3735.64/1881.82) lat (ms,95%): 29.72 err/s: 0.00 reconn/s: 0.00
[ 21s ] thds: 16 tps: 861.08 qps: 17341.66 (r/w/o: 12123.16/3496.33/1722.16) lat (ms,95%): 34.33 err/s: 0.00 reconn/s: 0.00
[ 22s ] thds: 16 tps: 926.79 qps: 18418.91 (r/w/o: 12901.13/3664.19/1853.59) lat (ms,95%): 30.26 err/s: 0.00 reconn/s: 0.00
[ 23s ] thds: 16 tps: 959.45 qps: 19272.16 (r/w/o: 13491.41/3861.84/1918.90) lat (ms,95%): 36.89 err/s: 0.00 reconn/s: 0.00
[ 24s ] thds: 16 tps: 906.03 qps: 18061.63 (r/w/o: 12644.44/3605.13/1812.06) lat (ms,95%): 41.85 err/s: 0.00 reconn/s: 0.00
[ 25s ] thds: 16 tps: 844.87 qps: 16967.29 (r/w/o: 11872.11/3405.46/1689.73) lat (ms,95%): 50.11 err/s: 0.00 reconn/s: 0.00
[ 26s ] thds: 16 tps: 942.12 qps: 18757.35 (r/w/o: 13132.64/3740.47/1884.24) lat (ms,95%): 31.94 err/s: 0.00 reconn/s: 0.00
[ 27s ] thds: 16 tps: 907.81 qps: 18116.17 (r/w/o: 12677.32/3623.23/1815.61) lat (ms,95%): 43.39 err/s: 0.00 reconn/s: 0.00
[ 28s ] thds: 16 tps: 817.76 qps: 16461.25 (r/w/o: 11526.68/3299.05/1635.53) lat (ms,95%): 50.11 err/s: 0.00 reconn/s: 0.00
[ 29s ] thds: 16 tps: 908.51 qps: 18066.94 (r/w/o: 12634.94/3614.99/1817.01) lat (ms,95%): 37.56 err/s: 0.00 reconn/s: 0.00
[ 30s ] thds: 16 tps: 892.89 qps: 18002.86 (r/w/o: 12606.50/3610.57/1785.79) lat (ms,95%): 48.34 err/s: 0.00 reconn/s: 0.00
SQL statistics:
queries performed:
read: 248234
write: 70923
other: 35463
total: 354620
transactions: 17731 (590.84 per sec.)
queries: 354620 (11816.83 per sec.)
ignored errors: 0 (0.00 per sec.)
reconnects: 0 (0.00 per sec.)
Throughput:
events/s (eps): 590.8416
time elapsed: 30.0097s
total number of events: 17731
Latency (ms):
min: 5.19
avg: 27.08
max: 9891.49
95th percentile: 33.12
sum: 480070.03
Threads fairness:
events (avg/stddev): 1108.1875/68.58
execution time (avg/stddev): 30.0044/0.00
As you can see now, also in this case we performed a failover in 9 seconds, using transactions, and with absolutely no reconnects because OFFLINE_SOFT allows transactions to complete.
@renecannao Thank you very much for your detailed analysis. I've figured out that my write-query time-out is very less which is causing connection drop although auto-commit is set to True.
I do understand the reasoning behind disabling the multiplexing during a transaction. But, I'm not sure why proxysql has to drop front-end client connection immediately incase of transaction failure instead of failing that particular transaction and holding the connection to execute further transaction/queries ?
For ex:
I have a use case where I want to hold the client connection with proxysql in case of transaction failure as well. Is there a way to achieve this ?
There are mostly 2 reasons why the connection is being dropped.
Reason 1, not so serious:
There is no official error code that indicates that a transaction is broken half way.
See here.
There is a workaround to this, for example it is possible to return error 1205 simulating a lock wait timeout, but of course it is not really a lock wait timeout.
Another workaround is to return a new error code specific to ProxySQL, but the application should start becoming aware that ProxySQL exists and that custom errors code exist too (maybe it is too early for this to happen).
Reason 2:
ProxySQL cannot be sure that the transaction failed.
If after a successful commit (explicit or not) to disk, but before it sends the OK to ProxySQL, the backend server dies or the network become unavailable, should ProxySQL tell the client that the transaction failed?
As there is the some probability that a transaction committed during a failover, ProxySQL cannot safely conclude if the transaction succeeded or failed, thus it cannot reply correctly to the client.
Mostly because of reason 2 , ProxySQL drops the client connection because the session is an unknown state.
Of course, it is possible to implement some logic to assume that a transaction didn't complete if the connection is not gracefully closed, but I think it is not generally safe.
@renecannao Thanks for the detailed reasoning.
Currently we are in process of migrating our databases managed by Tungsten to self managed using ProxySQL as a proxy layer, during which we are hitting this issue where tungsten is able to hold the connections during manual backend failover although a transaction is going on.
Is it possible for ProxySQL during graceful failover while master node is in OFFLINE_SOFT mode, to restrict further/new transactions (not only connections)? So, that when we move to OFFLINE_HARD afterwards there won't be any current transaction and the connection will hold.
Hi @chaithanyaganta .
The answer is absolutely yes!
I showed a practical example (step by step) in my previous comment (https://github.com/sysown/proxysql/issues/1632#issuecomment-409948646) in section "Using sysbench with transactions and performing a graceful switchover"
@renecannao The connection drop behaviour seems to depend upon mysql connector. I'm not seeing any connection drop using Sysbench. But, If I use any JDBC connector, connection is getting dropped during graceful switchover with an error:
com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure
The last packet successfully received from the server was 65 milliseconds ago. The last packet sent successfully to the server was 64 milliseconds ago.
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at com.mysql.jdbc.Util.handleNewInstance(Util.java:404)
at com.mysql.jdbc.SQLError.createCommunicationsException(SQLError.java:983)
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3457)
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3357)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3797)
at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2470)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2617)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2550)
at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1861)
at com.mysql.jdbc.PreparedStatement.executeUpdateInternal(PreparedStatement.java:2073)
at com.mysql.jdbc.PreparedStatement.executeUpdateInternal(PreparedStatement.java:2009)
at com.mysql.jdbc.PreparedStatement.executeLargeUpdate(PreparedStatement.java:5094)
at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1994)
at com.jcg.examples.JdbcTest.main(JdbcTest.java:44)
Caused by: java.io.EOFException: Can not read response from server. Expected to read 4 bytes, read 0 bytes before connection was unexpectedly lost.
at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:2949)
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3367)
... 11 more
And I've observed that a typical sysbench transaction is like:
UPDATE sbtest1 SET k=k+1 WHERE id=502
SELECT c FROM sbtest1 WHERE id=501
UPDATE sbtest1 SET c='45581742431-85388569041-07163901568' WHERE id=501
SELECT c FROM sbtest1 WHERE id=500
SELECT c FROM sbtest1 WHERE id=544
SELECT c FROM sbtest1 WHERE id BETWEEN 505 AND 604 ORDER BY c
INSERT INTO sbtest1
COMMIT
Where as my Jdbc transactions are like:
SELECT @@session.auto_increment_increment AS auto_increment_increment, @@character_set_client AS character_set_client, @@character_set_connection AS character_set_connection, @@character_set_results AS character_set_results, @@character_set_server AS character_set_server, @@init_connect AS init_connect, @@interactive_timeout AS interactive_timeout, @@license AS license, @@lower_case_table_names AS lower_case_table_names, @@max_allowed_packet AS max_allowed_packet, @@net_buffer_length AS net_buffer_length, @@net_write_timeout AS net_write_timeout, @@query_cache_size AS query_cache_size, @@query_cache_type AS query_cache_type, @@sql_mode AS sql_mode, @@system_time_zone AS system_time_zone, @@time_zone AS time_zone, @@tx_isolation AS tx_isolation, @@wait_timeout AS wait_timeout
SET character_set_results = NULL
SET autocommit=1
SET autocommit=0
INSERT INTO test values (NOW())
INSERT INTO test values (NOW())
commit
INSERT INTO test values (NOW())
INSERT INTO test values (NOW())
commit
Any idea why there is a connection drop with the above transactions in case of graceful switch-over ?
Hi @chaithanyaganta .
You are in the right track, because the reason is listed in your comment.
The first SELECT executed by JDBC disables multiplexing, and ProxySQL cannot establish a new backend connection for the same client connection. More details [https://github.com/sysown/proxysql/wiki/Multiplexing].
The good news is that it is very easy to fix: you need to instruct proxysql to not disable multiplexing for that specific query. Details here
Hi @renecannao
I've added a query rule to not disable multiplexing for that JDBC select query
INSERT INTO mysql_query_rules (active, digest, multiplex) SELECT 1,digest,2 FROM stats_mysql_query_digest WHERE digest_text LIKE '%@%';
And yet I'm seeing a connection drop, in case of graceful-switch over.
@chaithanyaganta : our tests show a different behaviour of what you are stating.
Please provide a clear reproducible test case.
I've observed that further transactions are still going on in case of OFFLINE_SOFT, opposed to what you have mentioned.
This is possible only if multiplexing is disabled. For example, if you didn't run LOAD MYSQL QUERY RULES TO RUNTIME after you created the rules.
Does multiplexing really affect connection drop assuming if further transactions are restricted in OFFLINE_SOFT mode, then same front-end connection can be held right ?
If status=OFFLINE_SOFT , backend connections are dropped when they return to the connection pool. If the connection has multiplexing disabled it will not return to the connection pool. Thus, when the failover is triggered, the frontend connection is dropped because it is using a backend connection with multiplexing disabled.
@renecannao I've verified that my query rules are loaded to runtime time table. But, still I'm seeing the issue with Jdbc connector. Here is the sample code through which I'm able to reproduce the issue consistently.
I'm attaching the dump of mysql_query_rules and stats_mysql_query_digest tables here for reference
Please let me know if anything else is required.
And is there a way I can check multiplexing status at runtime ?
PS: With the same configuration, I don't see any issue with Sysbench.
@chaithanyaganta , I slighting modified your code, here is the diff:
$ diff -u JdbcTest.java.orig JdbcTest.java
--- JdbcTest.java.orig 2018-08-20 11:48:21.268316327 +0200
+++ JdbcTest.java 2018-08-20 11:51:44.027986404 +0200
@@ -1,9 +1,8 @@
-package com.jcg.examples;
-
import java.sql.DriverManager;
import java.sql.Connection;
import java.sql.PreparedStatement;
import java.sql.SQLException;
+import java.util.concurrent.TimeUnit;
/**
* @author ganta
@@ -27,7 +26,7 @@
try {
connection = DriverManager
- .getConnection("jdbc:mysql://127.0.0.1:3306/test","admin", "admin");
+ .getConnection("jdbc:mysql://127.0.0.1:6033/sbtest","sbtest", "sbtest");
connection.setAutoCommit(false);
int i=0;
while (i<500000) {
@@ -42,6 +41,11 @@
System.out.println("Query status: " + pstmt.executeUpdate());
connection.commit();
+ System.out.println(java.time.LocalTime.now());
+ try {
+ TimeUnit.SECONDS.sleep(1);
+ } catch (InterruptedException ignore) {
+ }
} catch (SQLException e) {
e.printStackTrace();
//connection.rollback();
I didn't change anything in the way the code behaves, I only changed credentials, and added a timer and a sleep.
Here is the output:
java -cp .:./mysql-connector-java-5.1.47.jar JdbcTestest.java
-------- MySQL JDBC Connection Testing ------------
MySQL JDBC Driver Registered!
Query status: 1
Query status: 1
11:47:00.942
Query status: 1
Query status: 1
11:47:01.971
Query status: 1
Query status: 1
11:47:02.994
Query status: 1
Query status: 1
11:47:09.012
Query status: 1
Query status: 1
11:47:10.031
Query status: 1
Query status: 1
11:47:11.055
Query status: 1
Query status: 1
11:47:12.083
Query status: 1
Query status: 1
11:47:13.124
You can easily see that graceful switchover happened from 11:47:02.994 to 11:47:09.012 (6 seconds).
No errors, at all!
What are you doing wrong?
I don't know, but looking at your code my first concern is: are you really connecting to ProxySQL?
The doubt comes from 2 reasons:
admin:admin, although the wiki specifies:users in admin-admin_credentials cannot be used also in mysql_users table.
stats_mysql_query_digest it seems user mdsperf is being used.@renecannao
Yes, I'm connected to proxysql only, I just modified the credentials before uploading to gist and mdsperf is my actual user.
I think I've figured out the issue, If I explicitly set autocommit=1 after the transaction I don't see connection drop.
So, code will be like..
while(i < 500) {
connection.setAutoCommit(false);
pstmt = connection.prepareStatement("INSERT INTO test values (NOW())");
pstmt.executeUpdate();
connection.commit();
connection.setAutoCommit(true);
}
It seems we have to explicitly set autocommit=1, then only proxysql will re-enable multiplexing after transaction and restrict further transactions in case of OFFLINE_SOFT mode.
In the code I am using (that is mostly your original code) I am not setting autocommit=1, so I doubt this is the reason why it works or doesn't work.
After some private communication, the root cause of this issue was found in mysql-autocommit_false_is_transaction=true (the default is false).
When mysql-autocommit_false_is_transaction=true, proxysql considers a connection with autocommit=0 to have a transaction running. This is, of course, not true, but rather a misconfiguration.
Closing this issue.
Thanks
Most helpful comment
There are mostly 2 reasons why the connection is being dropped.
Reason 1, not so serious:
There is no official error code that indicates that a transaction is broken half way.
See here.
There is a workaround to this, for example it is possible to return error 1205 simulating a lock wait timeout, but of course it is not really a lock wait timeout.
Another workaround is to return a new error code specific to ProxySQL, but the application should start becoming aware that ProxySQL exists and that custom errors code exist too (maybe it is too early for this to happen).
Reason 2:
ProxySQL cannot be sure that the transaction failed.
If after a successful commit (explicit or not) to disk, but before it sends the
OKto ProxySQL, the backend server dies or the network become unavailable, should ProxySQL tell the client that the transaction failed?As there is the some probability that a transaction committed during a failover, ProxySQL cannot safely conclude if the transaction succeeded or failed, thus it cannot reply correctly to the client.
Mostly because of reason 2 , ProxySQL drops the client connection because the session is an unknown state.
Of course, it is possible to implement some logic to assume that a transaction didn't complete if the connection is not gracefully closed, but I think it is not generally safe.