proxysql high cpu usage after 24 hours

Created on 7 Dec 2016 · 76Comments · Source: sysown/proxysql

Proxysql CPU usage greatly increase in 24 hours with only 2 connections active.
Server is RHEL 6.8 running 2.6.32-642.4.2.el6.x86_64 kernel running proxysql-1.3.0h-1.x86_64

 
top - 15:42:22 up 68 days, 13 min,  1 user,  load average: 1.44, 1.38, 1.36
Tasks: 111 total,   1 running, 110 sleeping,   0 stopped,   0 zombie
Cpu(s): 24.3%us, 75.6%sy,  0.0%ni,  0.2%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   1914216k total,  1223260k used,   690956k free,   173528k buffers
Swap:  2093052k total,    11216k used,  2081836k free,   711984k cached

 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 3554 root      20   0  222m  31m 2632 S 199.5  1.7 391:54.63 proxysql
 1520 root      20   0  930m 6424 2980 S  0.3  0.3  49:55.22 dad

root      3553     1  0 Dec06 ?        00:00:00 proxysql -c /etc/proxysql.cnf -D /var/lib/proxysql
root      3554  3553 26 Dec06 ?        06:41:30 proxysql -c /etc/proxysql.cnf -D /var/lib/proxysql

bug

Source

ayder

👍3

Most helpful comment

A bit of bash kung fu:

for i in `head -n 43723 report-latest.txt | tail -n 43712 | grep -v '0.00%' | awk '{print $2}' | sort | uniq -c | awk '{print $2}'` ; do echo -n `cut -b 1-29 report-latest.txt | grep "$i " | tr '%' ' ' | awk '{print $1}' | grep -v '0.00' | awk '{ SUM += $1} END { print SUM }'` ; echo "% $i" ; done | sort -n -r
37.04% swapper
22.33% proxysql
15.15% mysqld
5.61% corosync
4.26% mongod
2.74% rsync
1.93% kthreadd
1.72% proxysql_galera
1.38% node_exporter
0.9% kworker/3:0
0.81% mysql
0.77% rcu_sched
0.45% kworker/2:1
0.44% systemd
0.44% modprobe
0.41% jbd2/dm-7-8
0.33% ha_logd
0.33% flannel
0.32% date
0.28% pcs
0.21% bash
0.21% IPaddr2
0.19% ps
0.19% awk
0.17% docker
0.17% dbus-daemon
0.15% proxysql_export
0.14% mongodb_exporte
0.14% kworker/2:1H
0.13% ksoftirqd/2
0.11% systemctl
0.11% lrmd
0.11% attrd
0.1% tail
0.08% pacemakerd
0.08% crmd
0.07% kworker/3:1H
0.04% timeout
0.04% mkdir
0.04% grep
0.03% wsrep_sst_rsync
0.03% systemd-journal
0.03% proxySQLWatcher
0.03% jbd2/dm-2-8
0.03% iptables
0.02% sshd
0.02% send_arp
0.02% ntpd

The sum of these values is 99.43% , therefore the rounding error is minimal!
The above already show something interesting: proxysql is not the process/service using more cpu.

A bit more shell scripting, specifically to proxysql, a lot of time is spent in kernel space:

grep "proxysql " report-latest.txt | grep -v '0.00%' > proxysql_report-latest.txt
$ for i in `cat proxysql_report-latest.txt | awk '{print $3}' | sort | uniq -c | awk '{print $2}' | tr -d '[]'` ; do echo -n `awk '{print $1" "$3}' proxysql_report-latest.txt | grep $i | awk '{print $1}' | tr -d '%' | awk '{ SUM += $1} END { print SUM }'` ; echo "% $i" ; done | sort -n -r
13.35% kernel.kallsyms
6.9% proxysql
1.42% libpthread-2.24.so
0.31% unknown
0.29% libc-2.24.so
0.03% vdso
0.03% libstdc++.so.6.0.22

In user space, the amount of cpu time spent by proxysql is less than mysqld:

grep "mysqld " report-latest.txt | grep -v '0.00%' > mysqld_report-latest.txt
for i in `cat mysqld_report-latest.txt | awk '{print $3}' | sort | uniq -c | awk '{print $2}' | tr -d '[]'` ; do echo -n `awk '{print $1" "$3}' mysqld_report-latest.txt | grep $i | awk '{print $1}' | tr -d '%' | awk '{ SUM += $1} END { print SUM }'` ; echo "% $i" ; done | sort -n -r
8.68% mysqld
3.51% kernel.kallsyms
1.24% libc-2.19.so
0.98% unknown
0.27% libgalera_smm.so
0.18% libaio.so.1.0.1
0.15% libstdc++.so.6.0.20
0.11% libpthread-2.19.so
0.03% ld-2.19.so

But what concerns me more is the first line of perf report:

    12.55%  swapper          [kernel.kallsyms]            [k] intel_idle

I suspect that the odd CPU behaviour you are seeing is caused by power management.
Is it possible to disable any power management in this box?
Thanks

renecannao on 21 Mar 2017

👍3

All 76 comments

Hi Sinan,

Thank you for the report, seems like a bug: it is spinning somewhere.
Can you please:

check if there is anything odd reported in error log /var/lib/proxysql/proxysql.log
run the follow command with pt-pmp:

sudo pt-pmp --pid 3554 --iterations 30

run the follow command with perf, let it run for few second before killing it (ctrl+c):

sudo perf record -p 3554

Once perf record is completed, please run the follow and attach report.txt

sudo perf report > report.txt

Thanks

renecannao on 7 Dec 2016

I haven't notice any errors in proxysql.log, only some users activity logged.

Report file is attached.

report.txt

ayder on 7 Dec 2016

Thanks.
Could you please attach also the output of pt-pmp ?
From the output of perf it seems it is spinning on poll().

Is the two connections in use stop sending traffic, does the CPU usage go to zero?

renecannao on 7 Dec 2016

Sorry, I have already restarted the application.

Now pt-pmp is returning

Unhandled dwarf expression opcode 0xf3

rapidly

I'll check to drop connections when this occurs agains

ayder on 7 Dec 2016

I'm seeing similar with the latest 1.3.1 release. Attached is the result from a run under low load. I'll rerun this later under our full query load where we see two things happening:

CPU usage seems to creep up over time, starting at around 200-300% reported by top, to 700% or more hours later under the same query load
Once CPU usage is up we also notice a large increase in memory use by the proxysql process, which stays constantly low for many hours after a fresh start, until something happens and it balloons

Some stats: in full query load we have 20-40k qps, generating ~700MBit/s+ of db traffic. We recently switched to compressing mysql traffic to keep things in check until we optimize our db usage to reduce the amount of data we push into mysql and read out of it is a good test bed for proxysql performance.

Attached is a report you can see in the mean time but we'll rerun it later when more load is introduced

rep.txt

geotro on 18 Dec 2016

I met same problem with 1.3.1 release. The proxysql running about 2 hours. I found CPU usage very high.

ee90a5ab-a333-42ac-8bec-88ea36dd3a46

proxysql.log include some errors message:

2016-12-29 10:30:01 sqlite3db.cpp:56:execute(): [ERROR] SQLITE error: cannot start a transaction within a transaction --- BEGIN
2016-12-29 10:30:08 sqlite3db.cpp:56:execute(): [ERROR] SQLITE error: cannot commit - no transaction is active --- COMMIT
2016-12-29 10:30:08 sqlite3db.cpp:56:execute(): [ERROR] SQLITE error: cannot start a transaction within a transaction --- BEGIN
2016-12-29 10:30:17 sqlite3db.cpp:56:execute(): [ERROR] SQLITE error: cannot commit - no transaction is active --- COMMIT
2016-12-29 10:39:01 sqlite3db.cpp:56:execute(): [ERROR] SQLITE error: cannot start a transaction within a transaction --- BEGIN
2016-12-29 10:39:02 sqlite3db.cpp:56:execute(): [ERROR] SQLITE error: cannot commit - no transaction is active --- COMMIT

Attached perf command report file.

report.txt

taylor840326 on 29 Dec 2016

This CPU usage seems a spike caused by some spinloop.
@taylor840326 : are you graphing also ProxySQL's metrics?
Would be interesting to find a correlation between that spike and other metrics.

Thanks

renecannao on 29 Dec 2016

I use telegraf+grafana monitoring CPU load

taylor840326 on 5 Jan 2017

@renecannao version 1.2.6 is ok.

taylor840326 on 6 Jan 2017

@taylor840326 : which version is not ok?
When is not ok, do you have any profiling from perf or pt-pmp ?

renecannao on 6 Jan 2017

@renecannao I have tested v1.2.6/v1.3.1/v1.3.2 version, all the version have same problem.
I'm not sure when the problem happended. Maybe 2 hours or 6 hours or 12 hours.

attached the perf report file.
perf_1.2.6.txt

taylor840326 on 9 Jan 2017

@taylor840326 : this perf report doesn't _seem_ related to what reported in #870 .
Can you please run pt-pmp against proxysql?
Also please share your configuration (from mysql_servers , global_variables and mysql_query_rules)?

renecannao on 9 Jan 2017

@renecannao ok, I will use pt-pmp comamnd next time.

taylor840326 on 9 Jan 2017

proxysql-1.3.3 rpm running on RHEL 6.8 has the same issues

My Problem occurs mainly when there is a lag between the proxsql and mysql serves. ( Our case Proxysql is in branch office, and mysql is at production site)

# Samples: 3  of event 'cpu-clock'
# Event count (approx.): 750000
#
# Overhead  Command   Shared Object      Symbol
# ........  ........  .................  ...........
#
   100.00%  proxysql  [kernel.kallsyms]  [k] do_wait

ayder on 10 Feb 2017

@ayder , can you please run pt-pmp against proxysql?
When running ps you will notice two proxysql processes, one is the parent and the other is the child.
Assuming 12345 is the pid of the child process, you need to run something like:

pt-pmp --iterations 30 -p 12345

I really want to close this issue, but I need to understand where it is using a lot of CPU.
Thank you for your help.

renecannao on 10 Feb 2017

The output of pt-pmp command against the child process of pt-pmp is
Unhandled dwarf expression opcode 0xf3

    240 pthread_cond_wait,remove,ConsumerThread::run,start_thread(libpthread.so.0),clone(libc.so.6)
    120 epoll_wait(libc.so.6),MySQL_Thread::run,mysql_worker_thread_func_idles,start_thread(libpthread.so.0),clone(libc.so.6)
    118 poll(libc.so.6),MySQL_Thread::run,mysql_worker_thread_func,start_thread(libpthread.so.0),clone(libc.so.6)
     30 poll(libc.so.6),admin_main_loop,start_thread(libpthread.so.0),clone(libc.so.6)
     30 nanosleep(libc.so.6),usleep(libc.so.6),Query_Cache::purgeHash_thread,mysql_shared_query_cache_funct,start_thread(libpthread.so.0),clone(libc.so.6)
     30 nanosleep(libc.so.6),usleep(libc.so.6),MySQL_Monitor::run,libstdc++::??(libstdc++.so.6),start_thread(libpthread.so.0),clone(libc.so.6)
     30 nanosleep(libc.so.6),usleep(libc.so.6),MySQL_Monitor::monitor_replication_lag,monitor_replication_lag_pthread,start_thread(libpthread.so.0),clone(libc.so.6)
     30 nanosleep(libc.so.6),usleep(libc.so.6),MySQL_Monitor::monitor_read_only,monitor_read_only_pthread,start_thread(libpthread.so.0),clone(libc.so.6)
     30 nanosleep(libc.so.6),usleep(libc.so.6),MySQL_Monitor::monitor_ping,monitor_ping_pthread,start_thread(libpthread.so.0),clone(libc.so.6)
     30 nanosleep(libc.so.6),usleep(libc.so.6),MySQL_Monitor::monitor_connect,monitor_connect_pthread,start_thread(libpthread.so.0),clone(libc.so.6)
     30 nanosleep(libc.so.6),usleep(libc.so.6),main
      2 ::??,::??,::??

ayder on 10 Feb 2017

@ayder , thank you very much!
From the output it seems that all the threads are waiting.
The threads in MySQL_Monitor are sleeping;
The Admin thread is waiting on poll();
The 4 main MySQL_Thread are waiting on poll();
The 4 auxiliary MySQL_Thread are waiting on epoll();
The 8 ConsumerThread() are waiting in the queue.
I am not sure what it is causing the CPU spike: from the above everything seems quiet, very idle.

Maybe it is possible that your system isn't able to manage so many threads.
Can you please reconfigure proxysql to use just one thread? Thanks

set mysql-threads=1;
save mysql variables to disk;
proxysql restart;

renecannao on 10 Feb 2017

After setting the
mysql-threads =1
Variable
Cpu usage is now %100 of 1 core of 2 core vm. (That was %200 before utilizing all the cores)

ayder on 15 Feb 2017

Without restarting proxysql, can you please try again to run pt-pmp , maybe with an increased number of iteration:

pt-pmp --iterations 60 -p 12345

Thank you.

renecannao on 15 Feb 2017

Oh the output this time is:

/usr/bin/proxysql (deleted): No such file or directory.
Cannot access memory at address 0x687fa290
60 ::??

I have not deleted the binary nor changed it.
[root@XXDBAPPSL01 ~]# ls -l /usr/bin/proxysql
-rwxr-xr-x 1 root root 13087538 Jan 21 00:10 /usr/bin/proxysql

ayder on 15 Feb 2017

@ayder : this probably mean that proxysql was upgraded.
You can verify this comparing the out of these two commands:

/usr/bin/proxysql --version

mysql -u admin -padmin -h 127.0.0.1 -P6032 -e "SELECT @@version" -NB

renecannao on 15 Feb 2017

Well versions consistent, no upgrade has done.
I'll restart the aplicattion and wait again

[salyuruk@XXDBAPPSL01 ~]$ mysql --defaults-file=.proxysql.cfg -p -e "SELECT @@version" -NB
Enter password:
1.3.3-0-gf2b677a
[salyuruk@OSLDBAPPSL01 ~]$ proxysql --version
ProxySQL version 1.3.3-0-gf2b677a, codename Truls

ayder on 15 Feb 2017

This is getting really odd. Another hypothesis is that there are two proxies running.
Can you please run this:

ps aux | grep proxysql

renecannao on 15 Feb 2017

[root@XXDBAPPSL01 ~]# ps aux | grep proxysql
root 19223 0.0 0.2 51260 5124 ? S Feb10 0:00 proxysql -c /etc/proxysql.cnf -D /var/lib/proxysql
root 19224 16.9 1.6 172348 31804 ? Sl Feb10 1217:10 proxysql -c /etc/proxysql.cnf -D /var/lib/proxysql
root 60320 0.0 0.0 103324 848 pts/1 S+ 15:18 0:00 grep proxysql

Nope

ayder on 15 Feb 2017

Honestly, I am out of ideas for now.
The previous output of pt-pmp doesn't show anything odd, and now /usr/bin/proxysql (deleted): No such file or directory is very odd too.
Is there anything unusual reported in dmesg or syslog or proxysql log file?
What about lsof -n -p 19224 ?

renecannao on 15 Feb 2017

[root@XXDBAPPSL01 COMMAND PID USER FD TYPE proxysql 19224 root cwd DIR proxysql 19224 root rtd DIR proxysql 19224 root txt REG proxysql 19224 root mem REG proxysql 19224 root mem REG proxysql 19224 root mem REG proxysql 19224 root mem REG proxysql 19224 root mem REG proxysql 19224 root mem REG proxysql 19224 root mem REG proxysql 19224 root mem REG proxysql 19224 root mem REG proxysql 19224 root mem REG proxysql 19224 root mem REG proxysql 19224 root mem REG proxysql 19224 root mem REG proxysql 19224 root mem REG proxysql 19224 root mem REG proxysql 19224 root mem REG proxysql 19224 root mem REG proxysql 19224 root mem REG proxysql 19224 root mem REG proxysql 19224 root mem REG proxysql 19224 root mem REG proxysql 19224 root 0r CHR proxysql 19224 root 1w REG proxysql 19224 root 2w REG proxysql 19224 root 3r FIFO proxysql 19224 root 4w FIFO proxysql 19224 root 5u REG proxysql 19224 root 6u REG proxysql 19224 root 7r FIFO proxysql 19224 root 8w FIFO proxysql 19224 root 9r FIFO proxysql 19224 root 10w FIFO proxysql 19224 root 11u IPv4 proxysql 19224 root 12u REG proxysql 19224 root proxysql 19224 root 14u IPv4 proxysql 19224 root proxysql 19224 root 16u IPv4 proxysql 19224 root 17u IPv4 proxysql 19224 root 18u IPv4 proxysql 19224 root 19u sock proxysql 19224 root 20u IPv4 proxysql 19224 root 21u IPv4 ~]# lsof -n -p 19224
DEVICE SIZE/OFF NODE NAME
253,0 4096 139114 /var/lib/proxysql
253,0 4096 2 /
253,0 12982906 390056 /usr/bin/proxysql (deleted)
253,0 1969952 395943 /usr/lib64/libcrypto.so.1.0.1e
253,0 448344 392447 /usr/lib64/libssl.so.1.0.1e
253,0 161704 396570 /lib64/ld-2.12.so
253,0 1930416 398992 /lib64/libc-2.12.so
253,0 146592 399382 /lib64/libpthread-2.12.so
253,0 23088 399347 /lib64/libdl-2.12.so
253,0 47760 399391 /lib64/librt-2.12.so
253,0 600048 399989 /lib64/libm-2.12.so
253,0 91096 392580 /lib64/libz.so.1.2.3
253,0 124640 392013 /lib64/libselinux.so.1
253,0 114496 399177 /lib64/libresolv-2.12.so
253,0 93320 391716 /lib64/libgcc_s-4.4.7-20120601.so.1
253,0 989840 400216 /usr/lib64/libstdc++.so.6.0.13
253,0 17256 392093 /lib64/libcom_err.so.2.1
253,0 12592 406828 /lib64/libkeyutils.so.1.3
253,0 177520 392097 /lib64/libk5crypto.so.3.1
253,0 46368 392717 /lib64/libkrb5support.so.0.1
253,0 280520 392095 /lib64/libgssapi_krb5.so.2.2
253,0 948840 392087 /lib64/libkrb5.so.3.3
253,0 27896 391297 /lib64/libnss_dns-2.12.so
253,0 66432 391299 /lib64/libnss_files-2.12.so
1,3 0t0 4217 /dev/null
253,0 584020 129989 /var/lib/proxysql/proxysql.log
253,0 584020 129989 /var/lib/proxysql/proxysql.log
0,8 0t0 20510499 pipe
0,8 0t0 20510499 pipe
253,0 33792 139105 /var/lib/proxysql/proxysql.db
253,0 33792 139105 /var/lib/proxysql/proxysql.db
0,8 0t0 20510502 pipe
0,8 0t0 20510502 pipe
0,8 0t0 20510503 pipe
0,8 0t0 20510503 pipe
20510505 0t0 TCP *:6033 (LISTEN)
0,9 0 4213 [eventpoll]
13u unix 0xffff88007d1c4540 0t0 20510506 /tmp/proxysql.sock
20510518 0t0 TCP 127.0.0.1:6032 (LISTEN)
15u unix 0xffff88007d006800 0t0 20510519 /tmp/proxysql_admin.sock
21257635 0t0 TCP 10.35.100.197:36666->10.35.78.78:mysql (ESTABLISHED)
21274282 0t0 TCP 10.35.100.197:59106->10.35.71.23:mysql (ESTABLISHED)
21235705 0t0 TCP 10.35.100.197:6033->10.238.236.15:49559 (ESTABLISHED)
0,6 0t0 21235708 can't identify protocol
21269066 0t0 TCP 10.35.100.197:6033->10.238.236.15:56218 (ESTABLISHED)
21269069 0t0 TCP 10.35.100.197:37218->10.35.78.78:mysql (ESTABLISHED)

ayder on 15 Feb 2017

Hi,
Today we put proxysql on our production server, and we are facing the same issue. We haven't loaded any query rules yet

> select @@version
1.3.4-0-gd158707

proxysql

select * from global_variables
global.txt

alejg on 21 Feb 2017

@alejg: can you please run pt-pmp as described above?
Also, would be possible to generate a core dump of the running process using gcore ?
Thanks

renecannao on 21 Feb 2017

Yes,
proxysql-pt-pmp.txt
I'll send the core dump to an email (at sysown.com) ok?

Thanks!

alejg on 21 Feb 2017

Thank you @alejg .
My email is rene.[email protected] .
One more question: which package are you using? Which OS?

But before sending the core...
Looking the output of pt-pmp I have the feeling that proxysql is not spinning in a loop without reason, but actually serving a lot of traffic:

49 send (to backend)
7 send (to client)
4 recv (from client)
3 recv (from backend)
1 send (to backend)
1 recv (from backend)

From the screenshot I can also deduce that mysqld running on the same box is quite busy too.
Therefore it is possible that the high CPU utilization you are seeing in ProxySQL is not because of a bug, but because it is really doing a lot of work.
Can you please also run these commands and let me know the output?

mysql -u admin -padmin -h 127.0.0.1 -P6032 -e "SELECT * FROM stats_mysql_connection_pool; SHOW MYSQL STATUS;"
sleep 30
mysql -u admin -padmin -h 127.0.0.1 -P6032 -e "SELECT * FROM stats_mysql_connection_pool; SHOW MYSQL STATUS;"

Thanks

renecannao on 21 Feb 2017

Thank you @renecannao!!
CentOS Linux release 7.2.1511
proxysql-1.3.4-1-centos7.x86_64.rpm

You are right, is a very busy server which spends most of his time doing IO. Worked fine for several hours, now the CPU raised up the load average above 18.

Thanks!

alejg on 21 Feb 2017

Continuing this thread via email.
I will update the issue with further finding.

renecannao on 21 Feb 2017

👍1

I am also having this problem intermittently. My Server is idle with

Tasks: 126 total,   1 running, 125 sleeping,   0 stopped,   0 zombie
Cpu(s): 38.9%us, 58.8%sy,  0.0%ni,  1.0%id,  0.0%wa,  0.0%hi,  0.1%si,  1.2%st
Mem:   8059196k total,  7313936k used,   745260k free,   140376k buffers
Swap:        0k total,        0k used,        0k free,   974752k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                          
30931 proxysql  20   0  318m  86m 2624 S 363.7  1.1 120:20.48 proxysql                                                                                                         
19023 root      20   0  325m  57m 4832 S 14.3  0.7  89:14.53 mysqld_exporter                                                                                                   
28096 mysql     20   0 14.1g 1.2g  33m S  9.6 15.2   5:55.45 mysqld                                                                                                            
 1372 root      20   0  343m  24m 3288 S  8.0  0.3 797:42.80 node_exporter

# sudo strace -c -p 30931
Process 30931 attached
^CProcess 30931 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 98.72    4.380782      273799        16           nanosleep
  1.28    0.056584       56584         1           restart_syscall
------ ----------- ----------- --------- --------- ----------------
100.00    4.437366                    17           total

restarting proxysql solves the issue temporarily

doddys on 14 Mar 2017

@doddys , if you run strace without -f flag, strace will only check the main thread and nothing more: this is exactly what you got, a strace of the main thread that is doing nothing more but nanosleep() .

How frequently does it happen?
Would it be possible to generate a core dump of the running process using gcore ? From the output above it seems it uses really little memory, so a dump should take really a small amount of time (possibly a fraction of second).
Thanks

renecannao on 14 Mar 2017

let me try to get a core dump and get back to you

On Mar 14, 2017, at 8:14 PM, René Cannaò notifications@github.com wrote:

@doddys https://github.com/doddys , if you run strace without -f flag, strace will only check the main thread and nothing more: this is exactly what you got, a strace of the main thread that is doing nothing more but nanosleep() .

How frequently does it happen?
Would it be possible to generate a core dump of the running process using gcore https://sourceware.org/gdb/onlinedocs/gdb/gcore-man.html ? From the output above it seems it uses really little memory, so a dump should take really a small amount of time (possibly a fraction of second).
Thanks

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/sysown/proxysql/issues/826#issuecomment-286416977, or mute the thread https://github.com/notifications/unsubscribe-auth/AQmB1O57zQ8Kl_xBoS3TxTGV022njG44ks5rlpLNgaJpZM4LGjqA.

doddys on 14 Mar 2017

@renecannao this is the dump that I get from

gcore -o proxysql.dump 30931

please rename to tgz and extract

proxysql.dump.30931.pdf

doddys on 14 Mar 2017

@doddys : thanks a lot.
To confirm, this dump is generated while CPU usage was high, or idle? It needs to be while CPU usage is high.
Also, can you please let me know exactly which ProxySQL package you have installed?
Thanks

renecannao on 14 Mar 2017

@renecannao Yes, it is generated during the cpu ~300% like the top above

# rpm -qa | grep proxy
proxysql-1.3.2-1.1.x86_64
# uname -a
Linux centos-8gb-sgp1-01 2.6.32-642.6.2.el6.x86_64 #1 SMP Wed Oct 26 06:52:09 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

It happens a few times already but I never pay attention when and how.
Let me know know if you need further info.
Thanks

doddys on 14 Mar 2017

Perfect!
I started looking into it, but it took me some time to figure out that the version installed is from Percona repository and from GitHub.
What version of CentOS are you using exactly? I tried on CentOS 6.7 but seems not the right one.
Thank you

renecannao on 14 Mar 2017

@renecannao yes it is from percona repo.
I am using

# cat /etc/redhat-release 
CentOS release 6.8 (Final)

i just notice that scheduler is invoking duplicate process like below

proxysql  7864 30931  1 13:52 ?        00:00:00 /bin/bash /var/lib/proxysql/proxysql_galera_checker_v2.sh 3 2 1 1 /var/lib/proxysql/galera_hg3.log
proxysql  7866 30931  1 13:52 ?        00:00:00 /bin/bash /var/lib/proxysql/proxysql_galera_checker_v2.sh 4 2 1 1 /var/lib/proxysql/galera_hg4.log
proxysql  7949  7866  0 13:52 ?        00:00:00 /bin/bash /var/lib/proxysql/proxysql_galera_checker_v2.sh 4 2 1 1 /var/lib/proxysql/galera_hg4.log
proxysql  7953  7864  0 13:52 ?        00:00:00 /bin/bash /var/lib/proxysql/proxysql_galera_checker_v2.sh 3 2 1 1 /var/lib/proxysql/galera_hg3.log
proxysql  7955  7949  0 13:52 ?        00:00:00 /bin/bash /var/lib/proxysql/proxysql_galera_checker_v2.sh 4 2 1 1 /var/lib/proxysql/galera_hg4.log
proxysql  7956  7955  0 13:52 ?        00:00:00 timeout 10 mysql -nNE -uproxysql -h 10.130.34.56 -P 3306 -e SHOW STATUS LIKE 'wsrep_local_state'
proxysql  7957  7955  0 13:52 ?        00:00:00 tail -1
proxysql  7958  7956  5 13:52 ?        00:00:00 mysql -nNE -uproxysql -h 10.130.34.56 -P 3306 -e SHOW STATUS LIKE 'wsrep_local_state'
proxysql  7960  7953  0 13:52 ?        00:00:00 /bin/bash /var/lib/proxysql/proxysql_galera_checker_v2.sh 3 2 1 1 /var/lib/proxysql/galera_hg3.log
proxysql  7961  7960  0 13:52 ?        00:00:00 timeout 10 mysql -nNE -uproxysql -h 10.130.34.56 -P 3306 -e SHOW STATUS LIKE 'wsrep_local_state'
proxysql  7962  7960  0 13:52 ?        00:00:00 tail -1
proxysql  7963  7961  4 13:52 ?        00:00:00 mysql -nNE -uproxysql -h 10.130.34.56 -P 3306 -e SHOW STATUS LIKE 'wsrep_local_state'
root      7965 16029  0 13:52 pts/0    00:00:00 grep proxysql
proxysql 30930     1  0 11:23 ?        00:00:00 /usr/bin/proxysql -c /etc/proxysql.cnf -D /var/lib/proxysql
proxysql 30931 30930 99 11:23 ?        03:56:50 /usr/bin/proxysql -c /etc/proxysql.cnf -D /var/lib/proxysql

Is this normal?

btw for whatever reason the cpu drop back to normal now. it lasted for about 1 hour

doddys on 14 Mar 2017

Ok, I will start looking into the dump using 6.8 .
With regards to the duplicate processes, there is no mechanism to limit the number of processes started by scheduler.
What's the output of SELECT * FROM scheduler ?

renecannao on 14 Mar 2017

below are the ouput

mysql> select * from scheduler;
+----+--------+-------------+-------------------------------------------------+------+------+------+------+-----------------------------------------------+---------+
| id | active | interval_ms | filename                                        | arg1 | arg2 | arg3 | arg4 | arg5                                          | comment |
+----+--------+-------------+-------------------------------------------------+------+------+------+------+-----------------------------------------------+---------+
| 1  | 0      | 10000       | /usr/bin/proxysql_galera_checker                | 0    | 0    | 3    | 1    | /var/lib/proxysql/proxysql_galera_checker.log |         |
| 2  | 0      | 10000       | /usr/bin/proxysql_galera_checker                | 2    | 2    | 3    | 1    | /var/lib/proxysql/galera_hg2.log              |         |
| 3  | 1      | 10000       | /var/lib/proxysql/proxysql_galera_checker_v2.sh | 3    | 2    | 1    | 1    | /var/lib/proxysql/galera_hg3.log              |         |
| 4  | 1      | 10000       | /var/lib/proxysql/proxysql_galera_checker_v2.sh | 4    | 2    | 1    | 1    | /var/lib/proxysql/galera_hg4.log              |         |
+----+--------+-------------+-------------------------------------------------+------+------+------+------+-----------------------------------------------+---------+
4 rows in set (0.00 sec)

mysql> select * from runtime_scheduler;
+----+--------+-------------+-------------------------------------------------+------+------+------+------+-----------------------------------------------+---------+
| id | active | interval_ms | filename                                        | arg1 | arg2 | arg3 | arg4 | arg5                                          | comment |
+----+--------+-------------+-------------------------------------------------+------+------+------+------+-----------------------------------------------+---------+
| 1  | 0      | 10000       | /usr/bin/proxysql_galera_checker                | 0    | 0    | 3    | 1    | /var/lib/proxysql/proxysql_galera_checker.log |         |
| 2  | 0      | 10000       | /usr/bin/proxysql_galera_checker                | 2    | 2    | 3    | 1    | /var/lib/proxysql/galera_hg2.log              |         |
| 3  | 1      | 10000       | /var/lib/proxysql/proxysql_galera_checker_v2.sh | 3    | 2    | 1    | 1    | /var/lib/proxysql/galera_hg3.log              |         |
| 4  | 1      | 10000       | /var/lib/proxysql/proxysql_galera_checker_v2.sh | 4    | 2    | 1    | 1    | /var/lib/proxysql/galera_hg4.log              |         |
+----+--------+-------------+-------------------------------------------------+------+------+------+------+-----------------------------------------------+---------+
4 rows in set (0.00 sec)

as you can see I have 2 active schedulers, but it spawn more than 1 process per scheduler at the same time.

doddys on 14 Mar 2017

_maybe_ it is not spawning more than 1 process per scheduler at the same time.
It should spawn a process ever 10 seconds, and check the backends.
The fact that from ps you can see two checks against 10.130.34.56 is curious: a check should be very very fast, so if you manage to catch two checks against the same hosts maybe the checks are being slow, and a scheduled process takes longer than 10 seconds to complete, and therefore there is an overlap.

renecannao on 14 Mar 2017

The check is fast. The script finished within 1 second. the weird thing eventhough two process running the same script, I can only see a single set of log entry

Tue Mar 14 14:22:04 UTC 2017 ###### proxysql_galera_checker.sh SUMMARY ######
Tue Mar 14 14:22:04 UTC 2017 Hostgroup writers 4
Tue Mar 14 14:22:04 UTC 2017 Hostgroup readers 2
Tue Mar 14 14:22:04 UTC 2017 Number of writers 1
Tue Mar 14 14:22:04 UTC 2017 Writers are readers 1
Tue Mar 14 14:22:04 UTC 2017 log file /var/lib/proxysql/galera_hg4.log
Tue Mar 14 14:22:05 UTC 2017 ###### HANDLE WRITER NODES ######
Tue Mar 14 14:22:05 UTC 2017 --> Checking WRITE server 4:10.130.34.56:3306, current status ONLINE, wsrep_local_state 4
Tue Mar 14 14:22:05 UTC 2017 server 4:10.130.34.56:3306 is already ONLINE: 1 of 1 write nodes
Tue Mar 14 14:22:05 UTC 2017 --> Checking WRITE server 4:10.130.33.225:3306, current status OFFLINE_SOFT, wsrep_local_state 4
Tue Mar 14 14:22:05 UTC 2017 server 4:10.130.33.225:3306 is already OFFLINE_SOFT, max write nodes reached (1)
Tue Mar 14 14:22:05 UTC 2017 --> Checking WRITE server 4:10.130.33.238:3306, current status OFFLINE_SOFT, wsrep_local_state 4
Tue Mar 14 14:22:05 UTC 2017 server 4:10.130.33.238:3306 is already OFFLINE_SOFT, max write nodes reached (1)
Tue Mar 14 14:22:05 UTC 2017 ###### HANDLE READER NODES ######
Tue Mar 14 14:22:05 UTC 2017 --> Checking READ server 2:10.130.34.56:3306, current status ONLINE, wsrep_local_state 4
Tue Mar 14 14:22:05 UTC 2017 server 2:10.130.34.56:3306 is already ONLINE
Tue Mar 14 14:22:05 UTC 2017 --> Checking READ server 2:10.130.33.225:3306, current status ONLINE, wsrep_local_state 4
Tue Mar 14 14:22:05 UTC 2017 server 2:10.130.33.225:3306 is already ONLINE
Tue Mar 14 14:22:05 UTC 2017 --> Checking READ server 2:10.130.33.238:3306, current status ONLINE, wsrep_local_state 4
Tue Mar 14 14:22:05 UTC 2017 server 2:10.130.33.238:3306 is already ONLINE
Tue Mar 14 14:22:05 UTC 2017 ###### SUMMARY ######
Tue Mar 14 14:22:05 UTC 2017 --> Number of writers that are 'ONLINE': 1 : hostgroup: 4
Tue Mar 14 14:22:05 UTC 2017 --> Number of readers that are 'ONLINE': 3 : hostgroup: 2
Tue Mar 14 14:22:05 UTC 2017 ###### Loading mysql_servers config into runtime ######

Tue Mar 14 14:22:15 UTC 2017 ###### proxysql_galera_checker.sh SUMMARY ######
Tue Mar 14 14:22:15 UTC 2017 Hostgroup writers 4
Tue Mar 14 14:22:15 UTC 2017 Hostgroup readers 2
Tue Mar 14 14:22:15 UTC 2017 Number of writers 1
Tue Mar 14 14:22:15 UTC 2017 Writers are readers 1
Tue Mar 14 14:22:15 UTC 2017 log file /var/lib/proxysql/galera_hg4.log
Tue Mar 14 14:22:15 UTC 2017 ###### HANDLE WRITER NODES ######
Tue Mar 14 14:22:15 UTC 2017 --> Checking WRITE server 4:10.130.34.56:3306, current status ONLINE, wsrep_local_state 4
Tue Mar 14 14:22:15 UTC 2017 server 4:10.130.34.56:3306 is already ONLINE: 1 of 1 write nodes
Tue Mar 14 14:22:15 UTC 2017 --> Checking WRITE server 4:10.130.33.225:3306, current status OFFLINE_SOFT, wsrep_local_state 4
Tue Mar 14 14:22:15 UTC 2017 server 4:10.130.33.225:3306 is already OFFLINE_SOFT, max write nodes reached (1)
Tue Mar 14 14:22:15 UTC 2017 --> Checking WRITE server 4:10.130.33.238:3306, current status OFFLINE_SOFT, wsrep_local_state 4
Tue Mar 14 14:22:15 UTC 2017 server 4:10.130.33.238:3306 is already OFFLINE_SOFT, max write nodes reached (1)
Tue Mar 14 14:22:15 UTC 2017 ###### HANDLE READER NODES ######
Tue Mar 14 14:22:15 UTC 2017 --> Checking READ server 2:10.130.34.56:3306, current status ONLINE, wsrep_local_state 4
Tue Mar 14 14:22:15 UTC 2017 server 2:10.130.34.56:3306 is already ONLINE
Tue Mar 14 14:22:15 UTC 2017 --> Checking READ server 2:10.130.33.225:3306, current status ONLINE, wsrep_local_state 4
Tue Mar 14 14:22:15 UTC 2017 server 2:10.130.33.225:3306 is already ONLINE
Tue Mar 14 14:22:15 UTC 2017 --> Checking READ server 2:10.130.33.238:3306, current status ONLINE, wsrep_local_state 4
Tue Mar 14 14:22:15 UTC 2017 server 2:10.130.33.238:3306 is already ONLINE
Tue Mar 14 14:22:15 UTC 2017 ###### SUMMARY ######
Tue Mar 14 14:22:15 UTC 2017 --> Number of writers that are 'ONLINE': 1 : hostgroup: 4
Tue Mar 14 14:22:15 UTC 2017 --> Number of readers that are 'ONLINE': 3 : hostgroup: 2
Tue Mar 14 14:22:15 UTC 2017 ###### Loading mysql_servers config into runtime ######

doddys on 14 Mar 2017

@doddys :
I couldn't yet find anything odd in the dump.
It seems that the instance is indeed idle:

the uptime of this proxy was 6930 seconds
the number of queries executed is 17229
there were 412 connections

What bugs me is servers_table_version = 2544 : that should means that LOAD MYSQL SERVERS TO RUNTIME was executed 2543 times, or about once every 2.72 seconds.

Can you please share proxysql's error log? (/var/lib/proxysql/proxysql.log)
I would like to dig into it.

Thank you

renecannao on 14 Mar 2017

@renecannao I did notice that there are a lot of reload in the proxysql.log but I am not sure it was normal or not.

please rename to tgz and extract.

proxysql.log.pdf

doddys on 15 Mar 2017

@doddys : there is a bug in proxysql_galera_checker.sh , because LOAD MYSQL SERVERS TO RUNTIME is always called even if no changes are performed.
Speaking with @geotro few days ago, I think we have a solution for this.
Calling LOAD MYSQL SERVERS TO RUNTIME has some costs, and it is possible that calling it too often has unexpected side effect. I will investigate more.
Thank you for the file!

renecannao on 15 Mar 2017

@doddys , interesting finding!
Number of LOAD MYSQL SERVERS per hour:

$ egrep '^2017-03-14 .*LOAD MYSQL SERVERS' proxysql.log | cut -b 1-13 | sort | uniq -c
 714 2017-03-14 00
 706 2017-03-14 01
 704 2017-03-14 02
 704 2017-03-14 03
 706 2017-03-14 04
 702 2017-03-14 05
 704 2017-03-14 06
 704 2017-03-14 07
 842 2017-03-14 08
1408 2017-03-14 09
1416 2017-03-14 10
1408 2017-03-14 11
1350 2017-03-14 12
1011 2017-03-14 13
 708 2017-03-14 14
 706 2017-03-14 15
 710 2017-03-14 16
 710 2017-03-14 17
 714 2017-03-14 18
 712 2017-03-14 19
 712 2017-03-14 20
 714 2017-03-14 21
 714 2017-03-14 22
  74 2017-03-14 23

From the above, it seems that something happens between 8am and 9am, and something changed again between 1pm and 2pm.
This is indeed the case:

$ egrep -v '^HID' proxysql.log | egrep -v '(Received LOAD MYSQL SERVERS TO RUNTIME command|New mysql_replication_hostgroups table)' | egrep '^2017-03-14.*INFO'
2017-03-14 08:47:18 [INFO] Received save scheduler from mem command
2017-03-14 08:47:25 [INFO] Received load scheduler from mem command
2017-03-14 08:47:36 [INFO] Received save mysql servers from mem command
2017-03-14 08:47:47 [INFO] Received load mysql servers from mem command
2017-03-14 08:49:46 [INFO] Received save mysql users from mem command
2017-03-14 08:49:59 [INFO] Received load mysql users from mem command
2017-03-14 08:51:18 [INFO] Received load mysql query rules from mem command
2017-03-14 08:51:29 [INFO] Received save mysql query rules from mem command
2017-03-14 08:54:19 [INFO] Received load scheduler from mem command
2017-03-14 08:54:24 [INFO] Received save scheduler from mem command
2017-03-14 08:56:42 [INFO] Received load scheduler from mem command
2017-03-14 08:56:47 [INFO] Received save scheduler from mem command
2017-03-14 08:59:38 [INFO] Received load scheduler from mem command
2017-03-14 08:59:41 [INFO] Received save scheduler from mem command
2017-03-14 11:23:31 [INFO] Starting shutdown...
2017-03-14 11:23:32 [INFO] Shutdown completed!
2017-03-14 11:23:32 [INFO] Exiting...
2017-03-14 11:23:32 [INFO] Shutdown angel process
2017-03-14 11:23:32 [INFO] Starting ProxySQL
2017-03-14 11:23:32 [INFO] Sucessfully started
2017-03-14 11:23:32 [INFO] Angel process started ProxySQL process 30931
2017-03-14 12:51:55 [INFO] Received load scheduler from mem command
2017-03-14 12:52:08 [INFO] Received save scheduler to mem command
2017-03-14 13:52:14 [INFO] Received load scheduler from mem command
2017-03-14 13:52:19 [INFO] Received save scheduler from mem command

Scheduler entries changed:

between 8:47 and 8:59am: they probably doubled
at 12:51:55 : probably no real changes
at 13:52:14 : they probably halves

ProxySQL was restarted at 11:23:32 , that means that the core dump you sent was generated at 13:19:02 (6930 seconds after 11:23:32).

In https://github.com/sysown/proxysql/issues/826#issuecomment-286429201 you commented that :

you were seeing double processes at 13:52
the issue is now solved, after about 1 hour

This is actually very interesting, because commands related to the scheduler were executed at 12:52ish and 13:52ish : exactly one hour apart.
Do you recall:
a) what did you change in the scheduler at these times?
b) if you run ps before of after running the commands on the scheduler?

Do you mind sharing the files /var/lib/proxysql/galera_hg*.log ? I would like to see if I can find anything else interesting during this time window.

renecannao on 15 Mar 2017

Looking at the number of LOAD MYSQL SERVERS per hour I actually missed a detail, and I should have checked more granularity, using:

egrep '^2017-03-14 1[1234].*LOAD MYSQL SERVERS' proxysql.log | cut -b 1-16 | sort | uniq -c

The number of LOAD MYSQL SERVERS per minute:

24 calls/min till 12:51
18 calls/min from 12:52 till 13:51
12 calls/min from 13:52

@doddys : can you please try this modified version of proxysql_galera_checker.sh ?
proxysql_galera_checker-v2.sh.zip

@taylor840326 , @aydancoskun : are you also using proxysql_galera_checker.sh ?

renecannao on 15 Mar 2017

@renecannao I did make some config changes after proxy restart at 11 ish ...

I applied the script that your provided. I notice a few things differ from previous behavior

password is pass through command which give a warning
mysql: [Warning] Using a password on the command line interface can be insecure.
ONLINE writer is not flagged based on weight anymore.

# watch -n 1 'mysql --login-path=proxysql -t -e "select * from stats_mysql_connection_pool where hostgroup in (2,3,4,94,92,93) order by hostgroup,srv_host ;" -e " select hostgroup_id,hostname,status,weight,comment from mysql_servers where hostgroup_id in (2,3,4)  order by hostgroup_id,hostname ;"'

Every 1.0s: mysql --login-path=proxysql -t -e "select * from stats_mysql_connection_pool where hostgroup in (2,3,4,94,92,93) order by hostgrou...  Wed Mar 15 02:23:21 2017

+-----------+---------------+----------+--------------+----------+----------+--------+---------+---------+-----------------+-----------------+------------+
| hostgroup | srv_host      | srv_port | status       | ConnUsed | ConnFree | ConnOK | ConnERR | Queries | Bytes_data_sent | Bytes_data_recv | Latency_ms |
+-----------+---------------+----------+--------------+----------+----------+--------+---------+---------+-----------------+-----------------+------------+
| 2         | 10.130.33.225 | 3306     | ONLINE       | 0        | 0        | 0      | 0       | 0   | 0               | 0               | 205        |
| 2         | 10.130.33.238 | 3306     | ONLINE       | 0        | 0        | 0      | 0       | 0   | 0               | 0               | 1384       |
| 2         | 10.130.34.56  | 3306     | ONLINE       | 0        | 13       | 13     | 0       | 19678   | 519672          | 186768          | 447        |
| 3         | 10.130.33.225 | 3306     | OFFLINE_SOFT | 0        | 0        | 0      | 0       | 0   | 0               | 0               | 205        |
| 3         | 10.130.33.238 | 3306     | ONLINE       | 200  | 0        | 594    | 0       | 3806    | 460314          | 2670117         | 1384       |
| 3         | 10.130.34.56  | 3306     | OFFLINE_SOFT | 0        | 0        | 0      | 0       | 0   | 0               | 0               | 447        |
| 4         | 10.130.33.225 | 3306     | OFFLINE_SOFT | 0        | 0        | 0      | 0       | 0   | 0               | 0               | 205        |
| 4         | 10.130.33.238 | 3306     | OFFLINE_SOFT | 0        | 0        | 0      | 0       | 0   | 0               | 0               | 1384       |
| 4         | 10.130.34.56  | 3306     | ONLINE       | 0        | 0        | 0      | 0       | 0   | 0               | 0               | 447        |
+-----------+---------------+----------+--------------+----------+----------+--------+---------+---------+-----------------+-----------------+------------+
+--------------+---------------+--------------+------------+-----------+
| hostgroup_id | hostname      | status       | weight     | comment   |
+--------------+---------------+--------------+------------+-----------+
| 2            | 10.130.33.225 | ONLINE       | 1          | PRIMARY   |
| 2            | 10.130.33.238 | ONLINE       | 1          | BACKUP    |
| 2            | 10.130.34.56  | ONLINE       | 1000000000 | BACKUP    |
| 3            | 10.130.33.225 | ONLINE       | 1          | BACKUP    |
| 3            | 10.130.33.238 | OFFLINE_SOFT | 1000000000 | PRIMARY   |
| 3            | 10.130.34.56  | OFFLINE_SOFT | 1          | BACKUP    |
| 4            | 10.130.33.225 | ONLINE       | 100000     | SECONDARY |
| 4            | 10.130.33.238 | OFFLINE_SOFT | 1          | BACKUP    |
| 4            | 10.130.34.56  | OFFLINE_SOFT | 100000000  | PRIMARY   |
+--------------+---------------+--------------+------------+-----------+

Server logs status is proxysql.log seems to dissapear

attached is the galera_hg logs

galera_hg.pdf

anyway, if the cpu spike again, is there anything you would like me to capture?

doddys on 15 Mar 2017

It seems I need to merge this checker with the checker from Percona.

anyway, if the cpu spike again, is there anything you would like me to capture?

Yes. Please refer to this comment.
Thanks.

renecannao on 15 Mar 2017

I'm getting this within a few hours of starting up, for about a month now.

[root@LION:~]# mysql -u admin -padmin -h 127.0.0.1 -P6032 -e "SELECT @@version" -NB
1.3.4-0-gd158707
[root@LION:~]# ps aux | grep proxysql
root     28041  0.0  0.0  16800 13148 ?        Ssl  Mar16   0:26 /srv/proxysql_exporter -web.listen-address=:42004
root     28482  0.0  0.0  34452  2972 ?        S    Mar16   0:00 proxysql -c /etc/proxysql.cnf -D /var/lib/proxysql
root     28484 87.9  0.4 336200 71332 ?        Sl   Mar16 5223:20 proxysql -c /etc/proxysql.cnf -D /var/lib/proxysql
[root@LION:~]# uname -a
Linux LION 4.9.0-2-amd64 #1 SMP Debian 4.9.13-1 (2017-02-27) x86_64 GNU/Linux
[root@LION:~]# mysql -u admin -padmin -h 127.0.0.1 -P6032 -e "SELECT * FROM scheduler" -NB
1   1   5000    /var/lib/proxysql/proxysql_galera_checker.sh    1   2   1   1   /var/lib/proxysql/proxysql_galera_checker.log   
[root@LION:~]# mysql -u admin -padmin -h 127.0.0.1 -P6032 -e "SELECT * FROM runtime_scheduler" -NB
1   1   5000    /var/lib/proxysql/proxysql_galera_checker.sh    1   2   1   1   /var/lib/proxysql/proxysql_galera_checker.log

screenshot from 2017-03-20 14-54-13

bweston92 on 20 Mar 2017

@bweston92 : from the graphs, it I am reading them correctly, it seems you have spiky CPU usage, and not a constantly high CPU usage. This could be anything, even traffic related.
Yet, in case it is proxysql_galera_checker.sh, can you modify it as reported in https://github.com/sysown/proxysql/issues/826#issuecomment-286611276 ?
Thanks

While I am on this issue...
@doddys : any more CPU weirdness after the change?

renecannao on 20 Mar 2017

@renecannao the spike we have it at most 10 users a minute. Not only that I'd expect some correlation with mysqld
screenshot from 2017-03-20 15-58-41

bweston92 on 20 Mar 2017

@renecannao I have not encountered the same problem anymore but I have upgraded the to 1.3.4 due to some crash during high number of connections. I will report when the problem reappear.

doddys on 20 Mar 2017

@renecannao also when I look at top it is just constantly 100% however everything still works fine.

bweston92 on 20 Mar 2017

@doddys : great!

@bweston92 : ok, this needs further investigation. Can you please collect some metrics as described in https://github.com/sysown/proxysql/issues/826#issuecomment-265441199 ?
Thanks

renecannao on 20 Mar 2017

Ok, see attached for report.txt
report.txt

pt-pmp --pid 28484 --iterations 30

[root@LION:~]# pt-pmp --pid 28484 --iterations 30
Mon 20 Mar 16:06:17 GMT 2017
84  ../sysdeps/unix/syscall-template.S: No such file or directory.
    240 pthread_cond_wait,wqueue::remove,ConsumerThread::run,start_thread,clone
    120 epoll_wait,MySQL_Thread::run,mysql_worker_thread_func_idles,start_thread,clone
    106 poll,MySQL_Thread::run,mysql_worker_thread_func,start_thread,clone
     30 poll,admin_main_loop,start_thread,clone
     30 nanosleep,usleep,Query_Cache::purgeHash_thread,mysql_shared_query_cache_funct,start_thread,clone
     30 nanosleep,usleep,MySQL_Monitor::run,libstdc++::??(libstdc++.so.6),start_thread,clone
     30 nanosleep,usleep,MySQL_Monitor::monitor_replication_lag,monitor_replication_lag_pthread,start_thread,clone
     30 nanosleep,usleep,MySQL_Monitor::monitor_read_only,monitor_read_only_pthread,start_thread,clone
     30 nanosleep,usleep,MySQL_Monitor::monitor_ping,monitor_ping_pthread,start_thread,clone
     30 nanosleep,usleep,MySQL_Monitor::monitor_connect,monitor_connect_pthread,start_thread,clone
     30 nanosleep,usleep,main
      4 __waitpid,waitpid_thread,start_thread,clone
      4 nanosleep,usleep,MySQL_Thread::run,mysql_worker_thread_func,start_thread,clone
      4 MySQL_Thread::run,mysql_worker_thread_func,start_thread,clone
      2 __random,rand,MySQL_Thread::run,mysql_worker_thread_func,start_thread,clone
      2 __libc_send,my_send_async(mysql_async.c:174),vio_write(violite.c:441),net_real_write(net.c:438),net_flush(net.c:255),net_write_command(net.c:333),mthd_my_send_cmd(libmariadb.c:612),mysql_real_query(libmariadb.c:2538),mysql_real_query_start_internal(mysql_async.c:490),my_context_spawn(my_context.c:205)
      1 poll,child_mysql,start_thread,clone
      1 MySQL_Data_Stream::set_pollout,MySQL_Thread::run,mysql_worker_thread_func,start_thread,clone
      1 __libc_send,my_send_async(mysql_async.c:174),vio_write(violite.c:441),net_real_write(net.c:438),net_flush(net.c:255),net_write_command(net.c:333),mthd_my_send_cmd(libmariadb.c:612),send_change_user_packet(my_auth.c:239),client_mpvio_write_packet(my_auth.c:239),native_password_auth_client(my_auth.c:132),run_plugin_auth(my_auth.c:646),mysql_change_user(libmariadb.c:2222),mysql_change_user_start_internal(mysql_async.c:846),my_context_spawn(my_context.c:205)

bweston92 on 20 Mar 2017

@bweston92 :
According to pt-pmp , proxysql is mainly sending traffic to backend.

But the output of perf is really really odd.
clear_page_c_e is a kernel call related to memory management, and SpookyHash::Final should be extremely fast! Is this a VM? Which Linux version?

renecannao on 20 Mar 2017

@renecannao nope bare metal.

[root@LION:~]# cat /etc/issue
Debian GNU/Linux 9 \n \l

[root@LION:~]# uname -a
Linux LION 4.9.0-2-amd64 #1 SMP Debian 4.9.13-1 (2017-02-27) x86_64 GNU/Linux

[root@LION:~]# cat /proc/cpuinfo | grep -c hyper
0

bweston92 on 20 Mar 2017

Let's start tuning ProxySQL to use the correct number of threads.
As your system seems to have 4 cores, and there are a lot of services running, I would configured ProxySQL to use only one core running this on Admin:

SET mysql-threads=1;
SAVE MYSQL VARIABLES TO DISK;
PROXYSQL RESTART;

Thanks

renecannao on 20 Mar 2017

By there is a lot you mean ProxySQL and MySQL alone right?
Also the CPU has just jumped right down, will report back if anything goes back up.
What may of caused this?

bweston92 on 20 Mar 2017

By "a lot" I mean that maybe there is more than ProxySQL and MySQL : I see also mongod and corosync using their amount of CPU.
I am not sure what causes the issue, but the output of perf report is really odd.
Let's start configuring the right number of threads, and let's see if the issue is related or not

renecannao on 20 Mar 2017

👍1

Ok left it over night and now it is back.

Threads is still set to one:

| Variable_name                          | Value                                        |
+----------------------------------------+----------------------------------------------+
| mysql-threads                          | 1                                            |
+----------------------------------------+----------------------------------------------+

[root@LION:~]# systemctl status proxysql
● proxysql.service - Cluster Controlled proxysql
   Loaded: loaded (/etc/init.d/proxysql; generated; vendor preset: enabled)
  Drop-In: /run/systemd/system/proxysql.service.d
           └─50-pacemaker.conf
   Active: active (running) since Thu 2017-03-16 11:52:19 GMT; 4 days ago
     Docs: man:systemd-sysv-generator(8)
  Process: 28473 ExecStart=/etc/init.d/proxysql start (code=exited, status=0/SUCCESS)
    Tasks: 13 (limit: 4915)
   Memory: 326.0M
      CPU: 4d 2h 1min 13.877s
   CGroup: /system.slice/proxysql.service
           ├─28482 proxysql -c /etc/proxysql.cnf -D /var/lib/proxysql
           └─28484 proxysql -c /etc/proxysql.cnf -D /var/lib/proxysql

screenshot from 2017-03-21 08-22-09

bweston92 on 21 Mar 2017

@bweston92 : I am really sorry to see that you still have these spikes.
Did you modify proxysql_galera_checker.sh as suggested before?

After you modify proxysql_galera_checker.sh , please collect metrics again with perf, but this type using this command:

perf record -a

Leave it run for some time, like 15-30 minutes.
Then again:

perf report > report.txt

Thank you

renecannao on 21 Mar 2017

report-latest.txt

bweston92 on 21 Mar 2017

A bit of bash kung fu:

for i in `head -n 43723 report-latest.txt | tail -n 43712 | grep -v '0.00%' | awk '{print $2}' | sort | uniq -c | awk '{print $2}'` ; do echo -n `cut -b 1-29 report-latest.txt | grep "$i " | tr '%' ' ' | awk '{print $1}' | grep -v '0.00' | awk '{ SUM += $1} END { print SUM }'` ; echo "% $i" ; done | sort -n -r
37.04% swapper
22.33% proxysql
15.15% mysqld
5.61% corosync
4.26% mongod
2.74% rsync
1.93% kthreadd
1.72% proxysql_galera
1.38% node_exporter
0.9% kworker/3:0
0.81% mysql
0.77% rcu_sched
0.45% kworker/2:1
0.44% systemd
0.44% modprobe
0.41% jbd2/dm-7-8
0.33% ha_logd
0.33% flannel
0.32% date
0.28% pcs
0.21% bash
0.21% IPaddr2
0.19% ps
0.19% awk
0.17% docker
0.17% dbus-daemon
0.15% proxysql_export
0.14% mongodb_exporte
0.14% kworker/2:1H
0.13% ksoftirqd/2
0.11% systemctl
0.11% lrmd
0.11% attrd
0.1% tail
0.08% pacemakerd
0.08% crmd
0.07% kworker/3:1H
0.04% timeout
0.04% mkdir
0.04% grep
0.03% wsrep_sst_rsync
0.03% systemd-journal
0.03% proxySQLWatcher
0.03% jbd2/dm-2-8
0.03% iptables
0.02% sshd
0.02% send_arp
0.02% ntpd

The sum of these values is 99.43% , therefore the rounding error is minimal!
The above already show something interesting: proxysql is not the process/service using more cpu.

A bit more shell scripting, specifically to proxysql, a lot of time is spent in kernel space:

grep "proxysql " report-latest.txt | grep -v '0.00%' > proxysql_report-latest.txt
$ for i in `cat proxysql_report-latest.txt | awk '{print $3}' | sort | uniq -c | awk '{print $2}' | tr -d '[]'` ; do echo -n `awk '{print $1" "$3}' proxysql_report-latest.txt | grep $i | awk '{print $1}' | tr -d '%' | awk '{ SUM += $1} END { print SUM }'` ; echo "% $i" ; done | sort -n -r
13.35% kernel.kallsyms
6.9% proxysql
1.42% libpthread-2.24.so
0.31% unknown
0.29% libc-2.24.so
0.03% vdso
0.03% libstdc++.so.6.0.22

In user space, the amount of cpu time spent by proxysql is less than mysqld:

grep "mysqld " report-latest.txt | grep -v '0.00%' > mysqld_report-latest.txt
for i in `cat mysqld_report-latest.txt | awk '{print $3}' | sort | uniq -c | awk '{print $2}' | tr -d '[]'` ; do echo -n `awk '{print $1" "$3}' mysqld_report-latest.txt | grep $i | awk '{print $1}' | tr -d '%' | awk '{ SUM += $1} END { print SUM }'` ; echo "% $i" ; done | sort -n -r
8.68% mysqld
3.51% kernel.kallsyms
1.24% libc-2.19.so
0.98% unknown
0.27% libgalera_smm.so
0.18% libaio.so.1.0.1
0.15% libstdc++.so.6.0.20
0.11% libpthread-2.19.so
0.03% ld-2.19.so

But what concerns me more is the first line of perf report:

    12.55%  swapper          [kernel.kallsyms]            [k] intel_idle

I suspect that the odd CPU behaviour you are seeing is caused by power management.
Is it possible to disable any power management in this box?
Thanks

renecannao on 21 Mar 2017

👍3

@renecannao ok, however:

We have 3 boxes
They all have MySQL on them
They all have ProxySQL, only started when the "box" is master on pacemaker.

Only when the machine is running ProxySQL does this issue occur.

bweston92 on 21 Mar 2017

@bweston92 , to be clear, I am not saying "nothing wrong with ProxySQL" .
I am saying that the behavior you are seeing (and that I am not able to reproduce in hundreds of different servers) may be triggered by something specific in your system, and we need to understand what triggers it.

renecannao on 21 Mar 2017

We also encountered this, CPU usage around 200%.

I realized the log file keep recording this every 2 seconds:

2017-03-26 21:34:04 [INFO] New mysql_replication_hostgroups table
writer_hostgroup: 1 , reader_hostgroup: 2, TMS Base Writer and Reader Group
writer_hostgroup: 3 , reader_hostgroup: 4, TMS TM Writer and Reader Group
HID: 1 , address: HOST1 , port: 3306 , weight: 1 , status: OFFLINE_HARD , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 2 , address: HOST1_RO , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 2 , address: HOST1 , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 3 , address: HOST2 , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 4 , address: HOST2_RO , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 4 , address: HOST2 , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 100 , address: HOST3 , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 1 , address: HOST1 , port: 3306 , weight: 1 , status: OFFLINE_HARD , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 2 , address: HOST1_RO , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 2 , address: HOST1 , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 3 , address: HOST2 , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 4 , address: HOST2_RO , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 4 , address: HOST2 , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 100 , address: HOST3 , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 1 , address: HOST1 , port: 3306 , weight: 1 , status: OFFLINE_HARD , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 2 , address: HOST1_RO , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 2 , address: HOST1 , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 3 , address: HOST2 , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 4 , address: HOST2_RO , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 4 , address: HOST2 , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 100 , address: HOST3 , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 1 , address: HOST1 , port: 3306 , weight: 1 , status: OFFLINE_HARD , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 2 , address: HOST1_RO , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 2 , address: HOST1 , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 3 , address: HOST2 , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 4 , address: HOST2_RO , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 4 , address: HOST2 , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 100 , address: HOST3 , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
2017-03-26 21:34:06 [INFO] New mysql_replication_hostgroups table
writer_hostgroup: 1 , reader_hostgroup: 2, TMS Base Writer and Reader Group
writer_hostgroup: 3 , reader_hostgroup: 4, TMS TM Writer and Reader Group

OFFLINE_HARD looks weird here; the host is accessible otherwise.

oytuntez on 26 Mar 2017

This is the top output:

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
  997 root      20   0  276160  37768   5052 S 199.7  0.5  14575:48 proxysql

oytuntez on 27 Mar 2017

@oytuntez: the output you provided is not really helpful in debugging what is the problem.
This issue is full of examples of how to collect metric: the output of top gives no information.
Similarly about the entries in the error log: copying the last 200-300 lines gives more information than "this is recorded every 2 seconds", because, maybe, it is also written what cause these lines to be written

renecannao on 27 Mar 2017

I will collect more once it occurs again. This was in production, so didn't have much time.

oytuntez on 27 Mar 2017

i'm am also experiencing this issue. CPU drops to 0% for a few seconds, swap spikes to 100% usage, then CPU spikes to 100% across all cores. I have a fairly simple setup. 3 servers, singlewrite mode. Some times it goes days without an issue, Other-times a few hours. In all cases, when the CPU drops to 0%, proxysql stops handling requests and CPU spikes.

I first thought it was a configuration error so I reinstalled the entire OS and re-setup proxysql. But it started happening last night again. I'm using proxysql version 1.3.6-1 on Ubuntu 16.04.2 LTS. I am running 5.7.18-15-57-log Percona XtraDB Cluster (GPL), Release rel15, Revision 7693d6e, WSREP version 29.20, wsrep_29.20.

What information would you need from me to help you debug if this is a proxysql issue?

NextStepGuru on 15 Jun 2017

@NextStepGuru : try 1.3.7 , this patch may solve random hangs.
But what concerns me in your comment is swap spiking to 100% : this could be a cause for unpredictable consequences. Is it proxysql swapping? How much memory is using?

renecannao on 15 Jun 2017

I have deployed multi-cluster MySQL with Jelastic Cloud and also have peaks at 200% CPU for no apparent reason.

The servers are isolated, there is only ProxySQL on it, the only connection that uses it is mine and ProxySQL always ends up racing.

I am using version 2.0.12 with mysql 8.0.20.