I upgraded to 2016.3 in my VMs and file states seem significantly slower on the first pass (10-20 times slower). However, on subsequent state.apply
calls, things seem fine again.
For every file that the minion has to fetch from the master, I see about 30-40 occurances of these log lines in the minion log:
2016-05-27 12:52:34,313 [ DEBUG] [17955] [salt.transport.zeromq] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'orch.test.infra.vagrant.local', 'tcp://10.0.1.99:4506', 'aes')
2016-05-27 12:52:34,313 [ DEBUG] [17955] [salt.crypt] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'orch.test.infra.vagrant.local', 'tcp://10.0.1.99:4506')
2016-05-27 12:52:34,319 [ DEBUG] [17955] [salt.crypt] Re-using SAuth for ('/etc/salt/pki/minion', 'orch.test.infra.vagrant.local', 'tcp://10.0.1.99:4506')
Followed by:
2016-05-27 12:52:35,930 [ DEBUG] [17955] [salt.fileclient] In saltenv 'base', looking at rel_path '|salt/minion/configs/master.conf' to resolve 'salt://|salt/minion/configs/master.conf'
2016-05-27 12:52:35,933 [ DEBUG] [17955] [salt.fileclient] In saltenv 'base', ** considering ** path '/var/cache/salt/minion/files/base/|salt/minion/configs/master.conf' to resolve 'salt://|salt/minion/configs/master.conf'
2016-05-27 12:52:35,933 [ DEBUG] [17955] [salt.fileclient] Fetching file from saltenv 'base', ** attempting ** 'salt://|salt/minion/configs/master.conf'
2016-05-27 12:52:35,933 [ DEBUG] [17955] [salt.fileclient] No dest file found
Then some more of the zeromq-related log lines, then this:
2016-05-27 12:52:37,632 [ INFO] [17955] [salt.fileclient] Fetching file from saltenv 'base', ** done ** '|salt/minion/configs/master.conf'
2016-05-27 12:52:37,633 [ DEBUG] [17955] [salt.utils.jinja] Jinja search path: '['/var/cache/salt/minion/files/base']'
I had a look at logs from before the upgrade and there were nowhere near the same amounts of log lines regarding re-auth.
vagrant@orch:~$ salt-call --versions
Salt Version:
Salt: 2016.3.0
Dependency Versions:
cffi: 0.8.6
cherrypy: 3.2.3
dateutil: 2.2
gitdb: 0.5.4
gitpython: 0.3.2 RC1
ioflo: Not Installed
Jinja2: 2.7.3
libgit2: Not Installed
libnacl: Not Installed
M2Crypto: Not Installed
Mako: Not Installed
msgpack-pure: Not Installed
msgpack-python: 0.4.2
mysql-python: 1.2.3
pycparser: 2.10
pycrypto: 2.6.1
pygit2: Not Installed
Python: 2.7.9 (default, Mar 1 2015, 12:57:24)
python-gnupg: Not Installed
PyYAML: 3.11
PyZMQ: 14.4.0
RAET: Not Installed
smmap: 0.8.2
timelib: Not Installed
Tornado: 4.2.1
ZMQ: 4.0.5
System Versions:
dist: debian 8.4
machine: x86_64
release: 3.16.0-4-amd64
system: Linux
version: debian 8.4
Similar experience after upgrade to 2016.3.0. First run of highstate on clean install is significantly slower. Subsequent runs are also slower vs 2015.8.10
Salt Version:
Salt: 2016.3.0
Dependency Versions:
cffi: Not Installed
cherrypy: Not Installed
dateutil: Not Installed
gitdb: Not Installed
gitpython: Not Installed
ioflo: Not Installed
Jinja2: 2.7.2
libgit2: Not Installed
libnacl: Not Installed
M2Crypto: Not Installed
Mako: Not Installed
msgpack-pure: Not Installed
msgpack-python: 0.4.6
mysql-python: Not Installed
pycparser: Not Installed
pycrypto: 2.6.1
pygit2: Not Installed
Python: 2.7.5 (default, Nov 20 2015, 02:00:19)
python-gnupg: Not Installed
PyYAML: 3.11
PyZMQ: 14.7.0
RAET: Not Installed
smmap: Not Installed
timelib: Not Installed
Tornado: 4.2.1
ZMQ: 4.0.5
System Versions:
dist: centos 7.2.1511 Core
machine: x86_64
release: 3.10.0-327.13.1.el7.x86_64
system: Linux
version: CentOS Linux 7.2.1511 Core
@anlutro I am able to replicate this and am seeing a performance difference with a small test case. Here's the test case:
[root@ch3ll-master salt]# cat issues/33575.sls
{% for file in range(1,20) %}
add-file-{{ file }}:
file.managed:
- name: /tmp/test
- source: salt://test{{ file }}
{% endfor %}
[root@ch3ll-master salt]# cat /srv/salt/top.sls
base:
'*':
- issues.33575
When running the following: time sudo salt 'ch3ll-mas*' state.highstate -ldebug
on 2015.8.10:
real 0m2.728s
user 0m0.888s
sys 0m0.157s
on 2016.3.0
real 0m3.795s
user 0m0.931s
sys 0m0.151s
@cachedout do you think that this might be part of the multimaster cacheing file client fix?
Hmmm, that's a good thought. Could be. Trying to find the culprit with a git bisect
could be a good approach here.
Looks like I have the same problem. salt.fileclient
is super slow. Some examples for the log:
2016-06-09 09:14:00,996 [salt.fileclient ][INFO ][17677] Fetching file from saltenv 'base', ** done ** '|vim/bundles/python-mode/pymode/libs2/rope/base/taskhandle.py'
2016-06-09 09:14:04,187 [salt.fileclient ][INFO ][17677] Fetching file from saltenv 'base', ** done ** '|vim/bundles/python-mode/pymode/libs2/rope/base/utils.py'
It took ~4 seconds to copy a file with 2891 bytes!!
Yeah we're experiencing this issue too... a job that used to take like 10 minutes now takes 3 hours. What takes time is copying about 8000 images from the master.
we also have this problem, a highstate takes now about 15m on each server, so for now i reverted to back to 2015.8.
I would say that this is a good reason for a bugfix release
Yes, we are still hunting this one down.
This should be resolved here and available in 2016.3.1:
Phew! Thanks @cachedout ! I thought this had been missed for a minute!
so can this be closed?
Fixed by #33896
Just upgraded again to 2016.3.1 and still takes a long time to process...
the salt-minion keeps using 100% of the CPU until it finishes the highstate.
salt-minion -l garbage i get this log showing always a AsyncZeroMQReqChannel and each 1~2 seconds i finally get a normal state being applied:
[DEBUG ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes')
[DEBUG ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506')
[TRACE ] Inserted key into loop_instance_map id 140156465323904 for key ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes') and process 12483
[DEBUG ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes')
[DEBUG ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506')
[TRACE ] Inserted key into loop_instance_map id 140156183639104 for key ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes') and process 10959
[DEBUG ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes')
[DEBUG ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506')
[TRACE ] Inserted key into loop_instance_map id 140156185678768 for key ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes') and process 10959
[DEBUG ] In saltenv 'base', looking at rel_path 'templates/CONFIGURATION/nsswitch.conf' to resolve 'salt://templates/CONFIGURATION/nsswitch.conf'
[DEBUG ] In saltenv 'base', ** considering ** path '/var/cache/salt/minion/files/base/templates/CONFIGURATION/nsswitch.conf' to resolve 'salt://templates/CONFIGURATION/nsswitch.conf'
[INFO ] Fetching file from saltenv 'base', ** skipped ** latest already in cache 'salt://templates/CONFIGURATION/nsswitch.conf'
[DEBUG ] Jinja search path: ['/var/cache/salt/minion/files/base']
[INFO ] File /etc/nsswitch.conf is in the correct state
[INFO ] Completed state [/etc/nsswitch.conf] at time 20:25:29.541349 duration_in_ms=5715.604
[INFO ] Running state [/etc/rsyslog.conf] at time 20:25:29.541611
[INFO ] Executing state file.managed for /etc/rsyslog.conf
[DEBUG ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes')
[DEBUG ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506')
[TRACE ] Inserted key into loop_instance_map id 140156182034624 for key ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes') and process 10959
[TRACE ] ret_val = None
[DEBUG ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes')
[DEBUG ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506')
[TRACE ] Inserted key into loop_instance_map id 140156182063512 for key ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes') and process 10959
[DEBUG ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes')
[DEBUG ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506')
[TRACE ] Inserted key into loop_instance_map id 140156185516080 for key ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes') and process 10959
[DEBUG ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes')
[DEBUG ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506')
[TRACE ] Inserted key into loop_instance_map id 140156183189480 for key ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes') and process 10959
[DEBUG ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes')
[DEBUG ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506')
[TRACE ] Inserted key into loop_instance_map id 140156157189872 for key ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes') and process 10959
[DEBUG ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes')
[DEBUG ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506')
[TRACE ] Inserted key into loop_instance_map id 140156181971744 for key ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes') and process 10959
[DEBUG ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes')
[DEBUG ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506')
[TRACE ] Inserted key into loop_instance_map id 140156184754224 for key ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes') and process 10959
[DEBUG ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes')
[DEBUG ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506')
[TRACE ] Inserted key into loop_instance_map id 140156183288000 for key ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes') and process 10959
[DEBUG ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes')
[DEBUG ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506')
Doing a strace -enetwork -f shows in the loop this:
[pid 7894] socket(PF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_TCP) = 48
[pid 7894] connect(48, {sa_family=AF_INET, sin_port=htons(4506), sin_addr=inet_addr("192.168.106.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
[pid 7894] getsockopt(48, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
[pid 7894] setsockopt(48, SOL_TCP, TCP_NODELAY, [1], 4) = 0
[pid 7894] setsockopt(48, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
[pid 7894] setsockopt(48, SOL_TCP, TCP_KEEPIDLE, [300], 4) = 0
[pid 7894] getpeername(48, {sa_family=AF_INET, sin_port=htons(4506), sin_addr=inet_addr("192.168.106.1")}, [16]) = 0
[pid 7894] recvfrom(48, "\377\0\0\0\0\0\0\0\1\177", 12, 0, NULL, NULL) = 10
[pid 7894] recvfrom(48, 0x7f5340000ef2, 2, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
[pid 7894] sendto(48, "\377\0\0\0\0\0\0\0\1\177\3", 11, 0, NULL, 0) = 11
[pid 7894] recvfrom(48, "\3\0", 2, 0, NULL, NULL) = 2
[pid 7894] recvfrom(48, "NULL\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 52, 0, NULL, NULL) = 52
[pid 7894] recvfrom(48, 0x7f5340003200, 8192, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
[pid 7894] sendto(48, "\0NULL\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 53, 0, NULL, 0) = 53
[pid 7894] sendto(48, "\4&\5READY\vSocket-Type\0\0\0\3REQ\10Identity\0\0\0\0", 40, 0, NULL, 0) = 40
[pid 7894] recvfrom(48, "\4)\5READY\vSocket-Type\0\0\0\6ROUTER\10Identity\0\0\0\0", 8192, 0, NULL, NULL) = 43
[pid 7893] +++ exited with 0 +++
[pid 7894] +++ exited with 0 +++
So this query is being done several times per second and with the encryption, it eats all the CPU of the machine
reverting to the 2015.8, the Initializing new AsyncZeroMQReqChannel
is ONLY show up 5 times in the "garbage" output during the machine highstate vs the thousands of times from the 2016.03.1
I can't reproduce this problem in 2016.3.1, I'll leave another comment if it should pop up again.
@markahopper this (specifically what @danielmotaleite pointed out) is huge for us.
@danielmotaleite, @anlutro, @ahammond
As @thatch45 pointed above this bug is fixed by #33896. Currently the only 2016.3 branch and develop branch contain the fix.
You should either wait for 2016.3.2 release or try to apply #33896 by yourself.
Oh, odd. I was sure it was merged before 2016.3.1. I guess it's just a coincidence that I can't reproduce it.
oohh, i see! as this ticket was marked as fixed, closed and then the 2016.3.1 was released, we all assume this version had the fix.
I tested with 2016.3.1 and applied the fix patch...and it worked!
Still seems slower than 2015.8, but now is a lot faster than it was without the fix.
Thanks, i will wait for the 2016.3.2 release
Most helpful comment
oohh, i see! as this ticket was marked as fixed, closed and then the 2016.3.1 was released, we all assume this version had the fix.
I tested with 2016.3.1 and applied the fix patch...and it worked!
Still seems slower than 2015.8, but now is a lot faster than it was without the fix.
Thanks, i will wait for the 2016.3.2 release