Salt: File states seem slower in 2016.3, especially on first cache retrieval

Created on 27 May 2016 · 19Comments · Source: saltstack/salt

I upgraded to 2016.3 in my VMs and file states seem significantly slower on the first pass (10-20 times slower). However, on subsequent state.apply calls, things seem fine again.

For every file that the minion has to fetch from the master, I see about 30-40 occurances of these log lines in the minion log:

2016-05-27 12:52:34,313 [   DEBUG] [17955] [salt.transport.zeromq] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'orch.test.infra.vagrant.local', 'tcp://10.0.1.99:4506', 'aes')
2016-05-27 12:52:34,313 [   DEBUG] [17955] [salt.crypt] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'orch.test.infra.vagrant.local', 'tcp://10.0.1.99:4506')
2016-05-27 12:52:34,319 [   DEBUG] [17955] [salt.crypt] Re-using SAuth for ('/etc/salt/pki/minion', 'orch.test.infra.vagrant.local', 'tcp://10.0.1.99:4506')

Followed by:

2016-05-27 12:52:35,930 [   DEBUG] [17955] [salt.fileclient] In saltenv 'base', looking at rel_path '|salt/minion/configs/master.conf' to resolve 'salt://|salt/minion/configs/master.conf'
2016-05-27 12:52:35,933 [   DEBUG] [17955] [salt.fileclient] In saltenv 'base', ** considering ** path '/var/cache/salt/minion/files/base/|salt/minion/configs/master.conf' to resolve 'salt://|salt/minion/configs/master.conf'
2016-05-27 12:52:35,933 [   DEBUG] [17955] [salt.fileclient] Fetching file from saltenv 'base', ** attempting ** 'salt://|salt/minion/configs/master.conf'
2016-05-27 12:52:35,933 [   DEBUG] [17955] [salt.fileclient] No dest file found

Then some more of the zeromq-related log lines, then this:

2016-05-27 12:52:37,632 [    INFO] [17955] [salt.fileclient] Fetching file from saltenv 'base', ** done ** '|salt/minion/configs/master.conf'
2016-05-27 12:52:37,633 [   DEBUG] [17955] [salt.utils.jinja] Jinja search path: '['/var/cache/salt/minion/files/base']'

I had a look at logs from before the upgrade and there were nowhere near the same amounts of log lines regarding re-auth.

vagrant@orch:~$ salt-call --versions
Salt Version:
           Salt: 2016.3.0

Dependency Versions:
           cffi: 0.8.6
       cherrypy: 3.2.3
       dateutil: 2.2
          gitdb: 0.5.4
      gitpython: 0.3.2 RC1
          ioflo: Not Installed
         Jinja2: 2.7.3
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.4.2
   mysql-python: 1.2.3
      pycparser: 2.10
       pycrypto: 2.6.1
         pygit2: Not Installed
         Python: 2.7.9 (default, Mar  1 2015, 12:57:24)
   python-gnupg: Not Installed
         PyYAML: 3.11
          PyZMQ: 14.4.0
           RAET: Not Installed
          smmap: 0.8.2
        timelib: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.0.5

System Versions:
           dist: debian 8.4 
        machine: x86_64
        release: 3.16.0-4-amd64
         system: Linux
        version: debian 8.4

Bug Confirmed Core P1 ZRELEASED - 2016.3.2 fixed-pending-your-verification severity-high

Source

anlutro

👍2

Most helpful comment

oohh, i see! as this ticket was marked as fixed, closed and then the 2016.3.1 was released, we all assume this version had the fix.

I tested with 2016.3.1 and applied the fix patch...and it worked!
Still seems slower than 2015.8, but now is a lot faster than it was without the fix.

Thanks, i will wait for the 2016.3.2 release

danielmotaleite on 22 Jun 2016

👍2

All 19 comments

Similar experience after upgrade to 2016.3.0. First run of highstate on clean install is significantly slower. Subsequent runs are also slower vs 2015.8.10

Salt Version:
Salt: 2016.3.0

Dependency Versions:
cffi: Not Installed
cherrypy: Not Installed
dateutil: Not Installed
gitdb: Not Installed
gitpython: Not Installed
ioflo: Not Installed
Jinja2: 2.7.2
libgit2: Not Installed
libnacl: Not Installed
M2Crypto: Not Installed
Mako: Not Installed
msgpack-pure: Not Installed
msgpack-python: 0.4.6
mysql-python: Not Installed
pycparser: Not Installed
pycrypto: 2.6.1
pygit2: Not Installed
Python: 2.7.5 (default, Nov 20 2015, 02:00:19)
python-gnupg: Not Installed
PyYAML: 3.11
PyZMQ: 14.7.0
RAET: Not Installed
smmap: Not Installed
timelib: Not Installed
Tornado: 4.2.1
ZMQ: 4.0.5

System Versions:
dist: centos 7.2.1511 Core
machine: x86_64
release: 3.10.0-327.13.1.el7.x86_64
system: Linux
version: CentOS Linux 7.2.1511 Core

genx7up on 27 May 2016

@anlutro I am able to replicate this and am seeing a performance difference with a small test case. Here's the test case:

[root@ch3ll-master salt]# cat issues/33575.sls 
{% for file in range(1,20) %}
add-file-{{ file }}:
  file.managed:
    - name: /tmp/test
    - source: salt://test{{ file }}
{% endfor %}

[root@ch3ll-master salt]# cat /srv/salt/top.sls 
base:
  '*':
    - issues.33575

When running the following: time sudo salt 'ch3ll-mas*' state.highstate -ldebug

on 2015.8.10:

real    0m2.728s
user    0m0.888s
sys     0m0.157s

on 2016.3.0

real    0m3.795s
user    0m0.931s
sys     0m0.151s

Ch3LL on 27 May 2016

@cachedout do you think that this might be part of the multimaster cacheing file client fix?

thatch45 on 27 May 2016

Hmmm, that's a good thought. Could be. Trying to find the culprit with a git bisect could be a good approach here.

cachedout on 27 May 2016

Looks like I have the same problem. salt.fileclient is super slow. Some examples for the log:

2016-06-09 09:14:00,996 [salt.fileclient  ][INFO    ][17677] Fetching file from saltenv 'base', ** done ** '|vim/bundles/python-mode/pymode/libs2/rope/base/taskhandle.py'
2016-06-09 09:14:04,187 [salt.fileclient  ][INFO    ][17677] Fetching file from saltenv 'base', ** done ** '|vim/bundles/python-mode/pymode/libs2/rope/base/utils.py'

It took ~4 seconds to copy a file with 2891 bytes!!

hgfischer on 9 Jun 2016

👍1

Yeah we're experiencing this issue too... a job that used to take like 10 minutes now takes 3 hours. What takes time is copying about 8000 images from the master.

inty on 9 Jun 2016

we also have this problem, a highstate takes now about 15m on each server, so for now i reverted to back to 2015.8.
I would say that this is a good reason for a bugfix release

danielmotaleite on 17 Jun 2016

Yes, we are still hunting this one down.

thatch45 on 17 Jun 2016

This should be resolved here and available in 2016.3.1:

https://github.com/saltstack/salt/pull/33896

cachedout on 17 Jun 2016

Phew! Thanks @cachedout ! I thought this had been missed for a minute!

thatch45 on 17 Jun 2016

so can this be closed?

thatch45 on 17 Jun 2016

Fixed by #33896

thatch45 on 17 Jun 2016

👍1

Just upgraded again to 2016.3.1 and still takes a long time to process...

the salt-minion keeps using 100% of the CPU until it finishes the highstate.

salt-minion -l garbage i get this log showing always a AsyncZeroMQReqChannel and each 1~2 seconds i finally get a normal state being applied:

[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506')
[TRACE   ] Inserted key into loop_instance_map id 140156465323904 for key ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes') and process 12483
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506')
[TRACE   ] Inserted key into loop_instance_map id 140156183639104 for key ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes') and process 10959
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506')
[TRACE   ] Inserted key into loop_instance_map id 140156185678768 for key ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes') and process 10959
[DEBUG   ] In saltenv 'base', looking at rel_path 'templates/CONFIGURATION/nsswitch.conf' to resolve 'salt://templates/CONFIGURATION/nsswitch.conf'
[DEBUG   ] In saltenv 'base', ** considering ** path '/var/cache/salt/minion/files/base/templates/CONFIGURATION/nsswitch.conf' to resolve 'salt://templates/CONFIGURATION/nsswitch.conf'
[INFO    ] Fetching file from saltenv 'base', ** skipped ** latest already in cache 'salt://templates/CONFIGURATION/nsswitch.conf'
[DEBUG   ] Jinja search path: ['/var/cache/salt/minion/files/base']
[INFO    ] File /etc/nsswitch.conf is in the correct state
[INFO    ] Completed state [/etc/nsswitch.conf] at time 20:25:29.541349 duration_in_ms=5715.604
[INFO    ] Running state [/etc/rsyslog.conf] at time 20:25:29.541611
[INFO    ] Executing state file.managed for /etc/rsyslog.conf
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506')
[TRACE   ] Inserted key into loop_instance_map id 140156182034624 for key ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes') and process 10959
[TRACE   ] ret_val = None
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506')
[TRACE   ] Inserted key into loop_instance_map id 140156182063512 for key ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes') and process 10959
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506')
[TRACE   ] Inserted key into loop_instance_map id 140156185516080 for key ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes') and process 10959
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506')
[TRACE   ] Inserted key into loop_instance_map id 140156183189480 for key ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes') and process 10959
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506')
[TRACE   ] Inserted key into loop_instance_map id 140156157189872 for key ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes') and process 10959
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506')
[TRACE   ] Inserted key into loop_instance_map id 140156181971744 for key ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes') and process 10959
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506')
[TRACE   ] Inserted key into loop_instance_map id 140156184754224 for key ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes') and process 10959
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506')
[TRACE   ] Inserted key into loop_instance_map id 140156183288000 for key ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes') and process 10959
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506')

Doing a strace -enetwork -f shows in the loop this:

[pid  7894] socket(PF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_TCP) = 48
[pid  7894] connect(48, {sa_family=AF_INET, sin_port=htons(4506), sin_addr=inet_addr("192.168.106.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
[pid  7894] getsockopt(48, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
[pid  7894] setsockopt(48, SOL_TCP, TCP_NODELAY, [1], 4) = 0
[pid  7894] setsockopt(48, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
[pid  7894] setsockopt(48, SOL_TCP, TCP_KEEPIDLE, [300], 4) = 0
[pid  7894] getpeername(48, {sa_family=AF_INET, sin_port=htons(4506), sin_addr=inet_addr("192.168.106.1")}, [16]) = 0
[pid  7894] recvfrom(48, "\377\0\0\0\0\0\0\0\1\177", 12, 0, NULL, NULL) = 10
[pid  7894] recvfrom(48, 0x7f5340000ef2, 2, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
[pid  7894] sendto(48, "\377\0\0\0\0\0\0\0\1\177\3", 11, 0, NULL, 0) = 11
[pid  7894] recvfrom(48, "\3\0", 2, 0, NULL, NULL) = 2
[pid  7894] recvfrom(48, "NULL\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 52, 0, NULL, NULL) = 52
[pid  7894] recvfrom(48, 0x7f5340003200, 8192, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
[pid  7894] sendto(48, "\0NULL\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 53, 0, NULL, 0) = 53
[pid  7894] sendto(48, "\4&\5READY\vSocket-Type\0\0\0\3REQ\10Identity\0\0\0\0", 40, 0, NULL, 0) = 40
[pid  7894] recvfrom(48, "\4)\5READY\vSocket-Type\0\0\0\6ROUTER\10Identity\0\0\0\0", 8192, 0, NULL, NULL) = 43
[pid  7893] +++ exited with 0 +++
[pid  7894] +++ exited with 0 +++

So this query is being done several times per second and with the encryption, it eats all the CPU of the machine

danielmotaleite on 17 Jun 2016

👍1

reverting to the 2015.8, the Initializing new AsyncZeroMQReqChannel is ONLY show up 5 times in the "garbage" output during the machine highstate vs the thousands of times from the 2016.03.1

danielmotaleite on 17 Jun 2016

I can't reproduce this problem in 2016.3.1, I'll leave another comment if it should pop up again.

anlutro on 20 Jun 2016

@markahopper this (specifically what @danielmotaleite pointed out) is huge for us.

ahammond on 21 Jun 2016

@danielmotaleite, @anlutro, @ahammond
As @thatch45 pointed above this bug is fixed by #33896. Currently the only 2016.3 branch and develop branch contain the fix.
You should either wait for 2016.3.2 release or try to apply #33896 by yourself.

DmitryKuzmenko on 22 Jun 2016

Oh, odd. I was sure it was merged before 2016.3.1. I guess it's just a coincidence that I can't reproduce it.

anlutro on 22 Jun 2016

oohh, i see! as this ticket was marked as fixed, closed and then the 2016.3.1 was released, we all assume this version had the fix.

I tested with 2016.3.1 and applied the fix patch...and it worked!
Still seems slower than 2015.8, but now is a lot faster than it was without the fix.

Thanks, i will wait for the 2016.3.2 release

danielmotaleite on 22 Jun 2016

👍2

Was this page helpful?

0 / 5 - 0 ratings