Salt: File states seem slower in 2016.3, especially on first cache retrieval

Created on 27 May 2016  路  19Comments  路  Source: saltstack/salt

I upgraded to 2016.3 in my VMs and file states seem significantly slower on the first pass (10-20 times slower). However, on subsequent state.apply calls, things seem fine again.

For every file that the minion has to fetch from the master, I see about 30-40 occurances of these log lines in the minion log:

2016-05-27 12:52:34,313 [   DEBUG] [17955] [salt.transport.zeromq] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'orch.test.infra.vagrant.local', 'tcp://10.0.1.99:4506', 'aes')
2016-05-27 12:52:34,313 [   DEBUG] [17955] [salt.crypt] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'orch.test.infra.vagrant.local', 'tcp://10.0.1.99:4506')
2016-05-27 12:52:34,319 [   DEBUG] [17955] [salt.crypt] Re-using SAuth for ('/etc/salt/pki/minion', 'orch.test.infra.vagrant.local', 'tcp://10.0.1.99:4506')

Followed by:

2016-05-27 12:52:35,930 [   DEBUG] [17955] [salt.fileclient] In saltenv 'base', looking at rel_path '|salt/minion/configs/master.conf' to resolve 'salt://|salt/minion/configs/master.conf'
2016-05-27 12:52:35,933 [   DEBUG] [17955] [salt.fileclient] In saltenv 'base', ** considering ** path '/var/cache/salt/minion/files/base/|salt/minion/configs/master.conf' to resolve 'salt://|salt/minion/configs/master.conf'
2016-05-27 12:52:35,933 [   DEBUG] [17955] [salt.fileclient] Fetching file from saltenv 'base', ** attempting ** 'salt://|salt/minion/configs/master.conf'
2016-05-27 12:52:35,933 [   DEBUG] [17955] [salt.fileclient] No dest file found 

Then some more of the zeromq-related log lines, then this:

2016-05-27 12:52:37,632 [    INFO] [17955] [salt.fileclient] Fetching file from saltenv 'base', ** done ** '|salt/minion/configs/master.conf'
2016-05-27 12:52:37,633 [   DEBUG] [17955] [salt.utils.jinja] Jinja search path: '['/var/cache/salt/minion/files/base']'

I had a look at logs from before the upgrade and there were nowhere near the same amounts of log lines regarding re-auth.

vagrant@orch:~$ salt-call --versions
Salt Version:
           Salt: 2016.3.0

Dependency Versions:
           cffi: 0.8.6
       cherrypy: 3.2.3
       dateutil: 2.2
          gitdb: 0.5.4
      gitpython: 0.3.2 RC1
          ioflo: Not Installed
         Jinja2: 2.7.3
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.4.2
   mysql-python: 1.2.3
      pycparser: 2.10
       pycrypto: 2.6.1
         pygit2: Not Installed
         Python: 2.7.9 (default, Mar  1 2015, 12:57:24)
   python-gnupg: Not Installed
         PyYAML: 3.11
          PyZMQ: 14.4.0
           RAET: Not Installed
          smmap: 0.8.2
        timelib: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.0.5

System Versions:
           dist: debian 8.4 
        machine: x86_64
        release: 3.16.0-4-amd64
         system: Linux
        version: debian 8.4
Bug Confirmed Core P1 ZRELEASED - 2016.3.2 fixed-pending-your-verification severity-high

Most helpful comment

oohh, i see! as this ticket was marked as fixed, closed and then the 2016.3.1 was released, we all assume this version had the fix.

I tested with 2016.3.1 and applied the fix patch...and it worked!
Still seems slower than 2015.8, but now is a lot faster than it was without the fix.

Thanks, i will wait for the 2016.3.2 release

All 19 comments

Similar experience after upgrade to 2016.3.0. First run of highstate on clean install is significantly slower. Subsequent runs are also slower vs 2015.8.10

Salt Version:
Salt: 2016.3.0

Dependency Versions:
cffi: Not Installed
cherrypy: Not Installed
dateutil: Not Installed
gitdb: Not Installed
gitpython: Not Installed
ioflo: Not Installed
Jinja2: 2.7.2
libgit2: Not Installed
libnacl: Not Installed
M2Crypto: Not Installed
Mako: Not Installed
msgpack-pure: Not Installed
msgpack-python: 0.4.6
mysql-python: Not Installed
pycparser: Not Installed
pycrypto: 2.6.1
pygit2: Not Installed
Python: 2.7.5 (default, Nov 20 2015, 02:00:19)
python-gnupg: Not Installed
PyYAML: 3.11
PyZMQ: 14.7.0
RAET: Not Installed
smmap: Not Installed
timelib: Not Installed
Tornado: 4.2.1
ZMQ: 4.0.5

System Versions:
dist: centos 7.2.1511 Core
machine: x86_64
release: 3.10.0-327.13.1.el7.x86_64
system: Linux
version: CentOS Linux 7.2.1511 Core

@anlutro I am able to replicate this and am seeing a performance difference with a small test case. Here's the test case:

[root@ch3ll-master salt]# cat issues/33575.sls 
{% for file in range(1,20) %}
add-file-{{ file }}:
  file.managed:
    - name: /tmp/test
    - source: salt://test{{ file }}
{% endfor %}
[root@ch3ll-master salt]# cat /srv/salt/top.sls 
base:
  '*':
    - issues.33575

When running the following: time sudo salt 'ch3ll-mas*' state.highstate -ldebug

on 2015.8.10:

real    0m2.728s
user    0m0.888s
sys     0m0.157s

on 2016.3.0

real    0m3.795s
user    0m0.931s
sys     0m0.151s

@cachedout do you think that this might be part of the multimaster cacheing file client fix?

Hmmm, that's a good thought. Could be. Trying to find the culprit with a git bisect could be a good approach here.

Looks like I have the same problem. salt.fileclient is super slow. Some examples for the log:

2016-06-09 09:14:00,996 [salt.fileclient  ][INFO    ][17677] Fetching file from saltenv 'base', ** done ** '|vim/bundles/python-mode/pymode/libs2/rope/base/taskhandle.py'
2016-06-09 09:14:04,187 [salt.fileclient  ][INFO    ][17677] Fetching file from saltenv 'base', ** done ** '|vim/bundles/python-mode/pymode/libs2/rope/base/utils.py'

It took ~4 seconds to copy a file with 2891 bytes!!

Yeah we're experiencing this issue too... a job that used to take like 10 minutes now takes 3 hours. What takes time is copying about 8000 images from the master.

we also have this problem, a highstate takes now about 15m on each server, so for now i reverted to back to 2015.8.
I would say that this is a good reason for a bugfix release

Yes, we are still hunting this one down.

This should be resolved here and available in 2016.3.1:

https://github.com/saltstack/salt/pull/33896

Phew! Thanks @cachedout ! I thought this had been missed for a minute!

so can this be closed?

Fixed by #33896

Just upgraded again to 2016.3.1 and still takes a long time to process...

the salt-minion keeps using 100% of the CPU until it finishes the highstate.

salt-minion -l garbage i get this log showing always a AsyncZeroMQReqChannel and each 1~2 seconds i finally get a normal state being applied:

[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506')
[TRACE   ] Inserted key into loop_instance_map id 140156465323904 for key ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes') and process 12483
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506')
[TRACE   ] Inserted key into loop_instance_map id 140156183639104 for key ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes') and process 10959
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506')
[TRACE   ] Inserted key into loop_instance_map id 140156185678768 for key ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes') and process 10959
[DEBUG   ] In saltenv 'base', looking at rel_path 'templates/CONFIGURATION/nsswitch.conf' to resolve 'salt://templates/CONFIGURATION/nsswitch.conf'
[DEBUG   ] In saltenv 'base', ** considering ** path '/var/cache/salt/minion/files/base/templates/CONFIGURATION/nsswitch.conf' to resolve 'salt://templates/CONFIGURATION/nsswitch.conf'
[INFO    ] Fetching file from saltenv 'base', ** skipped ** latest already in cache 'salt://templates/CONFIGURATION/nsswitch.conf'
[DEBUG   ] Jinja search path: ['/var/cache/salt/minion/files/base']
[INFO    ] File /etc/nsswitch.conf is in the correct state
[INFO    ] Completed state [/etc/nsswitch.conf] at time 20:25:29.541349 duration_in_ms=5715.604
[INFO    ] Running state [/etc/rsyslog.conf] at time 20:25:29.541611
[INFO    ] Executing state file.managed for /etc/rsyslog.conf
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506')
[TRACE   ] Inserted key into loop_instance_map id 140156182034624 for key ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes') and process 10959
[TRACE   ] ret_val = None
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506')
[TRACE   ] Inserted key into loop_instance_map id 140156182063512 for key ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes') and process 10959
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506')
[TRACE   ] Inserted key into loop_instance_map id 140156185516080 for key ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes') and process 10959
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506')
[TRACE   ] Inserted key into loop_instance_map id 140156183189480 for key ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes') and process 10959
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506')
[TRACE   ] Inserted key into loop_instance_map id 140156157189872 for key ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes') and process 10959
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506')
[TRACE   ] Inserted key into loop_instance_map id 140156181971744 for key ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes') and process 10959
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506')
[TRACE   ] Inserted key into loop_instance_map id 140156184754224 for key ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes') and process 10959
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506')
[TRACE   ] Inserted key into loop_instance_map id 140156183288000 for key ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes') and process 10959
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506')

Doing a strace -enetwork -f shows in the loop this:

[pid  7894] socket(PF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_TCP) = 48
[pid  7894] connect(48, {sa_family=AF_INET, sin_port=htons(4506), sin_addr=inet_addr("192.168.106.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
[pid  7894] getsockopt(48, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
[pid  7894] setsockopt(48, SOL_TCP, TCP_NODELAY, [1], 4) = 0
[pid  7894] setsockopt(48, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
[pid  7894] setsockopt(48, SOL_TCP, TCP_KEEPIDLE, [300], 4) = 0
[pid  7894] getpeername(48, {sa_family=AF_INET, sin_port=htons(4506), sin_addr=inet_addr("192.168.106.1")}, [16]) = 0
[pid  7894] recvfrom(48, "\377\0\0\0\0\0\0\0\1\177", 12, 0, NULL, NULL) = 10
[pid  7894] recvfrom(48, 0x7f5340000ef2, 2, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
[pid  7894] sendto(48, "\377\0\0\0\0\0\0\0\1\177\3", 11, 0, NULL, 0) = 11
[pid  7894] recvfrom(48, "\3\0", 2, 0, NULL, NULL) = 2
[pid  7894] recvfrom(48, "NULL\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 52, 0, NULL, NULL) = 52
[pid  7894] recvfrom(48, 0x7f5340003200, 8192, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
[pid  7894] sendto(48, "\0NULL\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 53, 0, NULL, 0) = 53
[pid  7894] sendto(48, "\4&\5READY\vSocket-Type\0\0\0\3REQ\10Identity\0\0\0\0", 40, 0, NULL, 0) = 40
[pid  7894] recvfrom(48, "\4)\5READY\vSocket-Type\0\0\0\6ROUTER\10Identity\0\0\0\0", 8192, 0, NULL, NULL) = 43
[pid  7893] +++ exited with 0 +++
[pid  7894] +++ exited with 0 +++

So this query is being done several times per second and with the encryption, it eats all the CPU of the machine

reverting to the 2015.8, the Initializing new AsyncZeroMQReqChannel is ONLY show up 5 times in the "garbage" output during the machine highstate vs the thousands of times from the 2016.03.1

I can't reproduce this problem in 2016.3.1, I'll leave another comment if it should pop up again.

@markahopper this (specifically what @danielmotaleite pointed out) is huge for us.

@danielmotaleite, @anlutro, @ahammond
As @thatch45 pointed above this bug is fixed by #33896. Currently the only 2016.3 branch and develop branch contain the fix.
You should either wait for 2016.3.2 release or try to apply #33896 by yourself.

Oh, odd. I was sure it was merged before 2016.3.1. I guess it's just a coincidence that I can't reproduce it.

oohh, i see! as this ticket was marked as fixed, closed and then the 2016.3.1 was released, we all assume this version had the fix.

I tested with 2016.3.1 and applied the fix patch...and it worked!
Still seems slower than 2015.8, but now is a lot faster than it was without the fix.

Thanks, i will wait for the 2016.3.2 release

Was this page helpful?
0 / 5 - 0 ratings