Salt: Minion did not return

Created on 30 Jan 2015  Â·  52Comments  Â·  Source: saltstack/salt

Hi :

I found salt minion 2014.7 very unstable, often not return。
I use 2014.7 commit id 3e2b366cd6f8c8bcb37eb55f59853178a0420072

#salt 'console1*' test.ping -v 
Executing job with jid 20150130165611866138
-------------------------------------------

console1.xxx:
    Minion did not return. [No response]

run command return null

salt-run jobs.lookup_jid 20150130165611866138

Minion debug log

2015-01-30 16:56:12,695 [salt.minion                                 ][INFO    ] /usr/local/salt/packages/salt/minion.py:927 User root Executing command test.ping with jid 20150130165611866138
2015-01-30 16:56:12,696 [salt.minion                                 ][DEBUG   ] /usr/local/salt/packages/salt/minion.py:933 Command details {'tgt_type': 'glob', 'jid': '20150130165611866138', 'tgt': 'console1*', 'ret': '', 'user': 'root', 'arg': [], 'fun': 'test.ping'}
2015-01-30 16:56:12,707 [salt.minion                                 ][INFO    ] /usr/local/salt/packages/salt/minion.py:1006 Starting a new job with PID 15879
2015-01-30 16:56:12,708 [salt.minion                                 ][INFO    ] /usr/local/salt/packages/salt/minion.py:1174 Returning information for job: 20150130165611866138
2015-01-30 16:56:12,841 [salt.crypt                                  ][DEBUG   ] /usr/local/salt/packages/salt/crypt.py:399 Decrypting the current master AES key
2015-01-30 16:56:12,842 [salt.crypt                                  ][DEBUG   ] /usr/local/salt/packages/salt/crypt.py:324 Loaded minion key: /etc/salt/pki/minion/minion.pem
2015-01-30 16:56:17,715 [salt.minion                                 ][INFO    ] /usr/local/salt/packages/salt/minion.py:927 User root Executing command saltutil.find_job with jid 20150130165616879623
2015-01-30 16:56:17,716 [salt.minion                                 ][DEBUG   ] /usr/local/salt/packages/salt/minion.py:933 Command details {'tgt_type': 'glob', 'jid': '20150130165616879623', 'tgt': 'console1*', 'ret': '', 'user': 'root', 'arg': ['20150130165611866138'], 'fun': 'saltutil.find_job'}
2015-01-30 16:56:17,726 [salt.minion                                 ][INFO    ] /usr/local/salt/packages/salt/minion.py:1006 Starting a new job with PID 15961
2015-01-30 16:56:17,730 [salt.minion                                 ][INFO    ] /usr/local/salt/packages/salt/minion.py:1174 Returning information for job: 20150130165616879623
2015-01-30 16:56:17,860 [salt.crypt                                  ][DEBUG   ] /usr/local/salt/packages/salt/crypt.py:399 Decrypting the current master AES key
2015-01-30 16:56:17,861 [salt.crypt                                  ][DEBUG   ] /usr/local/salt/packages/salt/crypt.py:324 Loaded minion key: /etc/salt/pki/minion/minion.pem

Minion error log

#less /var/log/salt/minion                             
2015-01-30 14:39:28,359 [salt.minion                                 ][CRITICAL] /usr/local/salt/packages/salt/minion.py:1734 An exception occurred while polling the minion
Traceback (most recent call last):
  File "/usr/local/salt/packages/salt/minion.py", line 1726, in tune_in_no_block
    self._do_socket_recv(socks)
  File "/usr/local/salt/packages/salt/minion.py", line 1760, in _do_socket_recv
    self._handle_payload(payload)
  File "/usr/local/salt/packages/salt/minion.py", line 866, in _handle_payload
    payload['sig'] if 'sig' in payload else None)
  File "/usr/local/salt/packages/salt/minion.py", line 897, in _handle_aes
    data = self.crypticle.loads(load)
  File "/usr/local/salt/packages/salt/crypt.py", line 791, in loads
    data = self.decrypt(data)
  File "/usr/local/salt/packages/salt/crypt.py", line 774, in decrypt
    raise AuthenticationError('message authentication failed')
AuthenticationError: message authentication failed
           Salt: 2014.7.0
         Python: 2.6.6 (r266:84292, Sep 12 2011, 14:03:14)
         Jinja2: 2.8-dev
       M2Crypto: 0.21.1
 msgpack-python: 0.4.0
   msgpack-pure: Not Installed
       pycrypto: 2.6.1
        libnacl: Not Installed
         PyYAML: 3.10
          ioflo: Not Installed
          PyZMQ: 13.0.2
           RAET: Not Installed
            ZMQ: 3.2.2
           Mako: 0.9.0

But very fast 0.17.5

#time salt '*' test.ping -v 
Executing job with jid 20150130161206249724
-------------------------------------------
lvs.xxx:
    True
cache.xxx:
    True
web.xxx:
    True
log.xxx:
    True

real    0m0.666s
user    0m0.422s
sys     0m0.044s
           Salt: 0.17.5
         Python: 2.6.6 (r266:84292, Sep 12 2011, 14:03:14)
         Jinja2: 2.8-dev
       M2Crypto: 0.21.1
 msgpack-python: 0.4.0
   msgpack-pure: Not Installed
       pycrypto: 2.6.1
         PyYAML: 3.10
          PyZMQ: 13.0.2
            ZMQ: 3.2.2
Bug Core P3 severity-medium stale

Most helpful comment

Hey guys! I found a solution for this! Now moving to ansible :)

All 52 comments

I can confirm I am also seeing this behavior using the v2014.7.1 tag; I would also note the worse part about this is the minion 'acts' alive but is most def not responding thus I cannot do anything with it. This APPEARS to have happened after a saltutil.sync_modules

2015-01-30 22:54:38,872 [salt.minion][CRITICAL] An exception occurred while polling the minion
Traceback (most recent call last):
  File "/opt/salt/lib/python2.6/site-packages/salt/minion.py", line 1747, in tune_in_no_block
    self._do_socket_recv(socks)
  File "/opt/salt/lib/python2.6/site-packages/salt/minion.py", line 1781, in _do_socket_recv
    self._handle_payload(payload)
  File "/opt/salt/lib/python2.6/site-packages/salt/minion.py", line 867, in _handle_payload
    payload['sig'] if 'sig' in payload else None)
  File "/opt/salt/lib/python2.6/site-packages/salt/minion.py", line 898, in _handle_aes
    data = self.crypticle.loads(load)
  File "/opt/salt/lib/python2.6/site-packages/salt/crypt.py", line 796, in loads
    data = self.decrypt(data)
  File "/opt/salt/lib/python2.6/site-packages/salt/crypt.py", line 779, in decrypt
    raise AuthenticationError('message authentication failed')
AuthenticationError: message authentication failed

           Salt: 2014.7.1
         Python: 2.6.6 (r266:84292, Jan 22 2014, 09:42:36)
         Jinja2: 2.7.3
       M2Crypto: 0.22
 msgpack-python: 0.4.4
   msgpack-pure: Not Installed
       pycrypto: 2.6.1
        libnacl: Not Installed
         PyYAML: 3.11
          ioflo: Not Installed
          PyZMQ: 14.5.0
           RAET: Not Installed
            ZMQ: 4.0.5
           Mako: Not Installed

does restarting the minion do anything? or, forcing regenration of keys for that minion (rm /etc/salt/pki/minion/*, rm /etc/salt/pki/master/minions/, restart master and minion and reaccept keys)?

Yes restarting the minion has been the fix for me, here is what I see in the log after restarting the minion. The message authentication failed, then I restarted then the error about cannot deserialize msgpack

2015-01-31 00:29:47,760 [salt.minion [CRITICAL] An exception occurred while polling the minion
Traceback (most recent call last):
  File "/opt/salt/lib/python2.6/site-packages/salt/minion.py", line 1747, in tune_in_no_block
    self._do_socket_recv(socks)
  File "/opt/salt/lib/python2.6/site-packages/salt/minion.py", line 1781, in _do_socket_recv
    self._handle_payload(payload)
  File "/opt/salt/lib/python2.6/site-packages/salt/minion.py", line 867, in _handle_payload
    payload['sig'] if 'sig' in payload else None)
  File "/opt/salt/lib/python2.6/site-packages/salt/minion.py", line 898, in _handle_aes
    data = self.crypticle.loads(load)
  File "/opt/salt/lib/python2.6/site-packages/salt/crypt.py", line 796, in loads
    data = self.decrypt(data)
  File "/opt/salt/lib/python2.6/site-packages/salt/crypt.py", line 779, in decrypt
    raise AuthenticationError('message authentication failed')
AuthenticationError: message authentication failed
2015-02-01 00:45:34,862 [salt.payload][CRITICAL] Could not deserialize msgpack message: In an attempt to keep Salt running, returning an empty dict.This often happens when trying to read a file not in binary mode.Please open an issue and include the following error: Unpack failed: error = 0
2015-02-01 00:45:34,862 [salt.payload][CRITICAL] Could not deserialize msgpack message: In an attempt to keep Salt running, returning an empty dict.This often happens when trying to read a file not in binary mode.Please open an issue and include the following error: Unpack failed: error = 0
2015-02-01 00:45:34,863 [salt.payload][CRITICAL] Could not deserialize msgpack message: In an attempt to keep Salt running, returning an empty dict.This often happens when trying to read a file not in binary mode.Please open an issue and include the following error: Unpack failed: error = 0
2015-02-01 00:45:36,211 [salt.payload][CRITICAL] Could not deserialize msgpack message: In an attempt to keep Salt running, returning an empty dict.This often happens when trying to read a file not in binary mode.Please open an issue and include the following error: Unpack failed: error = 0
2015-02-01 00:45:36,212 [salt.payload][CRITICAL] Could not deserialize msgpack message: In an attempt to keep Salt running, returning an empty dict.This often happens when trying to read a file not in binary mode.Please open an issue and include the following error: Unpack failed: error = 0
2015-02-01 00:45:36,212 [salt.payload][CRITICAL] Could not deserialize msgpack message: In an attempt to keep Salt running, returning an empty dict.This often happens when trying to read a file not in binary mode.Please open an issue and include the following error: Unpack failed: error = 0

I tried to restarting the minion, minion normal after the restart. But over time, continue to appear Minion did not return。

but salt-call command normal.

@pitatus @terminalmage @thatch45

[*******lvs2 ~]# salt-call state.sls reactor -l debug 
[DEBUG   ] Reading configuration from /etc/salt/minion
[DEBUG   ] Guessing ID. The id can be explicitly in set /etc/salt/minion
[INFO    ] Found minion id from getfqdn(): lvs2.***
[DEBUG   ] loading log_handlers in ['/var/cache/salt/minion/extmods/log_handlers', '/usr/local/salt/packages/salt/log/handlers']
[DEBUG   ] Skipping /var/cache/salt/minion/extmods/log_handlers, it is not a directory
[DEBUG   ] /usr/local/salt/packages/salt/utils/parsers.py:171:parse_args Configuration file path: /etc/salt/minion
[DEBUG   ] /usr/local/salt/packages/salt/config.py:427:_read_conf_file Reading configuration from /etc/salt/minion
[DEBUG   ] /usr/local/salt/packages/salt/loader.py:561:gen_functions loading grain in ['/var/cache/salt/minion/extmods/grains', '/usr/local/salt/packages/salt/grains']
[DEBUG   ] /usr/local/salt/packages/salt/loader.py:587:gen_functions Skipping /var/cache/salt/minion/extmods/grains, it is not a directory
[DEBUG   ] /usr/local/salt/packages/salt/crypt.py:216:get_keys Loaded minion key: /etc/salt/pki/minion/minion.pem
[DEBUG   ] /usr/local/salt/packages/salt/crypt.py:268:decrypt_aes Decrypting the current master AES key
[DEBUG   ] /usr/local/salt/packages/salt/crypt.py:216:get_keys Loaded minion key: /etc/salt/pki/minion/minion.pem
[DEBUG   ] /usr/local/salt/packages/salt/crypt.py:216:get_keys Loaded minion key: /etc/salt/pki/minion/minion.pem
[DEBUG   ] /usr/local/salt/packages/salt/loader.py:561:gen_functions loading module in ['/opt/lib/cdn/py/salt_module', '/var/cache/salt/minion/extmods/modules', '/usr/local/salt/packages/salt/modules']
[DEBUG   ] /usr/local/salt/packages/salt/loader.py:617:gen_functions Skipping .init, it does not end with an expected extension
[DEBUG   ] /usr/local/salt/packages/salt/loader.py:587:gen_functions Skipping /var/cache/salt/minion/extmods/modules, it is not a directory
[DEBUG   ] /usr/local/salt/packages/salt/loader.py:617:gen_functions Skipping cytest.pyx, it does not end with an expected extension
[DEBUG   ] /usr/local/salt/packages/salt/loader.py:617:gen_functions Skipping .grains.py.swp, it does not end with an expected extension
[DEBUG   ] /usr/local/salt/packages/salt/loader.py:617:gen_functions Skipping .cp.py.swp, it does not end with an expected extension
[DEBUG   ] /usr/local/salt/packages/salt/loader.py:764:gen_functions Loaded localemod as virtual locale
[DEBUG   ] /usr/local/salt/packages/salt/loader.py:764:gen_functions Loaded groupadd as virtual group
[DEBUG   ] /usr/local/salt/packages/salt/loader.py:764:gen_functions Loaded rh_service as virtual service
[DEBUG   ] /usr/local/salt/packages/salt/loader.py:764:gen_functions Loaded yumpkg as virtual pkg
[DEBUG   ] /usr/local/salt/packages/salt/loader.py:764:gen_functions Loaded parted as virtual partition
[DEBUG   ] /usr/local/salt/packages/salt/loader.py:764:gen_functions Loaded linux_sysctl as virtual sysctl
[DEBUG   ] /usr/local/salt/packages/salt/loader.py:764:gen_functions Loaded mdadm as virtual raid
[DEBUG   ] /usr/local/salt/packages/salt/loader.py:764:gen_functions Loaded linux_acl as virtual acl
[DEBUG   ] /usr/local/salt/packages/salt/loader.py:764:gen_functions Loaded sysmod as virtual sys
[DEBUG   ] /usr/local/salt/packages/salt/loader.py:764:gen_functions Loaded rpm as virtual lowpkg
[DEBUG   ] /usr/local/salt/packages/salt/loader.py:764:gen_functions Loaded useradd as virtual user
[DEBUG   ] /usr/local/salt/packages/salt/loader.py:764:gen_functions Loaded grub_legacy as virtual grub
[DEBUG   ] /usr/local/salt/packages/salt/loader.py:764:gen_functions Loaded rh_ip as virtual ip
[DEBUG   ] /usr/local/salt/packages/salt/loader.py:764:gen_functions Loaded cmdmod as virtual cmd
[DEBUG   ] /usr/local/salt/packages/salt/loader.py:764:gen_functions Loaded virtualenv_mod as virtual virtualenv
[DEBUG   ] /usr/local/salt/packages/salt/loader.py:764:gen_functions Loaded linux_lvm as virtual lvm
[DEBUG   ] /usr/local/salt/packages/salt/loader.py:764:gen_functions Loaded djangomod as virtual django
[DEBUG   ] /usr/local/salt/packages/salt/loader.py:561:gen_functions loading returner in ['/var/cache/salt/minion/extmods/returners', '/usr/local/salt/packages/salt/returners']
[DEBUG   ] /usr/local/salt/packages/salt/loader.py:587:gen_functions Skipping /var/cache/salt/minion/extmods/returners, it is not a directory
[DEBUG   ] /usr/local/salt/packages/salt/loader.py:764:gen_functions Loaded couchdb_return as virtual couchdb
[DEBUG   ] /usr/local/salt/packages/salt/loader.py:764:gen_functions Loaded syslog_return as virtual syslog
[DEBUG   ] /usr/local/salt/packages/salt/loader.py:764:gen_functions Loaded carbon_return as virtual carbon
[DEBUG   ] /usr/local/salt/packages/salt/loader.py:764:gen_functions Loaded sqlite3_return as virtual sqlite3
[DEBUG   ] /usr/local/salt/packages/salt/loader.py:561:gen_functions loading states in ['/var/cache/salt/minion/extmods/states', '/usr/local/salt/packages/salt/states']
[DEBUG   ] /usr/local/salt/packages/salt/loader.py:587:gen_functions Skipping /var/cache/salt/minion/extmods/states, it is not a directory
[DEBUG   ] /usr/local/salt/packages/salt/loader.py:764:gen_functions Loaded saltmod as virtual salt
[DEBUG   ] /usr/local/salt/packages/salt/loader.py:764:gen_functions Loaded mdadm as virtual raid
[DEBUG   ] /usr/local/salt/packages/salt/loader.py:764:gen_functions Loaded virtualenv_mod as virtual virtualenv
[DEBUG   ] /usr/local/salt/packages/salt/loader.py:561:gen_functions loading render in ['/var/cache/salt/minion/extmods/renderers', '/usr/local/salt/packages/salt/renderers']
[DEBUG   ] /usr/local/salt/packages/salt/loader.py:587:gen_functions Skipping /var/cache/salt/minion/extmods/renderers, it is not a directory
[DEBUG   ] /usr/local/salt/packages/salt/loader.py:617:gen_functions Skipping .py.py.swp, it does not end with an expected extension
[DEBUG   ] /usr/local/salt/packages/salt/minion.py:166:parse_args_and_kwargs Parsed args: ['reactor']
[DEBUG   ] /usr/local/salt/packages/salt/minion.py:167:parse_args_and_kwargs Parsed kwargs: {'__pub_fun': 'state.sls', '__pub_jid': '20150202153415445097', '__pub_pid': 30269, '__pub_tgt': 'salt-call'}

I tried to rollback tag:v2014.7.0 problem persists.

I have the same issue with 2014.7.0 and 2014.7.1.
Master and minions are located in different datacenters, if that should matter.

$ sudo salt -v '*log*' state.highstate
Executing job with jid 20150202160533182941
-------------------------------------------

yr-log-1:
    Minion did not return. [No response]

There is no entry in the minion log.
Restarting salt-minion seems to reinstate contact between master and minion for a short period.
If the minion and master first have contact, keeping it alive with test.ping seems to work.

Running salt-call from the minion also works.

with the upgrading and downgrading that has been happening, can you (those that are affected) try removing keys and re-add (all minions), and also deleting the cache (but not while ANY jobs are active) and restart. specifically, shutdown master and minion, remove keys and cache, startup and re-accept keys. also check that both master and all minions are running the same version.. (any possible way when upgrading was done that a minion was upgraded first, before the master?)

I tried stopping master, deleting all keys and cache, restart all minions, start up master and then re-add keys.
Minions respond right away, but after about 5 minutes of inactivity, minions stop returning.

it's possible the router/switches between the datacenters are timing out the connection (busy switches), when that happens do you know if the minion still thinks it has a connection..

netstat -an | grep 4505

I suspect the TCP connection status will show ESTABLISHED

FYI.. I can simulate this with two boxes, master on one, minion on the other, establish connection, accept keys, etc.

pull the network cable on the minion and do a 'test.ping' from the master, the master will timeout waiting for the minion to respond and the network socket to the minion will be discarded. plug back in the minion's network cable, check 'netstat -an' on the minion, it shows an 'ESTABLISHED' connection to the master.. the minion (TCP stack) doesn't know the connection is invalid, and won't until it tries to use the socket..

In my case, the minion will eventually reconnect.. however if the network is busy enough that the routers/switches are removing idle connections then there may be another complicating component in the mix..

@pitatus Thanks for investigating. It could seem your right: (And I wonder why I didn't check connections myself ..)

$ netstat -an | grep '4506'
tcp        0      0 10.0.1.10:44578         salt-master:4506     ESTABLISHED
$ netstat -an | grep '4505'
tcp        0      0 10.0.1.10:54115         salt-master:4505     ESTABLISHED

On master, minion did not return.
I'm really not sure how to go about this. I guess I could set up a cron that runs salt-call on the minions to make sure the connection is open.

I've been running things in Azure (both master and minions) and had these problems as well. At first I put up a cron that pingeded every minute (Azure seemed to cap ovar 60s idle connections) and it worked at first. But when the number of minions grew this solution just spammed us down.

I've got a test setup running with one master and one minion now, and activated the keepalive settings with intervals less than 60s on the minion. This seems to work (2014.7.1 minion, master from dev branch).

So far this seems to work! However...efter a long idle period the first ping takes 10s, the rest takes 1s ...but thats another issue I think. The connection seems to be there!

I also added keepalive settings on one of my minions with intervals on 60s.
So far, it seems like its working for keeping minions responding.
Like @andrejohansson, I use Azure as a provider.

Hi :
@andrejohansson @pitatus
my configuration is as follows minion, On master, minion did not return.
minion, it shows an 'ESTABLISHED' connection to the master。
I think that this error does not occur at the network layer, I use my own private network.

[**** ~]# netstat -antp | grep '4506'
tcp        0      0 ****.194:41260        ****.157:4506         ESTABLISHED 12480/python2.6     
tcp        0      0 ****.194:39668        ****.156:4506         ESTABLISHED 12480/python2.6     
[**** ~]# 
#more /etc/salt/minion
master: 
    - master1
    - master2

pidfile: /var/run/salt-minion.pid
log_fmt_console: '[%(levelname)-8s] %(pathname)s:%(lineno)d:%(funcName)s %(message)s'
log_fmt_logfile: '%(asctime)s,%(msecs)03.0f [%(name)-17s][%(levelname)-8s] %(pathname)s:%(lineno)d %(message)s'
log_level_logfile: debug

module_dirs: ['/opt/lib/py/salt_module']
minion_id_caching: False
tcp_keepalive: True
tcp_keepalive_idle: 80
tcp_keepalive_cnt: 3
tcp_keepalive_intvl: 20

Hi :
Minion run a few days later, the problem will reappear.
i think this is a serious problem.
@pitatus @terminalmage @thatch45

@Jiaion what do you see from netstat for port 4505 on the master and minion when you are seeing this problem (does the master show an established connection with the minion experiencing the problem)? Also what are the minion config settings for recon_* ?

I can confirm the same issue after searching for it.
This is not a matter of the config settings used, as those are the default ones since there is no need to change re-connection or any specific key regen unless for troubleshooting purposes.

@pitatus
Yes, master and minion show an established connection with then minion experiencing the problem.

Minion netstat

[root@ca** ~]# netstat -antp | awk '$5~/:(4505|4506)/ && $5~/23.(156|157)/'   
tcp        0      0 ****.23.166:47568        ****.23.157:4506         ESTABLISHED 7773/python2.6      
tcp        0      0 ****.23.166:47822        ****.23.156:4505         ESTABLISHED 7773/python2.6      
tcp        0      0 ****.23.166:37487        ****.23.156:4506         ESTABLISHED 7773/python2.6      
tcp        0      0 ****.23.166:39379        ****.23.157:4505         ESTABLISHED 7773/python2.6      
[root@ca** ~]#

Master netstat

[root@console** /root]
#netstat -antp | awk '$4~/:(4505|4506)/ && $5~/166/'    
tcp        0      0 0.0.0.0:4505                0.0.0.0:*                   LISTEN      107104/python2.6    
tcp        0      0 0.0.0.0:4506                0.0.0.0:*                   LISTEN      107350/python2.6      
tcp        0      0 ****.23.156:4506         ****.23.166:37487        ESTABLISHED 107350/python2.6    
tcp        0      0 ****.23.156:4505         ****.23.166:47822        ESTABLISHED 107104/python2.6 

I'm not able to reproduce so if there is any other data (logs, other evidence, etc) please add to this issue to help find a solution..

Hi
@pitatus
minion except line in 779 , Under what circumstances will result to zero it?
How long to run your minion?
I suspect that I configure multiple master lead, I am testing

763     def decrypt(self, data):
774         result = 0
775         for zipped_x, zipped_y in zip(mac_bytes, sig):
776             result |= ord(zipped_x) ^ ord(zipped_y)
777         if result != 0:
778             log.debug('Failed to authenticate message')
779             raise AuthenticationError('message authentication failed')
"/usr/local/salt/packages/salt/crypt.py" 846 lines --89%--  

I'm having the same issue. My servers are located in different Datacenters of Hetzner hosting network.

All minions and the master are 2014.7.1

looks like after some period of inactivity, the first command to a minion timeouts, the subsequent command suceeds. For example: when I run 'salt * test.ping' on the master for the fisrst time - some minions do not return, next time I ping - more minions return, and third time all of them return.

I have logs at INFO level, don't see anything related to this problem.

I was trying to find optimal configuration with minion and master parameters in config files (reconnects, timouts, etc), but that didn't change anything.

The only thing that solves the problem is running this command in crontab every 5 min:
_/5 * * * * /usr/bin/salt '_' test.ping > /dev/null
Having this in crontab, makes minions allways return and not timout on commands.

Hi :
@pitatus This problem occurs in the case of multiple Master's.

/etc/salt/minion
master:
- master1
- master2

@pitatus ?

Not sure if this is directly relevant but I have a couple of (separate) masters with their own swarm of minions. Each of the masters are running Helium (2014.7.1) and the minions in each system are split between a mix of 2014.1.11, 2014.7.0, and 2014.7.1 and I am consistently seeing that all servers with anything 2014.7.x respond slower to the test module and are less often able to respond to my first test.ping. All 2014.1.11 servers respond and in a fairly timely manner.

A Salt-User has reported the following to SaltStack:


Effectively this kills our ability to use salt-api and the cloudify salt plugin or salt step in rundeck because its not reliable.


ZD-218

I'm using the salt-api (in a deploy script) despite the problem with "sleepy minions".
The test.ping crontab workaround (see above) works great :)

I'm also running into this.

Hey guys! I found a solution for this! Now moving to ansible :)

I also encountered this issue, it turns out at some point the salt master
had hung, so we ended up with two salt-master processes. Our minion
behavior was everything worked perfectly when doing salt-call from the
minion, but any time you tried to run a salt state from the master, it was
a crap-shoot whether it would time out or have "authentication failed" or
some other error. Once we killed the extra process and rebooted, we haven't
had an issue since. The bizarre thing was we actually performed an upgrade
of the salt-master during the time this hung process was around, and it
still didn't kill it, we had to manually kill it and then rebooted for good
measure.

On Sun, Apr 5, 2015 at 12:38 AM, Alexander Artemenko <
[email protected]> wrote:

Hey guys! I found a solution for this! Now moving to ansible :)

—
Reply to this email directly or view it on GitHub
https://github.com/saltstack/salt/issues/20240#issuecomment-89720356.

@svetlyak40wt good idea..

Just upgraded from 0.17.2 to 2014.7.2 and everything went crazy. We have autoscaling in AWS and all our systems depend on Salt to work properly. We often find that we cannot deploy software on different machines due to this connection problem. The work-around is to restart the minions, but it becomes very frustrating.

Guys, as I said, just get this in your crontab on salt-master:
_/5 * * * * /usr/bin/salt '_' test.ping > /dev/null

That should do it.

Actually, downgrade to 0.17.5 solved most of the issues. It looks far more stable then 2014.x.y.

The issue may be related to the problem with ZMQ described here http://lucumr.pocoo.org/2012/6/26/disconnects-are-good-for-you/
And that article is discussed here https://news.ycombinator.com/item?id=4161073

This might be fixed with the ret port connection keepalives that we added recently. we added connection persistence for those in 2014.7 and they can die sometimes. So that patch is in the latest 2014.7 branch (@cachedout did that make it into 2014.7.4?) and in 2015.2

Moving to RAET will probably fix the issue, right ?

Yes, it should fix it as well

@Jiaion - has this been fixed?

@ssgward
I have not tested the latest version

Just confirmed, Use of Multiple masters is causing this issue for me as well. (Lithium)

Seeing the same problem on 2015.5.2

Not particularly helpful at this stage I guess, but seeing same issue here.
Will try to nail down more debugging stuff tomorrow.

Have a similar issue
# salt --versions-report
Salt: 2015.5.2
Python: 2.7.3 (default, Dec 18 2014, 19:10:20)
Jinja2: 2.6
M2Crypto: 0.21.1
msgpack-python: 0.1.10
msgpack-pure: Not Installed
pycrypto: 2.4.1
libnacl: Not Installed
PyYAML: 3.10
ioflo: Not Installed
PyZMQ: 13.0.0
RAET: Not Installed
ZMQ: 3.2.2
Mako: Not Installed

Minions sometimes return when executing commands with cmd.run :
Minion did not return. [No response]

Commands still execute but the output is empty.

i have had the same issue. Only it was dns related.
I my case the minion is called centos, but the name centos can not be resolved.

Sometimes it works and sometime not. Is it posible to lookup the minion by IP? Now the master is looking voor centos in DNS. after adding the minion 'centos' to the /etc/hosts of the master. i had no problems any more.

Same issue here.
We detect the problem time ago, but using a big timeout configuration in master solves the problem.
Now is ocurred in a initial highstate execution.

salt --versions
Salt: 2015.5.3
Python: 2.7.6 (default, Mar 22 2014, 22:59:56)
Jinja2: 2.7.3
M2Crypto: 0.22
msgpack-python: 0.4.6
msgpack-pure: Not Installed
pycrypto: 2.6.1
libnacl: Not Installed
PyYAML: 3.11
ioflo: Not Installed
PyZMQ: 14.4.1
RAET: Not Installed
ZMQ: 4.0.5
Mako: 0.9.1
Tornado: Not Installed
Debian source package: 2015.5.3+ds-1trusty1

I have the same problem with the latest 2015.8.0

# salt --versions
Salt Version:
           Salt: 2015.8.0

Dependency Versions:
         Jinja2: 2.7.3
       M2Crypto: Not Installed
           Mako: Not Installed
         PyYAML: 3.11
          PyZMQ: 14.7.0
         Python: 2.7.5 (default, Jun 24 2015, 00:41:19)
           RAET: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.0.5
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
        libnacl: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.4.6
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: 2.6.1
         pygit2: Not Installed
   python-gnupg: Not Installed
          smmap: Not Installed
        timelib: Not Installed

System Versions:
           dist: centos 7.1.1503 Core
        machine: x86_64
        release: 3.10.0-229.14.1.el7.x86_64
         system: CentOS Linux 7.1.1503 Core

Same issue:

$ salt --versions-report
           Salt: 2014.7.6-134-gd284eb1
         Python: 2.7.3 (default, Jun 22 2015, 19:33:41)
         Jinja2: 2.6
       M2Crypto: 0.21.1
 msgpack-python: 0.1.10
   msgpack-pure: Not Installed
       pycrypto: 2.6.1
        libnacl: Not Installed
         PyYAML: 3.10
          ioflo: Not Installed
          PyZMQ: 14.5.0
           RAET: Not Installed
            ZMQ: 4.0.5
           Mako: Not Installed

same issue here. Usually restarting the salt-minion service fixes it temporarily, but it eventually comes back later.

manual commands such as
salt '*' cmd.run 'apt-get install vim'
work fine, but running pkg.installed in a state file for vim gets a no-response.

Salt Version:
           Salt: 2015.8.0

Dependency Versions:
         Jinja2: 2.8
       M2Crypto: Not Installed
           Mako: Not Installed
         PyYAML: 3.10
          PyZMQ: 14.7.0
         Python: 2.7.6 (default, Jun 22 2015, 17:58:13)
           RAET: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.1.2
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
        libnacl: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.4.6
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: 2.6.1
         pygit2: Not Installed
   python-gnupg: Not Installed
          smmap: Not Installed
        timelib: Not Installed

System Versions:
           dist: Ubuntu 14.04 trusty
        machine: x86_64
        release: 3.13.0-58-generic
         system: Ubuntu 14.04 trusty

Just comfirm the same issue here.

salt version: 2014.7.0

upset about setting multi master for failover, because it only works for serveral minutes while minion restart. After that all minions go down for being "Minion did not return".

I suspect it's related to the network delay between master and minions. Hope this proble will be solved.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

If this issue is closed prematurely, please leave a comment and we will gladly reopen the issue.

same issue.

Salt Version:
           Salt: 2019.2.2

Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: Not Installed
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
         Jinja2: 2.8.1
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: 0.35.2
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.5.6
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: Not Installed
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 3.6.8 (default, Aug  7 2019, 17:28:10)
   python-gnupg: Not Installed
         PyYAML: 3.12
          PyZMQ: 15.3.0
           RAET: Not Installed
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.4.2
            ZMQ: 4.1.4

System Versions:
           dist: centos 7.6.1810 Core
         locale: UTF-8
        machine: x86_64
        release: 3.10.0-957.10.1.el7.x86_64
         system: Linux
        version: CentOS Linux 7.6.1810 Core
Was this page helpful?
0 / 5 - 0 ratings