I did a test.ping I had handful not responding. The versions reports are below. I can also get at random times 20 clients not responding, but am not right now. I have a cron job which restarts the service on my minions ever hour. Some of these servers are on the other side of a WAN connection. I have configured a TCP keepalive of 60 seconds because of the NAT'ing for them.
Why do I keep loosing my connection with the minions?
Master on CentOS 6
[root@saltstack ~]# salt --versions-report
Salt: 0.15.3
Python: 2.6.6 (r266:84292, Jul 10 2013, 22:48:45)
Jinja2: unknown
M2Crypto: 0.20.2
msgpack-python: 0.1.13
msgpack-pure: Not Installed
pycrypto: 2.0.1
PyYAML: 3.10
PyZMQ: 2.2.0.1
ZMQ: 3.2.3
[root@saltstack ~]# cat /etc/redhat-release
CentOS release 6.4 (Final)
[root@saltstack ~]#
CentOS 6.4 Minion
Salt: 0.15.3
Python: 2.6.6 (r266:84292, Feb 21 2013, 23:54:59)
Jinja2: unknown
M2Crypto: 0.20.2
msgpack-python: 0.1.13
msgpack-pure: Not Installed
pycrypto: 2.0.1
PyYAML: 3.10
PyZMQ: 2.2.0.1
ZMQ: 3.2.3
CentOS 6.4 Minion
Salt: 0.15.3
Python: 2.6.6 (r266:84292, Feb 22 2013, 00:00:18)
Jinja2: 2.2.1
M2Crypto: 0.20.2
msgpack-python: 0.1.13
msgpack-pure: Not Installed
pycrypto: 2.0.1
PyYAML: 3.10
PyZMQ: 2.2.0.1
ZMQ: 3.2.3
CentOS release 5.9 (Final)
Salt: 0.15.3
Python: 2.6.8 (unknown, Nov 7 2012, 14:47:34)
Jinja2: unknown
M2Crypto: 0.21.1
msgpack-python: 0.1.12
msgpack-pure: Not Installed
pycrypto: 2.3
PyYAML: 3.08
PyZMQ: 2.1.9
ZMQ: 2.1.9
Windows 2008 R2 64-bit Minion
Salt: 0.16.0
Python: 2.7.3 (default, Apr 10 2012, 23:24:47) [MSC v.1500 64 bit (AMD64)]
Jinja2: 2.6
M2Crypto: 0.21.1
msgpack-python: 0.1.12
msgpack-pure: Not Installed
pycrypto: 2.3
PyYAML: 3.10
PyZMQ: 13.1.0
ZMQ: 3.2.2
The primary cause for minions losing their connection is ZMQ2, which I see on at least one of your version reports. Definitely upgrade to ZMQ3 to prevent many of these issues.
However, we have been seeing a few reports of minions losing connection recently. Some problems are solved with a minion restart, some with a master. Is there anything in the logs of the disconnected minions or the master that might clue us into what is happening?
Also, are you using any IPv6? We're wondering if the recent problems are related to re-enabling IPv6 support.
Oh, and I just barely connected you with the other "restarting the master" issue. =P Thanks for creating a separate issue.
Yeah, I am the person who reported my issue incorrectly on another ticket. We are not using IPv6.
I have looked at upgrading to ZMQ3, but I can't find any RPM's for my CentOS 5 machines which work. There are too many version conflicts when I have tried to get ZMQ3 onto my CentOS 5 machines. But I am not that worried since in reality the issue is more widespread and is effecting my ZMQ3 clients just as much. If I can get the ZMQ3 machines work well I would be happy.
Ya, this is high on our priority list. The problem is the difficulty of reproducing issues like this. =\
Good, I am now seeing 20 of my 30 odd minions go offline until the cron job restarts the minion and then they are reachable for about 5-15 minutes then unreachable until the cronjob runs again. I am not finding salt to be that useful in this situation. This is for zmq3 and zmq2 minions.
20 out of 30? that's high, I don't think I've seen anyone with that high of a percentage of disconnects. We'll definitely look into it.
I'm having the same problem, but mine is reproduceable.
If the master server's IP address changes and DNS is updated, the minions have to be restarted to regain connection. There was a fix about a year ago for this where the minion would re-resolve the IP address of the master if it lost connection.
I'm not sure if that is happening correctly now. The IP address of a saltmaster running in EC2 changes often when it's shutdown and brought back up.
This is super annoying because my remote execution to restart the minion process is well...done by salt :)
my master's IP address is not changing. We have a static IP for it. I have 19 servers this morning not responding 30 minutes after the minion service was restarted.
I have a local cron job that runs each at 3 minutes past every hour on each minion server to restart salt-minion. But I can only then reach them right after the service restarts. This is a real pain for actually trying to use this as a possible solution for my organisation.
This is happening for minions that are on the sam LAN, as well as minions connecting over WAN, right?
Same LAN and across a WAN. My Windows box that I have put into salt is usually the first to go and I know it is running ZMQ3. Not bothered putting more machines into salt because of this issue.
Cheers,
Leon Hedding
On 30 Jul 2013 22:54:41, Colton Myers [email protected] wrote:
This is happening for minions that are on the sam LAN, as well as minions connecting over WAN, right?
—
Reply to this email directly or view it on GitHubhttps://github.com/saltstack/salt/issues/6231#issuecomment-21825901.
OM International Services (Carlisle) Ltd - Unit B Clifford Court, Cooper Way - Carlisle CA3 0JG - United Kingdom
Charity reg no: 1112655 - Company reg no: 5649412 (England and Wales)
Ok, I think I might have a solution for my Linux machines. We run 20 odd CentOS 5 machines and without ZMQ3 installed with its ability to configure TCP Keep-Alive all these machines go offline after some time. I tried using a cron job on the master which did a test.ping ever 5 minutes to all devices as one way to do a poor man's keep alive, but my minions would still go offline occasionally.
I eventually found some time to build my own 32-bit and 64-bit RPM's for PyZMQ 13.1.0 which was the limiter before. There is a public repo for ZMQ3, but none for a version of PyZMQ that supports ZMQ3 on CentOS. I was getting library incompatibilities until I had the latest version of PyZMQ.
It is early days still, but I have yet to loose any minions since getting ZMQ3 onto my CentOS 5 machines. I really wish there was a public repo for ZMQ3 and PyZMQ 13.1.0 because I don't want to have maintain my own RPM's for PyZMQ for both i386 and x64. I saw a ticket open about getting a salt repo setup for CentOS 5. This would be brilliant for others that plan to run CentOS 5 until its support ends in 2017 and also want to run salt.
My Windows machine still keeps dropping after 15 minutes or so. I have enabled keep alives on this Windows machine. I have a number more Windows machines I would like to use Salt on, but am holding off until I can get it working reliably on at least one Windows machine. I have seen that I need to increase the timeout for Windows machines and am doing 45 seconds for a test.ping and still no response.
Windows Server 2008 R2 Datacenter SP1 Machine:
C:\salt> salt --versions-report
Salt: 0.16.2
Python: 2.7.3 (default, Apr 10 2012, 23:24:47) [MSC v.1500 64 bit (AMD64)]
Jinja2: 2.6
M2Crypto: 0.21.1
msgpack-python: 0.1.12
msgpack-pure: Not Installed
pycrypto: 2.3
PyYAML: 3.10
PyZMQ: 13.1.0
ZMQ: 3.2.2
Master running CentOS 6.4
[root@saltstack ~]# salt --versions-report
Salt: 0.16.0
Python: 2.6.6 (r266:84292, Jul 10 2013, 22:48:45)
Jinja2: unknown
M2Crypto: 0.20.2
msgpack-python: 0.1.13
msgpack-pure: Not Installed
pycrypto: 2.0.1
PyYAML: 3.10
PyZMQ: 2.2.0.1
ZMQ: 3.2.3
When I run tcpdump and do a test.ping to my Windows machine absolutely nothing shows in the dump from the masters perspective. I restart the minion and I then see traffic in my tcpdump. Somehow the connection is dropping and I don't know how to test why this is happening.
I talked to @UtahDave (who's in charge of Windows development) and he says he's going to package up a new windows installer with ZMQ 3.2.3 just to make sure that's not the issue. Then we can work from there.
Hi, i have the same problem with my windows minions (0.16.2) .The contact is lost after a few hours except for one minion. It is in the same VLAN as salt-master. I think it's a keepalive problem. A simple restart on the minion has solved this problem. I installed 0.16.3 today. Some news tomorrow ;)
Hello, today contact has been lost with my windows minion :-(. 0.16.3 didn't address this issue.
In master log i can see some minion activity every hour:
2013-08-13 06:29:53,066 [salt.master ][INFO ] Authentication request from XML01.services.sib.fr
2013-08-13 06:29:53,067 [salt.master ][INFO ] Authentication accepted from XML01.services.sib.fr
2013-08-13 07:29:52,619 [salt.master ][INFO ] Authentication request from XML01.services.sib.fr
2013-08-13 07:29:52,619 [salt.master ][INFO ] Authentication accepted from XML01.services.sib.fr
2013-08-13 08:29:52,147 [salt.master ][INFO ] Authentication request from XML01.services.sib.fr
2013-08-13 08:29:52,147 [salt.master ][INFO ] Authentication accepted from XML01.services.sib.fr
But minion doesn't respond to salt-master requests...
More tests:
I can use salt-call on minion without any problems. A salt-call state.sls xxxx is working like a charm. But on master i still have no results.
[root@salt ~]# salt -v 'XML01*' test.version
Executing job with jid 20130813100802364942
XML01.services.sib.fr:
Minion did not return
Until i restart salt-minion.
Can i help you with pcap capture or advanced logs ?
I think I am having the same or similar issue. Minions stopping responding. I have 50+ minions out of 133 not responding now. I too, cannot update ZMQ everywhere without great massive pain.
@caseybea For ZMQ < 3, we really can't do anything. There are severe bugs in those versions which make the connection very unstable. Some have gotten around it by using cron to restart the minion service. But your best bet is still to upgrade ZMQ, unfortunately.
Damn. I'm not surprised to find out ZMQ 2.x is the main problem, the unfortunate part is there's no clean way to install updates on a RHEL5 box with the zmq repo(s) out there, because there's a twisty maze of ugly dependancies that make upgrading to ZMQ3 more or less unrealistic.
That said, when I have some time I will still try, and see how it goes.
(meanwhile... I'm tossing in my vote to have SSH an option for connectivity. Just in case a crafty developer is listening :-)
SALT is still a really cool deal. I'll get it into production. Eventually.....
@basepi , i did more tests with my windows minions. If i kill tcp connection ( marked as ESTABLISHED on minion but doesn't exist on server side ) with salt server on minion side with an utility like tcpview, everything becomes OK. Do you think it's a Zmq issue (Ver 3.2.2) ?
@caseybea salt-ssh is shipping with 0.17. Your wish is our command! =P (Though it's way slower than ZMQ, as expected)
@equinoxefr Strange that it's ESTABLISHED on minion and doesn't show up at all on master. That shouldn't be possible, right? o.O We've actually heard of a few different people having connection problems recently on Windows, we're still trying to track down the cause. But 3.2.2 is the most up-to-date version of ZMQ on Windows, so if it's ZMQ, it's an unfixed bug.
Wooo! I didn’t know salt-ssh is already in the mix. This is good news for us stuck with redhat5. (actually, the whole thing could be resolved is the 0MQ folks updated the repos with 3.x for RHEL--- but I know that’s not your responsibility. I’ll take whichever solution arrives first-- salt-ssh or zeromq 3.x for rhel. ☺
From: Colton Myers [mailto:[email protected]]
Sent: Wednesday, August 21, 2013 2:57 PM
To: saltstack/salt
Cc: Brodie, Kent
Subject: Re: [salt] Minion losing connection and not returning without a restart of service (#6231)
@caseybeahttps://github.com/caseybea salt-ssh is shipping with 0.17. Your wish is our command! =P (Though it's way slower than ZMQ, as expected)
@equinoxefrhttps://github.com/equinoxefr Strange that it's ESTABLISHED on minion and doesn't show up at all on master. That shouldn't be possible, right? o.O We've actually heard of a few different people having connection problems recently on Windows, we're still trying to track down the cause. But 3.2.2 is the most up-to-date version of ZMQ on Windows, so if it's ZMQ, it's an unfixed bug.
—
Reply to this email directly or view it on GitHubhttps://github.com/saltstack/salt/issues/6231#issuecomment-23045223.
Ya, we've been considering creating our own repos for RHEL 5 to host ZMQ3 packages, but we just haven't had time.
@basepi You are right, that shouldn't be possible but... http://www.evanjones.ca/tcp-stuck-connection-mystery.html
I did more test and i can confirm that:
Do you want me to open another issue only for the Windows minion ?
@equinoxefr Yes, could create a new issue? That seems to be a different problem from the linux disconnection issues.
Please include as much information from this thread as you think is relevant. Thanks!
Same issue here, master and minion using FreeBSD 9.1 and saltstack installed from ports, any ideas or possible workaround ?
@basepi @equinoxefr i confirm the issue of a connection established in the minion and not established on the master
and the minion is running centos 5 not windows
Thanks for the input, @Abukamel. It's helpful to know that it can occur on non-windows machines as well.
i have wrote a simple script to install salt and it's dependencies from source on centos 5 to solve ZMQ problem
here is the gist:
https://gist.github.com/Abukamel/7515248
I see this on Ubuntu 12.04, ZMQ 3.2.2 on 0.17.2. Minions lose connection to the master, do have a connection on 4506 but not 4505.
strace shows the minion not doing anything:
root@idd0012:/home/mrten# strace -F -f -p 2910
Process 2910 attached with 3 threads - interrupt to quit
[pid 2964] epoll_wait(8, <unfinished ...>
[pid 2963] epoll_wait(6, <unfinished ...>
[pid 2910] restart_syscall(<... resuming interrupted call ...>
The only thing in the minion log is the hsum thing I reported earlier: (#8653) but the timestamp suggests it is not relevant.
I'm having the exact same problem as @Mrten , Ubuntu 12.04 ZMQ 3 on 0.16.2.
This has been a constant theme throughout my salt experience; minions dropping off and requiring a restart to be fixed, I'm actually regretting picking saltstack.
In my case I see the minion with an ESTABLISHED netstat line for port 4505 (not 4506) but the master apparently has no idea about this connection.
tcp 0 0 10.x.x.x:54281 x.x.x.x:4505 ESTABLISHED
The master on the other side has a reference to this connection.
What is worth noting I think, is that these connections are being routed through (in my case) Azure's load balancer, so its not a direct connection, and I wonder whether its dropping idle connections from time to time and both master and minion don't realise the socket is actually dead.
What efforts do master and minion go to ensure the connection is kept open?
Idle connections dropping elsewhere have a solution, I think:
I put this in /etc/minion.d/tcp-keepalive.conf:
# zodat we maar twee minuten hoeven te wachten op een kapotte verbinding
# managed on salt.ii.nl
#start
tcp_keep_alive_idle: 300
# aantal missing = kapot
tcp_keep_alive_cnt: 3
# repeat every X seconds
tcp_keep_alive_intvl: 60
So there's a ping at least once a minute.
Azure and AWS load balancers don't respect tcp keepalives unfortunately
Sent from my iPhone
On 05/12/2013, at 7:16 PM, Mrten [email protected] wrote:
Idle connections dropping elsewhere have a solution, I think:
I put this in /etc/minion.d/tcp-keepalive.conf:zodat we maar twee minuten hoeven te wachten op een kapotte verbinding
managed on salt.ii.nl
start
tcp_keep_alive_idle: 300
aantal missing = kapot
tcp_keep_alive_cnt: 3
repeat every X seconds
tcp_keep_alive_intvl: 60
So there's a ping at least once a minute.—
Reply to this email directly or view it on GitHub.
Try a cronjob, then?
*/5 * * * * salt '*' test.ping > /dev/null
I think salt should have this as a feature, maybe it has but I haven't found it yet.
Ya, we don't have the ability to just ping the minions on a regular basis built into salt yet. cron
is the way to go.
If Azure isn't respecting keepalive, that could definitely be causing your problems. As of right now, the minions will not attempt to reconnect outside of the ZMQ keepalive routines. (We recognize that this is a problem -- the biggest blocker is the fact that ZMQ is not very good at reporting that connections are dead. We've been trying to find a good way around this problem)
Pretty sure AWS ELBs and other load balancers in general shed "idle"
connections (clear routing tables, etc) constantly.
Would a cron being setup on the master to ping clients be good enough?
On Fri, Dec 6, 2013 at 4:07 AM, Colton Myers [email protected]:
Ya, we don't have the ability to just ping the minions on a regular basis
built into salt yet. cron is the way to go.If Azure isn't respecting keepalive, that could definitely be causing your
problems. As of right now, the minions will not attempt to reconnect
outside of the ZMQ keepalive routines. (We recognize that this is a problem
-- the biggest blocker is the fact that ZMQ is not very good at reporting
that connections are dead. We've been trying to find a good way around this
problem)—
Reply to this email directly or view it on GitHubhttps://github.com/saltstack/salt/issues/6231#issuecomment-29916298
.
you may watch down minions from master server via this command
salt-run -t30 manage.down
if the return value is not none then it will be line delimited value
with 1 minion on each line
you may loop over them and try to restart the minions to get them back online
i suggest you monitor the master and minions via nagios and nrpe and
then fire an nrpe script on minions that has been reported to be down
by the master nagios nrpe plugin to restart them
this is the solution that i ended to be using yesterday to overcome this problem
On 12/6/13, Andrew Armstrong [email protected] wrote:
Pretty sure AWS ELBs and other load balancers in general shed "idle"
connections (clear routing tables, etc) constantly.Would a cron being setup on the master to ping clients be good enough?
On Fri, Dec 6, 2013 at 4:07 AM, Colton Myers
[email protected]:Ya, we don't have the ability to just ping the minions on a regular basis
built into salt yet. cron is the way to go.If Azure isn't respecting keepalive, that could definitely be causing
your
problems. As of right now, the minions will not attempt to reconnect
outside of the ZMQ keepalive routines. (We recognize that this is a
problem
-- the biggest blocker is the fact that ZMQ is not very good at reporting
that connections are dead. We've been trying to find a good way around
this
problem)—
Reply to this email directly or view it on
GitHubhttps://github.com/saltstack/salt/issues/6231#issuecomment-29916298
.
Reply to this email directly or view it on GitHub:
https://github.com/saltstack/salt/issues/6231#issuecomment-29944869
Regards,
Ahmed Kamel
Linux/Hosting Systems Engineer
I'm seeing this with 0.17.4-4 minions on Windows Server 2012 R2. Minion will stop responding to master, but local salt-call commands work normally.
I'm seeing this issue with Windows Server 2008 R2 with 0.17.5-2 as well.
Workaround was to make a cron job on the master that test.ping's all minions every minute.
@bkeroack : We are trying this workaround.
I am also experiencing this problem with hosts connected via VPN, running 2014.1.0 on Debian Wheezy (amd64) on all hosts except one.
I first tried using a cronjob on the master to do a test.ping to all hosts every 5 minutes, but that did not help, so I changed it to run every minute, which seems to help, except for the one Windows 7 minion...
I suspect there might be another issue with the Windows minion, will investigate further, but the problem now is that the job history for the test.ping cronjob is causing my master's file system to run out of inodes :(
@jakwas Have you set a non-default value for keep_jobs
in your master config? We fixed some weirdness in the job cleanup routines which hopefully will have fixed this issue for you in future versions of salt. Additionally, the latest Windows installers contain the newest version of ZeroMQ, which fixes the keepalive routines for Windows. So you shouldn't actually need your test.ping cron job anymore!
@basepi No, 'keep_jobs' is commented in my master config, so default should be 24. That is good news, but unfortunately without the test.ping cron job, my Debian Wheezy minions running 2014.1.0 don't reconnect until I restart the minion service on each host. Please let me know if there is anything else I should try.
@jakwas Can you inspect your job cache (/var/cache/salt/master/jobs/
) and see if there are any files in there with timestamps greater than 24 hours? Curious if you're being bit by the cleanup, or if you're just generating a ton of cached jobs in a 24-hour period and running out of inodes that way. You could also set keep_jobs
to a lower setting (like 1 hour) and see if that solves your problem.
@basepi I have already deleted it, since the master service did not want to start, but it was over 10GB in size, and I only have about 50 minions. I have set keep_jobs to 1 hour and will reply here if it happens again. Any idea if/why the other minions still require the cron job?
As far as I understood, as long as you have at least ZMQ 3.2 on unix minions, and 4.0.4 on Windows minions, the keepalive routines are pretty solid. I suppose certain network situations could maybe cause problems, but most often an old ZMQ version is to blame. Can't imagine that you would have a very old version of ZMQ on Wheezy, though....
All my Linux minions have ZMQ v3.2.3. I have installed the newest Windows version and will test.
It might be worth noting that most of my minions are connecting via dynamic public IP addresses to my master, which is also on a dynamic public IP address...
Interesting. That may very well be the problem. But it seems like a problem that test.ping wouldn't necessarily solve, so I'm not sure.
As the original poster on this problem I would like to say that the latest salt-minion 2014.1.1 with ZMQ 4.0.4 is actually working. Before it was a joke for me because my salt master is in the DMZ and most of my Windows clients are across either WAN connections or on our inside network. The keep alive was not working until now. I have actually now rolled out salt-minions to my Windows Servers now because I feel it works now.
Thanks for getting this problem resolved.
@leonhedding I'm glad to hear this is working for you! I'll go ahead and close this issue out.
@leonhedding thanks for the report! I'm glad it's working for you now. Your help has been much appreciated.
In my settings, I have 10 minions and a master ubuntu 12.04 instance on Azure. The connections are not stable. Sometimes some of the minions can reconnect after I restart the salt-minion service, but then they
will lost connections soon...like less than 5 minutes. Here are the versions report:
$ salt-minion --versions-report
Salt: 2014.7.2
Python: 2.7.3 (default, Dec 18 2014, 19:10:20)
Jinja2: 2.6
M2Crypto: 0.21.1
msgpack-python: 0.1.10
msgpack-pure: Not Installed
pycrypto: 2.4.1
libnacl: Not Installed
PyYAML: 3.10
ioflo: Not Installed
PyZMQ: 14.5.0
RAET: Not Installed
ZMQ: 4.0.5
Mako: Not Installed
Debian source package: 2014.7.2+ds-1precise2
on master
$ sudo salt --versions-report
Salt: 2014.7.2
Python: 2.7.3 (default, Dec 18 2014, 19:10:20)
Jinja2: 2.6
M2Crypto: 0.21.1
msgpack-python: 0.1.10
msgpack-pure: Not Installed
pycrypto: 2.6.1
libnacl: Not Installed
PyYAML: 3.10
ioflo: Not Installed
PyZMQ: 14.5.0
RAET: Not Installed
ZMQ: 4.0.5
Mako: Not Installed
Debian source package: 2014.7.2+ds-1precise2
The 10 minions were created by salt-cloud with map file.
I have to manual upgrade them from ZMQ2 to ZMQ4.
Not sure if is a Azure related issue? any thing else I can try or any other useful information I can help to provide?
I was having problems with my Azure minions maintaining a connection to the salt master (which was also on Azure). I moved the master to another provider and continued to have issues maintaining contact with the Azure minions. Minions on other providers were fine.
Today I set some explicit keepalive settings on the Azure minions:
tcp_keepalive: True
tcp_keepalive_idle: 60
I've not had an issue keeping in touch with these minions the way I used to before this change. Just a few minutes ago, I created a new minion on Azure without these keepalive settings, and it's already lost contact with the master.
I'm going to bounce that new minion again and see if it loses contact. If it does, I'll update the keepalive settings and see how it looks.
A day later, the keepalive settings seem to have solved everything for me with my Azure minions.
I put my minion in dmz and then no lost connection any more so I remove the minion from dmz and use the @codekoala solution:
tcp_keepalive: True
tcp_keepalive_idle: 60
and it work for me. so there is some reason why we have to use this solution. some configuration on master server or on minion it makes necessary communication from master to minion on some close port or use the configuration above.
best regards
P.S. I'm using 2018.03
Most helpful comment
I was having problems with my Azure minions maintaining a connection to the salt master (which was also on Azure). I moved the master to another provider and continued to have issues maintaining contact with the Azure minions. Minions on other providers were fine.
Today I set some explicit keepalive settings on the Azure minions:
I've not had an issue keeping in touch with these minions the way I used to before this change. Just a few minutes ago, I created a new minion on Azure without these keepalive settings, and it's already lost contact with the master.
I'm going to bounce that new minion again and see if it loses contact. If it does, I'll update the keepalive settings and see how it looks.