Salt: Minion losing connection and not returning without a restart of service

Created on 19 Jul 2013 · 57Comments · Source: saltstack/salt

I did a test.ping I had handful not responding. The versions reports are below. I can also get at random times 20 clients not responding, but am not right now. I have a cron job which restarts the service on my minions ever hour. Some of these servers are on the other side of a WAN connection. I have configured a TCP keepalive of 60 seconds because of the NAT'ing for them.

Why do I keep loosing my connection with the minions?

Master on CentOS 6
[root@saltstack ~]# salt --versions-report
Salt: 0.15.3
Python: 2.6.6 (r266:84292, Jul 10 2013, 22:48:45)
Jinja2: unknown
M2Crypto: 0.20.2
msgpack-python: 0.1.13
msgpack-pure: Not Installed
pycrypto: 2.0.1
PyYAML: 3.10
PyZMQ: 2.2.0.1
ZMQ: 3.2.3
[root@saltstack ~]# cat /etc/redhat-release
CentOS release 6.4 (Final)
[root@saltstack ~]#

CentOS 6.4 Minion
Salt: 0.15.3
Python: 2.6.6 (r266:84292, Feb 21 2013, 23:54:59)
Jinja2: unknown
M2Crypto: 0.20.2
msgpack-python: 0.1.13
msgpack-pure: Not Installed
pycrypto: 2.0.1
PyYAML: 3.10
PyZMQ: 2.2.0.1
ZMQ: 3.2.3

CentOS 6.4 Minion
Salt: 0.15.3
Python: 2.6.6 (r266:84292, Feb 22 2013, 00:00:18)
Jinja2: 2.2.1
M2Crypto: 0.20.2
msgpack-python: 0.1.13
msgpack-pure: Not Installed
pycrypto: 2.0.1
PyYAML: 3.10
PyZMQ: 2.2.0.1
ZMQ: 3.2.3

CentOS release 5.9 (Final)
Salt: 0.15.3
Python: 2.6.8 (unknown, Nov 7 2012, 14:47:34)
Jinja2: unknown
M2Crypto: 0.21.1
msgpack-python: 0.1.12
msgpack-pure: Not Installed
pycrypto: 2.3
PyYAML: 3.08
PyZMQ: 2.1.9
ZMQ: 2.1.9

Windows 2008 R2 64-bit Minion
Salt: 0.16.0
Python: 2.7.3 (default, Apr 10 2012, 23:24:47) [MSC v.1500 64 bit (AMD64)]
Jinja2: 2.6
M2Crypto: 0.21.1
msgpack-python: 0.1.12
msgpack-pure: Not Installed
pycrypto: 2.3
PyYAML: 3.10
PyZMQ: 13.1.0
ZMQ: 3.2.2

Bug severity-medium

Source

leonhedding

Most helpful comment

I was having problems with my Azure minions maintaining a connection to the salt master (which was also on Azure). I moved the master to another provider and continued to have issues maintaining contact with the Azure minions. Minions on other providers were fine.

Today I set some explicit keepalive settings on the Azure minions:

tcp_keepalive: True
tcp_keepalive_idle: 60

I've not had an issue keeping in touch with these minions the way I used to before this change. Just a few minutes ago, I created a new minion on Azure without these keepalive settings, and it's already lost contact with the master.

I'm going to bounce that new minion again and see if it loses contact. If it does, I'll update the keepalive settings and see how it looks.

codekoala on 23 Mar 2016

👍5

All 57 comments

The primary cause for minions losing their connection is ZMQ2, which I see on at least one of your version reports. Definitely upgrade to ZMQ3 to prevent many of these issues.

However, we have been seeing a few reports of minions losing connection recently. Some problems are solved with a minion restart, some with a master. Is there anything in the logs of the disconnected minions or the master that might clue us into what is happening?

basepi on 19 Jul 2013

Also, are you using any IPv6? We're wondering if the recent problems are related to re-enabling IPv6 support.

basepi on 19 Jul 2013

Oh, and I just barely connected you with the other "restarting the master" issue. =P Thanks for creating a separate issue.

basepi on 19 Jul 2013

Yeah, I am the person who reported my issue incorrectly on another ticket. We are not using IPv6.

I have looked at upgrading to ZMQ3, but I can't find any RPM's for my CentOS 5 machines which work. There are too many version conflicts when I have tried to get ZMQ3 onto my CentOS 5 machines. But I am not that worried since in reality the issue is more widespread and is effecting my ZMQ3 clients just as much. If I can get the ZMQ3 machines work well I would be happy.

leonhedding on 22 Jul 2013

Ya, this is high on our priority list. The problem is the difficulty of reproducing issues like this. =\

basepi on 22 Jul 2013

Good, I am now seeing 20 of my 30 odd minions go offline until the cron job restarts the minion and then they are reachable for about 5-15 minutes then unreachable until the cronjob runs again. I am not finding salt to be that useful in this situation. This is for zmq3 and zmq2 minions.

leonhedding on 23 Jul 2013

20 out of 30? that's high, I don't think I've seen anyone with that high of a percentage of disconnects. We'll definitely look into it.

basepi on 23 Jul 2013

I'm having the same problem, but mine is reproduceable.

If the master server's IP address changes and DNS is updated, the minions have to be restarted to regain connection. There was a fix about a year ago for this where the minion would re-resolve the IP address of the master if it lost connection.

I'm not sure if that is happening correctly now. The IP address of a saltmaster running in EC2 changes often when it's shutdown and brought back up.

This is super annoying because my remote execution to restart the minion process is well...done by salt :)

tateeskew on 24 Jul 2013

my master's IP address is not changing. We have a static IP for it. I have 19 servers this morning not responding 30 minutes after the minion service was restarted.

I have a local cron job that runs each at 3 minutes past every hour on each minion server to restart salt-minion. But I can only then reach them right after the service restarts. This is a real pain for actually trying to use this as a possible solution for my organisation.

leonhedding on 30 Jul 2013

This is happening for minions that are on the sam LAN, as well as minions connecting over WAN, right?

basepi on 30 Jul 2013

Same LAN and across a WAN. My Windows box that I have put into salt is usually the first to go and I know it is running ZMQ3. Not bothered putting more machines into salt because of this issue.

Cheers,

Leon Hedding

On 30 Jul 2013 22:54:41, Colton Myers [email protected] wrote:

This is happening for minions that are on the sam LAN, as well as minions connecting over WAN, right?

—
Reply to this email directly or view it on GitHubhttps://github.com/saltstack/salt/issues/6231#issuecomment-21825901.

OM International Services (Carlisle) Ltd - Unit B Clifford Court, Cooper Way - Carlisle CA3 0JG - United Kingdom
Charity reg no: 1112655 - Company reg no: 5649412 (England and Wales)

leonhedding on 31 Jul 2013

Ok, I think I might have a solution for my Linux machines. We run 20 odd CentOS 5 machines and without ZMQ3 installed with its ability to configure TCP Keep-Alive all these machines go offline after some time. I tried using a cron job on the master which did a test.ping ever 5 minutes to all devices as one way to do a poor man's keep alive, but my minions would still go offline occasionally.

I eventually found some time to build my own 32-bit and 64-bit RPM's for PyZMQ 13.1.0 which was the limiter before. There is a public repo for ZMQ3, but none for a version of PyZMQ that supports ZMQ3 on CentOS. I was getting library incompatibilities until I had the latest version of PyZMQ.

It is early days still, but I have yet to loose any minions since getting ZMQ3 onto my CentOS 5 machines. I really wish there was a public repo for ZMQ3 and PyZMQ 13.1.0 because I don't want to have maintain my own RPM's for PyZMQ for both i386 and x64. I saw a ticket open about getting a salt repo setup for CentOS 5. This would be brilliant for others that plan to run CentOS 5 until its support ends in 2017 and also want to run salt.

My Windows machine still keeps dropping after 15 minutes or so. I have enabled keep alives on this Windows machine. I have a number more Windows machines I would like to use Salt on, but am holding off until I can get it working reliably on at least one Windows machine. I have seen that I need to increase the timeout for Windows machines and am doing 45 seconds for a test.ping and still no response.

Windows Server 2008 R2 Datacenter SP1 Machine:
C:\salt> salt --versions-report
Salt: 0.16.2
Python: 2.7.3 (default, Apr 10 2012, 23:24:47) [MSC v.1500 64 bit (AMD64)]
Jinja2: 2.6
M2Crypto: 0.21.1
msgpack-python: 0.1.12
msgpack-pure: Not Installed
pycrypto: 2.3
PyYAML: 3.10
PyZMQ: 13.1.0
ZMQ: 3.2.2

Master running CentOS 6.4
[root@saltstack ~]# salt --versions-report
Salt: 0.16.0
Python: 2.6.6 (r266:84292, Jul 10 2013, 22:48:45)
Jinja2: unknown
M2Crypto: 0.20.2
msgpack-python: 0.1.13
msgpack-pure: Not Installed
pycrypto: 2.0.1
PyYAML: 3.10
PyZMQ: 2.2.0.1
ZMQ: 3.2.3

When I run tcpdump and do a test.ping to my Windows machine absolutely nothing shows in the dump from the masters perspective. I restart the minion and I then see traffic in my tcpdump. Somehow the connection is dropping and I don't know how to test why this is happening.

leonhedding on 7 Aug 2013

I talked to @UtahDave (who's in charge of Windows development) and he says he's going to package up a new windows installer with ZMQ 3.2.3 just to make sure that's not the issue. Then we can work from there.

basepi on 7 Aug 2013

Hi, i have the same problem with my windows minions (0.16.2) .The contact is lost after a few hours except for one minion. It is in the same VLAN as salt-master. I think it's a keepalive problem. A simple restart on the minion has solved this problem. I installed 0.16.3 today. Some news tomorrow ;)

equinoxefr on 12 Aug 2013

Hello, today contact has been lost with my windows minion :-(. 0.16.3 didn't address this issue.

In master log i can see some minion activity every hour:

2013-08-13 06:29:53,066 [salt.master ][INFO ] Authentication request from XML01.services.sib.fr
2013-08-13 06:29:53,067 [salt.master ][INFO ] Authentication accepted from XML01.services.sib.fr
2013-08-13 07:29:52,619 [salt.master ][INFO ] Authentication request from XML01.services.sib.fr
2013-08-13 07:29:52,619 [salt.master ][INFO ] Authentication accepted from XML01.services.sib.fr
2013-08-13 08:29:52,147 [salt.master ][INFO ] Authentication request from XML01.services.sib.fr
2013-08-13 08:29:52,147 [salt.master ][INFO ] Authentication accepted from XML01.services.sib.fr

But minion doesn't respond to salt-master requests...

equinoxefr on 13 Aug 2013

More tests:

I can use salt-call on minion without any problems. A salt-call state.sls xxxx is working like a charm. But on master i still have no results.

[root@salt ~]# salt -v 'XML01*' test.version
Executing job with jid 20130813100802364942

XML01.services.sib.fr:
Minion did not return

Until i restart salt-minion.

Can i help you with pcap capture or advanced logs ?

equinoxefr on 13 Aug 2013

I think I am having the same or similar issue. Minions stopping responding. I have 50+ minions out of 133 not responding now. I too, cannot update ZMQ everywhere without great massive pain.

caseybea on 19 Aug 2013

@caseybea For ZMQ < 3, we really can't do anything. There are severe bugs in those versions which make the connection very unstable. Some have gotten around it by using cron to restart the minion service. But your best bet is still to upgrade ZMQ, unfortunately.

basepi on 21 Aug 2013

Damn. I'm not surprised to find out ZMQ 2.x is the main problem, the unfortunate part is there's no clean way to install updates on a RHEL5 box with the zmq repo(s) out there, because there's a twisty maze of ugly dependancies that make upgrading to ZMQ3 more or less unrealistic.

That said, when I have some time I will still try, and see how it goes.

(meanwhile... I'm tossing in my vote to have SSH an option for connectivity. Just in case a crafty developer is listening :-)

SALT is still a really cool deal. I'll get it into production. Eventually.....

caseybea on 21 Aug 2013

@basepi , i did more tests with my windows minions. If i kill tcp connection ( marked as ESTABLISHED on minion but doesn't exist on server side ) with salt server on minion side with an utility like tcpview, everything becomes OK. Do you think it's a Zmq issue (Ver 3.2.2) ?

equinoxefr on 21 Aug 2013

@caseybea salt-ssh is shipping with 0.17. Your wish is our command! =P (Though it's way slower than ZMQ, as expected)

@equinoxefr Strange that it's ESTABLISHED on minion and doesn't show up at all on master. That shouldn't be possible, right? o.O We've actually heard of a few different people having connection problems recently on Windows, we're still trying to track down the cause. But 3.2.2 is the most up-to-date version of ZMQ on Windows, so if it's ZMQ, it's an unfixed bug.

basepi on 21 Aug 2013

Wooo! I didn’t know salt-ssh is already in the mix. This is good news for us stuck with redhat5. (actually, the whole thing could be resolved is the 0MQ folks updated the repos with 3.x for RHEL--- but I know that’s not your responsibility. I’ll take whichever solution arrives first-- salt-ssh or zeromq 3.x for rhel. ☺

From: Colton Myers [mailto:[email protected]]
Sent: Wednesday, August 21, 2013 2:57 PM
To: saltstack/salt
Cc: Brodie, Kent
Subject: Re: [salt] Minion losing connection and not returning without a restart of service (#6231)

@caseybeahttps://github.com/caseybea salt-ssh is shipping with 0.17. Your wish is our command! =P (Though it's way slower than ZMQ, as expected)

@equinoxefrhttps://github.com/equinoxefr Strange that it's ESTABLISHED on minion and doesn't show up at all on master. That shouldn't be possible, right? o.O We've actually heard of a few different people having connection problems recently on Windows, we're still trying to track down the cause. But 3.2.2 is the most up-to-date version of ZMQ on Windows, so if it's ZMQ, it's an unfixed bug.

—
Reply to this email directly or view it on GitHubhttps://github.com/saltstack/salt/issues/6231#issuecomment-23045223.

caseybea on 21 Aug 2013

Ya, we've been considering creating our own repos for RHEL 5 to host ZMQ3 packages, but we just haven't had time.

basepi on 22 Aug 2013

@basepi You are right, that shouldn't be possible but... http://www.evanjones.ca/tcp-stuck-connection-mystery.html

I did more test and i can confirm that:

Only windows minion is concerned (0.16.3)
Connections between minion / master without firewall = OK
Connections between minion / master with firewall (Cisco) = KO after a few hours.
On master no tcp connection from minion on tcp 4505. On minion a stuck TCP connection is ESTABLISHED ! If i restart minion or if i kill this shadow TCP connection, everything becomes OK.
If i use salt-call on minion, it opens TCP connection to master on 4506 but on connection on 4505 still stuck.

Do you want me to open another issue only for the Windows minion ?

equinoxefr on 22 Aug 2013

@equinoxefr Yes, could create a new issue? That seems to be a different problem from the linux disconnection issues.

Please include as much information from this thread as you think is relevant. Thanks!

basepi on 22 Aug 2013

Same issue here, master and minion using FreeBSD 9.1 and saltstack installed from ports, any ideas or possible workaround ?

nbari on 31 Aug 2013

@basepi @equinoxefr i confirm the issue of a connection established in the minion and not established on the master

and the minion is running centos 5 not windows

Abukamel on 19 Oct 2013

Thanks for the input, @Abukamel. It's helpful to know that it can occur on non-windows machines as well.

basepi on 21 Oct 2013

i have wrote a simple script to install salt and it's dependencies from source on centos 5 to solve ZMQ problem
here is the gist:
https://gist.github.com/Abukamel/7515248

Abukamel on 18 Nov 2013

I see this on Ubuntu 12.04, ZMQ 3.2.2 on 0.17.2. Minions lose connection to the master, do have a connection on 4506 but not 4505.

strace shows the minion not doing anything:

root@idd0012:/home/mrten# strace -F -f -p 2910
Process 2910 attached with 3 threads - interrupt to quit
[pid  2964] epoll_wait(8,  <unfinished ...>
[pid  2963] epoll_wait(6,  <unfinished ...>
[pid  2910] restart_syscall(<... resuming interrupted call ...>

The only thing in the minion log is the hsum thing I reported earlier: (#8653) but the timestamp suggests it is not relevant.

Mrten on 20 Nov 2013

I'm having the exact same problem as @Mrten , Ubuntu 12.04 ZMQ 3 on 0.16.2.

This has been a constant theme throughout my salt experience; minions dropping off and requiring a restart to be fixed, I'm actually regretting picking saltstack.

In my case I see the minion with an ESTABLISHED netstat line for port 4505 (not 4506) but the master apparently has no idea about this connection.

tcp        0      0 10.x.x.x:54281       x.x.x.x:4505       ESTABLISHED

The master on the other side has a reference to this connection.

What is worth noting I think, is that these connections are being routed through (in my case) Azure's load balancer, so its not a direct connection, and I wonder whether its dropping idle connections from time to time and both master and minion don't realise the socket is actually dead.

What efforts do master and minion go to ensure the connection is kept open?

Plasma on 5 Dec 2013

👍1

Idle connections dropping elsewhere have a solution, I think:
I put this in /etc/minion.d/tcp-keepalive.conf:

# zodat we maar twee minuten hoeven te wachten op een kapotte verbinding
# managed on salt.ii.nl

#start
tcp_keep_alive_idle: 300

# aantal missing = kapot
tcp_keep_alive_cnt: 3

# repeat every X seconds
tcp_keep_alive_intvl: 60

So there's a ping at least once a minute.

Mrten on 5 Dec 2013

Azure and AWS load balancers don't respect tcp keepalives unfortunately

Sent from my iPhone

On 05/12/2013, at 7:16 PM, Mrten [email protected] wrote:

Idle connections dropping elsewhere have a solution, I think:
I put this in /etc/minion.d/tcp-keepalive.conf:

zodat we maar twee minuten hoeven te wachten op een kapotte verbinding

managed on salt.ii.nl

start

tcp_keep_alive_idle: 300

aantal missing = kapot

tcp_keep_alive_cnt: 3

repeat every X seconds

tcp_keep_alive_intvl: 60
So there's a ping at least once a minute.

—
Reply to this email directly or view it on GitHub.

Plasma on 5 Dec 2013

Try a cronjob, then?
*/5 * * * * salt '*' test.ping > /dev/null

I think salt should have this as a feature, maybe it has but I haven't found it yet.

Mrten on 5 Dec 2013

Ya, we don't have the ability to just ping the minions on a regular basis built into salt yet. cron is the way to go.

If Azure isn't respecting keepalive, that could definitely be causing your problems. As of right now, the minions will not attempt to reconnect outside of the ZMQ keepalive routines. (We recognize that this is a problem -- the biggest blocker is the fact that ZMQ is not very good at reporting that connections are dead. We've been trying to find a good way around this problem)

basepi on 5 Dec 2013

Pretty sure AWS ELBs and other load balancers in general shed "idle"
connections (clear routing tables, etc) constantly.

Would a cron being setup on the master to ping clients be good enough?

On Fri, Dec 6, 2013 at 4:07 AM, Colton Myers [email protected]:

Ya, we don't have the ability to just ping the minions on a regular basis
built into salt yet. cron is the way to go.

If Azure isn't respecting keepalive, that could definitely be causing your
problems. As of right now, the minions will not attempt to reconnect
outside of the ZMQ keepalive routines. (We recognize that this is a problem
-- the biggest blocker is the fact that ZMQ is not very good at reporting
that connections are dead. We've been trying to find a good way around this
problem)

—
Reply to this email directly or view it on GitHubhttps://github.com/saltstack/salt/issues/6231#issuecomment-29916298
.

Plasma on 5 Dec 2013

you may watch down minions from master server via this command
salt-run -t30 manage.down

if the return value is not none then it will be line delimited value
with 1 minion on each line

you may loop over them and try to restart the minions to get them back online

i suggest you monitor the master and minions via nagios and nrpe and
then fire an nrpe script on minions that has been reported to be down
by the master nagios nrpe plugin to restart them

this is the solution that i ended to be using yesterday to overcome this problem

On 12/6/13, Andrew Armstrong [email protected] wrote:

Pretty sure AWS ELBs and other load balancers in general shed "idle"
connections (clear routing tables, etc) constantly.

Would a cron being setup on the master to ping clients be good enough?

On Fri, Dec 6, 2013 at 4:07 AM, Colton Myers
[email protected]:

Ya, we don't have the ability to just ping the minions on a regular basis
built into salt yet. cron is the way to go.

If Azure isn't respecting keepalive, that could definitely be causing
your
problems. As of right now, the minions will not attempt to reconnect
outside of the ZMQ keepalive routines. (We recognize that this is a
problem
-- the biggest blocker is the fact that ZMQ is not very good at reporting
that connections are dead. We've been trying to find a good way around
this
problem)

—
Reply to this email directly or view it on
GitHubhttps://github.com/saltstack/salt/issues/6231#issuecomment-29916298
.

Reply to this email directly or view it on GitHub:
https://github.com/saltstack/salt/issues/6231#issuecomment-29944869

Regards,
Ahmed Kamel
Linux/Hosting Systems Engineer

Abukamel on 6 Dec 2013

I'm seeing this with 0.17.4-4 minions on Windows Server 2012 R2. Minion will stop responding to master, but local salt-call commands work normally.

bkeroack on 4 Feb 2014

I'm seeing this issue with Windows Server 2008 R2 with 0.17.5-2 as well.

bigmstone on 17 Feb 2014

Workaround was to make a cron job on the master that test.ping's all minions every minute.

bkeroack on 17 Feb 2014

@bkeroack : We are trying this workaround.

NoCoBonobo on 17 Feb 2014

I am also experiencing this problem with hosts connected via VPN, running 2014.1.0 on Debian Wheezy (amd64) on all hosts except one.

I first tried using a cronjob on the master to do a test.ping to all hosts every 5 minutes, but that did not help, so I changed it to run every minute, which seems to help, except for the one Windows 7 minion...

jakwas on 17 Mar 2014

I suspect there might be another issue with the Windows minion, will investigate further, but the problem now is that the job history for the test.ping cronjob is causing my master's file system to run out of inodes :(

jakwas on 2 Apr 2014

@jakwas Have you set a non-default value for keep_jobs in your master config? We fixed some weirdness in the job cleanup routines which hopefully will have fixed this issue for you in future versions of salt. Additionally, the latest Windows installers contain the newest version of ZeroMQ, which fixes the keepalive routines for Windows. So you shouldn't actually need your test.ping cron job anymore!

basepi on 2 Apr 2014

@basepi No, 'keep_jobs' is commented in my master config, so default should be 24. That is good news, but unfortunately without the test.ping cron job, my Debian Wheezy minions running 2014.1.0 don't reconnect until I restart the minion service on each host. Please let me know if there is anything else I should try.

jakwas on 2 Apr 2014

@jakwas Can you inspect your job cache (/var/cache/salt/master/jobs/) and see if there are any files in there with timestamps greater than 24 hours? Curious if you're being bit by the cleanup, or if you're just generating a ton of cached jobs in a 24-hour period and running out of inodes that way. You could also set keep_jobs to a lower setting (like 1 hour) and see if that solves your problem.

basepi on 2 Apr 2014

@basepi I have already deleted it, since the master service did not want to start, but it was over 10GB in size, and I only have about 50 minions. I have set keep_jobs to 1 hour and will reply here if it happens again. Any idea if/why the other minions still require the cron job?

jakwas on 2 Apr 2014

As far as I understood, as long as you have at least ZMQ 3.2 on unix minions, and 4.0.4 on Windows minions, the keepalive routines are pretty solid. I suppose certain network situations could maybe cause problems, but most often an old ZMQ version is to blame. Can't imagine that you would have a very old version of ZMQ on Wheezy, though....

basepi on 3 Apr 2014

All my Linux minions have ZMQ v3.2.3. I have installed the newest Windows version and will test.

It might be worth noting that most of my minions are connecting via dynamic public IP addresses to my master, which is also on a dynamic public IP address...

jakwas on 3 Apr 2014

Interesting. That may very well be the problem. But it seems like a problem that test.ping wouldn't necessarily solve, so I'm not sure.

basepi on 4 Apr 2014

As the original poster on this problem I would like to say that the latest salt-minion 2014.1.1 with ZMQ 4.0.4 is actually working. Before it was a joke for me because my salt master is in the DMZ and most of my Windows clients are across either WAN connections or on our inside network. The keep alive was not working until now. I have actually now rolled out salt-minions to my Windows Servers now because I feel it works now.

Thanks for getting this problem resolved.

leonhedding on 16 Apr 2014

@leonhedding I'm glad to hear this is working for you! I'll go ahead and close this issue out.

cachedout on 16 Apr 2014

@leonhedding thanks for the report! I'm glad it's working for you now. Your help has been much appreciated.

UtahDave on 16 Apr 2014

In my settings, I have 10 minions and a master ubuntu 12.04 instance on Azure. The connections are not stable. Sometimes some of the minions can reconnect after I restart the salt-minion service, but then they
will lost connections soon...like less than 5 minutes. Here are the versions report:

$ salt-minion --versions-report
Salt: 2014.7.2
Python: 2.7.3 (default, Dec 18 2014, 19:10:20)
Jinja2: 2.6
M2Crypto: 0.21.1
msgpack-python: 0.1.10
msgpack-pure: Not Installed
pycrypto: 2.4.1
libnacl: Not Installed
PyYAML: 3.10
ioflo: Not Installed
PyZMQ: 14.5.0
RAET: Not Installed
ZMQ: 4.0.5
Mako: Not Installed
Debian source package: 2014.7.2+ds-1precise2

on master
$ sudo salt --versions-report
Salt: 2014.7.2
Python: 2.7.3 (default, Dec 18 2014, 19:10:20)
Jinja2: 2.6
M2Crypto: 0.21.1
msgpack-python: 0.1.10
msgpack-pure: Not Installed
pycrypto: 2.6.1
libnacl: Not Installed
PyYAML: 3.10
ioflo: Not Installed
PyZMQ: 14.5.0
RAET: Not Installed
ZMQ: 4.0.5
Mako: Not Installed
Debian source package: 2014.7.2+ds-1precise2

The 10 minions were created by salt-cloud with map file.
I have to manual upgrade them from ZMQ2 to ZMQ4.
Not sure if is a Azure related issue? any thing else I can try or any other useful information I can help to provide?

yytsui on 11 Apr 2015

Today I set some explicit keepalive settings on the Azure minions:

tcp_keepalive: True
tcp_keepalive_idle: 60

I'm going to bounce that new minion again and see if it loses contact. If it does, I'll update the keepalive settings and see how it looks.

codekoala on 23 Mar 2016

👍5

A day later, the keepalive settings seem to have solved everything for me with my Azure minions.

codekoala on 24 Mar 2016

I put my minion in dmz and then no lost connection any more so I remove the minion from dmz and use the @codekoala solution:

tcp_keepalive: True
tcp_keepalive_idle: 60

and it work for me. so there is some reason why we have to use this solution. some configuration on master server or on minion it makes necessary communication from master to minion on some close port or use the configuration above.

best regards

P.S. I'm using 2018.03