salt 🚀 - Application-level Keepalive is mandatory for healthy connections

What version of salt are you running and on which OSes?

Can you provide the out put of salt 'minion id' test.versions_report?
On May 5, 2014 6:49 AM, "Spiros Ioannou" [email protected] wrote:

Hello,
The problem:

we have now about 100 salt-minions which are installed in remote areas
with 3G and satellite connections.

We loose connectivity with all of those minions in about 1-2 days after
installation, with test.ping reporting "minion did not return". The state
was each time that the minions saw an ESTABLISHED TCP connection, while on
the salt-master there were no connection listed at all. (Yes that is
correct). Tighter keepalive settings were tried with no result. (OS is
linux)
Each time, restarting the salt-minion fixes the problem immediately.

Obviously the connections are transparently proxied someplace, (who knows
what happens with those SAT networks) so the whole tcp-keepalive mechanism
of 0mq fails.

Salt should handle this on the application level, so as to determine
connection health and reconnect if needed by e.g. sending a dummy ping data
sent periodically every e.g. 10 minutes and checking for reply.

—
Reply to this email directly or view it on GitHubhttps://github.com/saltstack/salt/issues/12540
.

UtahDave on 5 May 2014

All minions are on debian squeeze, salt-minion versions: 2014.1.1+ds-1~bpo60+1 (freshly installed), with keepalive counters counting normaly (when executing ss -e or netstat -ean)
Salt-master is 2014.1.3+ds-2trusty2, ubuntu 14.04

test.versions won't work since they are all unreachable, but for one I restarted manually I got this:

               Salt: 2014.1.1
             Python: 2.6.6 (r266:84292, Dec 27 2010, 00:02:40)
             Jinja2: 2.5.5
           M2Crypto: 0.20.1
     msgpack-python: 0.1.10
       msgpack-pure: Not Installed
           pycrypto: 2.1.0
             PyYAML: 3.09
              PyZMQ: 13.1.0
                ZMQ: 3.2.3

I also include a tcpdump from the minion. The master shows no connections, while the minion shows established.

tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
12:59:37.751471 IP (tos 0x0, ttl 64, id 49407, offset 0, flags [DF], proto TCP (6), length 52)
    10.11.32.161.58214 > xxxxxx.eu-west-1.compute.amazonaws.com.4505: Flags [.], cksum 0x9449 (correct), ack 1897141646, win 92, options [nop,nop,TS val 4774885 ecr 1689822158], length 0

12:59:37.755784 IP (tos 0x0, ttl 63, id 58924, offset 0, flags [DF], proto TCP (6), length 52)
    xxxxxx.eu-west-1.compute.amazonaws.com.4505 > 10.11.32.161.58214: Flags [.], cksum 0xc7a4 (correct), ack 1, win 46, options [nop,nop,TS val 1690129362 ecr 3537089], length 0

13:04:37.755293 IP (tos 0x0, ttl 64, id 49408, offset 0, flags [DF], proto TCP (6), length 52)
    10.11.32.161.58214 > xxxxxx.eu-west-1.compute.amazonaws.com.4505: Flags [.], cksum 0xbf46 (correct), ack 1, win 92, options [nop,nop,TS val 4849886 ecr 1690129362], length 0

13:04:37.762560 IP (tos 0x0, ttl 63, id 58925, offset 0, flags [DF], proto TCP (6), length 52)
    xxxxxx.eu-west-1.compute.amazonaws.com.4505 > 10.11.32.161.58214: Flags [.], cksum 0x1798 (correct), ack 1, win 46, options [nop,nop,TS val 1690436570 ecr 3537089], length 0

13:09:37.759286 IP (tos 0x0, ttl 64, id 49409, offset 0, flags [DF], proto TCP (6), length 52)
    10.11.32.161.58214 > xxxxxx.eu-west-1.compute.amazonaws.com.4505: Flags [.], cksum 0xea3f (correct), ack 1, win 92, options [nop,nop,TS val 4924887 ecr 1690436570], length 0

13:09:37.767481 IP (tos 0x0, ttl 63, id 58926, offset 0, flags [DF], proto TCP (6), length 52)
    xxxxxx.eu-west-1.compute.amazonaws.com.4505 > 10.11.32.161.58214: Flags [.], cksum 0x678f (correct), ack 1, win 46, options [nop,nop,TS val 1690743774 ecr 3537089], length 0

(xxxxamazonaws.com is the master, 10.11.32.161 is the local ip of the minion)

sivann on 5 May 2014

I think the new RAET UDP transport will be the best answer for times like these. We won't have to worry about TCP reporting whether it's alive or not, it should handle latency a lot better.

basepi on 5 May 2014

(It will also give us a lot more application-side power and introspection into what's happening, so we can solve issues like this more easily. ZMQ tends to be a black box, much harder to debug these types of problems)

basepi on 5 May 2014

I believe changing the queue mechanism to resolve this is not the best answer. UDP is not guaranteed to work either in this case. I urge you not to combine the 2 issues, since I feel that this will only delay a possible fix.
I don't see why implementing an application-level keepalive cannot be performed using 0mq, this is not a 0mq bug.

sivann on 7 May 2014

Oh, I agree that UDP is not a fix-all. The advantage of this new implementation is that it brings the queuing mechanisms much closer to the application level, which will allow us to much easier build application-level keepalive (which is a given, since UDP doesn't have its own keepalive).

We would love to have ZMQ application-level keepalive, and we have by no means written it off, but it will take extensive effort and we're going to wait until RAET is out in the wild and see what the reception is. We may even end up building a TCP transport for RAET to completely replace ZMQ, at which point application-level keepalive would be a given there as well.

basepi on 7 May 2014

Do you have any time estimate for this release? Does this 0mq replacement also means that all the clients will have to be updated manually? Because that would result in a huge IT effort even for the few hundrends of salt minions we have. There have been several cases in salt development that resulted in lost minions. Having a second mechanism to access those hosts really diminishes salt's usefulness.
Sorry for being a bit bitter, I really appreciate your efforts and salt itself, but I feel that breaking compatibility so often is really not the way to go.

sivann on 8 May 2014

0MQ is not being replaced. This new release will just introduce an alternate transport mechanism which must be enabled to be in use. Nothing will change unless you want it to. Even people who want to switch to RAET should be able to do so without having to manually install. All it will take is upgrading the master and minions, ensuring that the proper RAET dependencies are installed on all the systems, and then switching first the minions and then the master over to RAET In their config.

All of that said, it will be a beta product in this next release, so we won't recommend immediately switching over an entire infrastructure or anything.

To answer your original question, we are targeting 2 weeks from now for the first release candidate.

basepi on 8 May 2014

@sivann, have you tried setting up the Salt master to run a test.ping on all your minions on a regular basis? Maybe once an hour or once every 10 minutes? Salt has a scheduler that allows you to do that. Some people have had success doing that.

UtahDave on 9 May 2014

@basepi thanks for clarifying that, 2 weeks is not too long,
@UtahDave We thought of that, but in some cases we loose them even in 10 minutes. Another option is to restart salt-minion by cron hourly, but this seems an overkill.

I saw in the sources that salt-minion has a scheduler but couldn't find how to re-initialize the rabbitmq connection, so as to write a simple keepalive patch (to just send a message to the server); it seems it is only initialized once in the tune_in function so it was not so simple for me to patch.

sivann on 9 May 2014

sorry i never read the full thread. just reading through random issues.

if using the dev version of salt, perhaps something like this might help in the minion config.

master: none_mult_master_ip
ping_interval: 2
auth_timeout: 10
auth_tries: 2
auth_safemode: False
random_reauth_delay: 10

steverweber on 5 Jun 2014

the above will have the minions 'ping' the master every 120 seconds ... if that ping fails and 5 re-auth retries fail then the minion restarts.

steverweber on 5 Jun 2014

Thanks @steverweber this looks promising! I will try it.

sivann on 6 Jun 2014

Please let us/me know in a week from now if anything can be improved.
Thanks.

steverweber on 6 Jun 2014

correction ping_interval is in minutes.. so every 2hours the minion pings master.

steverweber on 11 Jun 2014

@sivann The auto restart code was patched.
https://github.com/saltstack/salt/pull/13582

How things going? Is the salt deployment more stable now?

steverweber on 23 Jun 2014

This is not yet released as of 2014.1.7, We just installed today's devel from github and will get back with results.

sivann on 10 Jul 2014

It seems it is not fixed, it's actually worse. The new code causes lots of bad EST connections on the master.

minion IP: 10.11.40.161, public: 176.227.142.126
saltmaster IP: 10.0.0.212, public: 54.246.180.52

minion:
root@debian:/usr/local/bin# netstat -ean |grep 54.246.180.52
tcp        0      0 10.11.40.161:43693      54.246.180.52:4505      ESTABLISHED 0          125808  

root@debian:/usr/local/bin# ss -e|grep 54.246.180.52
ESTAB      0      0            10.11.40.161:43693        54.246.180.52:4505     timer:(keepalive,46sec,0) ino:125808 sk:f3512600

Master:
root@saltmaster:~ # netstat -ean |grep 176.227.142.126
tcp        0      0 10.0.0.212:4505         176.227.142.126:60215   ESTABLISHED 0          149365     
tcp        0      0 10.0.0.212:4506         176.227.142.126:53544   ESTABLISHED 0          156632     
tcp        0      0 10.0.0.212:4505         176.227.142.126:47687   ESTABLISHED 0          149367     
tcp        0      0 10.0.0.212:4505         176.227.142.126:40874   ESTABLISHED 0          149360     
tcp        0   1560 10.0.0.212:4505         176.227.142.126:37470   ESTABLISHED 0          149378     
tcp        0      0 10.0.0.212:4506         176.227.142.126:53513   ESTABLISHED 0          156639     
tcp        0      0 10.0.0.212:4505         176.227.142.126:54876   ESTABLISHED 0          149377     
tcp        0   1560 10.0.0.212:4505         176.227.142.126:43693   ESTABLISHED 0          150116     
tcp        0      0 10.0.0.212:4505         176.227.142.126:55295   ESTABLISHED 0          149362     
tcp        0      0 10.0.0.212:4505         176.227.142.126:39531   ESTABLISHED 0          149361     
tcp        0      0 10.0.0.212:4505         176.227.142.126:48655   ESTABLISHED 0          149363

strace on the minion at that time (logs show nothing useful):

[pid 14053] poll([{fd=10, events=POLLIN}, {fd=13, events=POLLIN}], 2, 1000) = 0 (Timeout)
[pid 14053] poll([{fd=10, events=POLLIN}], 1, 0) = 0 (Timeout)
[pid 14053] poll([{fd=13, events=POLLIN}], 1, 0) = 0 (Timeout)
[pid 14053] clock_gettime(CLOCK_MONOTONIC, {335431, 508535066}) = 0
[pid 14053] gettimeofday({1405065700, 720156}, NULL) = 0
[pid 14053] gettimeofday({1405065700, 720449}, NULL) = 0
[pid 14053] clock_gettime(CLOCK_MONOTONIC, {335431, 509479581}) = 0
[pid 14053] poll([{fd=10, events=POLLIN}, {fd=13, events=POLLIN}], 2, 0) = 0 (Timeout)
[pid 14053] poll([{fd=10, events=POLLIN}], 1, 0) = 0 (Timeout)
[pid 14053] poll([{fd=13, events=POLLIN}], 1, 0) = 0 (Timeout)
[pid 14053] clock_gettime(CLOCK_MONOTONIC, {335431, 509881187}) = 0
[pid 14053] poll([{fd=10, events=POLLIN}, {fd=13, events=POLLIN}], 2, 1000
... and so on for ever

It seems the minion shows 1 ESTABLISHED connection with the master, the master shows 9, and none actually works.
What happened probably is that the minion tried to reconnect to the master, leaving all those fake ESTABLISHED connections on the master. For some reason reconnection was unsuccessful with the master since the minion does not respond to the master to salt commands. Perhaps the master does not know which connection is the right one?.

A suggestion is for the master to also ping the minions on established connections, and to close fake EST connections.

sivann on 11 Jul 2014

ya something was mucked up with that commit.
I created a new fix that seems much more stable...
https://github.com/saltstack/salt/pull/14064

I'll likely make only a small change to that before I give the go-ahead to merge.
Testing is most welcome!

steverweber on 11 Jul 2014

to test this patch you can do.

curl -o install_salt.sh.sh -L https://bootstrap.saltstack.com
sudo sh install_salt.sh.sh -g https://github.com/steverweber/salt.git git fix_restarts

steverweber on 11 Jul 2014

I installed the version above, this version does not even connect to the master once, something's wrong.

Firstly it starts 2 salt-minion processes. And test.ping never works.
I include the debug logfile:

2014-07-14 10:07:57,527 [salt             ][INFO    ] Setting up the Salt Minion "battens-c1.insolar-plants.net"
2014-07-14 10:07:57,534 [salt.utils.process][DEBUG   ] Created pidfile: /var/run/salt-minion.pid
2014-07-14 10:07:57,537 [salt.config      ][DEBUG   ] Reading configuration from /etc/salt/minion
2014-07-14 10:07:57,799 [salt.config      ][DEBUG   ] Including configuration from '/etc/salt/minion.d/auth_timeout.conf'
2014-07-14 10:07:57,800 [salt.config      ][DEBUG   ] Reading configuration from /etc/salt/minion.d/auth_timeout.conf
2014-07-14 10:07:57,803 [salt.config      ][DEBUG   ] Including configuration from '/etc/salt/minion.d/id.conf'
2014-07-14 10:07:57,803 [salt.config      ][DEBUG   ] Reading configuration from /etc/salt/minion.d/id.conf
2014-07-14 10:07:57,806 [salt.config      ][DEBUG   ] Including configuration from '/etc/salt/minion.d/master.conf'
2014-07-14 10:07:57,807 [salt.config      ][DEBUG   ] Reading configuration from /etc/salt/minion.d/master.conf
2014-07-14 10:07:58,302 [salt.minion                              ][DEBUG   ] Attempting to authenticate with the Salt Master at 54.246.180.52
2014-07-14 10:07:58,305 [salt.crypt                               ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-14 10:08:02,290 [salt.crypt                               ][DEBUG   ] Decrypting the current master AES key
2014-07-14 10:08:02,291 [salt.crypt                               ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-14 10:08:04,158 [salt.minion                              ][INFO    ] Authentication with master successful!
2014-07-14 10:08:06,970 [salt.crypt                               ][DEBUG   ] Decrypting the current master AES key
2014-07-14 10:08:06,972 [salt.crypt                               ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-14 10:08:11,594 [salt.crypt                               ][DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-14 10:08:12,552 [salt.config                              ][DEBUG   ] Reading configuration from /etc/salt/minion
2014-07-14 10:08:12,813 [salt.config                              ][DEBUG   ] Including configuration from '/etc/salt/minion.d/auth_timeout.conf'
2014-07-14 10:08:12,814 [salt.config                              ][DEBUG   ] Reading configuration from /etc/salt/minion.d/auth_timeout.conf
2014-07-14 10:08:12,817 [salt.config                              ][DEBUG   ] Including configuration from '/etc/salt/minion.d/id.conf'
2014-07-14 10:08:12,818 [salt.config                              ][DEBUG   ] Reading configuration from /etc/salt/minion.d/id.conf
2014-07-14 10:08:12,821 [salt.config                              ][DEBUG   ] Including configuration from '/etc/salt/minion.d/master.conf'
2014-07-14 10:08:12,821 [salt.config                              ][DEBUG   ] Reading configuration from /etc/salt/minion.d/master.conf
2014-07-14 10:08:13,990 [salt.utils.schedule                         ][INFO    ] Added new job __mine_interval to scheduler
2014-07-14 10:08:13,991 [salt.minion                                 ][DEBUG   ] I am battens-c1.insolar-plants.net and I am not supposed to start any proxies. (Likely not a problem)
2014-07-14 10:08:13,991 [salt.minion                                 ][INFO    ] Minion is starting as user 'root'
2014-07-14 10:08:13,992 [salt.minion                                 ][DEBUG   ] Minion 'battens-c1.insolar-plants.net' trying to tune in
2014-07-14 10:08:13,994 [salt.minion                                 ][DEBUG   ] Minion PUB socket URI: ipc:///var/run/salt/minion/minion_event_ce16006525_pub.ipc
2014-07-14 10:08:13,995 [salt.minion                                 ][DEBUG   ] Minion PULL socket URI: ipc:///var/run/salt/minion/minion_event_ce16006525_pull.ipc
2014-07-14 10:08:13,995 [salt.minion                                 ][INFO    ] Starting pub socket on ipc:///var/run/salt/minion/minion_event_ce16006525_pub.ipc
2014-07-14 10:08:13,996 [salt.minion                                 ][INFO    ] Starting pull socket on ipc:///var/run/salt/minion/minion_event_ce16006525_pull.ipc
2014-07-14 10:08:13,997 [salt.minion                                 ][DEBUG   ] Generated random reconnect delay between '1000ms' and '11000ms' (2767)
2014-07-14 10:08:13,998 [salt.minion                                 ][DEBUG   ] Setting zmq_reconnect_ivl to '2767ms'
2014-07-14 10:08:13,999 [salt.minion                                 ][DEBUG   ] Setting zmq_reconnect_ivl_max to '11000ms'

sivann on 14 Jul 2014

@steverweber does the master expect pings from the minions? If not it would not forget the stale connections. I think the master must be aware of this "Ping". If the master does not receive pings from the minions it could close their connections.

sivann on 14 Jul 2014

The above log looks like the minion connects to the master at 54.246.180.52.
Authentication with master successful!

The master does not "expect" pings but rather accepts them. The minion sends pings at the ping_interval: in minutes.

here is an aggressive configuration i use on my minions for testing.

master: ddns.name.com
ping_interval: 2
auth_timeout: 10
auth_tries: 2
auth_safemode: False
random_reauth_delay: 10

steverweber on 14 Jul 2014

Yes, I know the minion thought it was successfully connected, but it wasn't. It seems just restarting the minion the way it is done now confuses the master somehow. I could not issue a single successful command to the minion with the above version, even after multiple restarts. Reinstalling the "stock" minion versions fixed this behaviour. Please tel me how to help debug further.

sivann on 14 Jul 2014

I found an issue when in daemon mode when running under a thread... however I don't think this would cause the minion to not respond to the master.

I pushed a new fix to fix_restarts it cleans up some issues.

steverweber on 14 Jul 2014

It is strange though that there were multiple salt-minion processes running from the beggining.. Even after killing them and restarting there were again multiple salt-minion processes. Perhaps this is manifested if the network between minion and master is slow, as in our case.

sivann on 14 Jul 2014

You should see two process.

Once some tricky issues are solved then this solution can be one process.

https://github.com/saltstack/salt/pull/14236

steverweber on 15 Jul 2014

The current version is holding up well on my systems. However I'm holding off my pull request until this solution works on your environment. Are you testing the latest version https://github.com/steverweber/salt/tree/fix_restarts that was pushed 2 days ago? Is it working out?

steverweber on 17 Jul 2014

I will test tommorrow.

sivann on 17 Jul 2014

when testing this patch please disable your custom
tcp_keepalive_* settings and reboot the system.

Thanks.

steverweber on 21 Jul 2014

The keepalive patch has been merged to the develop branch.

steverweber on 26 Jul 2014

Any

sivann on 23 Sep 2014

@steverweber sorry for the long delay, I'm ready to test again. Where could I find your latest code to test?

sivann on 23 Sep 2014

Ignore last comment, I'm testing latest dev branch.

sivann on 23 Sep 2014

hows the little minions behaving?

steverweber on 1 Oct 2014

I run it in a minion that normaly gets lost in a few hours, and now with the dev version it's always responding on ocassional pings for the past 8 days. Sometimes on 2nd ping. I would say it's very good news. I'll install it on 2-3 more minions soon. Thanks.

sivann on 1 Oct 2014

@sivann can this issue be closed ?

steverweber on 14 Oct 2014

Will your patch get released ? If yes then yes I consider it fixed. My minion still responds :-)

sivann on 18 Oct 2014

I think it's currently only in the develop branch, so that would make it set for the feature release after 2014.7.

basepi on 24 Oct 2014

Seeing similar issues to this. Is there a plan for this to make it into a release?

afletch on 4 Feb 2015

the patch /works/ but it's not elegant.
personaly i would rather see the minion die hard and have the service manager/ systemd, upstart, whatever you have restart the minion.https://github.com/saltstack/salt/pull/22313

steverweber on 4 Feb 2015

Does the keepalive patch (https://github.com/saltstack/salt/issues/12540#issuecomment-50223513) simply restart the minion? This was the patch I referred to.

afletch on 4 Feb 2015

it restarts the minion... but it's a minion restarting ones self. You will see two minion proccesses running ps one that keeps the other one running. It was done thisway because salt code was not really built for rebuilding the minion object in the same proccess. /arg parsing and global objects are tricky/.

Looking back at the code It would be more simple to update all the different service launchers such as /systemd, upstart, init.d, launchd... to auto restart if the minion dies.
https://github.com/saltstack/salt/pull/22313

steverweber on 4 Feb 2015

Surely the better approach would be to resolve the reason for a restart being needed (minion stops communicating to the master).
init.d, for example, has no auto restart ability and would need something like supervisord.

afletch on 4 Feb 2015

I agree, exiting the minion and relying on systemd/init/monit is also another source of technical issues: systemd timeouts, init muting the service, etc. Salt minion should be robust enough to cope with a simple network reconnection.

sivann on 4 Feb 2015

Well this is not completely fixed, although the ping does seem to work. I have:
ping_interval: 90
auth_tries: 20
rejected_retry: True
auth_safemode: False
restart_on_error: True

All commands always fail at first and some in several subsequent tries. Not very reliable if you have thousands of minions. Looking forward to RAET in order to be able to actually benefit from saltstack, because in its current state we can only use it for 1-st time configurations/installations.

sivann on 21 Apr 2015

You might also try ping_on_rotate: True in your master config so that it will automatically send a test.ping job after AES key rotate. That solves some of the "slow to respond" issues for some users.

basepi on 21 Apr 2015

I'm in a situation where VPN connections come and go, sometimes changing the IP address of the endpoint.

I _think_ I can cope with this with the changes to the OpenVPN config that would do a minion restart on the VPN coming up, but of course I'm similarly interested in this open issue.

vielmetti on 22 Apr 2015

Also looking forward to RAET. Because loosing connections to minions is a painful experience.

steverweber on 23 Apr 2015

@basepi thanks, I will try that.

sivann on 27 Apr 2015

@sivann kernel option net.ipv4.tcp_mtu_probing might be helpful on some minions.

steverweber on 16 May 2015

having exactly same issues as @sivann

I tried all the workaround described here and in various other issues. I only have ~40 Minions, all in different Clouds, Azure, AWS etc. The master is in Azure Cloud, they have a firewall in place and dont allow any ICMP (shouldn't matter but just mentioning).
The minions loose connection frequently and salt is absolutely unusable because every time I want to do something I have to go and restart salt-minion on all minions that have lost connection, after few days maybe 25% of my minions are still connected.

I tried to use RAET over the last months few times, every time the codebase had serious issues with the RAET support that broke it completely (see the various open RAET support issues).

In the end of the day this simple issue makes Salt completely unusable for any "remote configuration management scenario". I am seriously considering switching to Ansible even though I would have to rewrite a huge amount of states into playbooks. I love Salt when it works, but this simple issue of minions loosing connection is so frustrating and so unacceptable that it all does not matter.

Sorry for this rant, but there are many issues about minions loosing connection, I don't understand how anyone is actually using Salt over WAN successfully, I have tried so many different servers and clouds and configuration, it happens every time, the thing is that a TCP connection is really not guaranteed to stay up forever. Why is this not being handled on the Application Level as suggested in this issue?

Seeing as this issue has been open for over 1 year, is there any intention from the Salt Team at all to fix this (or to provide working RAET implementation?) or should I cut my losses and move to another solution?

bymodude on 4 Sep 2015

@bymodude Sorry you're having so many problems!

It's interesting because for every user who has a ton of problems with salt connections, we have hundreds of other users who are using it successfully. Not sure what's going on in your case! Have you tried hosting your master on another cloud to see if maybe there is something going on specifically with Azure?

Sorry for the RAET issues you've had. We've been focusing our efforts on a replacement TCP transport for ZMQ which should give us more visibility into what's going on when things fail, which will in turn allow us to much more easily implement application-level keepalive. It also will allow us to continue the paradigm of only having to open ports on the master, instead of all the minions, which is a pretty big "gotcha" when it comes to RAET (or UDP in general).

You might also consider checking out salt-ssh. Although the use of ssh as the transport slows salt down considerably, most state runs will work out of the box with salt-ssh.

basepi on 4 Sep 2015

@basepi thanks for your prompt response

We have tried hosting the master elsewhere, which reduces the frequency of disconnected minions, but it still happens, also playing with TCP Keepalive kernel parameters influences this, e.g. keepalive of 1200 seconds will lead to lots of down minions after about 12 hours, keepalive of 75 seconds will have most minions still up after few days, but some start dropping eventually after days/weeks.

These disconnects always happened since we started using salt (we started around 2014.7 release), however just having tested 2015.5.3 and specifically trying to reproduce this issue with the disconnecting minions (by upping our tcp keepalive to 1200 which seems to trigger the issue within ~ 12 hours, and also by having one minion<->master connection which is experiencing packet loss which triggers the disconnect even more easily) we are now actually seeing some errors on the minion side. I have opened #26888 and also a related comment in #25288 (we were trying to use a manual workaround, as suggested here http://comments.gmane.org/gmane.comp.sysutils.salt.user/22264 )

From my understanding the salt-minion actually has the mechanisms to detect these lost connections now, at least we are seeing SaltReqTimeoutErrors on minion side, just the minions does not seem to do the restart to reconnect.

We are using salt-ssh to bootstrap our minions, however using it for all states is not an option due to "most state runs will work out of the box", at least in 2014.7 we found this to be not the case. Since we do have a roster containing all minions from the salt-ssh bootstrapping, a feasible workaround for us may be to setup a cronjob that does "salt '*' test.ping" every few minutes and then goes and does "salt-ssh dead_minions cmd.run 'service salt-minion restart'". It just feels clunky.

bymodude on 5 Sep 2015

just wanted to add regarding Azure for the benefit of others that use Salt with Minions or Master in Azure Cloud, the default tcp timeout is 4 minutes on Azure, so setting tcp keepalive below that is essential, or increasing the timeout on Azure side as documented here: http://azure.microsoft.com/en-us/blog/new-configurable-idle-timeout-for-azure-load-balancer/

bymodude on 5 Sep 2015

I will note that salt-ssh had a lot of improvements in 2015.5, you might consider testing there. I'd be interested to hear about any disconnects in feature compatibility -- we want to give salt-ssh feature parity with normal salt, if not the same performance. (Mine and publish calls are particularly bad in the performance area right now)

basepi on 5 Sep 2015

@bymodude using something like this on the minions should cause the minion to restart if it hits 2 SaltReqTimeoutErrors in a row.

ping_interval: 2
auth_timeout: 10
auth_tries: 2
auth_safemode: False
random_reauth_delay: 10

I haven't tested it in some time now because I moved my master to a stable server with a static IP... I was using a laptop with a dynamic ip address at one time and this worked moving between home and work : - )

steverweber on 5 Sep 2015

@steverweber thanks for the suggestion, but I am already using that config and the minions still do not restart even though SaltReqTimeoutErrors are shown in log, as documented in #26888

bymodude on 5 Sep 2015

This issue is still open. The ping patch improves things a lot, but in the long run we lose most minions. Of our now 830 minions, about 80% do not reply to test.ping, all upgraded at least to version 2015.8.8 . Nothing ever shows on the logs.

It's not apparent if it's only networking or other bugs (auth issues?) manifesting quietly, but at its current state we cannot unfortunately consider salt for configuration management or orchestration, but we only use it for initial installation. We have created more than 100 salt states and we really would like to see salt working for us.

I would be happy to provide more information but I'm not sure how I could help debug.

The only suggestion I can offer is to drop both your custom low-level transport implementations, including RAET, and use a well proven mechanism like MQTT, using an MQTT broker on the saltmaster. It will cleanup salt's codebase, make protocol debugging much easier, and solve those issues. This will also open the door for thinner minion implementations e.g. for embedded devices.

sivann on 9 May 2016

Example logs for a minion that does no longer respond.

There is a _schedule.conf, with:

schedule:
  __mine_interval: {function: mine.update, jid_include: true, maxrunning: 2, minutes: 60}

2016-05-16 09:34:03,620 [salt.utils.event                                     ][DEBUG   ][7013] MinionEvent PUB socket URI: ipc:///var/run/salt/minion/minion_event_73b7be1ff6_pub.ipc
2016-05-16 09:34:03,621 [salt.utils.event                                     ][DEBUG   ][7013] MinionEvent PULL socket URI: ipc:///var/run/salt/minion/minion_event_73b7be1ff6_pull.ipc
2016-05-16 09:34:03,624 [salt.utils.event                                     ][DEBUG   ][7013] Sending event - data = {'clear': False, 'cmd': '_mine', 'data': {}, 'id': 'ermine-c0.insolar-plants.net', '_stamp': '2016-05-16T08:34:03.623764'}
2016-05-16 09:34:03,626 [salt.minion                                          ][DEBUG   ][1995] Handling event '_minion_mine\n\n\x85\xa5clear\xc2\xa3cmd\xa5_mine\xa4data\x80\xa2id\xbcermine-c0.insolar-plants.net\xa6_stamp\xba2016-05-16T08:34:03.623764'
2016-05-16 09:34:03,628 [salt.transport.zeromq                                ][DEBUG   ][1995] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'ermine-c0.insolar-plants.net', 'tcp://54.246.180.52:4506', 'aes')
2016-05-16 09:34:03,629 [salt.crypt                                           ][DEBUG   ][1995] Initializing new SAuth for ('/etc/salt/pki/minion', 'ermine-c0.insolar-plants.net', 'tcp://54.246.180.52:4506')
2016-05-16 09:34:04,127 [salt.transport.zeromq                                ][DEBUG   ][7013] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'ermine-c0.insolar-plants.net', 'tcp://54.246.180.52:4506', 'aes')
2016-05-16 09:34:04,128 [salt.crypt                                           ][DEBUG   ][7013] Initializing new SAuth for ('/etc/salt/pki/minion', 'ermine-c0.insolar-plants.net', 'tcp://54.246.180.52:4506')
2016-05-16 09:34:06,338 [salt.utils.schedule                                  ][DEBUG   ][7013] schedule.handle_func: Removing /var/cache/salt/minion/proc/20160516093403600494
2016-05-16 10:04:00,595 [salt.transport.zeromq                                ][DEBUG   ][1995] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'ermine-c0.insolar-plants.net', 'tcp://54.246.180.52:4506', 'aes')
2016-05-16 10:04:00,596 [salt.crypt                                           ][DEBUG   ][1995] Initializing new SAuth for ('/etc/salt/pki/minion', 'ermine-c0.insolar-plants.net', 'tcp://54.246.180.52:4506')
2016-05-16 10:34:03,594 [salt.utils.schedule                                  ][INFO    ][1995] Running scheduled job: __mine_interval
2016-05-16 10:34:03,595 [salt.utils.schedule                                  ][DEBUG   ][1995] schedule: This job was scheduled with jid_include, adding to cache (jid_include defaults to True)
2016-05-16 10:34:03,596 [salt.utils.schedule                                  ][DEBUG   ][1995] schedule: This job was scheduled with a max number of 2
2016-05-16 10:34:03,614 [salt.utils.schedule                                  ][DEBUG   ][7061] schedule.handle_func: adding this job to the jobcache with data {'fun': 'mine.update', 'jid': '20160516103403600217', 'pid': 7061, 'id': 'ermine-c0.insolar-plants.net', 'schedule': '__mine_interval'}
2016-05-16 10:34:03,620 [salt.utils.event                                     ][DEBUG   ][7061] MinionEvent PUB socket URI: ipc:///var/run/salt/minion/minion_event_73b7be1ff6_pub.ipc
2016-05-16 10:34:03,621 [salt.utils.event                                     ][DEBUG   ][7061] MinionEvent PULL socket URI: ipc:///var/run/salt/minion/minion_event_73b7be1ff6_pull.ipc
2016-05-16 10:34:03,624 [salt.utils.event                                     ][DEBUG   ][7061] Sending event - data = {'clear': False, 'cmd': '_mine', 'data': {}, 'id': 'ermine-c0.insolar-plants.net', '_stamp': '2016-05-16T09:34:03.623898'}
2016-05-16 10:34:03,627 [salt.minion                                          ][DEBUG   ][1995] Handling event '_minion_mine\n\n\x85\xa5clear\xc2\xa3cmd\xa5_mine\xa4data\x80\xa2id\xbcermine-c0.insolar-plants.net\xa6_stamp\xba2016-05-16T09:34:03.623898'
2016-05-16 10:34:03,630 [salt.transport.zeromq                                ][DEBUG   ][1995] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'ermine-c0.insolar-plants.net', 'tcp://54.246.180.52:4506', 'aes')
2016-05-16 10:34:03,632 [salt.crypt                                           ][DEBUG   ][1995] Initializing new SAuth for ('/etc/salt/pki/minion', 'ermine-c0.insolar-plants.net', 'tcp://54.246.180.52:4506')
2016-05-16 10:34:04,128 [salt.transport.zeromq                                ][DEBUG   ][7061] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'ermine-c0.insolar-plants.net', 'tcp://54.246.180.52:4506', 'aes')
2016-05-16 10:34:04,129 [salt.crypt                                           ][DEBUG   ][7061] Initializing new SAuth for ('/etc/salt/pki/minion', 'ermine-c0.insolar-plants.net', 'tcp://54.246.180.52:4506')
2016-05-16 10:34:06,551 [salt.utils.schedule                                  ][DEBUG   ][7061] schedule.handle_func: Removing /var/cache/salt/minion/proc/20160516103403600217
(END)

sivann on 16 May 2016

@cachedout Pinging you so we have a current core dev in the conversation.

basepi on 16 May 2016

@sivann Hi there.

I'm not sure I would agree that the ZeroMQ transport is a "custom low-level implementation". It's a series of fairly well-defined patterns into PyZMQ using hooks provided by that library. Perhaps you just meant RAET and TCP? If so, I would agree there.

I'm aware of MQTT and one of the reasons that the salt transport system has been designed in a pluggable manner is to allow easy exploration into the feasibility of adding that support. It's not something that's on the roadmap at present.

Regarding the issues you've been facing, we'd have to debug this as one would debug any other networking problem. We'd likely need to look at packet captures of failed minions and we'd need to know the state of the sockets on the minion side and the master. You didn't say whether this was on the publish side or the return side, so that's the first determination that we'd need to make...

cachedout on 16 May 2016

@cachedout, thank you for your help, I will try to have packet captures.

sivann on 16 Jun 2016

We also tried the new TCP transport, same behaviour, minion was lost after 3 days. Minion's side shows ESTABLISHED, no logs, master's side shows no connection. As I stated before TCP keepalives are not trustworthy, they are terminated locally in some networks.

@cachedout There is nothing to capture, stracing the minion just polls its socket:

[pid 29931] gettimeofday({1488274468, 901392}, NULL) = 0
[pid 29931] clock_gettime(CLOCK_MONOTONIC, {677798, 684207835}) = 0
[pid 29931] poll([{fd=8, events=POLLIN}, {fd=10, events=POLLIN}, {fd=11, events=POLLIN}, {fd=12, events=POLLIN}, {fd=16, events=POLLIN}, {fd=18, events=POLLIN}], 6, 0) = 0 (Timeout)
[pid 29931] poll([{fd=8, events=POLLIN}, {fd=10, events=POLLIN}, {fd=11, events=POLLIN}, {fd=12, events=POLLIN}, {fd=16, events=POLLIN}, {fd=18, events=POLLIN}], 6, 995) = 0 (Timeout)
[pid 29931] clock_gettime(CLOCK_MONOTONIC, {677799, 681042172}) = 0
[pid 29931] gettimeofday({1488274469, 898725}, NULL) = 0
[pid 29931] gettimeofday({1488274469, 899251}, NULL) = 0
[pid 29931] waitpid(29932, 0xbfe1f7f4, WNOHANG) = 0
[pid 29931] gettimeofday({1488274469, 899689}, NULL) = 0
[pid 29931] gettimeofday({1488274469, 900043}, NULL) = 0
[pid 29931] gettimeofday({1488274469, 900350}, NULL) = 0
[pid 29931] gettimeofday({1488274469, 900561}, NULL) = 0
[pid 29931] clock_gettime(CLOCK_MONOTONIC, {677799, 683313820}) = 0
[pid 29931] poll([{fd=8, events=POLLIN}, {fd=10, events=POLLIN}, {fd=11, events=POLLIN}, {fd=12, events=POLLIN}, {fd=16, events=POLLIN}, {fd=18, events=POLLIN}], 6, 0) = 0 (Timeout)
[pid 29931] poll([{fd=8, events=POLLIN}, {fd=10, events=POLLIN}, {fd=11, events=POLLIN}, {fd=12, events=POLLIN}, {fd=16, events=POLLIN}, {fd=18, events=POLLIN}], 6, 996) = 0 (Timeout)
[pid 29931] clock_gettime(CLOCK_MONOTONIC, {677800, 681053958}) = 0
[pid 29931] gettimeofday({1488274470, 898734}, NULL) = 0
[pid 29931] gettimeofday({1488274470, 899271}, NULL) = 0
[pid 29931] waitpid(29932, 0xbfe1f7f4, WNOHANG) = 0
[pid 29931] gettimeofday({1488274470, 899710}, NULL) = 0
[pid 29931] gettimeofday({1488274470, 900071}, NULL) = 0
[pid 29931] gettimeofday({1488274470, 900379}, NULL) = 0
[pid 29931] gettimeofday({1488274470, 900598}, NULL) = 0
[pid 29931] clock_gettime(CLOCK_MONOTONIC, {677800, 683347726}) = 0
[pid 29931] poll([{fd=8, events=POLLIN}, {fd=10, events=POLLIN}, {fd=11, events=POLLIN}, {fd=12, events=POLLIN}, {fd=16, events=POLLIN}, {fd=18, events=POLLIN}], 6, 0) = 0 (Timeout)
[pid 29931] poll([{fd=8, events=POLLIN}, {fd=10, events=POLLIN}, {fd=11, events=POLLIN}, {fd=12, events=POLLIN}, {fd=16, events=POLLIN}, {fd=18, events=POLLIN}], 6, 995) = 0 (Timeout)
[pid 29931] clock_gettime(CLOCK_MONOTONIC, {677801, 680110054}) = 0
[pid 29931] gettimeofday({1488274471, 898951}, NULL) = 0
[pid 29931] gettimeofday({1488274471, 900609}, NULL) = 0
[pid 29931] waitpid(29932, 0xbfe1f7f4, WNOHANG) = 0

Please do not implement more access methods, please just fix one of the current implementations. If you don't use your own keepalive payload this will always be broken. Will a support contract help you move this forward? We have invested too much time in salt and we are very sorry to see this basic functionality broken.

sivann on 28 Feb 2017

@sivann, I am seeing/troubleshooting this as well on my network. We have a DMZ that has been closing connections from our non-DMZ Master to the DMZ Minions.

When you strace the Minion and it shows that it "just polls its socket", what do you see in the logs on the Minion? Are you running the option "log_level_logfile: all" or equivalent? Or, what are the last actions logged by the Minion? Also, what does the following command give you?

ss -nao | grep -P ':450(5|6)\s+'

In my case, I believe that our DMZ firewall is ignoring the TCP Keepalives that are going through, and since we do not have a scheduled Master->Minion ping (yet), the connection gets destroyed by the firewall after about 60 minutes due to what it considers to be inactivity. I am still troubleshooting however.

PeterS242 on 3 Mar 2017

Here is an update.

On my non-firewalled network I have minions that still report timeouts, but recover adequately.

2017-03-05 13:20:12,920 [salt.transport.zeromq                                ][DEBUG   ][27814] SaltReqTimeoutError, retrying. (1/7)
2017-03-05 13:20:17,920 [salt.transport.zeromq                                ][DEBUG   ][27814] SaltReqTimeoutError, retrying. (2/7)
2017-03-05 13:20:22,920 [salt.transport.zeromq                                ][DEBUG   ][27814] SaltReqTimeoutError, retrying. (3/7)

With this in mind I decided to configure my DMZ-firewalled systems to have an increased "auth_tries" value of 20 instead of the default of 7. This seems to have done the trick for me. I could not get the "Set timeouts really low to force the code to restart the salt-minion child" method to work--I always wound up with my DMZ-firewalled systems running only the parent when it attempted, and failed, to restart the child.

Here is what my DMZ-firewalled systems report in their logs, for timeouts:

2017-03-05 13:20:18,155 [salt.transport.zeromq                                ][DEBUG   ][2841] SaltReqTimeoutError, retrying. (1/20)
2017-03-05 13:20:28,157 [salt.transport.zeromq                                ][DEBUG   ][2841] SaltReqTimeoutError, retrying. (2/20)
2017-03-05 13:20:38,158 [salt.transport.zeromq                                ][DEBUG   ][2841] SaltReqTimeoutError, retrying. (3/20)

Hope that helps someone.

PeterS242 on 6 Mar 2017

@PeterS242 Hi peter, as I stated above, keepalive counters count on one end (the minion's) because on SAT networks keepalive is proxied (emulated). Nothing on the minion logs.

sivann on 8 Mar 2017

There are some good news: testing the TCP transport with the latest 2016.11.x, we have 2 complete weeks that all 3 test minions are reachable.

sivann on 15 Mar 2017

At Link Labs we are facing a similar issue with minions on 3G and Satcom connections. We have the minion running in supervisor, which helps some, but it would be very helpful to get a fuller picture of what minion/master settings you used to achieve your reliability. @sivann

emeryray02 on 1 Jun 2017

I'm also for application level keepalives. 👍

I also second @emeryray02's request for more details @sivann.

I have a master in AWS and minions in the field. After several hours, the minions will lose connectivity to the salt master, but the minion will not realize it. I have configured reasonable TCP keepalives, but AWS does not seem to be respecting them.

riq-dfaubion on 26 Jun 2017

👍4

ZeroMQ 4.2 include heartbeat functionality, maybe that can be used

nuharaf on 10 May 2018

@riq-dfaubion we used the TCP transport for experiment, i.e. "transport: tcp" not the 0mq; that seemed to work in a short test. But we have not found the time to switch over, we have > 1000 minions. Since there are several issues that prevent me for trusting salt for orchestration (several broken releases, version incompatibilities, short-term O/S support, half-baked LDAP). We just use it for commissioning now.

sivann on 11 May 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

If this issue is closed prematurely, please leave a comment and we will gladly reopen the issue.

stale[bot] on 24 Aug 2019

Salt: Application-level Keepalive is mandatory for healthy connections

The problem:

Suggestion:

Most helpful comment

All 73 comments

Related issues