Hello,
we have now about 100 salt-minions which are installed in remote areas with 3G and satellite connections.
We loose connectivity with all of those minions in about 1-2 days after installation, with test.ping reporting "minion did not return". The state was each time that the minions saw an ESTABLISHED TCP connection, while on the salt-master there were no connection listed at all. (Yes that is correct). Tighter keepalive settings were tried with no result. (OS is linux)
Each time, restarting the salt-minion fixes the problem immediately.
Obviously the connections are transparently proxied someplace, (who knows what happens with those SAT networks) so the whole tcp-keepalive mechanism of 0mq fails.
Salt should handle this on the application level, so as to determine connection health and reconnect if needed by e.g. sending a dummy ping data periodically every e.g. 10 minutes and checking for valid reply. The only workarround we can see is restarting salt-minion hourly which is really ugly.
What version of salt are you running and on which OSes?
Can you provide the out put of salt 'minion id' test.versions_report?
On May 5, 2014 6:49 AM, "Spiros Ioannou" [email protected] wrote:
Hello,
The problem:we have now about 100 salt-minions which are installed in remote areas
with 3G and satellite connections.We loose connectivity with all of those minions in about 1-2 days after
installation, with test.ping reporting "minion did not return". The state
was each time that the minions saw an ESTABLISHED TCP connection, while on
the salt-master there were no connection listed at all. (Yes that is
correct). Tighter keepalive settings were tried with no result. (OS is
linux)
Each time, restarting the salt-minion fixes the problem immediately.Obviously the connections are transparently proxied someplace, (who knows
what happens with those SAT networks) so the whole tcp-keepalive mechanism
of 0mq fails.Salt should handle this on the application level, so as to determine
connection health and reconnect if needed by e.g. sending a dummy ping data
sent periodically every e.g. 10 minutes and checking for reply.—
Reply to this email directly or view it on GitHubhttps://github.com/saltstack/salt/issues/12540
.
All minions are on debian squeeze, salt-minion versions: 2014.1.1+ds-1~bpo60+1 (freshly installed), with keepalive counters counting normaly (when executing ss -e or netstat -ean)
Salt-master is 2014.1.3+ds-2trusty2, ubuntu 14.04
test.versions won't work since they are all unreachable, but for one I restarted manually I got this:
Salt: 2014.1.1
Python: 2.6.6 (r266:84292, Dec 27 2010, 00:02:40)
Jinja2: 2.5.5
M2Crypto: 0.20.1
msgpack-python: 0.1.10
msgpack-pure: Not Installed
pycrypto: 2.1.0
PyYAML: 3.09
PyZMQ: 13.1.0
ZMQ: 3.2.3
I also include a tcpdump from the minion. The master shows no connections, while the minion shows established.
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
12:59:37.751471 IP (tos 0x0, ttl 64, id 49407, offset 0, flags [DF], proto TCP (6), length 52)
10.11.32.161.58214 > xxxxxx.eu-west-1.compute.amazonaws.com.4505: Flags [.], cksum 0x9449 (correct), ack 1897141646, win 92, options [nop,nop,TS val 4774885 ecr 1689822158], length 0
12:59:37.755784 IP (tos 0x0, ttl 63, id 58924, offset 0, flags [DF], proto TCP (6), length 52)
xxxxxx.eu-west-1.compute.amazonaws.com.4505 > 10.11.32.161.58214: Flags [.], cksum 0xc7a4 (correct), ack 1, win 46, options [nop,nop,TS val 1690129362 ecr 3537089], length 0
13:04:37.755293 IP (tos 0x0, ttl 64, id 49408, offset 0, flags [DF], proto TCP (6), length 52)
10.11.32.161.58214 > xxxxxx.eu-west-1.compute.amazonaws.com.4505: Flags [.], cksum 0xbf46 (correct), ack 1, win 92, options [nop,nop,TS val 4849886 ecr 1690129362], length 0
13:04:37.762560 IP (tos 0x0, ttl 63, id 58925, offset 0, flags [DF], proto TCP (6), length 52)
xxxxxx.eu-west-1.compute.amazonaws.com.4505 > 10.11.32.161.58214: Flags [.], cksum 0x1798 (correct), ack 1, win 46, options [nop,nop,TS val 1690436570 ecr 3537089], length 0
13:09:37.759286 IP (tos 0x0, ttl 64, id 49409, offset 0, flags [DF], proto TCP (6), length 52)
10.11.32.161.58214 > xxxxxx.eu-west-1.compute.amazonaws.com.4505: Flags [.], cksum 0xea3f (correct), ack 1, win 92, options [nop,nop,TS val 4924887 ecr 1690436570], length 0
13:09:37.767481 IP (tos 0x0, ttl 63, id 58926, offset 0, flags [DF], proto TCP (6), length 52)
xxxxxx.eu-west-1.compute.amazonaws.com.4505 > 10.11.32.161.58214: Flags [.], cksum 0x678f (correct), ack 1, win 46, options [nop,nop,TS val 1690743774 ecr 3537089], length 0
(xxxxamazonaws.com is the master, 10.11.32.161 is the local ip of the minion)
I think the new RAET UDP transport will be the best answer for times like these. We won't have to worry about TCP reporting whether it's alive or not, it should handle latency a lot better.
(It will also give us a lot more application-side power and introspection into what's happening, so we can solve issues like this more easily. ZMQ tends to be a black box, much harder to debug these types of problems)
I believe changing the queue mechanism to resolve this is not the best answer. UDP is not guaranteed to work either in this case. I urge you not to combine the 2 issues, since I feel that this will only delay a possible fix.
I don't see why implementing an application-level keepalive cannot be performed using 0mq, this is not a 0mq bug.
Oh, I agree that UDP is not a fix-all. The advantage of this new implementation is that it brings the queuing mechanisms much closer to the application level, which will allow us to much easier build application-level keepalive (which is a given, since UDP doesn't have its own keepalive).
We would love to have ZMQ application-level keepalive, and we have by no means written it off, but it will take extensive effort and we're going to wait until RAET is out in the wild and see what the reception is. We may even end up building a TCP transport for RAET to completely replace ZMQ, at which point application-level keepalive would be a given there as well.
Do you have any time estimate for this release? Does this 0mq replacement also means that all the clients will have to be updated manually? Because that would result in a huge IT effort even for the few hundrends of salt minions we have. There have been several cases in salt development that resulted in lost minions. Having a second mechanism to access those hosts really diminishes salt's usefulness.
Sorry for being a bit bitter, I really appreciate your efforts and salt itself, but I feel that breaking compatibility so often is really not the way to go.
0MQ is not being replaced. This new release will just introduce an alternate transport mechanism which must be enabled to be in use. Nothing will change unless you want it to. Even people who want to switch to RAET should be able to do so without having to manually install. All it will take is upgrading the master and minions, ensuring that the proper RAET dependencies are installed on all the systems, and then switching first the minions and then the master over to RAET In their config.
All of that said, it will be a beta product in this next release, so we won't recommend immediately switching over an entire infrastructure or anything.
To answer your original question, we are targeting 2 weeks from now for the first release candidate.
@sivann, have you tried setting up the Salt master to run a test.ping on all your minions on a regular basis? Maybe once an hour or once every 10 minutes? Salt has a scheduler that allows you to do that. Some people have had success doing that.
@basepi thanks for clarifying that, 2 weeks is not too long,
@UtahDave We thought of that, but in some cases we loose them even in 10 minutes. Another option is to restart salt-minion by cron hourly, but this seems an overkill.
I saw in the sources that salt-minion has a scheduler but couldn't find how to re-initialize the rabbitmq connection, so as to write a simple keepalive patch (to just send a message to the server); it seems it is only initialized once in the tune_in function so it was not so simple for me to patch.
sorry i never read the full thread. just reading through random issues.
if using the dev version of salt, perhaps something like this might help in the minion config.
master: none_mult_master_ip
ping_interval: 2
auth_timeout: 10
auth_tries: 2
auth_safemode: False
random_reauth_delay: 10
the above will have the minions 'ping' the master every 120 seconds ... if that ping fails and 5 re-auth retries fail then the minion restarts.
Thanks @steverweber this looks promising! I will try it.
Please let us/me know in a week from now if anything can be improved.
Thanks.
correction ping_interval
is in minutes.. so every 2hours the minion pings master.
@sivann The auto restart code was patched.
https://github.com/saltstack/salt/pull/13582
How things going? Is the salt deployment more stable now?
This is not yet released as of 2014.1.7, We just installed today's devel from github and will get back with results.
It seems it is not fixed, it's actually worse. The new code causes lots of bad EST connections on the master.
minion IP: 10.11.40.161, public: 176.227.142.126
saltmaster IP: 10.0.0.212, public: 54.246.180.52
minion:
root@debian:/usr/local/bin# netstat -ean |grep 54.246.180.52
tcp 0 0 10.11.40.161:43693 54.246.180.52:4505 ESTABLISHED 0 125808
root@debian:/usr/local/bin# ss -e|grep 54.246.180.52
ESTAB 0 0 10.11.40.161:43693 54.246.180.52:4505 timer:(keepalive,46sec,0) ino:125808 sk:f3512600
Master:
root@saltmaster:~ # netstat -ean |grep 176.227.142.126
tcp 0 0 10.0.0.212:4505 176.227.142.126:60215 ESTABLISHED 0 149365
tcp 0 0 10.0.0.212:4506 176.227.142.126:53544 ESTABLISHED 0 156632
tcp 0 0 10.0.0.212:4505 176.227.142.126:47687 ESTABLISHED 0 149367
tcp 0 0 10.0.0.212:4505 176.227.142.126:40874 ESTABLISHED 0 149360
tcp 0 1560 10.0.0.212:4505 176.227.142.126:37470 ESTABLISHED 0 149378
tcp 0 0 10.0.0.212:4506 176.227.142.126:53513 ESTABLISHED 0 156639
tcp 0 0 10.0.0.212:4505 176.227.142.126:54876 ESTABLISHED 0 149377
tcp 0 1560 10.0.0.212:4505 176.227.142.126:43693 ESTABLISHED 0 150116
tcp 0 0 10.0.0.212:4505 176.227.142.126:55295 ESTABLISHED 0 149362
tcp 0 0 10.0.0.212:4505 176.227.142.126:39531 ESTABLISHED 0 149361
tcp 0 0 10.0.0.212:4505 176.227.142.126:48655 ESTABLISHED 0 149363
strace on the minion at that time (logs show nothing useful):
[pid 14053] poll([{fd=10, events=POLLIN}, {fd=13, events=POLLIN}], 2, 1000) = 0 (Timeout)
[pid 14053] poll([{fd=10, events=POLLIN}], 1, 0) = 0 (Timeout)
[pid 14053] poll([{fd=13, events=POLLIN}], 1, 0) = 0 (Timeout)
[pid 14053] clock_gettime(CLOCK_MONOTONIC, {335431, 508535066}) = 0
[pid 14053] gettimeofday({1405065700, 720156}, NULL) = 0
[pid 14053] gettimeofday({1405065700, 720449}, NULL) = 0
[pid 14053] clock_gettime(CLOCK_MONOTONIC, {335431, 509479581}) = 0
[pid 14053] poll([{fd=10, events=POLLIN}, {fd=13, events=POLLIN}], 2, 0) = 0 (Timeout)
[pid 14053] poll([{fd=10, events=POLLIN}], 1, 0) = 0 (Timeout)
[pid 14053] poll([{fd=13, events=POLLIN}], 1, 0) = 0 (Timeout)
[pid 14053] clock_gettime(CLOCK_MONOTONIC, {335431, 509881187}) = 0
[pid 14053] poll([{fd=10, events=POLLIN}, {fd=13, events=POLLIN}], 2, 1000
... and so on for ever
It seems the minion shows 1 ESTABLISHED connection with the master, the master shows 9, and none actually works.
What happened probably is that the minion tried to reconnect to the master, leaving all those fake ESTABLISHED connections on the master. For some reason reconnection was unsuccessful with the master since the minion does not respond to the master to salt commands. Perhaps the master does not know which connection is the right one?.
A suggestion is for the master to also ping the minions on established connections, and to close fake EST connections.
ya something was mucked up with that commit.
I created a new fix that seems much more stable...
https://github.com/saltstack/salt/pull/14064
I'll likely make only a small change to that before I give the go-ahead to merge.
Testing is most welcome!
to test this patch you can do.
curl -o install_salt.sh.sh -L https://bootstrap.saltstack.com
sudo sh install_salt.sh.sh -g https://github.com/steverweber/salt.git git fix_restarts
I installed the version above, this version does not even connect to the master once, something's wrong.
Firstly it starts 2 salt-minion processes. And test.ping never works.
I include the debug logfile:
2014-07-14 10:07:57,527 [salt ][INFO ] Setting up the Salt Minion "battens-c1.insolar-plants.net"
2014-07-14 10:07:57,534 [salt.utils.process][DEBUG ] Created pidfile: /var/run/salt-minion.pid
2014-07-14 10:07:57,537 [salt.config ][DEBUG ] Reading configuration from /etc/salt/minion
2014-07-14 10:07:57,799 [salt.config ][DEBUG ] Including configuration from '/etc/salt/minion.d/auth_timeout.conf'
2014-07-14 10:07:57,800 [salt.config ][DEBUG ] Reading configuration from /etc/salt/minion.d/auth_timeout.conf
2014-07-14 10:07:57,803 [salt.config ][DEBUG ] Including configuration from '/etc/salt/minion.d/id.conf'
2014-07-14 10:07:57,803 [salt.config ][DEBUG ] Reading configuration from /etc/salt/minion.d/id.conf
2014-07-14 10:07:57,806 [salt.config ][DEBUG ] Including configuration from '/etc/salt/minion.d/master.conf'
2014-07-14 10:07:57,807 [salt.config ][DEBUG ] Reading configuration from /etc/salt/minion.d/master.conf
2014-07-14 10:07:58,302 [salt.minion ][DEBUG ] Attempting to authenticate with the Salt Master at 54.246.180.52
2014-07-14 10:07:58,305 [salt.crypt ][DEBUG ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-14 10:08:02,290 [salt.crypt ][DEBUG ] Decrypting the current master AES key
2014-07-14 10:08:02,291 [salt.crypt ][DEBUG ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-14 10:08:04,158 [salt.minion ][INFO ] Authentication with master successful!
2014-07-14 10:08:06,970 [salt.crypt ][DEBUG ] Decrypting the current master AES key
2014-07-14 10:08:06,972 [salt.crypt ][DEBUG ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-14 10:08:11,594 [salt.crypt ][DEBUG ] Loaded minion key: /etc/salt/pki/minion/minion.pem
2014-07-14 10:08:12,552 [salt.config ][DEBUG ] Reading configuration from /etc/salt/minion
2014-07-14 10:08:12,813 [salt.config ][DEBUG ] Including configuration from '/etc/salt/minion.d/auth_timeout.conf'
2014-07-14 10:08:12,814 [salt.config ][DEBUG ] Reading configuration from /etc/salt/minion.d/auth_timeout.conf
2014-07-14 10:08:12,817 [salt.config ][DEBUG ] Including configuration from '/etc/salt/minion.d/id.conf'
2014-07-14 10:08:12,818 [salt.config ][DEBUG ] Reading configuration from /etc/salt/minion.d/id.conf
2014-07-14 10:08:12,821 [salt.config ][DEBUG ] Including configuration from '/etc/salt/minion.d/master.conf'
2014-07-14 10:08:12,821 [salt.config ][DEBUG ] Reading configuration from /etc/salt/minion.d/master.conf
2014-07-14 10:08:13,990 [salt.utils.schedule ][INFO ] Added new job __mine_interval to scheduler
2014-07-14 10:08:13,991 [salt.minion ][DEBUG ] I am battens-c1.insolar-plants.net and I am not supposed to start any proxies. (Likely not a problem)
2014-07-14 10:08:13,991 [salt.minion ][INFO ] Minion is starting as user 'root'
2014-07-14 10:08:13,992 [salt.minion ][DEBUG ] Minion 'battens-c1.insolar-plants.net' trying to tune in
2014-07-14 10:08:13,994 [salt.minion ][DEBUG ] Minion PUB socket URI: ipc:///var/run/salt/minion/minion_event_ce16006525_pub.ipc
2014-07-14 10:08:13,995 [salt.minion ][DEBUG ] Minion PULL socket URI: ipc:///var/run/salt/minion/minion_event_ce16006525_pull.ipc
2014-07-14 10:08:13,995 [salt.minion ][INFO ] Starting pub socket on ipc:///var/run/salt/minion/minion_event_ce16006525_pub.ipc
2014-07-14 10:08:13,996 [salt.minion ][INFO ] Starting pull socket on ipc:///var/run/salt/minion/minion_event_ce16006525_pull.ipc
2014-07-14 10:08:13,997 [salt.minion ][DEBUG ] Generated random reconnect delay between '1000ms' and '11000ms' (2767)
2014-07-14 10:08:13,998 [salt.minion ][DEBUG ] Setting zmq_reconnect_ivl to '2767ms'
2014-07-14 10:08:13,999 [salt.minion ][DEBUG ] Setting zmq_reconnect_ivl_max to '11000ms'
@steverweber does the master expect pings from the minions? If not it would not forget the stale connections. I think the master must be aware of this "Ping". If the master does not receive pings from the minions it could close their connections.
The above log looks like the minion connects to the master at 54.246.180.52.
Authentication with master successful!
The master does not "expect" pings but rather accepts them. The minion sends pings at the ping_interval:
in minutes.
here is an aggressive configuration i use on my minions for testing.
master: ddns.name.com
ping_interval: 2
auth_timeout: 10
auth_tries: 2
auth_safemode: False
random_reauth_delay: 10
Yes, I know the minion thought it was successfully connected, but it wasn't. It seems just restarting the minion the way it is done now confuses the master somehow. I could not issue a single successful command to the minion with the above version, even after multiple restarts. Reinstalling the "stock" minion versions fixed this behaviour. Please tel me how to help debug further.
I found an issue when in daemon mode when running under a thread... however I don't think this would cause the minion to not respond to the master.
I pushed a new fix to fix_restarts it cleans up some issues.
It is strange though that there were multiple salt-minion processes running from the beggining.. Even after killing them and restarting there were again multiple salt-minion processes. Perhaps this is manifested if the network between minion and master is slow, as in our case.
You should see two process.
Once some tricky issues are solved then this solution can be one process.
The current version is holding up well on my systems. However I'm holding off my pull request until this solution works on your environment. Are you testing the latest version https://github.com/steverweber/salt/tree/fix_restarts that was pushed 2 days ago? Is it working out?
I will test tommorrow.
when testing this patch please disable your custom
tcp_keepalive_*
settings and reboot the system.
Thanks.
The keepalive patch has been merged to the develop branch.
Any
@steverweber sorry for the long delay, I'm ready to test again. Where could I find your latest code to test?
Ignore last comment, I'm testing latest dev branch.
hows the little minions behaving?
I run it in a minion that normaly gets lost in a few hours, and now with the dev version it's always responding on ocassional pings for the past 8 days. Sometimes on 2nd ping. I would say it's very good news. I'll install it on 2-3 more minions soon. Thanks.
@sivann can this issue be closed ?
Will your patch get released ? If yes then yes I consider it fixed. My minion still responds :-)
I think it's currently only in the develop branch, so that would make it set for the feature release after 2014.7.
Seeing similar issues to this. Is there a plan for this to make it into a release?
the patch /works/ but it's not elegant.
personaly i would rather see the minion die hard and have the service manager/ systemd, upstart, whatever you have restart the minion.https://github.com/saltstack/salt/pull/22313
Does the keepalive patch (https://github.com/saltstack/salt/issues/12540#issuecomment-50223513) simply restart the minion? This was the patch I referred to.
it restarts the minion... but it's a minion restarting ones self. You will see two minion proccesses running ps
one that keeps the other one running. It was done thisway because salt code was not really built for rebuilding the minion object in the same proccess. /arg parsing and global objects are tricky/.
Looking back at the code It would be more simple to update all the different service launchers such as /systemd, upstart, init.d, launchd... to auto restart if the minion dies.
https://github.com/saltstack/salt/pull/22313
Surely the better approach would be to resolve the reason for a restart being needed (minion stops communicating to the master).
init.d, for example, has no auto restart ability and would need something like supervisord.
I agree, exiting the minion and relying on systemd/init/monit is also another source of technical issues: systemd timeouts, init muting the service, etc. Salt minion should be robust enough to cope with a simple network reconnection.
Well this is not completely fixed, although the ping does seem to work. I have:
ping_interval: 90
auth_tries: 20
rejected_retry: True
auth_safemode: False
restart_on_error: True
All commands always fail at first and some in several subsequent tries. Not very reliable if you have thousands of minions. Looking forward to RAET in order to be able to actually benefit from saltstack, because in its current state we can only use it for 1-st time configurations/installations.
You might also try ping_on_rotate: True
in your master config so that it will automatically send a test.ping job after AES key rotate. That solves some of the "slow to respond" issues for some users.
I'm in a situation where VPN connections come and go, sometimes changing the IP address of the endpoint.
I _think_ I can cope with this with the changes to the OpenVPN config that would do a minion restart on the VPN coming up, but of course I'm similarly interested in this open issue.
Also looking forward to RAET. Because loosing connections to minions is a painful experience.
@basepi thanks, I will try that.
@sivann kernel option net.ipv4.tcp_mtu_probing might be helpful on some minions.
having exactly same issues as @sivann
I tried all the workaround described here and in various other issues. I only have ~40 Minions, all in different Clouds, Azure, AWS etc. The master is in Azure Cloud, they have a firewall in place and dont allow any ICMP (shouldn't matter but just mentioning).
The minions loose connection frequently and salt is absolutely unusable because every time I want to do something I have to go and restart salt-minion on all minions that have lost connection, after few days maybe 25% of my minions are still connected.
I tried to use RAET over the last months few times, every time the codebase had serious issues with the RAET support that broke it completely (see the various open RAET support issues).
In the end of the day this simple issue makes Salt completely unusable for any "remote configuration management scenario". I am seriously considering switching to Ansible even though I would have to rewrite a huge amount of states into playbooks. I love Salt when it works, but this simple issue of minions loosing connection is so frustrating and so unacceptable that it all does not matter.
Sorry for this rant, but there are many issues about minions loosing connection, I don't understand how anyone is actually using Salt over WAN successfully, I have tried so many different servers and clouds and configuration, it happens every time, the thing is that a TCP connection is really not guaranteed to stay up forever. Why is this not being handled on the Application Level as suggested in this issue?
Seeing as this issue has been open for over 1 year, is there any intention from the Salt Team at all to fix this (or to provide working RAET implementation?) or should I cut my losses and move to another solution?
@bymodude Sorry you're having so many problems!
It's interesting because for every user who has a ton of problems with salt connections, we have hundreds of other users who are using it successfully. Not sure what's going on in your case! Have you tried hosting your master on another cloud to see if maybe there is something going on specifically with Azure?
Sorry for the RAET issues you've had. We've been focusing our efforts on a replacement TCP transport for ZMQ which should give us more visibility into what's going on when things fail, which will in turn allow us to much more easily implement application-level keepalive. It also will allow us to continue the paradigm of only having to open ports on the master, instead of all the minions, which is a pretty big "gotcha" when it comes to RAET (or UDP in general).
You might also consider checking out salt-ssh. Although the use of ssh as the transport slows salt down considerably, most state runs will work out of the box with salt-ssh.
@basepi thanks for your prompt response
We have tried hosting the master elsewhere, which reduces the frequency of disconnected minions, but it still happens, also playing with TCP Keepalive kernel parameters influences this, e.g. keepalive of 1200 seconds will lead to lots of down minions after about 12 hours, keepalive of 75 seconds will have most minions still up after few days, but some start dropping eventually after days/weeks.
These disconnects always happened since we started using salt (we started around 2014.7 release), however just having tested 2015.5.3 and specifically trying to reproduce this issue with the disconnecting minions (by upping our tcp keepalive to 1200 which seems to trigger the issue within ~ 12 hours, and also by having one minion<->master connection which is experiencing packet loss which triggers the disconnect even more easily) we are now actually seeing some errors on the minion side. I have opened #26888 and also a related comment in #25288 (we were trying to use a manual workaround, as suggested here http://comments.gmane.org/gmane.comp.sysutils.salt.user/22264 )
From my understanding the salt-minion actually has the mechanisms to detect these lost connections now, at least we are seeing SaltReqTimeoutErrors on minion side, just the minions does not seem to do the restart to reconnect.
We are using salt-ssh to bootstrap our minions, however using it for all states is not an option due to "most state runs will work out of the box", at least in 2014.7 we found this to be not the case. Since we do have a roster containing all minions from the salt-ssh bootstrapping, a feasible workaround for us may be to setup a cronjob that does "salt '*' test.ping" every few minutes and then goes and does "salt-ssh dead_minions cmd.run 'service salt-minion restart'". It just feels clunky.
just wanted to add regarding Azure for the benefit of others that use Salt with Minions or Master in Azure Cloud, the default tcp timeout is 4 minutes on Azure, so setting tcp keepalive below that is essential, or increasing the timeout on Azure side as documented here: http://azure.microsoft.com/en-us/blog/new-configurable-idle-timeout-for-azure-load-balancer/
I will note that salt-ssh had a lot of improvements in 2015.5, you might consider testing there. I'd be interested to hear about any disconnects in feature compatibility -- we want to give salt-ssh feature parity with normal salt, if not the same performance. (Mine and publish calls are particularly bad in the performance area right now)
@bymodude using something like this on the minions should cause the minion to restart if it hits 2 SaltReqTimeoutErrors in a row.
ping_interval: 2
auth_timeout: 10
auth_tries: 2
auth_safemode: False
random_reauth_delay: 10
I haven't tested it in some time now because I moved my master to a stable server with a static IP... I was using a laptop with a dynamic ip address at one time and this worked moving between home and work : - )
@steverweber thanks for the suggestion, but I am already using that config and the minions still do not restart even though SaltReqTimeoutErrors are shown in log, as documented in #26888
This issue is still open. The ping patch improves things a lot, but in the long run we lose most minions. Of our now 830 minions, about 80% do not reply to test.ping, all upgraded at least to version 2015.8.8 . Nothing ever shows on the logs.
It's not apparent if it's only networking or other bugs (auth issues?) manifesting quietly, but at its current state we cannot unfortunately consider salt for configuration management or orchestration, but we only use it for initial installation. We have created more than 100 salt states and we really would like to see salt working for us.
I would be happy to provide more information but I'm not sure how I could help debug.
The only suggestion I can offer is to drop both your custom low-level transport implementations, including RAET, and use a well proven mechanism like MQTT, using an MQTT broker on the saltmaster. It will cleanup salt's codebase, make protocol debugging much easier, and solve those issues. This will also open the door for thinner minion implementations e.g. for embedded devices.
Example logs for a minion that does no longer respond.
There is a _schedule.conf, with:
schedule:
__mine_interval: {function: mine.update, jid_include: true, maxrunning: 2, minutes: 60}
2016-05-16 09:34:03,620 [salt.utils.event ][DEBUG ][7013] MinionEvent PUB socket URI: ipc:///var/run/salt/minion/minion_event_73b7be1ff6_pub.ipc
2016-05-16 09:34:03,621 [salt.utils.event ][DEBUG ][7013] MinionEvent PULL socket URI: ipc:///var/run/salt/minion/minion_event_73b7be1ff6_pull.ipc
2016-05-16 09:34:03,624 [salt.utils.event ][DEBUG ][7013] Sending event - data = {'clear': False, 'cmd': '_mine', 'data': {}, 'id': 'ermine-c0.insolar-plants.net', '_stamp': '2016-05-16T08:34:03.623764'}
2016-05-16 09:34:03,626 [salt.minion ][DEBUG ][1995] Handling event '_minion_mine\n\n\x85\xa5clear\xc2\xa3cmd\xa5_mine\xa4data\x80\xa2id\xbcermine-c0.insolar-plants.net\xa6_stamp\xba2016-05-16T08:34:03.623764'
2016-05-16 09:34:03,628 [salt.transport.zeromq ][DEBUG ][1995] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'ermine-c0.insolar-plants.net', 'tcp://54.246.180.52:4506', 'aes')
2016-05-16 09:34:03,629 [salt.crypt ][DEBUG ][1995] Initializing new SAuth for ('/etc/salt/pki/minion', 'ermine-c0.insolar-plants.net', 'tcp://54.246.180.52:4506')
2016-05-16 09:34:04,127 [salt.transport.zeromq ][DEBUG ][7013] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'ermine-c0.insolar-plants.net', 'tcp://54.246.180.52:4506', 'aes')
2016-05-16 09:34:04,128 [salt.crypt ][DEBUG ][7013] Initializing new SAuth for ('/etc/salt/pki/minion', 'ermine-c0.insolar-plants.net', 'tcp://54.246.180.52:4506')
2016-05-16 09:34:06,338 [salt.utils.schedule ][DEBUG ][7013] schedule.handle_func: Removing /var/cache/salt/minion/proc/20160516093403600494
2016-05-16 10:04:00,595 [salt.transport.zeromq ][DEBUG ][1995] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'ermine-c0.insolar-plants.net', 'tcp://54.246.180.52:4506', 'aes')
2016-05-16 10:04:00,596 [salt.crypt ][DEBUG ][1995] Initializing new SAuth for ('/etc/salt/pki/minion', 'ermine-c0.insolar-plants.net', 'tcp://54.246.180.52:4506')
2016-05-16 10:34:03,594 [salt.utils.schedule ][INFO ][1995] Running scheduled job: __mine_interval
2016-05-16 10:34:03,595 [salt.utils.schedule ][DEBUG ][1995] schedule: This job was scheduled with jid_include, adding to cache (jid_include defaults to True)
2016-05-16 10:34:03,596 [salt.utils.schedule ][DEBUG ][1995] schedule: This job was scheduled with a max number of 2
2016-05-16 10:34:03,614 [salt.utils.schedule ][DEBUG ][7061] schedule.handle_func: adding this job to the jobcache with data {'fun': 'mine.update', 'jid': '20160516103403600217', 'pid': 7061, 'id': 'ermine-c0.insolar-plants.net', 'schedule': '__mine_interval'}
2016-05-16 10:34:03,620 [salt.utils.event ][DEBUG ][7061] MinionEvent PUB socket URI: ipc:///var/run/salt/minion/minion_event_73b7be1ff6_pub.ipc
2016-05-16 10:34:03,621 [salt.utils.event ][DEBUG ][7061] MinionEvent PULL socket URI: ipc:///var/run/salt/minion/minion_event_73b7be1ff6_pull.ipc
2016-05-16 10:34:03,624 [salt.utils.event ][DEBUG ][7061] Sending event - data = {'clear': False, 'cmd': '_mine', 'data': {}, 'id': 'ermine-c0.insolar-plants.net', '_stamp': '2016-05-16T09:34:03.623898'}
2016-05-16 10:34:03,627 [salt.minion ][DEBUG ][1995] Handling event '_minion_mine\n\n\x85\xa5clear\xc2\xa3cmd\xa5_mine\xa4data\x80\xa2id\xbcermine-c0.insolar-plants.net\xa6_stamp\xba2016-05-16T09:34:03.623898'
2016-05-16 10:34:03,630 [salt.transport.zeromq ][DEBUG ][1995] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'ermine-c0.insolar-plants.net', 'tcp://54.246.180.52:4506', 'aes')
2016-05-16 10:34:03,632 [salt.crypt ][DEBUG ][1995] Initializing new SAuth for ('/etc/salt/pki/minion', 'ermine-c0.insolar-plants.net', 'tcp://54.246.180.52:4506')
2016-05-16 10:34:04,128 [salt.transport.zeromq ][DEBUG ][7061] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'ermine-c0.insolar-plants.net', 'tcp://54.246.180.52:4506', 'aes')
2016-05-16 10:34:04,129 [salt.crypt ][DEBUG ][7061] Initializing new SAuth for ('/etc/salt/pki/minion', 'ermine-c0.insolar-plants.net', 'tcp://54.246.180.52:4506')
2016-05-16 10:34:06,551 [salt.utils.schedule ][DEBUG ][7061] schedule.handle_func: Removing /var/cache/salt/minion/proc/20160516103403600217
(END)
@cachedout Pinging you so we have a current core dev in the conversation.
@sivann Hi there.
I'm not sure I would agree that the ZeroMQ transport is a "custom low-level implementation". It's a series of fairly well-defined patterns into PyZMQ using hooks provided by that library. Perhaps you just meant RAET and TCP? If so, I would agree there.
I'm aware of MQTT and one of the reasons that the salt transport system has been designed in a pluggable manner is to allow easy exploration into the feasibility of adding that support. It's not something that's on the roadmap at present.
Regarding the issues you've been facing, we'd have to debug this as one would debug any other networking problem. We'd likely need to look at packet captures of failed minions and we'd need to know the state of the sockets on the minion side and the master. You didn't say whether this was on the publish side or the return side, so that's the first determination that we'd need to make...
@cachedout, thank you for your help, I will try to have packet captures.
We also tried the new TCP transport, same behaviour, minion was lost after 3 days. Minion's side shows ESTABLISHED, no logs, master's side shows no connection. As I stated before TCP keepalives are not trustworthy, they are terminated locally in some networks.
@cachedout There is nothing to capture, stracing the minion just polls its socket:
[pid 29931] gettimeofday({1488274468, 901392}, NULL) = 0
[pid 29931] clock_gettime(CLOCK_MONOTONIC, {677798, 684207835}) = 0
[pid 29931] poll([{fd=8, events=POLLIN}, {fd=10, events=POLLIN}, {fd=11, events=POLLIN}, {fd=12, events=POLLIN}, {fd=16, events=POLLIN}, {fd=18, events=POLLIN}], 6, 0) = 0 (Timeout)
[pid 29931] poll([{fd=8, events=POLLIN}, {fd=10, events=POLLIN}, {fd=11, events=POLLIN}, {fd=12, events=POLLIN}, {fd=16, events=POLLIN}, {fd=18, events=POLLIN}], 6, 995) = 0 (Timeout)
[pid 29931] clock_gettime(CLOCK_MONOTONIC, {677799, 681042172}) = 0
[pid 29931] gettimeofday({1488274469, 898725}, NULL) = 0
[pid 29931] gettimeofday({1488274469, 899251}, NULL) = 0
[pid 29931] waitpid(29932, 0xbfe1f7f4, WNOHANG) = 0
[pid 29931] gettimeofday({1488274469, 899689}, NULL) = 0
[pid 29931] gettimeofday({1488274469, 900043}, NULL) = 0
[pid 29931] gettimeofday({1488274469, 900350}, NULL) = 0
[pid 29931] gettimeofday({1488274469, 900561}, NULL) = 0
[pid 29931] clock_gettime(CLOCK_MONOTONIC, {677799, 683313820}) = 0
[pid 29931] poll([{fd=8, events=POLLIN}, {fd=10, events=POLLIN}, {fd=11, events=POLLIN}, {fd=12, events=POLLIN}, {fd=16, events=POLLIN}, {fd=18, events=POLLIN}], 6, 0) = 0 (Timeout)
[pid 29931] poll([{fd=8, events=POLLIN}, {fd=10, events=POLLIN}, {fd=11, events=POLLIN}, {fd=12, events=POLLIN}, {fd=16, events=POLLIN}, {fd=18, events=POLLIN}], 6, 996) = 0 (Timeout)
[pid 29931] clock_gettime(CLOCK_MONOTONIC, {677800, 681053958}) = 0
[pid 29931] gettimeofday({1488274470, 898734}, NULL) = 0
[pid 29931] gettimeofday({1488274470, 899271}, NULL) = 0
[pid 29931] waitpid(29932, 0xbfe1f7f4, WNOHANG) = 0
[pid 29931] gettimeofday({1488274470, 899710}, NULL) = 0
[pid 29931] gettimeofday({1488274470, 900071}, NULL) = 0
[pid 29931] gettimeofday({1488274470, 900379}, NULL) = 0
[pid 29931] gettimeofday({1488274470, 900598}, NULL) = 0
[pid 29931] clock_gettime(CLOCK_MONOTONIC, {677800, 683347726}) = 0
[pid 29931] poll([{fd=8, events=POLLIN}, {fd=10, events=POLLIN}, {fd=11, events=POLLIN}, {fd=12, events=POLLIN}, {fd=16, events=POLLIN}, {fd=18, events=POLLIN}], 6, 0) = 0 (Timeout)
[pid 29931] poll([{fd=8, events=POLLIN}, {fd=10, events=POLLIN}, {fd=11, events=POLLIN}, {fd=12, events=POLLIN}, {fd=16, events=POLLIN}, {fd=18, events=POLLIN}], 6, 995) = 0 (Timeout)
[pid 29931] clock_gettime(CLOCK_MONOTONIC, {677801, 680110054}) = 0
[pid 29931] gettimeofday({1488274471, 898951}, NULL) = 0
[pid 29931] gettimeofday({1488274471, 900609}, NULL) = 0
[pid 29931] waitpid(29932, 0xbfe1f7f4, WNOHANG) = 0
Please do not implement more access methods, please just fix one of the current implementations. If you don't use your own keepalive payload this will always be broken. Will a support contract help you move this forward? We have invested too much time in salt and we are very sorry to see this basic functionality broken.
@sivann, I am seeing/troubleshooting this as well on my network. We have a DMZ that has been closing connections from our non-DMZ Master to the DMZ Minions.
When you strace the Minion and it shows that it "just polls its socket", what do you see in the logs on the Minion? Are you running the option "log_level_logfile: all" or equivalent? Or, what are the last actions logged by the Minion? Also, what does the following command give you?
ss -nao | grep -P ':450(5|6)\s+'
In my case, I believe that our DMZ firewall is ignoring the TCP Keepalives that are going through, and since we do not have a scheduled Master->Minion ping (yet), the connection gets destroyed by the firewall after about 60 minutes due to what it considers to be inactivity. I am still troubleshooting however.
Here is an update.
On my non-firewalled network I have minions that still report timeouts, but recover adequately.
2017-03-05 13:20:12,920 [salt.transport.zeromq ][DEBUG ][27814] SaltReqTimeoutError, retrying. (1/7)
2017-03-05 13:20:17,920 [salt.transport.zeromq ][DEBUG ][27814] SaltReqTimeoutError, retrying. (2/7)
2017-03-05 13:20:22,920 [salt.transport.zeromq ][DEBUG ][27814] SaltReqTimeoutError, retrying. (3/7)
With this in mind I decided to configure my DMZ-firewalled systems to have an increased "auth_tries" value of 20 instead of the default of 7. This seems to have done the trick for me. I could not get the "Set timeouts really low to force the code to restart the salt-minion child" method to work--I always wound up with my DMZ-firewalled systems running only the parent when it attempted, and failed, to restart the child.
Here is what my DMZ-firewalled systems report in their logs, for timeouts:
2017-03-05 13:20:18,155 [salt.transport.zeromq ][DEBUG ][2841] SaltReqTimeoutError, retrying. (1/20)
2017-03-05 13:20:28,157 [salt.transport.zeromq ][DEBUG ][2841] SaltReqTimeoutError, retrying. (2/20)
2017-03-05 13:20:38,158 [salt.transport.zeromq ][DEBUG ][2841] SaltReqTimeoutError, retrying. (3/20)
Hope that helps someone.
@PeterS242 Hi peter, as I stated above, keepalive counters count on one end (the minion's) because on SAT networks keepalive is proxied (emulated). Nothing on the minion logs.
There are some good news: testing the TCP transport with the latest 2016.11.x, we have 2 complete weeks that all 3 test minions are reachable.
At Link Labs we are facing a similar issue with minions on 3G and Satcom connections. We have the minion running in supervisor, which helps some, but it would be very helpful to get a fuller picture of what minion/master settings you used to achieve your reliability. @sivann
I'm also for application level keepalives. 👍
I also second @emeryray02's request for more details @sivann.
I have a master in AWS and minions in the field. After several hours, the minions will lose connectivity to the salt master, but the minion will not realize it. I have configured reasonable TCP keepalives, but AWS does not seem to be respecting them.
ZeroMQ 4.2 include heartbeat functionality, maybe that can be used
@riq-dfaubion we used the TCP transport for experiment, i.e. "transport: tcp" not the 0mq; that seemed to work in a short test. But we have not found the time to switch over, we have > 1000 minions. Since there are several issues that prevent me for trusting salt for orchestration (several broken releases, version incompatibilities, short-term O/S support, half-baked LDAP). We just use it for commissioning now.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
If this issue is closed prematurely, please leave a comment and we will gladly reopen the issue.
Most helpful comment
I'm also for application level keepalives. 👍
I also second @emeryray02's request for more details @sivann.
I have a master in AWS and minions in the field. After several hours, the minions will lose connectivity to the salt master, but the minion will not realize it. I have configured reasonable TCP keepalives, but AWS does not seem to be respecting them.