salt-minion error on unstable tcp connection

Created on 3 Sep 2015 · 43Comments · Source: saltstack/salt

I have a minion connecting to a master via Internet, the routing between these 2 servers is bit unstable, e.g. there is some packet loss, however many other services are running fine, just salt is having bigger problems.

one server is in Azure Cloud and they dont allow ICMP, so here is a tcpping to give an idea of the packet loss encountered:

root@minion1:~# ./tcpping salt-master.example.com 4506
seq 0: tcp response from salt-master.example.com (1.1.1.1) [open]  310.518 ms
seq 1: tcp response from salt-master.example.com (1.1.1.1) [open]  314.011 ms
seq 2: tcp response from salt-master.example.com (1.1.1.1) [open]  310.209 ms
seq 3: tcp response from salt-master.example.com (1.1.1.1) [open]  310.258 ms
seq 5: tcp response from salt-master.example.com (1.1.1.1) [open]  310.698 ms
seq 4: no response (timeout)
seq 7: tcp response from salt-master.example.com (1.1.1.1) [open]  313.063 ms
seq 8: tcp response from salt-master.example.com (1.1.1.1) [open]  309.968 ms
seq 6: no response (timeout)
seq 9: tcp response from salt-master.example.com (1.1.1.1) [open]  313.527 ms
seq 10: tcp response from salt-master.example.com (1.1.1.1) [open]  311.129 ms
seq 11: tcp response from salt-master.example.com (1.1.1.1) [open]  310.477 ms
seq 12: tcp response from salt-master.example.com (1.1.1.1) [open]  313.292 ms
seq 13: tcp response from salt-master.example.com (1.1.1.1) [open]  313.591 ms
seq 15: tcp response from salt-master.example.com (1.1.1.1) [open]  314.750 ms
seq 16: tcp response from salt-master.example.com (1.1.1.1) [open]  313.983 ms
seq 14: no response (timeout)
seq 17: tcp response from salt-master.example.com (1.1.1.1) [open]  310.181 ms
seq 19: tcp response from salt-master.example.com (1.1.1.1) [open]  312.813 ms

here is "salt-minion -l debug", trying "salt minion1 test.ping" from the master only works about 1 out of 10 times, just wondering if the errors below are normal in this situation of a not 100% stable connection and also if its normal to expect salt to work this badly over this kind of connection (pretty much unusable, whereas other services are working fine even though there is some packet loss)

[DEBUG   ] Ping master
[DEBUG   ] Re-using SAuth for ('/etc/salt/pki/minion', 'minion1', 'tcp://1.1.1.1:4506')
[INFO    ] SaltReqTimeoutError: after 60 seconds. (Try 1 of 3)
[INFO    ] SaltReqTimeoutError: after 60 seconds. (Try 2 of 3)
[INFO    ] SaltReqTimeoutError: after 60 seconds. (Try 3 of 3)
[INFO    ] fire_master failed: Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/salt/minion.py", line 992, in _fire_master
    result = channel.send(load, timeout=timeout)
  File "/usr/lib/python2.7/dist-packages/salt/transport/__init__.py", line 314, in send
    return self._crypted_transfer(load, tries, timeout)
  File "/usr/lib/python2.7/dist-packages/salt/transport/__init__.py", line 302, in _crypted_transfer
    return _do_transfer()
  File "/usr/lib/python2.7/dist-packages/salt/transport/__init__.py", line 293, in _do_transfer
    timeout)
  File "/usr/lib/python2.7/dist-packages/salt/payload.py", line 273, in send
    'SaltReqTimeoutError: after {0} seconds, ran {1} tries'.format(timeout * tried, tried)
SaltReqTimeoutError: SaltReqTimeoutError: after 180 seconds, ran 3 tries

[DEBUG   ] Ping master
[DEBUG   ] Re-using SAuth for ('/etc/salt/pki/minion', 'minion1', 'tcp://1.1.1.1:4506')
[INFO    ] SaltReqTimeoutError: after 60 seconds. (Try 1 of 3)
[INFO    ] SaltReqTimeoutError: after 60 seconds. (Try 2 of 3)
[INFO    ] SaltReqTimeoutError: after 60 seconds. (Try 3 of 3)
[INFO    ] fire_master failed: Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/salt/minion.py", line 992, in _fire_master
    result = channel.send(load, timeout=timeout)
  File "/usr/lib/python2.7/dist-packages/salt/transport/__init__.py", line 314, in send
    return self._crypted_transfer(load, tries, timeout)
  File "/usr/lib/python2.7/dist-packages/salt/transport/__init__.py", line 302, in _crypted_transfer
    return _do_transfer()
  File "/usr/lib/python2.7/dist-packages/salt/transport/__init__.py", line 293, in _do_transfer
    timeout)
  File "/usr/lib/python2.7/dist-packages/salt/payload.py", line 273, in send
    'SaltReqTimeoutError: after {0} seconds, ran {1} tries'.format(timeout * tried, tried)
SaltReqTimeoutError: SaltReqTimeoutError: after 180 seconds, ran 3 tries

I experience similar kind of problems with other minions, all are connecting via Internet to the master. You could argue that the connection in the above example is not good, but over WAN connecting multiple clouds and many servers its just expected to have a few of those not ideal connections, other services between those servers (mostly HTTP/HTTPS based etc) run fine, is it normal that ZMQ is this sensitive to packet loss, given that other services are able to recover without problems but with ZMQ I always seem to have minions loosing their connection to the master.

I wanted to try RAET, but it has been broken for a few releases now already, I tried following the RAET bugs and the patches there but could never get it to work sadly.

Bug Core P3 Transport ZMQ severity-medium stale

Source

bymodude

All 43 comments

just observed that this also happened on a minion connecting to master via WAN but over a 100% stable connection, however there are firewalls involved so the tcp connection eventually gets dropped, but this exact same issue also happens on a stable connection, the connection with packet loss connections just shows the problems much more frequently.

Looking at the stuck minion which is running in debug mode, master is returning "Minion did not return. [No response]" and minion is still doing the ping attempt every 5 minutes,from the logs below it seems that "SaltReqTimeouError" would imply that the minion correctly identified that its current TCP connection is broken and in case of my minion config should then do a self restart? (a manual restart of the minion immediately restores the connection for me, until it fails again):

log output on minion showing that ping master is still happening after minion is in stuck mode, however the auth_mode: false (restart automatically of minion?) does not seem to work instead showing this "fire_master failed" traceback

[DEBUG   ] Ping master
[DEBUG   ] Re-using SAuth for ('/etc/salt/pki/minion', 'de-vultr', 'tcp://1.1.1.1:4506')
[INFO    ] SaltReqTimeoutError: after 60 seconds. (Try 1 of 3)
[INFO    ] SaltReqTimeoutError: after 60 seconds. (Try 2 of 3)
[INFO    ] SaltReqTimeoutError: after 60 seconds. (Try 3 of 3)
[INFO    ] fire_master failed: Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/salt/minion.py", line 992, in _fire_master
    result = channel.send(load, timeout=timeout)
  File "/usr/lib/python2.7/dist-packages/salt/transport/__init__.py", line 314, in send
    return self._crypted_transfer(load, tries, timeout)
  File "/usr/lib/python2.7/dist-packages/salt/transport/__init__.py", line 302, in _crypted_transfer
    return _do_transfer()
  File "/usr/lib/python2.7/dist-packages/salt/transport/__init__.py", line 293, in _do_transfer
    timeout)
  File "/usr/lib/python2.7/dist-packages/salt/payload.py", line 273, in send
    'SaltReqTimeoutError: after {0} seconds, ran {1} tries'.format(timeout * tried, tried)
SaltReqTimeoutError: SaltReqTimeoutError: after 180 seconds, ran 3 tries

minion config excerpt

rejected_retry: True
auth_tries: 5
auth_safemode: False
ping_interval: 5
tcp_keepalive: True

minion versions

root@minion1:~# salt-minion --versions-report
                  Salt: 2015.5.3
                Python: 2.7.3 (default, Mar 13 2014, 11:03:55)
                Jinja2: 2.6
              M2Crypto: 0.21.1
        msgpack-python: 0.1.10
          msgpack-pure: Not Installed
              pycrypto: 2.6
               libnacl: Not Installed
                PyYAML: 3.10
                 ioflo: Not Installed
                 PyZMQ: 13.1.0
                  RAET: Not Installed
                   ZMQ: 3.2.3
                  Mako: Not Installed
               Tornado: Not Installed
 Debian source package: 2015.5.3+ds-1~bpo70+2

master verions

root@salt-master:~# salt-master --versions-report
                  Salt: 2015.5.3
                Python: 2.7.9 (default, Mar  1 2015, 12:57:24)
                Jinja2: 2.7.3
              M2Crypto: 0.21.1
        msgpack-python: 0.4.2
          msgpack-pure: Not Installed
              pycrypto: 2.6.1
               libnacl: 1.4.3
                PyYAML: 3.11
                 ioflo: 1.3.9
                 PyZMQ: 14.4.0
                  RAET: 0.6.3
                   ZMQ: 4.0.5
                  Mako: Not Installed
               Tornado: Not Installed
 Debian source package: 2015.5.3+ds-1~bpo70+2

bymodude on 5 Sep 2015

please try these settings...

ping_interval: 5
auth_timeout: 10
random_reauth_delay: 10
auth_tries: 2
auth_safemode: False

steverweber on 5 Sep 2015

👍1

@steverweber just added your exact values to my config and will report back if it helped
I cant find any mention of restart_on_error in the documentation, but it sounds promising

bymodude on 5 Sep 2015

I applied that config and also just happened to catch a minion showing the "SaltReqTimeoutError" errors, I immediately test.ping from master which failed, however after some time it worked again! I would have expected some info in salt-minion debug output if it "restarted" or it would not show any output?

[DEBUG   ] Ping master
[DEBUG   ] Re-using SAuth for ('/etc/salt/pki/minion', 'de-vultr', 'tcp://1.1.1.1:4506')
[INFO    ] SaltReqTimeoutError: after 60 seconds. (Try 1 of 3)
[INFO    ] SaltReqTimeoutError: after 60 seconds. (Try 2 of 3)

exactly at this point I test.ping from the master with result "Minion did not return. [No response]"

[INFO    ] SaltReqTimeoutError: after 60 seconds. (Try 3 of 3)
[INFO    ] fire_master failed: Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/salt/minion.py", line 992, in _fire_master
    result = channel.send(load, timeout=timeout)
  File "/usr/lib/python2.7/dist-packages/salt/transport/__init__.py", line 314, in send
    return self._crypted_transfer(load, tries, timeout)
  File "/usr/lib/python2.7/dist-packages/salt/transport/__init__.py", line 302, in _crypted_transfer
    return _do_transfer()
  File "/usr/lib/python2.7/dist-packages/salt/transport/__init__.py", line 293, in _do_transfer
    timeout)
  File "/usr/lib/python2.7/dist-packages/salt/payload.py", line 273, in send
    'SaltReqTimeoutError: after {0} seconds, ran {1} tries'.format(timeout * tried, tried)
SaltReqTimeoutError: SaltReqTimeoutError: after 180 seconds, ran 3 tries

[DEBUG   ] Ping master
[DEBUG   ] Re-using SAuth for ('/etc/salt/pki/minion', 'de-vultr', 'tcp://1.1.1.1:4506')

when I saw the next "[DEBUG ] Ping master" I went to master and test.ping again and this time it works as seen what follows in minion log, so salt-minion in fact managed to recover somehow

[INFO    ] User root Executing command test.ping with jid 20150905152732565088
[DEBUG   ] Command details {'tgt_type': 'glob', 'jid': '20150905152732565088', 'tgt': '*', 'ret': '', 'user': 'root', 'arg': [], 'fun': 'test.ping'}
[INFO    ] Starting a new job with PID 4541
[DEBUG   ] LazyLoaded test.ping
[INFO    ] Returning information for job: 20150905152732565088
[DEBUG   ] Re-using SAuth for ('/etc/salt/pki/minion', 'de-vultr', 'tcp://1.1.1.1:4506')

bymodude on 5 Sep 2015

unfortunately I have to add, same thing just happened again (e.g. I see SaltReqTimeouError and test.ping etc.), second time round the salt-minion did not recover after next "Ping Master", however a "[INFO ] Running scheduled job: __mine_interval" started and after that test.ping from master worked again. Not sure if it needs 2 cycles of "Ping Master" to fail to recover etc. Maybe I can tweak the ping intervals etc.

For testing this I have set my TCP Keepalive on the minions to 5 min or so and Azure (where my master is) kills the tcp connection after 4 minutes, so its a pretty reliable testbed for testing the minions dropping the connection, essentially it just takes 5 or 10 minutes to start getting failed connections this way.

bymodude on 5 Sep 2015

We should see a log message: ** Restarting minion **
https://github.com/saltstack/salt/blob/c390e812d8ce19595b0a78b5f8918f4f464055a5/salt/scripts.py

but it does not seem to be passing the error up the stack.

[INFO    ] fire_master failed: Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/salt/minion.py", line 992, in _fire_master
    result = channel.send(load, timeout=timeout)
  File "/usr/lib/python2.7/dist-packages/salt/transport/__init__.py", line 314, in send
    return self._crypted_transfer(load, tries, timeout)
  File "/usr/lib/python2.7/dist-packages/salt/transport/__init__.py", line 302, in _crypted_transfer
    return _do_transfer()
  File "/usr/lib/python2.7/dist-packages/salt/transport/__init__.py", line 293, in _do_transfer
    timeout)
  File "/usr/lib/python2.7/dist-packages/salt/payload.py", line 273, in send
    'SaltReqTimeoutError: after {0} seconds, ran {1} tries'.format(timeout * tried, tried)
SaltReqTimeoutError: SaltReqTimeoutError: after 180 seconds, ran 3 tries

it seems that /usr/lib/python2.7/dist-packages/salt/minion.py in _fire_master is just showing the error and returns without passing it up to the salt/script.py

So if your game could you add some code to. salt/minion.py
on one of the minions...

    def _fire_master(self, data=None, tag=None, events=None, pretag=None, timeout=60):
        '''
        Fire an event on the master, or drop message if unable to send.
        '''
        load = {'id': self.opts['id'],
                'cmd': '_minion_event',
                'pretag': pretag,
                'tok': self.tok}
        if events:
            load['events'] = events
        elif data and tag:
            load['data'] = data
            load['tag'] = tag
        elif not data and tag:
            load['data'] = {}
            load['tag'] = tag
        else:
            return
        channel = salt.transport.Channel.factory(self.opts)
        try:
            result = channel.send(load, timeout=timeout)
            return True
+        except (SaltReqTimeoutError) as exc:
+            raise exc
        except Exception:
            log.info('fire_master failed: {0}'.format(traceback.format_exc()))
            return False

that might do the trick... but the error could get stuck again some place and need to be raised again.

steverweber on 5 Sep 2015

I think i have a solution that is less problematic, but it will take some development time. Perhaps latter this month i'll have some time.

steverweber on 5 Sep 2015

thanks @steverweber I will try your code modification

bymodude on 5 Sep 2015

thanks for taking the time!

FYI since this keepalive is broken in the salt releases. I'm going to deploy a cron job workaround to my minions... It's a royal pain to lose a minion in a locked down environment so i hope this does the magic in that time of need.

# /state/global/salt_keepalive.sls
# TODO: should use grains['os'] to change strategy to restart service
{{ tpldot }}:
    cron.present:
        - name: /bin/sh -c "if salt-call -t 20 test.ping ; then echo 'connected'; else service salt-minion restart; fi"
        - identifier: salt_keepalive
        - user: root
        - minute: '*/5'

steverweber on 5 Sep 2015

after the code modification i now get an additional [CRITICAL] in the logs as shown below, behavior is unchanged though (minion not responding etc.)

[DEBUG   ] Ping master
[DEBUG   ] Re-using SAuth for ('/etc/salt/pki/minion', 'minion1', 'tcp://1.1.1.1:4506')
[INFO    ] SaltReqTimeoutError: after 60 seconds. (Try 1 of 3)
[INFO    ] SaltReqTimeoutError: after 60 seconds. (Try 2 of 3)
[INFO    ] SaltReqTimeoutError: after 60 seconds. (Try 3 of 3)
[CRITICAL] An exception occurred while polling the minion
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/salt/minion.py", line 1879, in tune_in
    self._fire_master('ping', 'minion_ping')
  File "/usr/lib/python2.7/dist-packages/salt/minion.py", line 995, in _fire_master
    raise exc
SaltReqTimeoutError: SaltReqTimeoutError: after 180 seconds, ran 3 tries

regarding the cron, when I try "salt-call -t 20 test.ping" on a stuck minion it just successfully returns "local: True", I was actually wondering how I could figure out that salt-minion has lost connection (without watching the logfile for the saltreqtimeouterror lines)

bymodude on 5 Sep 2015

humm thanks for that pointer using test.ping. perhaps there is another call we can used... the same one I used to preform "[DEBUG ] Ping master" will be it.

now that it comes to mind i think that _fire_master might run in another thread. ill take another look at source.

steverweber on 5 Sep 2015

workaround that test.ping try:

event.fire_master test ping

steverweber on 5 Sep 2015

its so strange "salt-call -t 20 event.fire_master test ping" also returns "local: true" success on stuck minion - is that supposed to contact the master? the minion log looks like this when i try it, so this event.fire_master seems to happen, but from master side this minion is still just giving "Minion did not return. [No response]" on test.ping:

[DEBUG   ] Ping master
[DEBUG   ] Re-using SAuth for ('/etc/salt/pki/minion', 'de-vultr', 'tcp://23.99.5.65:4506')
[INFO    ] SaltReqTimeoutError: after 60 seconds. (Try 1 of 3)
[INFO    ] SaltReqTimeoutError: after 60 seconds. (Try 2 of 3)
[INFO    ] SaltReqTimeoutError: after 60 seconds. (Try 3 of 3)
[CRITICAL] An exception occurred while polling the minion
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/salt/minion.py", line 1879, in tune_in
    self._fire_master('ping', 'minion_ping')
  File "/usr/lib/python2.7/dist-packages/salt/minion.py", line 995, in _fire_master
    raise exc
SaltReqTimeoutError: SaltReqTimeoutError: after 180 seconds, ran 3 tries
[DEBUG   ] Handling event 'salt/event/new_client\n\n\x81\xa6_stamp\xba2015-09-05T17:14:28.280315'
[DEBUG   ] Handling event 'fire_master\n\n\x85\xa6pretag\xc0\xa6_stamp\xba2015-09-05T17:14:29.281810\xa3tag\xa4ping\xa4data\xa4test\xa6events\xc0'
[DEBUG   ] Forwarding master event tag=ping
[DEBUG   ] Re-using SAuth for ('/etc/salt/pki/minion', 'de-vultr', 'tcp://23.99.5.65:4506')

bymodude on 5 Sep 2015

yes that does hit the master. so your minion can contact the master... but master to minion is broken. I have reviewed the code and so much has changed that its tricky to follow everything... but i'll keep digging for a little longer

steverweber on 5 Sep 2015

ok: https://github.com/steverweber/salt/tree/keepalive

it needs some testing... so you could do something like:

# on the minion
curl -L https://bootstrap.saltstack.com | sh -s -- -g https://github.com/steverweber/salt.git git keepalive

steverweber on 5 Sep 2015

@steverweber thanks so much for your help, unfortunately just tested and its still showing the same behavior (although SaltReqTimeoutErrors are gone from logs now, but I thought those were a good thing as that is when the restart of minion should happen)

salt-minion --versions-report

Salt Version:
           Salt: 2015.8.0rc3-94-g374c7de

Dependency Versions:
         Jinja2: 2.6
       M2Crypto: 0.21.1
           Mako: Not Installed
         PyYAML: 3.10
          PyZMQ: 13.1.0
         Python: 2.7.3 (default, Mar 13 2014, 11:03:55)
           RAET: Not Installed
        Tornado: 4.2.1
            ZMQ: 3.2.3
          ioflo: Not Installed
        libnacl: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.1.10
       pycrypto: 2.6
        timelib: Not Installed

System Versions:
           dist: debian 7.8
        machine: x86_64
        release: 3.2.0-4-amd64
         system: debian 7.8

Test Procedure:
1) "salt-minion -l debug" on minion 1, TCP Keepalive set to 5 minutes, master is in Azure Cloud and Azure Firewall will tear down connection after 4 minutes by default
2) "salt minion1 test.ping" on master -> successfull
3) wait ~5-10 Minutes, "salt minion1 test.ping" on master -> fails with "Minion did not return. [No response]" (other minions that have lower TPC Keepalive still responding, only the minion with high tcp keepalive fails so quickly)
4) on minion1: "salt-call -t 20 test.ping" -> "local: True" success / "salt-call -t 20 event.fire_master test ping" -> "local: True" success (so these 2 commands are returning success while test.ping from master side is failing)
5) restart salt-minion on minion1 manually -> test.ping on master starts working again

salt-minion logs (can see first successful test.ping command from master, then can see event.fire_master from minion side later in the logs, in between the connection failed, however with the new version no more SaltReqTimeoutError's are shown, 2015.5.3 would show the SaltReqTimeouError's around 5 minutes after the first connection)

[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'minion1', 'tcp://1.1.1.1:4506', 'aes')
[DEBUG   ] Initializing new SAuth for ('/etc/salt/pki/minion', 'minion1', 'tcp://1.1.1.1:4506')
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'minion1', 'tcp://1.1.1.1:4506', 'aes')
[DEBUG   ] Initializing new SAuth for ('/etc/salt/pki/minion', 'minion1', 'tcp://1.1.1.1:4506')
[DEBUG   ] schedule.handle_func: Removing /var/cache/salt/minion/proc/20150906022742104874
[INFO    ] User root Executing command test.ping with jid 20150906022751211090
[DEBUG   ] Command details {'tgt_type': 'glob', 'jid': '20150906022751211090', 'tgt': '*', 'ret': '', 'user': 'root', 'arg': [], 'fun': 'test.ping'}
[INFO    ] Starting a new job with PID 11458
[DEBUG   ] LazyLoaded test.ping
[INFO    ] Returning information for job: 20150906022751211090
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'minion1', 'tcp://1.1.1.1:4506', 'aes')
[DEBUG   ] Initializing new SAuth for ('/etc/salt/pki/minion', 'minion1', 'tcp://1.1.1.1:4506')
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'minion1', 'tcp://1.1.1.1:4506', 'aes')
[DEBUG   ] Initializing new SAuth for ('/etc/salt/pki/minion', 'minion1', 'tcp://1.1.1.1:4506')
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'minion1', 'tcp://1.1.1.1:4506', 'aes')
[DEBUG   ] Initializing new SAuth for ('/etc/salt/pki/minion', 'minion1', 'tcp://1.1.1.1:4506')
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'minion1', 'tcp://1.1.1.1:4506', 'aes')
[DEBUG   ] Initializing new SAuth for ('/etc/salt/pki/minion', 'minion1', 'tcp://1.1.1.1:4506')
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'minion1', 'tcp://1.1.1.1:4506', 'aes')
[DEBUG   ] Initializing new SAuth for ('/etc/salt/pki/minion', 'minion1', 'tcp://1.1.1.1:4506')
[DEBUG   ] Handling event 'fire_master\n\n\x85\xa6pretag\xc0\xa6_stamp\xba2015-09-06T02:51:50.495255\xa3tag\xa4ping\xa4data\xa4test\xa6events\xc0'
[DEBUG   ] Forwarding master event tag=ping
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'minion1', 'tcp://1.1.1.1:4506', 'aes')
[DEBUG   ] Initializing new SAuth for ('/etc/salt/pki/minion', 'minion1', 'tcp://1.1.1.1:4506')
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'minion1', 'tcp://1.1.1.1:4506', 'aes')
[DEBUG   ] Initializing new SAuth for ('/etc/salt/pki/minion', 'minion1', 'tcp://1.1.1.1:4506')
[DEBUG   ] Handling event 'fire_master\n\n\x85\xa6pretag\xc0\xa6_stamp\xba2015-09-06T02:57:17.519208\xa3tag\xa4ping\xa4data\xa4test\xa6events\xc0'
[DEBUG   ] Forwarding master event tag=ping
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'minion1', 'tcp://1.1.1.1:4506', 'aes')
[DEBUG   ] Initializing new SAuth for ('/etc/salt/pki/minion', 'minion1', 'tcp://1.1.1.1:4506')
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'minion1', 'tcp://1.1.1.1:4506', 'aes')
[DEBUG   ] Initializing new SAuth for ('/etc/salt/pki/minion', 'minion1', 'tcp://1.1.1.1:4506')

bymodude on 6 Sep 2015

that is strange... I should see some SaltReqTimeoutError... unless that ping worked.. I'll give it a test on my box.

steverweber on 6 Sep 2015

it seems to work on my end but im cheating... I start the minion with the master running ... then i take the master offline....

[DEBUG   ] MinionEvent PULL socket URI: ipc:///var/run/salt/minion/minion_event_c5ec2b4aea_pull.ipc
[DEBUG   ] Sending event - data = {'clear': False, 'cmd': '_mine', 'data': {}, 'id': 'salt.math.private.uwaterloo.ca', '_stamp': '2015-09-06T03:25:09.951590'}
[DEBUG   ] Handling event '_minion_mine\n\n\x85\xa5clear\xc2\xa3cmd\xa5_mine\xa4data\x80\xa2id\xbesalt.math.private.uwaterloo.ca\xa6_stamp\xba2015-09-06T03:25:09.951590'
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'salt.math.private.uwaterloo.ca', 'tcp://172.27.7.49:4506', 'aes')
[DEBUG   ] Initializing new SAuth for ('/etc/salt/pki/minion', 'salt.math.private.xxxx.ca', 'tcp://172.27.7.49:4506')
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'salt.math.private.uwaterloo.ca', 'tcp://172.27.7.49:4506', 'aes')
[DEBUG   ] Initializing new SAuth for ('/etc/salt/pki/minion', 'salt.math.private.uwaterloo.ca', 'tcp://172.27.7.49:4506')
[DEBUG   ] schedule.handle_func: Removing /var/cache/salt/minion/proc/20150905232509938582
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'salt.math.private.uwaterloo.ca', 'tcp://172.27.7.49:4506', 'aes')
[DEBUG   ] Initializing new SAuth for ('/etc/salt/pki/minion', 'salt.math.private.xxxxx.ca', 'tcp://172.27.7.49:4506')
[INFO    ] fire_master failed: Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/salt/minion.py", line 934, in _fire_master
    result = channel.send(load, timeout=timeout)
  File "/usr/lib/python2.7/site-packages/salt/utils/async.py", line 93, in wrap
    ret = self._block_future(ret)
  File "/usr/lib/python2.7/site-packages/salt/utils/async.py", line 103, in _block_future
    return future.result()
  File "/usr/lib64/python2.7/site-packages/tornado/concurrent.py", line 208, in result
    raise_exc_info(self._exc_info)
  File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 810, in run
    yielded = self.gen.throw(*sys.exc_info())
  File "/usr/lib/python2.7/site-packages/salt/transport/zeromq.py", line 220, in send
    ret = yield self._crypted_transfer(load, tries=tries, timeout=timeout)
  File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 807, in run
    value = future.result()
  File "/usr/lib64/python2.7/site-packages/tornado/concurrent.py", line 208, in result
    raise_exc_info(self._exc_info)
  File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 810, in run
    yielded = self.gen.throw(*sys.exc_info())
  File "/usr/lib/python2.7/site-packages/salt/transport/zeromq.py", line 193, in _crypted_transfer
    ret = yield _do_transfer()
  File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 807, in run
    value = future.result()
  File "/usr/lib64/python2.7/site-packages/tornado/concurrent.py", line 208, in result
    raise_exc_info(self._exc_info)
  File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 810, in run
    yielded = self.gen.throw(*sys.exc_info())
  File "/usr/lib/python2.7/site-packages/salt/transport/zeromq.py", line 179, in _do_transfer
    timeout=timeout,
  File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 807, in run
    value = future.result()
  File "/usr/lib64/python2.7/site-packages/tornado/concurrent.py", line 208, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
SaltReqTimeoutError: Message timed out
[ERROR   ] ** Master Ping failed. Attempting to restart minion**
[INFO    ] delaying random_reauth_delay 10s

steverweber on 6 Sep 2015

perhaps we are coming at this problem from the wrong direction. Do all your minions have this problem? if so then perhaps the system running the master has some strange issues.

steverweber on 6 Sep 2015

Stopping the master may be different from my case where merely the tcp connection is "timed out" by a firewall in between master and minion

At the moment this situation only happens on minions where I deliberately set the TCP Keepalive to a value that is higher than the Firewalls. This is done to trigger the problem quickly, in real life I have set lower TCP Keepalive, but still the minions would sometimes disconnect (just after days/weeks which makes it bit harder to track/analyze), maybe due to a different problem, but my initial guess was that its related to this same TCP Keepalive issue, since the symptoms are basically the same.

bymodude on 6 Sep 2015

im inclined to think the firewall is doing something "fancy", is the firewall running with some interesting sysctl settings... is ip_conntrack_tcp_timeout_established set to 5 days?.. im weak in this area.. perhaps let the minion run over night and see if anything changes. /sorry i got to tap out/

steverweber on 6 Sep 2015

the firewall is just Microsoft Azure Cloud, all VMs there are behind it, I do know that it blocks all ICMP and tcp connection time out is 4 minutes.

I did a quick tcpdump on the minion, and its not receiving any packets from the master when I do test.ping on the master and it returns "Minion did not return", so yes its probably the masters side fault at this point

bymodude on 6 Sep 2015

Could you try the keepalive code on a few minions for a week? Perhaps if they are siting around the SaltReqTimeoutError is more likely...

Regarding the root cause/solution I think I have to pass the torch. Perhaps a main developer that is already overworked @basepi @UtahDave have some ideas why Azure should not be used for a Salt Master or how to fix it.

steverweber on 6 Sep 2015

Hello,
we have the same problem.
Our salt-master is on a azure VM and all of the minion are outside Azure.
We lose the communication with all of our minions (we test it with near 20 minion) after 3-4 minutes.
We have test with many version 2014.5, 2015.0, 2015.3 (master & minion).
Now, we will test the Raet module between the master & minion.

code42fr on 6 Sep 2015

here is a /bad/ solution. /untested/ it is only helpfull if the minions are on-line for like a week.. @code42fr sorry this is not much help if the minions die after 4min, you could try the tcp_keepalive trick to extend that time.

cron job to run on master:

*/2 * * * *  salt \* cmd.run 'echo > /tmp/salt_keepalive'

cron state for minions

{{ tpldot }}:
    cron.present:
        - name: /bin/sh -c "if test "`find /tmp/salt_keepalive -mmin +7`" ; then service salt-minion restart; fi"
        - identifier: salt_keepalive
        - minute: '*/4'

steverweber on 7 Sep 2015

@code42fr you can solve the 4minute disconnect problem with the settings shown below in the minion config

salt-minion config tcp keepalive settings for azure

tcp_keepalive: True
tcp_keepalive_idle: 75
tcp_keepalive_cnt: 9
tcp_keepalive_intvl: 75

sorry for so much text below, I just write what I know at this point and don't have time right now to articulate it in a more concise way.

First I thought tcp_keepalive_idle of 200 (seconds) would be enough, but somehow the relevant timeout seems to be ~120 seconds (Azure TCP timeout would be 4 minutes, but there must be something else in play as well since I found 200 seconds for the keepalive was not working), a value of 75 seconds works for me now.

With the tcp_keepalive_idle set to 200 seconds or higher, I have the situation as documented above where the minion is still able to connect to the master after ~4 minutes, however the master cannot connect to the minion anymore, tcpdump shows master sending a packet (to the established tcp connection/socket?), which does not show in tcpdump on the minion side though, I assume Azure is "dropping" those packets since it somehow decided to drop the connection/socket. I am by no means a network expert, I do not really understand tcp and what exactly is involved in a connection/socket and under what circumstances it times out.

I have tried "netstat -otn" on the master with "stuck minion" which shows an established connection:
If I use a high tcp_keepalive_idle=200, I see an established connection (ESTABLISHED ON with timeout counter of 120seconds) between master and minion with that command when master has problems reaching the minion, master tries to send packet to minion (assuming it tries to use this established connection), but minion never receives this packet.
Compared to a minion with tcp_keepalive_idle set to 75 then I do NOT see any established connection with "netstat -otn" (all connection shows ESTABLISHED off) and the packet that the master tries to send during test.ping correctly reaches minion.

So in the end of the day: Yes the root cause may be an "unstable" TCP connection or a Firewall in between Master and Minion which terminates idle tcp connections after X seconds timeout, however I don't think blaming Azure's firewall or other Firewalls out there and moving on is entirely correct, well its ok to blame them, but ideally the master could detect such a state of "stuck connection" and recover from it.

What seems to be missing at the moment is for the master to detect this kind of TCP connection that is showing as "ESTABLISHED on" but which has its actual packets dropped on the way to the minion by the network/firewall etc.
When programming such systems that rely on an established TCP connection, from my understanding on the pure TCP level its essential to use a "correct" TCP KeepAlive idle value, "correct" meaning to set it below the Firewall or Routers idle timeout, but in a WAN scenario the configuration of firewalls and routers in between is unknown, while normally the default tcp idle timeout on a firewall would be reasonably high (e.g. defaults to 7200 seconds for most linux systems?) on Azure for example it is not.
So even we set the KeepAlive to 75 seconds, there may suddenly be a firewall/router somewhere that times out idle connections after 60 seconds. My understanding is that the master/server side just sends the packet to the established connection, from the tcp level it does not know that the packet gets dropped, however could it not implement a mechanism in the masters application logic that says after x seconds of not receiving a response try to establish a NEW tcp connection instead and see if that gets a response? I really don't know if its feasible and if the master/zmq code even has access to the tcp stack at that level.

Personally I would suspect actually quite a few people running into this exact problem (not only Azure users) and many cases of "minion not responding" being directly related to this very issue, its not so easy to realize what is going on either - when I started investigating this issue I check if connection between servers is ok etc, only thing that really shows whats going on is doing a tcpdump on both master and minion, while the problem occurs, which on most firewalls the default idle timeout is quite high (possibly 3600 or 7200 seconds), so the problem does not occur immediately after minion restart, so it's hard to catch it "in the act".

bymodude on 7 Sep 2015

@steverweber I will try to put your keepalive branch on some minions in the next few days, and then let it run for a week or two. We hope to see a SaltReqTimeoutError at some point and then see the minion restart itself after those errors right?

To recap:
1) Minions not reachable from master side after 3-4 minutes of starting minion, with master being on Azure VM, is "fixed" by setting tcp_keepalive_idle: 75 in minion config

2) When leaving tcp_keepalive_idle as default on Linux minion (e.g. default on most systems is 7200seconds) then 2015.5.3 minion shows SaltReqTimeoutError errors after ~5 minutes but the minion never restarts by itself (using those ping_interval:5, auth_mode: False and other parameters). event.fire_master would return successfully and connect to master successful though (while test.ping from master side fails)
-> When using your keepalive branch updated minion code I am NOT setting these SaltReqTimeoutErrors anymore, minion was also not restarting by itself (well its not showing the SaltReqTimeoutErrors or any other errors, so I wouldnt expect it to restart). event.fire_master would also return successfully.

What is strange is that this "event.fire_master ping" from minion side always seems to work, even on 2015.5.3 minion when it shows SaltReqTimeoutError (which means its internal ping failed, but when I do the event.fire_master ping manually shortly after seeing the error in the logs it works). Does this event.fire_master somehow open its own connection?

In your own testing when you stop the master your keepalive branch correctly restarts the minion, I think what I was after is that the minion does this in case of the established tcp connection that is timed out and silently being dropped by the firewall in between. Seeing the SaltReqTimeoutError being logged on 2105.5.3 minion was good in that sense as it means the minion correctly detected that it could not get a response from the master, and could then act upon that and restart itself. However having said that, the manual event.fire_mater ping works fine while the SaltReqTimeoutError was shown, so maybe 2015.5.3 minion just had a bug that shows this SaltReqTimeoutError even though the connection was actually fine? Because in your keepalive branch minion I don't see this SaltReqTimeouError anymore (I dont think I changed anything else besides upgrading the minion version to your keepalive branch).

Anyway there are many moving parts to testing this and I unfortunately do not have the time right now to setup a proper test environment where things can be replicated in a consistent manner.

bymodude on 7 Sep 2015

Hy :)
@bymodude : your settings seems work well. I'm happy.
In the past, i've investigate onto the master but not on the minion.

Thanks.

code42fr on 7 Sep 2015

@bymodude

Does this event.fire_master somehow open its own connection?

i suspect the same. I need to figure out how the new code works differently...
old: https://github.com/saltstack/salt/blob/2014.7/salt/minion.py#L1535
new: https://github.com/saltstack/salt/blob/2015.8/salt/minion.py#L1576
/much has changed/

Perhaps the new salt version 2015.8 "solved" the SaltReqTimeoutError for your case but created other issues.

after taking another look at the code. Perhaps should somehow ping through pub_channel..
https://github.com/saltstack/salt/blob/2015.8/salt/minion.py#L1664

Or the issue might be in the salt master code... I think its best if another developer investigates this issue. I'm kinda in the dark.

till then perhaps the /hacky/ cron jobs might work...

FYI im glad this is being looked at. If this gets solved im sure it will be the solution to many unexplained master minion communication issues. Thanks!

steverweber on 7 Sep 2015

@bymodude, @steverweber, @bymodude, thanks for working on this. Communication over sketchy or unfriendly networks has been a problem for a long time. This is one of the things that RAET and the tornado TCP transport were created to fix with ZMQ.

@bymodude, you may have more success upgrading zmq to > 4.0.0 if possible. Another thing to look at is zmq_monitor. I remember another issue where an aggressive network was unhelpfully trimming packets or frames or something but I can't find that issue, although @cachedout might remember.

jfindlay on 8 Sep 2015

👍1

Hello,

now we have the same like this -> https://github.com/saltstack/salt/issues/20101 . Our list of job is full of saltutil.find_job .

code42fr on 10 Sep 2015

@bymodude did you get any SaltReqTimeoutError while using the keepalive branch yet?
has things stabilized?

thanks.

# on the minion
curl -L https://bootstrap.saltstack.com | sh -s -- -g https://github.com/steverweber/salt.git git keepalive

ping_interval: 5
auth_timeout: 10
random_reauth_delay: 10
auth_tries: 2
auth_safemode: False

steverweber on 11 Sep 2015

@steverweber no SaltReqTimeoutError yet, but I suspect the underlying problem has not yet happened.

My theory is still that the fundamental problem is with TCP Keepalive handling and specifically with detecting dead connections (TCP connections that netstat shows as ESTABLISHED ON but that a firewall in between has already dropped and thus any packets sent to this connection are dropped by this Firewall silently) - with Keepalive set too high the application (salt-master and salt-minion) will run into situations were established TPC connections are dead but they wont know about it. My testing shows this to be the case. A pragmatic fix may be to do something like "If no response after x try and x timeout, re-establish zmq connection", as the manual fix is exactly like that (just restart salt-minion and it will re-establish zmq connection and everything works until TCP Keepalive timeout).

Salt-Minion 2015.5.3 was showing SaltReqTimeoutError in this case (but not manual tests like "salt-call status.master" and "salt-call event.fire status.master" as they probably established a new connection), I suspect it tried to use the dead connection.
Salt-Minion Keepalive branch (2105.8.x) does not show SaltReqTimeoutError but I suspect its just using a new connection vs 2015.5.3 re-using the old dead connection?

Unfortunately don't have time at the moment to reproduce this outside Azure, but I would assume its possible to simply reproduce the "TCP KeepAlive+Firewall dropping connection" scenario with an IPTables rule DROP'ing an established zmq connection in a test environment.

bymodude on 14 Sep 2015

@jfindlay I had tried zmq4 with 2014.7.x late last year and it made absolutely no difference in my case, I would have to try it again with 2015.5.3 or 2015.8, however all my minions are on Debian Wheezy and I just remember it was a major pain to get zmq4 installed as there is no package for Wheezy. Is there an "easy" way to get zmq4 on Wheezy?

bymodude on 14 Sep 2015

some food for thought.

perhaps a tcp_keepalive_auto option would be idea,
this theoretical option would detect a tcp_keepalive issue then reduce the idle and intvl values to a "safemode"

tcp_keepalive: True
tcp_keepalive_idle: 75
tcp_keepalive_cnt: 9
tcp_keepalive_intvl: 75

steverweber on 14 Sep 2015

What level of packet loss is typical here? What I'd like to do is to try to create a simulation here that we can work with to try and replicate this issue and determine when a fix is sufficient to handle the case at hand.

cachedout on 19 Nov 2015

I still see this issue on Azure with 2016.3.3.. Tried all the things mentioned above, still no luck. Not sure where the real issue is.. Just going to add a retry mechanism on running high states as a last resort.

pavankumar2203 on 23 Sep 2016

I believe I have a lead as to what is going on. First, a little background - we are using SaltStack in combination with netbooting and clonezilla to deploy one of several OS images to bare-metal machines and configuring them to customer specifications. We use hardware of several different types: some high-powered (Intel i3/i5/i7 w/ 16G ram), and some very low-powered (Intel J1900 w/ 4G ram). The system is all on the same IP network without firewall or routing of any kind between the master and minions. Finally, for various reasons, we're running on salt 2015.5.11 on Ubuntu 14.04 - even clonezilla is the Ubuntu-based one (14.04) but customized with salt-minion.

We've noticed that the system seems pretty happy when there are no J1900-based machines running. However, we have nearly immediate problems when the J1900s connect and start using resources - they pretty much bring the whole system to a halt with these errors.

Symptoms include many but not all connected systems erroring their highstate runs with the "SaltReqTimeoutError" error. This includes even the high-powered systems. When this happens we sometimes cannot "test.ping" the machines with failed highstate runs, but we can successfully restart the highstate runs (which then fail again within minutes with the same "SaltReqTimeoutError"). However, there are no other errors in logs anywhere on the various systems with any information. And one other interesting tidbit of information - when deploying machines using a "masterless" setup from USB key, the deployment runs without error on ALL of our platforms (so I know the system works and works correctly)!

Furthermore, I have been unable to replicate these symptoms on my smaller testing system - it only happens on our production floor, and only when we have several machines being deployed at the same time. We've gone so far as to replace the cluster hardware running the salt-master as well as change out ethernet switches and cables - none of that solved the issue.

So now with the facts out of the way, here's my hypothesis: by virtue of how slow they are, the J1900s trigger some sort of wait situation in ZMQ which stops the keepalive processing. However, I do not know how to start debugging with this in mind. Maybe someone with more understanding of SaltStack's or ZMQ's internals can find something....

Let me know if I can be of any help. Please remember that the only place I can reliably reproduce this issue is a production floor that must be kept running....

MaDaTyGo on 2 Dec 2016

❤1

I'm experiencing similar problems with an AWS-hosted master located in Ireland and minions located in Africa and Asia. At the debug log level, I'll see things like the following:

[INFO    ] Running state [/opt/shibboleth-idp] at time 10:30:27.490906
[INFO    ] Executing state file.recurse for [/opt/shibboleth-idp]
[DEBUG   ] LazyLoaded file.user_to_uid
[DEBUG   ] LazyLoaded cp.list_master
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', '[MINION ID ELIDED]', 'tcp://[MASTER IP ADDRESS ELIDED]:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', '[MINION ID ELIDED]', 'tcp://[MASTER IP ADDRESS ELIDED]:4506')
[DEBUG   ] SaltReqTimeoutError, retrying. (1/3)
[DEBUG   ] SaltReqTimeoutError, retrying. (2/3)

This leads to a SaltReqTimeoutError: Message timed out error similar to the following even with the TCP keep-alive settings posted above:

https://gist.github.com/xenophonf/0239c76e939e2cb9ed13a6c39db09b20

xenophonf on 2 Jan 2018

Same problem. When I'm trying to apply state from US master to 30 China minions, about half of minions reports "SaltReqTimeoutError"

sotona- on 26 Feb 2019

Perhaps if a minion cant find its master it should just exits... The service can be configured to preform an auto restart after a short delay. This would cleanup a few issues like if the master changed IP addresses or the connection has strange issues with keepalive. It is not the perfect solution but this fallback would get the minion talking again....

so perhaps you can find a mixture of settings that would have the minion die after toomany SaltReqTimeoutError and tweak the lets say systemd service to auto restart.

steverweber on 26 Feb 2019

also i dont think my old code that forks the minion process and restarts the sub-process works anymore... perhaps it should be remove .

https://github.com/saltstack/salt/blob/develop/salt/scripts.py#L136

steverweber on 26 Feb 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

If this issue is closed prematurely, please leave a comment and we will gladly reopen the issue.