The Salt master is at 10.17.15.194
One of the Salt minion is at 10.17.15.127
sh-4.3# salt "*" test.ping
[DEBUG ] Configuration file path: /etc/salt/master
[WARNING ] Insecure logging configuration detected! Sensitive data may be logged.
[DEBUG ] Reading configuration from /etc/salt/master
[DEBUG ] Including configuration from '/etc/salt/master.d/api.conf'
[DEBUG ] Reading configuration from /etc/salt/master.d/api.conf
[DEBUG ] Including configuration from '/etc/salt/master.d/debug.conf'
[DEBUG ] Reading configuration from /etc/salt/master.d/debug.conf
[DEBUG ] Including configuration from '/etc/salt/master.d/master.conf'
[DEBUG ] Reading configuration from /etc/salt/master.d/master.conf
[DEBUG ] Including configuration from '/etc/salt/master.d/peer.conf'
[DEBUG ] Reading configuration from /etc/salt/master.d/peer.conf
[DEBUG ] Including configuration from '/etc/salt/master.d/reactor.conf'
[DEBUG ] Reading configuration from /etc/salt/master.d/reactor.conf
[DEBUG ] Including configuration from '/etc/salt/master.d/returner.conf'
[DEBUG ] Reading configuration from /etc/salt/master.d/returner.conf
[DEBUG ] Missing configuration file: /root/.saltrc
[DEBUG ] MasterEvent PUB socket URI: /var/run/salt/master/master_event_pub.ipc
[DEBUG ] MasterEvent PULL socket URI: /var/run/salt/master/master_event_pull.ipc
[DEBUG ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/master', 'dashboard.k8s.local_master', 'tcp://127.0.0.1:4506', 'clear')
[DEBUG ] Initializing new IPCClient for path: /var/run/salt/master/master_event_pub.ipc
[DEBUG ] LazyLoaded local_cache.get_load
[DEBUG ] Reading minion list from /var/cache/salt/master/jobs/86/a48fad8b8da521c6db8a12184ce28e039ca4fd3ea7f3afb4bfbfb67a6bcc65/.minions.p
[DEBUG ] get_iter_returns for jid 20170605101957391229 sent to set(['04efe4d7f00940a799492844ed29ab57', '94b96abeaa41445c8688c454ceaac726', 'ca', '1740fea923c842508012be4aea15d857']) will timeout at 10:20:02.412050
[DEBUG ] jid 20170605101957391229 return from ca
[DEBUG ] LazyLoaded nested.output
ca:
True
[DEBUG ] jid 20170605101957391229 return from 04efe4d7f00940a799492844ed29ab57
[DEBUG ] LazyLoaded nested.output
04efe4d7f00940a799492844ed29ab57:
True
[DEBUG ] jid 20170605101957391229 return from 1740fea923c842508012be4aea15d857
[DEBUG ] LazyLoaded nested.output
1740fea923c842508012be4aea15d857:
True
[DEBUG ] jid 20170605101957391229 return from 94b96abeaa41445c8688c454ceaac726
[DEBUG ] LazyLoaded nested.output
94b96abeaa41445c8688c454ceaac726:
True
[DEBUG ] jid 20170605101957391229 found all minions set(['ca', '94b96abeaa41445c8688c454ceaac726', '04efe4d7f00940a799492844ed29ab57', '1740fea923c842508012be4aea15d857'])
Then if I do
ip addr del 10.17.15.127 dev eth0
ip addr add 10.17.15.128 dev eth0
Then running a test.ping results in a Minion did not return. [No response] for the minion that changed IP address.
sh-4.3# salt "*" test.ping
[DEBUG ] Configuration file path: /etc/salt/master
[WARNING ] Insecure logging configuration detected! Sensitive data may be logged.
[DEBUG ] Reading configuration from /etc/salt/master
[DEBUG ] Including configuration from '/etc/salt/master.d/api.conf'
[DEBUG ] Reading configuration from /etc/salt/master.d/api.conf
[DEBUG ] Including configuration from '/etc/salt/master.d/debug.conf'
[DEBUG ] Reading configuration from /etc/salt/master.d/debug.conf
[DEBUG ] Including configuration from '/etc/salt/master.d/master.conf'
[DEBUG ] Reading configuration from /etc/salt/master.d/master.conf
[DEBUG ] Including configuration from '/etc/salt/master.d/peer.conf'
[DEBUG ] Reading configuration from /etc/salt/master.d/peer.conf
[DEBUG ] Including configuration from '/etc/salt/master.d/reactor.conf'
[DEBUG ] Reading configuration from /etc/salt/master.d/reactor.conf
[DEBUG ] Including configuration from '/etc/salt/master.d/returner.conf'
[DEBUG ] Reading configuration from /etc/salt/master.d/returner.conf
[DEBUG ] Missing configuration file: /root/.saltrc
[DEBUG ] MasterEvent PUB socket URI: /var/run/salt/master/master_event_pub.ipc
[DEBUG ] MasterEvent PULL socket URI: /var/run/salt/master/master_event_pull.ipc
[DEBUG ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/master', 'dashboard.k8s.local_master', 'tcp://127.0.0.1:4506', 'clear')
[DEBUG ] Initializing new IPCClient for path: /var/run/salt/master/master_event_pub.ipc
[DEBUG ] LazyLoaded local_cache.get_load
[DEBUG ] Reading minion list from /var/cache/salt/master/jobs/63/5811ecfd3f2050ff4d461c7d8b230080d061f5834c343305b2eb098dabf489/.minions.p
[DEBUG ] get_iter_returns for jid 20170605102344190820 sent to set(['04efe4d7f00940a799492844ed29ab57', '94b96abeaa41445c8688c454ceaac726', 'ca', '1740fea923c842508012be4aea15d857']) will timeout at 10:23:49.238166
[DEBUG ] jid 20170605102344190820 return from 94b96abeaa41445c8688c454ceaac726
[DEBUG ] LazyLoaded nested.output
94b96abeaa41445c8688c454ceaac726:
True
[DEBUG ] jid 20170605102344190820 return from 1740fea923c842508012be4aea15d857
[DEBUG ] LazyLoaded nested.output
1740fea923c842508012be4aea15d857:
True
[DEBUG ] jid 20170605102344190820 return from ca
[DEBUG ] LazyLoaded nested.output
ca:
True
[DEBUG ] Checking whether jid 20170605102344190820 is still running
[DEBUG ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/master', 'dashboard.k8s.local_master', 'tcp://127.0.0.1:4506', 'clear')
[DEBUG ] Passing on saltutil error. This may be an error in saltclient. 'retcode'
[DEBUG ] LazyLoaded localfs.init_kwargs
[DEBUG ] LazyLoaded localfs.init_kwargs
[DEBUG ] LazyLoaded no_return.output
04efe4d7f00940a799492844ed29ab57:
Minion did not return. [No response]
I can see the master is still trying to contact the minion:
...
12:35:15.225523 ARP, Request who-has 10.17.15.127 tell 10.17.15.194, length 28
12:35:16.225528 ARP, Request who-has 10.17.15.127 tell 10.17.15.194, length 28
12:35:17.225462 ARP, Request who-has 10.17.15.127 tell 10.17.15.194, length 28
12:35:18.225459 ARP, Request who-has 10.17.15.127 tell 10.17.15.194, length 28
12:35:19.225418 ARP, Request who-has 10.17.15.127 tell 10.17.15.194, length 28
12:35:20.225482 ARP, Request who-has 10.17.15.127 tell 10.17.15.194, length 28
12:35:21.225519 ARP, Request who-has 10.17.15.127 tell 10.17.15.194, length 28
12:35:22.225477 ARP, Request who-has 10.17.15.127 tell 10.17.15.194, length 28
12:35:23.225477 ARP, Request who-has 10.17.15.127 tell 10.17.15.194, length 28
...
sh-4.3# salt --versions-report
Salt Version:
Salt: 2016.11.4
Dependency Versions:
cffi: 1.5.2
cherrypy: Not Installed
dateutil: Not Installed
docker-py: Not Installed
gitdb: Not Installed
gitpython: Not Installed
ioflo: Not Installed
Jinja2: 2.8
libgit2: Not Installed
libnacl: Not Installed
M2Crypto: Not Installed
Mako: Not Installed
msgpack-pure: Not Installed
msgpack-python: 0.4.6
mysql-python: 1.2.5
pycparser: 2.10
pycrypto: 2.6.1
pycryptodome: Not Installed
pygit2: Not Installed
Python: 2.7.13 (default, Jan 11 2017, 10:56:06) [GCC]
python-gnupg: Not Installed
PyYAML: 3.10
PyZMQ: 14.0.0
RAET: Not Installed
smmap: Not Installed
timelib: Not Installed
Tornado: 4.2.1
ZMQ: 4.0.4
System Versions:
dist: SuSE 12 x86_64
machine: x86_64
release: 4.4.59-92.17-default
system: Linux
version: SUSE Linux Enterprise Server 12 x86_64
What do the logs show on the minion side? Also do you still see this behavior after restarting the minion service?
The does not show anything in the logs in the minion: I guess it doesn't realize anything has changed. Connectivity is restored after running systemctl restart salt-minion in the minion, and I can run fine salt "*" test.ping in the master again.
+1
2017-10-01 21:59:25,498 [salt.minion ][WARNING ][22935] Master ip address changed from A.A.A.A to B.B.B.B
2017-10-02 00:49:27,683 [salt.minion ][WARNING ][22935] Master ip address changed from A.A.A.A to B.B.B.B
2017-10-02 00:49:27,707 [salt.minion ][WARNING ][22935] Master ip address changed from A.A.A.A to B.B.B.B
2017-10-02 08:22:15,266 [salt.minion ][WARNING ][22935] Master ip address changed from B.B.B.B to A.A.A.A
2017-10-02 09:20:17,538 [salt.minion ][WARNING ][22935] Master ip address changed from B.B.B.B to A.A.A.A
2017-10-02 11:40:20,125 [salt.minion ][WARNING ][22935] Master ip address changed from B.B.B.B to A.A.A.A
2017-10-02 12:44:22,485 [salt.minion ][WARNING ][22935] Master ip address changed from B.B.B.B to A.A.A.A
2017-10-02 12:50:45,471 [salt.utils ][ERROR ][22935] DNS lookup or connection check of 'saltmaster1' failed.
2017-10-02 12:50:45,490 [salt.minion ][WARNING ][22935] Master ip address changed from B.B.B.B to 127.0.0.1
2017-10-02 12:50:50,500 [salt.utils ][ERROR ][22935] DNS lookup or connection check of 'saltmaster2' failed.
2017-10-02 12:50:50,500 [salt.minion ][WARNING ][22935] Master ip address changed from B.B.B.B to 127.0.0.1
2017-10-02 12:55:11,299 [salt.utils ][ERROR ][22935] DNS lookup or connection check of 'saltmaster2' failed.
2017-10-02 12:55:11,299 [salt.minion ][WARNING ][22935] Master ip address changed from B.B.B.B to 127.0.0.1
2017-10-02 12:55:11,304 [salt.minion ][WARNING ][22935] Master ip address changed from B.B.B.B to A.A.A.A
2017-10-02 18:13:15,018 [salt.minion ][WARNING ][22935] Master ip address changed from A.A.A.A to B.B.B.B
2017-10-02 20:57:42,191 [salt.utils ][ERROR ][22935] DNS lookup or connection check of 'saltmaster1' failed.
2017-10-02 20:57:50,375 [salt.minion ][WARNING ][22935] Master ip address changed from A.A.A.A to 127.0.0.1
2017-10-02 20:58:00,385 [salt.utils ][ERROR ][22935] DNS lookup or connection check of 'saltmaster2' failed.
2017-10-02 20:58:00,386 [salt.minion ][WARNING ][22935] Master ip address changed from A.A.A.A to 127.0.0.1
2017-10-02 20:58:01,389 [salt.minion ][WARNING ][22935] Master ip address changed from A.A.A.A to B.B.B.B
2017-10-03 00:15:31,060 [salt.utils ][ERROR ][22935] DNS lookup or connection check of 'saltmaster2' failed.
2017-10-03 00:15:31,245 [salt.minion ][WARNING ][22935] Master ip address changed from B.B.B.B to 127.0.0.1
2017-10-03 00:15:34,255 [salt.minion ][WARNING ][22935] Master ip address changed from B.B.B.B to A.A.A.A
2017-10-03 00:19:51,984 [salt.utils ][ERROR ][22935] DNS lookup or connection check of 'saltmaster1' failed.
2017-10-03 00:19:51,985 [salt.minion ][WARNING ][22935] Master ip address changed from A.A.A.A to 127.0.0.1
2017-10-03 00:20:01,995 [salt.utils ][ERROR ][22935] DNS lookup or connection check of 'saltmaster2' failed.
2017-10-03 00:20:01,996 [salt.minion ][WARNING ][22935] Master ip address changed from A.A.A.A to 127.0.0.1
2017-10-03 00:20:12,005 [salt.utils ][ERROR ][22935] DNS lookup or connection check of 'saltmaster1' failed.
2017-10-03 00:20:12,006 [salt.minion ][WARNING ][22935] Master ip address changed from A.A.A.A to 127.0.0.1
2017-10-03 00:20:21,397 [salt.crypt ][WARNING ][22935] SaltReqTimeoutError: Message timed out
2017-10-03 00:20:36,403 [salt.crypt ][WARNING ][22935] SaltReqTimeoutError: Message timed out
2017-10-03 00:20:51,408 [salt.crypt ][WARNING ][22935] SaltReqTimeoutError: Message timed out
2017-10-03 00:21:06,414 [salt.crypt ][WARNING ][22935] SaltReqTimeoutError: Message timed out
2017-10-03 00:21:21,419 [salt.crypt ][WARNING ][22935] SaltReqTimeoutError: Message timed out
2017-10-03 00:21:36,424 [salt.crypt ][WARNING ][22935] SaltReqTimeoutError: Message timed out
2017-10-03 00:21:51,430 [salt.crypt ][WARNING ][22935] SaltReqTimeoutError: Message timed out
2017-10-03 00:22:06,436 [salt.crypt ][WARNING ][22935] SaltReqTimeoutError: Message timed out
There is also no any connection to saltmaster:4505 found by netstat.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
If this issue is closed prematurely, please leave a comment and we will gladly reopen the issue.
This is apparently a regression on #2358.
I know this would only be a workaround, but do any of the parameters that start here
https://github.com/saltstack/salt/blob/master/conf/minion#L244
in the example minion config file help?
I tried these, but it made no difference.
auth_tries: 10
auth_safemode: False
ping_interval: 2
With or without those settings, the minions are reconnected after ~15 minutes. No restart required.
I guess that counts as working as designed?
versions report
Salt Version:
Salt: 3000.3
Dependency Versions:
cffi: Not Installed
cherrypy: Not Installed
dateutil: 2.6.1
docker-py: Not Installed
gitdb: Not Installed
gitpython: Not Installed
Jinja2: 2.10
libgit2: 0.26.0
M2Crypto: 0.32.0
Mako: Not Installed
msgpack-pure: Not Installed
msgpack-python: 0.5.6
mysql-python: Not Installed
pycparser: Not Installed
pycrypto: 2.6.1
pycryptodome: Not Installed
pygit2: 0.26.2
Python: 3.6.9 (default, Apr 18 2020, 01:56:04)
python-gnupg: 0.4.1
PyYAML: 3.12
PyZMQ: 16.0.2
smmap: Not Installed
timelib: Not Installed
Tornado: 4.5.3
ZMQ: 4.2.5
System Versions:
dist: Ubuntu 18.04 bionic
locale: ISO-8859-1
machine: x86_64
release: 4.15.0-106-generic
system: Linux
version: Ubuntu 18.04 bionic
putting this in Magnesium release to at least have a discussion and assigning Dwoz to help facilitate if this is working as designed or merely "failling back" to the TCP keepalive default and isn't reliable.
@OrangeDog You write that after 15 minutes, your minions do reconnect.
We have a similar problem as the one that you are reporting, but in our case, the minions do not reconnect, not even after hours, so I would like to understand what might be different about our setup compared to yours.
In our case, the minions connect to the master, not the other way around and the master occassionally changes its IP address, but the DNS name is updated to represent the new IP address, so I am under the impression that the minions still try to connect to the old IP address instead of resolving the DNS name again and then using the new IP address, but I haven鈥檛 found any setting in the Salt minion that could be used to tell it resolve names again.
@OrangeDog @smarsching if it would help, myself and/or @cassandrafaris can set up a time to discuss in a group? We have engineers across the Pacific in EU timezones where we could set up a meeting then or US timezones, whatever works for all involved. If that helps to discuss what maybe happening here and thus lead to a possible solution seems like a win-win, no?
Moving this to non-committed work in Magnesium -- but it is still on the table to discuss in a small group setting if that helps. @DmitryKuzmenko assigning in case a discussion is to occur in your timezone and not US plus your skills are welcomed here: let's see if this is working as designed or merely "failling back" to the TCP keepalive default and isn't reliable, please.
@sagetherage I am very interested in a discussion of this issue. I am located in the EU, but I am somewhat flexible regarding the the time of the discussion (as long as it isn鈥檛 before 8am or past 10 pm local time), so I am sure we can figure something out, regardless of where your engineers are located.
Today, when we had the problem again, I made an interesting observation. The minion does not seem to lose the connection to the master completely. Executing modules that communicate with the master (like cp.get_file_str or saltutil.refresh_pillar) still works, when executing them locally with salt-call.
But triggering the execution from the master (even something as simple as test.ping) fails. So I guess it is the registration with one of the event buses, which doesn鈥檛 reconnect correctly. In case it helps, this is the associated stack trace in the minion log (both master and minion are on version 3001):
salt.exceptions.SaltReqTimeoutError: Message timed out
2020-08-12 18:41:57,043 [tornado.application:640 ][ERROR ][13860] Exception in callback functools.partial(<function wrap.<locals>.wrapped at 0x7f6672a57ea0>, <salt.ext.tornado.concurrent.Future object at 0x7f6672a845c0>)
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/salt/ext/tornado/ioloop.py", line 606, in _run_callback
ret = callback()
File "/usr/lib/python3/dist-packages/salt/ext/tornado/stack_context.py", line 346, in wrapped
raise_exc_info(exc)
File "<string>", line 4, in raise_exc_info
File "/usr/lib/python3/dist-packages/salt/ext/tornado/stack_context.py", line 317, in wrapped
ret = fn(*args, **kwargs)
File "/usr/lib/python3/dist-packages/salt/ext/tornado/gen.py", line 291, in <lambda>
future, lambda future: callback(future.result()))
File "/usr/lib/python3/dist-packages/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/usr/lib/python3/dist-packages/salt/ext/tornado/gen.py", line 1064, in run
yielded = self.gen.throw(*exc_info)
File "/usr/lib/python3/dist-packages/salt/minion.py", line 1564, in _send_req_async
ret = yield channel.send(load, timeout=timeout)
File "/usr/lib/python3/dist-packages/salt/ext/tornado/gen.py", line 1056, in run
value = future.result()
File "/usr/lib/python3/dist-packages/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/usr/lib/python3/dist-packages/salt/ext/tornado/gen.py", line 1064, in run
yielded = self.gen.throw(*exc_info)
File "/usr/lib/python3/dist-packages/salt/transport/zeromq.py", line 442, in send
load, tries=tries, timeout=timeout, raw=raw
File "/usr/lib/python3/dist-packages/salt/ext/tornado/gen.py", line 1056, in run
value = future.result()
File "/usr/lib/python3/dist-packages/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/usr/lib/python3/dist-packages/salt/ext/tornado/gen.py", line 1064, in run
yielded = self.gen.throw(*exc_info)
File "/usr/lib/python3/dist-packages/salt/transport/zeromq.py", line 411, in _crypted_transfer
ret = yield _do_transfer()
File "/usr/lib/python3/dist-packages/salt/ext/tornado/gen.py", line 1056, in run
value = future.result()
File "/usr/lib/python3/dist-packages/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/usr/lib/python3/dist-packages/salt/ext/tornado/gen.py", line 1064, in run
yielded = self.gen.throw(*exc_info)
File "/usr/lib/python3/dist-packages/salt/transport/zeromq.py", line 394, in _do_transfer
tries=tries,
File "/usr/lib/python3/dist-packages/salt/ext/tornado/gen.py", line 1056, in run
value = future.result()
File "/usr/lib/python3/dist-packages/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
salt.exceptions.SaltReqTimeoutError: Message timed out
The logfile of the master doesn鈥檛 contain any associated messages. As expected, restarting the master doesn鈥檛 fix the problem, but restarting the minon does.
As a workaround, I thought about having a cronjob on the minion that periodically checks whether the minion is still connected and restarts it if not, but as all modules seem to execute fine, I am not sure yet how I could detect such a situation on the minon side (without parsing the minion log, which seems too error prone).
Most helpful comment
@sagetherage I am very interested in a discussion of this issue. I am located in the EU, but I am somewhat flexible regarding the the time of the discussion (as long as it isn鈥檛 before 8am or past 10 pm local time), so I am sure we can figure something out, regardless of where your engineers are located.
Today, when we had the problem again, I made an interesting observation. The minion does not seem to lose the connection to the master completely. Executing modules that communicate with the master (like
cp.get_file_strorsaltutil.refresh_pillar) still works, when executing them locally withsalt-call.But triggering the execution from the master (even something as simple as
test.ping) fails. So I guess it is the registration with one of the event buses, which doesn鈥檛 reconnect correctly. In case it helps, this is the associated stack trace in the minion log (both master and minion are on version 3001):The logfile of the master doesn鈥檛 contain any associated messages. As expected, restarting the master doesn鈥檛 fix the problem, but restarting the minon does.
As a workaround, I thought about having a cronjob on the minion that periodically checks whether the minion is still connected and restarts it if not, but as all modules seem to execute fine, I am not sure yet how I could detect such a situation on the minon side (without parsing the minion log, which seems too error prone).