Salt: Consistent Errors on Salt-Master Start with Tornado

Created on 16 Jul 2017  路  13Comments  路  Source: saltstack/salt

Description of Issue/Question

Getting salt-master errors on startup. Can't run salt-master for more than a few seconds before it fails. Here is the error that repeats in the logs.

2017-07-16 00:16:59,619 [tornado.application][ERROR ][858] Future exception was never retrieved: Traceback (most recent call last):
File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 285, in wrapper
yielded = next(result)
File "/usr/lib/python2.7/site-packages/salt/transport/zeromq.py", line 611, in handle_message
stream.send(self.serial.dumps(self._auth(payload['load'])))
File "/usr/lib/python2.7/site-packages/salt/transport/mixins/auth.py", line 210, in _auth
auto_sign = self.auto_key.check_autosign(load['id'])
File "/usr/lib/python2.7/site-packages/salt/daemons/masterapi.py", line 427, in check_autosign
if self.check_autosign_dir(keyid):
File "/usr/lib/python2.7/site-packages/salt/daemons/masterapi.py", line 405, in check_autosign_dir
if not os.path.exists(stub_file):
File "/usr/lib64/python2.7/genericpath.py", line 26, in exists
os.stat(path)
TypeError: stat() argument 1 must be encoded string without null bytes, not str

Setup

Standard install via RPM / DNF on Fedora 26. Starting with systemctl start salt-master

Steps to Reproduce Issue

Any start of the process.

Versions Report

salt --versions-report

Salt Version:
Salt: 2016.11.5

Dependency Versions:
cffi: 1.9.1
cherrypy: Not Installed
dateutil: Not Installed
docker-py: Not Installed
gitdb: Not Installed
gitpython: Not Installed
ioflo: Not Installed
Jinja2: 2.9.6
libgit2: Not Installed
libnacl: Not Installed
M2Crypto: Not Installed
Mako: Not Installed
msgpack-pure: Not Installed
msgpack-python: 0.4.8
mysql-python: Not Installed
pycparser: 2.14
pycrypto: 2.6.1
pycryptodome: Not Installed
pygit2: Not Installed
Python: 2.7.13 (default, Jun 26 2017, 10:20:05)
python-gnupg: Not Installed
PyYAML: 3.12
PyZMQ: 16.0.2
RAET: Not Installed
smmap: Not Installed
timelib: Not Installed
Tornado: 4.4.2
ZMQ: 4.1.6

System Versions:
dist: fedora 26 Twenty Six
machine: x86_64
release: 4.11.9-300.fc26.x86_64
system: Linux
version: Fedora 26 Twenty Six

info-needed stale team-core

Most helpful comment

We seem to be seeing it in 2018.3.0-1 still. Any updates?

All 13 comments

We removed all keys and tested - no change. We reinstalled all Salt and Python packages - no change. We build a secondary server from scratch, worked fine..... until we flipped DNS to point existing minions to it. Then the issue happened there immediately as well. Our current guess is that there is a bug on the Master side that is not handling certain input from a minion. We are attempting to test this now.

We've traced teh issue to running a salt-minion on a Windows Server Preview edition (aka 2016 R2.) We were testing it on the first hours after it was out. It failed to register the key in any way on the master, but when it would attempt to connect it would throw the error above every time and, we think, would kill that worker so that it would not come back. That this is happening on an untested OS isn't too surprising. But it is a concern that this totally crippled the Master. This means that this represents a trivial DoS attack vector where any system replicating the call made by a 2016 R2 minion would take down even the largest Master in a few minutes.

Hello,
If you don't mind sharing, what is the minion id of the Windows machine you mentioned that is causing the crash?

The reason I ask is because from what I can tell this error should only occur when keyid is not a valid string.

We do not test against the preview images so that isn't a lot I can do right now.

It does look like the problem is caused by the id of the minion, would you mind providing what the minion_id is?

You should be able to run salt-run state.event pretty=True and look for the minion sending a request to auth. it is pulling the id from the payload and passing it as the keyid, which is being tacked onto the autosign_dir, which is causing this error.

Thanks,
Daniel

Let me add some info that may be related. I'm running a master with windows minions.
I'm getting this same error.
The minion_id is built with a windows script and is set to the mac_address of the device.
On the first devices deployed, we had an error on the output of this script, and the minion_id file was written with some unreadable characters at the start. (of course, our bad).
For example:
what should have been this minion id


has written as this

\357\273\277B4D5BD5E3BCF

We found out this because we are tracking our config with git, because it's not that easy to see it right from bash. From bash you see the correct name, in salt-key or in the files... but it doesn't work, see:

root@srv-salt-master-1:/srv# salt-key | grep B4D5BD5E3BCF
B4D5BD5E3BCF
root@srv-salt-master-1:/srv# salt 'B4D5BD5E3BCF' test.ping
No minions matched the target. No command was sent, no jid was assigned.
ERROR: No return received
root@srv-salt-master-1:/srv# salt '*B4D5BD5E3BCF' test.ping
B4D5BD5E3BCF:
    Minion did not return. [Not connected]

(it didn't return because it's not connected, but that's not the point, the point is that you see the ID as B4D5BD5E3BCF but it you target it that way it won't work.

Also see this, the same happens for the file:

root@srv-salt-master-1:/etc/salt/pki/master/minions# ls B4D5BD5E3BCF
ls: cannot access 'B4D5BD5E3BCF': No such file or directory
root@srv-salt-master-1:/etc/salt/pki/master/minions# ls *B4D5BD5E3BCF
B4D5BD5E3BCF
root@srv-salt-master-1:/etc/salt/pki/master/minions# stat B4D5BD5E3BCF
stat: cannot stat 'B4D5BD5E3BCF': No such file or directory
root@srv-salt-master-1:/etc/salt/pki/master/minions# stat *B4D5BD5E3BCF
  File: 'B4D5BD5E3BCF'
  Size: 450             Blocks: 8          IO Block: 4096   regular file
Device: 801h/2049d      Inode: 1062388     Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2017-09-11 14:35:11.616365732 +0000
Modify: 2017-09-07 21:01:11.720765645 +0000
Change: 2017-09-07 21:03:06.903611558 +0000
 Birth: -
root@srv-salt-master-1:/etc/salt/pki/master/minions#

I don't know if this is really the reason for this issue to happen, but as we both have this issue, we both have windows minions, and @scottalanmiller still didn't send the minion_ids perhaps the problem is related to this.

Hope this helps, in the following weeks I'll try to remove those minions and rename them, I'll share the info as soon as I have it.

That is super weird.

@saltstack/team-core @twangboy has anyone seen this before?

Thanks,
Daniel

I haven't....

Let me update you with the tests I made today.
First, how I get to target good or bad minions (considering I use MAC address also as minion_id):

  • I can target all the good minions with salt '????????????'
  • I can target all the bad minios with salt '???????????????' (the 3 strange characters added at minion_id start)

I removed the bad minions using salt-key -d '???????????????' -y (notice 15 ? )
As far as this minions don't reappear on the master, I don't get this error, but as soon as I get one of this minions appear on the master, I get the error:

TypeError: stat() argument 1 must be encoded string without null bytes, not str
2017-09-26 20:39:50,707 [tornado.application][ERROR   ][21404] Future exception was never retrieved: Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 230, in wrapper
    yielded = next(result)
  File "/usr/lib/python2.7/dist-packages/salt/transport/zeromq.py", line 611, in handle_message
    stream.send(self.serial.dumps(self._auth(payload['load'])))
  File "/usr/lib/python2.7/dist-packages/salt/transport/mixins/auth.py", line 210, in _auth
    auto_sign = self.auto_key.check_autosign(load['id'])
  File "/usr/lib/python2.7/dist-packages/salt/daemons/masterapi.py", line 415, in check_autosign
    if self.check_autosign_dir(keyid):
  File "/usr/lib/python2.7/dist-packages/salt/daemons/masterapi.py", line 393, in check_autosign_dir
    if not os.path.exists(stub_file):
  File "/usr/lib/python2.7/genericpath.py", line 26, in exists
    os.stat(path)
TypeError: stat() argument 1 must be encoded string without null bytes, not str

(remember it was my bad at start that the first minions got this 3 strange characters at the beggining...)

sorry i didn't report back.

We have replicated this, and it will be fixed in the next release.

Thanks,
Daniel

Do you think we can patch our current installation so this won't happen, so we don't have to wait for the next release?

This should be fixed in 2017.7.2

Thanks,
Daniel

We seem to be seeing it in 2018.3.0-1 still. Any updates?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

If this issue is closed prematurely, please leave a comment and we will gladly reopen the issue.

Was this page helpful?
0 / 5 - 0 ratings