Salt: Handling master's key change on minion - re-authentication to master

Created on 30 Nov 2017 · 23Comments · Source: saltstack/salt

Description of Issue/Question

What is the proper way to re-auth minion to upgraded master (master which got its keys changed)?
I have limited access to minions, some of them are Windows minions

Setup

Regular salt master - minion setup

Steps to Reproduce Issue

stop salt-master
generate new master keys salt-key --gen-keys master
put keys to /etc/salt/pki/master
start salt-master

It doesn't matter if you previously deleted minion from master via salt-key -d
As long the minion keeps master's key somewhere in salt/pki/minion/minion_master.pub it seems that changing keys on master is unsupported. Am I right?

This is a problem for me because I have salt master deployed in container
Sometimes I want to upgrade salt by simply replacing the container

Is the only solution to manually remove minion_master.pub from minion?
Maybe there is some configuration option on minion to relax this requirement?

Versions Report

any version afair

Question

Source

kiemlicz

👍7

Most helpful comment

@Slamoth

grep "The master key has changed, the salt master could have been subverted" /usr/lib/python3.6/site-packages/salt/crypt.py

rico256-cn on 27 May 2020

👍2

All 23 comments

I also have this question

satshabad-cr on 18 Dec 2018

I also have the question？
Is there any solution for it？

shangxdy on 13 Jan 2019

The "old" public master key on the minion prevents connection the (any) master with a "new" key.
This is exaclty what it must do, and there can be no way to relax that.

We must prepare the clients for a master key change before we change the master key.

The only way I know of is to switch to a new master:

1) on the minion, delete the old key of the old master,
2) on the minion, set new master and store the new public key

Is there realy a second master needed to renew the master key?

markuskramerIgitt on 27 Mar 2019

I gave it a thought: to handle master key change, the minion must know both (old and new) keys.

I think this is a feature request, there is no "proper way" (yet) I know of.

Because @kiemlicz said "Sometimes I want to upgrade", I understand this is a use case:

Use case no specific time

The minion should try the second master key, if the first does not work (any more)

I could also think of a second use case, where the master will change the key at a specific time

Use case specific time

The minion, at a specific time, shall replace the master key with the the second.

markuskramerIgitt on 27 Mar 2019

👍1

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

If this issue is closed prematurely, please leave a comment and we will gladly reopen the issue.

stale[bot] on 8 Jan 2020

Still a desired feature

kiemlicz on 9 Jan 2020

Thank you for updating this issue. It is no longer marked as stale.

stale[bot] on 9 Jan 2020

Also from our part.

markuskramerIgitt on 9 Jan 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

If this issue is closed prematurely, please leave a comment and we will gladly reopen the issue.

stale[bot] on 8 Feb 2020

gladly reopen the issue

kiemlicz on 10 Feb 2020

Thank you for updating this issue. It is no longer marked as stale.

stale[bot] on 10 Feb 2020

The problem is spoiled me ,so I add following code to resolve the problem. @shangxdy @kiemlicz
os.remove(self.opts['pki_dir'] + "/minion_master.pub")

at location:
/usr/lib/python3.6/site-packages/salt/crypt.py
"The master key has changed, the salt master could have been subverted, verify salt master's public key"

rico256-cn on 1 Mar 2020

I wish saltstack add an option to support auto delete the minion_master.pub file.

rico256-cn on 1 Mar 2020

You remove the master key from minion, I know this will work
However I don't understand why this is designed in a way that doesn't allow to "change" (actually it's same master but only the keys have changed) the master

Maybe there are some security reasons for that, yet I would love this to be configurable

kiemlicz on 5 Mar 2020

We have a similar method where we have an AWS autoscaling group of 1 master to create it if it dies (or we need a fresh version). We set the userdata build of the master to copy the master keys from /etc/salt/pki/master - the master.pem and master.pub into a secure S3 bucket if they don't exist there - and download them from there if they do (I have the bash if anyone wants it). We also have the master set to use the autosign_grains to auto-register minions if they match the special grains - so we don't have to accept the keys again.

However - the minions aren't attempting to reconnect at all unless they are restarted (which you can't do via salt at that point!). There seems to be no "poll" of the master happening in the /var/log/salt/minion log either - certainly no warnings of the master not being present (we thought a cron job to check for lack of salt connection and run a systemctl restart salt-minion ).

If I can just solve this one part then we'll have a self-healing salt infrastructure. Otherwise we need to either manually log onto every minion and restart or fully rebuild the platform which seems overkill for just the salt master being replaced.

mhaswell on 10 Mar 2020

We have a similar method where we have an AWS autoscaling group of 1 master to create it if it dies (or we need a fresh version). We set the userdata build of the master to copy the master keys from /etc/salt/pki/master - the master.pem and master.pub into a secure S3 bucket if they don't exist there - and download them from there if they do (I have the bash if anyone wants it). We also have the master set to use the autosign_grains to auto-register minions if they match the special grains - so we don't have to accept the keys again.

However - the minions aren't attempting to reconnect at all unless they are restarted (which you can't do via salt at that point!). There seems to be no "poll" of the master happening in the /var/log/salt/minion log either - certainly no warnings of the master not being present (we thought a cron job to check for lack of salt connection and run a systemctl restart salt-minion ).

If I can just solve this one part then we'll have a self-healing salt infrastructure. Otherwise we need to either manually log onto every minion and restart or fully rebuild the platform which seems overkill for just the salt master being replaced.

Ahh - just found this to investigate: https://github.com/saltstack/salt/issues/44038#issuecomment-342761567

mhaswell on 10 Mar 2020

Just want to clarify

We have a similar method where we have an AWS autoscaling group of 1 master to create it if it dies (or we need a fresh version). We set the userdata build of the master to copy the master keys from /etc/salt/pki/master - the master.pem and master.pub into a secure S3 bucket if they don't exist there - and download them from there if they do (I have the bash if anyone wants it). We also have the master set to use the autosign_grains to auto-register minions if they match the special grains - so we don't have to accept the keys again.

I'm not very familiar with AWS, does that mean that you Salt Master instances always have the same keypair? (if so then you should not have problem)

However - the minions aren't attempting to reconnect at all unless they are restarted (which you can't do via salt at that point!). There seems to be no "poll" of the master happening in the /var/log/salt/minion log either - certainly no warnings of the master not being present (we thought a cron job to check for lack of salt connection and run a systemctl restart salt-minion ).

By default Minions die if the connection to master(s) won't succeed.
You can change that (if the reason is key reject: https://docs.saltstack.com/en/latest/ref/configuration/minion.html#rejected-retry)
However I don't know if this is your case

kiemlicz on 10 Mar 2020

Just want to clarify

We have a similar method where we have an AWS autoscaling group of 1 master to create it if it > > ...

I'm not very familiar with AWS, does that mean that you Salt Master instances always have the same keypair? (if so then you should not have problem)

Yes - we keep the same keypair - but the minions weren't reconnecting (also the master IP address is changing - but DNS would hopefully find it).

However - the minions aren't attempting to reconnect at all unless they are restarted (which
...

By default Minions die if the connection to master(s) won't succeed.
You can change that (if the reason is key reject: https://docs.saltstack.com/en/latest/ref/configuration/minion.html#rejected-retry)
However I don't know if this is your case

Actually it looks like the salt-minion process (2019.2.0 code) is just sitting there doing nothing. Maybe this is changed in later versions (we got badly bitten by the 2019.2.1 performance bug so we're very cautious now).

Looks like this might work in the salt-minion conf file:

  # If authentication fails due to SaltReqTimeoutError during a ping_interval,
  # cause sub minion process to restart.
  auth_safemode: False
  # Ping Master to ensure connection is alive (minutes).
  ping_interval: 2
  # Number of consecutive SaltReqTimeoutError that are acceptable when trying to
  # authenticate
  auth_tries: 2
  # The number of attempts to connect to a master before giving up.
  # Set this to -1 for unlimited attempts. This allows for a master to have
  # downtime and the minion to reconnect to it later when it comes back up.
  # In 'failover' mode, it is the number of attempts for each set of masters.
  # In this mode, it will cycle through the list of masters for each attempt.
  #
  # This is different than auth_tries because auth_tries attempts to
  # retry auth attempts with a single master. auth_tries is under the
  # assumption that you can connect to the master but not gain
  # authorization from it. master_tries will still cycle through all
  # the masters in a given try, so it is appropriate if you expect
  # occasional downtime from the master(s).
  master_tries: -1

mhaswell on 10 Mar 2020

And it doesn't work (at least on version 2019.2.0). The minion does detect that it's gone now:

2020-03-10 15:10:23,288 [tornado.application:611 ][ERROR   ][2789] Exception in callback <functools.partial object at 0x7fdc204d2f18>
Traceback (most recent call last):
  File "/usr/lib64/python2.7/site-packages/tornado/ioloop.py", line 591, in _run_callback
    ret = callback()
  File "/usr/lib64/python2.7/site-packages/tornado/stack_context.py", line 342, in wrapped
    raise_exc_info(exc)
  File "/usr/lib64/python2.7/site-packages/tornado/stack_context.py", line 313, in wrapped
    ret = fn(*args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 212, in <lambda>
    future, lambda future: callback(future.result()))
  File "/usr/lib64/python2.7/site-packages/tornado/concurrent.py", line 214, in result
    raise_exc_info(self._exc_info)
  File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 876, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib/python2.7/site-packages/salt/minion.py", line 1409, in _send_req_async
    ret = yield channel.send(load, timeout=timeout)
  File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 870, in run
    value = future.result()
  File "/usr/lib64/python2.7/site-packages/tornado/concurrent.py", line 214, in result
    raise_exc_info(self._exc_info)
  File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 876, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib/python2.7/site-packages/salt/transport/zeromq.py", line 373, in send
    ret = yield self._crypted_transfer(load, tries=tries, timeout=timeout, raw=raw)
  File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 870, in run
    value = future.result()
  File "/usr/lib64/python2.7/site-packages/tornado/concurrent.py", line 214, in result
    raise_exc_info(self._exc_info)
  File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 876, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib/python2.7/site-packages/salt/transport/zeromq.py", line 341, in _crypted_transfer
    ret = yield _do_transfer()
  File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 870, in run
    value = future.result()
  File "/usr/lib64/python2.7/site-packages/tornado/concurrent.py", line 214, in result
    raise_exc_info(self._exc_info)
  File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 876, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib/python2.7/site-packages/salt/transport/zeromq.py", line 325, in _do_transfer
    tries=tries,
  File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 870, in run
    value = future.result()
  File "/usr/lib64/python2.7/site-packages/tornado/concurrent.py", line 214, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
SaltReqTimeoutError: Message timed out

however it just posts that every 2 minutes and doesn't restart and pick up the new master.

mhaswell on 10 Mar 2020

To me you're having some other issue (possibly network related)

kiemlicz on 12 Mar 2020

The problem is spoiled me ,so I add following code to resolve the problem. @shangxdy @kiemlicz
os.remove(self.opts['pki_dir'] + "/minion_master.pub")

at location:
/usr/lib/python3.6/site-packages/salt/crypt.py
"The master key has changed, the salt master could have been subverted, verify salt master's public key"

on what line number or function did you put that on @rico256-cn ?