Hello,
I'm currently running 2014.7.2 and after removing some minions that were VMs using Foreman with the foreman-salt plugging (hence deleting their keys), the data I get from mine while targeting using grain is always from the minions that are now gone.
I've tried using salt-run cache.clear_all tgt='*'
, salt '*' saltutil.clear_cache
and salt '*' mine.flush
and nothing seems to change the outcome.
What I'm trying to run is salt-call mine.get 'kernel:Linux' backend_ip_addr grain
in a template so it looks like {% for host, ips in salt['mine.get']('kernel:Linux', 'backend_ip_addr', 'grain').items() %}
.
If I do salt-call mine.get '*' backend_ip_addr
I get data without the dead minions. Weirdly enough If do something like salt-call mine.get 'os:Ubuntu' backend_ip_addr grain
. I only get data from the same host as the minion I'm running this on even if there is many more minions that are running Ubuntu.
I've tried to target minions using pillar and I get either nothing or things that doesn't make sense...
Thank you.
What I ended up having to do was to go in /var/cache/salt/master
and delete all the minions that are dead by removing the whole directory. Than running salt '*' saltutil.clear_cache
. This seemed to finally worked and the data I receive is the correct one.
Hi @numkem - Thanks for reporting this bug! It looks to me like there are two things happening here that need to be addressed to fix this bug:
mine.flush
and the other cache function should probably be clearing that out as well.Hmm, I'm having a little trouble reproducing this. In my testing, using salt-run cache.clear_all tgt='*'
performs its intended function and the "dead" minions no longer appear in a mine query.
You've said here that after manually removing the master's cache directory that things work as expected. Could you please try again to use the cache runner to clear the cache and then before deleting this directory by hand, see if the "dead" minion data has been removed? If not, there may be some sort of strange bug in the cache cleaner.
I tried using salt-run cache.clear_all tgt='*'
and while it did feel like some things changed as some of the bad data went away, some stayed and I ended up having to resort to deleting the content of /var/cache/salt/master
anyway.
Also salt '*' saltutil.clear_cache
doesn't change anything either.
@numkem Yes, those two commands operate on completely different caches. The runner command is the one we want here. Can you please try this again and after running salt-run cache.clear_all tgt='*'
post the contents of /var/cache/salt/minions/<dead_minion_in_question>
?
@cachedout I don't see any difference after using salt-run cache.clear_all tgt='*'
.
The problem is really when doing a mine.get
while targeting with either grains or pillar.
I don't have a folder like you mentionned but I did find /var/cache/salt/master/minions
containing 2 files for the dead minion.
How can I send this to you as it looks like its containing "private" things?
@cachedout Anything I can do to help to get this issue fixed?
Not sure if "+1"s are useful, but I think I'm experiencing the same problem; even after running salt-run cache.clear_all tgt='*'
I still have lots of (some dead, some not) minions listed in /var/cache/salt/master/minions/
Let me know if I can provide some useful other info.
+1
I can confirm that after upgrading to 2015.5.2 from 2014.7.5 that whenever I destroy an minion its data stays in the salt mine cache... this was not the case with the previous version
This workaround works for me, 2015.5.2:
clean_mine_cache:
cmd.run:
- name: |
salt-run cache.clear_mine tgt='*' && \
salt '*' mine.update
I added this to orchestrate state and it is executed each time before other steps.
Actually, it seems it doesn't work if any of minions are down. Tested on two environments:
Env 1, all minions up:
$ find /var/cache/salt/master/minions -name mine.p | wc -l
42
$ sudo salt-run cache.clear_mine tgt='*'
$ find /var/cache/salt/master/minions -name mine.p | wc -l
0
Env 2, 10 minions down:
$ find /var/cache/salt/master/minions -name mine.p | wc -l
16
$ sudo salt-run cache.clear_mine tgt='*'
$ find /var/cache/salt/master/minions -name mine.p | wc -l
16
Same version of salt on both nodes:
$ salt-call --versions-report
Salt: 2015.5.2
Python: 2.7.6 (default, Mar 22 2014, 22:59:56)
Jinja2: 2.7.2
M2Crypto: 0.21.1
msgpack-python: 0.3.0
msgpack-pure: Not Installed
pycrypto: 2.6.1
libnacl: Not Installed
PyYAML: 3.10
ioflo: Not Installed
PyZMQ: 14.0.1
RAET: Not Installed
ZMQ: 4.0.4
Mako: 0.9.1
Debian source package: 2015.5.2+ds-1trusty1
However, this works, simple and effective:
clean_mine_cache:
cmd.run:
- name: |
rm /var/cache/salt/master/minions/*/mine.p && \
salt '*' mine.update
Also it now seems like two servers come up in any compound match...
root@inf-use1a-pr-01-salt-salt-0001:~# salt-call mine.get 'G@roles:elasticsearch and G@elasticsearch:cluster_name:inf-kibana-cluster' network.ip_addrs compound
local:
----------
dev-use1b-pr-01-utopia-madb-0001:
- 10.10.50.250
dev-use1e-pr-01-utopia-madb-0001:
- 10.10.60.60
inf-use1a-pr-01-kibana-kibana-0001:
- 10.40.100.5
inf-use1b-pr-01-kibana-esdb-0001:
- 10.40.50.84
inf-use1b-pr-01-kibana-kibana-0001:
- 10.40.110.245
inf-use1e-pr-01-kibana-esdb-0001:
- 10.40.60.108
@joshughes I'm running into the exact same issue.
What I end up doing is to delete everything under /var/cache/salt/master/minions/
and restart the salt-master
and all the salt-minion
service on all minions.
It is also worth noting that I do have some minions that are still running 2014.7.5... I hessitent to upgrade them because I am at the point where I think i will be rolling back...
Also I opened this issue...
https://github.com/saltstack/salt/issues/25613
It seems to me that salt-mine should return results correctly even if there is somehow stale data.
@numkem
I am seeing my compound queries return all of my minions now... Which might be better for helping track this bug down but basically makes salt mine useless for me...
salt-call mine.get 'G@roles:mariadb and G@mariadb:cluster_name:adfasd_utopia' network.ip_addrs compound
That should only return 6 servers but instead is returning all of my servers now... Wondering if you see similar behavior if your try a compound query
@joshughes I really do, it's not only when targeting with grains its with any kind of compound query. If I try using I@roles
with pillar I get the same erratic result until I restart the minion.
@numkem so I restarted my minions and my salt-mine is completely hosed at this point. With any compound query I get all my minions and any other query returns nothing... I downgraded to salt v2014.7.5 and it doing the same thing... so I am at a loss for what could be causing the issue... unless it is the v2015.5.2 minion that are reporting data the make the whole thing break down.
@joshughes I remember having the issue with 2014.7.5 as well. It just seems like the data reported to mine can only be added or modified but not removed. I has to be by design since it would make sens in some way but one thing for sure the behavior that you are seeing is as bad as it gets.
I can't recall exactly what I did to make it work again but I know that deleting minions cached data and restarting master/minions ended up correcting the problem. I also put a lot of pillar data inside mine so refreshing pillar was necessary as well.
Ya I am not seeing the issue that dead minions are reporting... but that the salt mine is just not working at all for me anymore, now that I deleted the minion cache and restarted everything.
I am currently experiencing this issue. I upgraded my master and all minions from 2014.1.10 to 2015.5.3. I am getting back 'dead' minion information in a mine.get. Was there a work around? I am hesitant to do what @joshughes has done simply becuase I don't want the salt mine to break.
salt '*' mine.flush
removes the /var/cache/salt/master/minions/<node>/mine.p
file but only for minions that are 'alive'. The removed minions still have their directory containing mine.p and data.p. Neither salt-run cache.clear_all tgt='*'
nor salt '*' saltutil.clear_cache
appears to do anything.
My resolution steps, not all steps might be necessary but I wasn't taking chances:
service salt-master stop
salt '*' mine.flush
rm -rf /var/cache/salt/master/minions/<offending minion dir>
service salt-master start
salt '*' mine.update
Or just this:
rm /var/cache/salt/master/minions/*/mine.p && \
salt '*' mine.update
I'm using this solution in orchestrate state for 3 months already without any issues.
If your using config enforcement removing all the mine data can leave you open to a salt mine
call returning no data and misconfiguring a host. Salt is fast but removing all mine data does create a race condition where somewhere in your infrastructure a minion can get some bad data.
I have adopted @powellchristoph approach after being burned by the race condition with the brute force solution to this issue. If you have a dead minion returning in your salt mine
call, the safest solution I have found is to remove that minion specifically... especially if your formula rely on salt mine and you have config enforcement enabled.
I just checked my salt state, I have 'sleep 2' after mine.update.
So, yes, it can create some issues, but I'm using it to orchestrate hundreds of docker containers and first call in orchestrate state is to clean mine data with above command. I have 30+ minions on this salt master, never got any issues.
I also prefer to clean all mine data from all minions, because I wanted fresh data from salt mine before deployment. I think I had salt mine configured to get some grains from minions.
I'm facing this issue as well. For some reason, before 2015.5.* deleting a minion using salt-key -d <id>
also removed its entries from the mine but it looks like it's no longer the case...
Yea it looks like the cache.clear_all salt runner function refuses to delete data from deleted minions. Here's a before and after:
sudo salt 'nginx013' mine.get 'role:server' 'profiles' expr_form=grain
nginx013:
----------
app014:
- app
app022:
- app
app027:
- app
beta022:
- beta
jobs012:
- jobs
jobs014:
- jobs
jobs0141:
- jobs
jobs022:
- jobs
jobs026:
- jobs
The minions jobs014 and jobs0141 have been deleted and no longer exist (salt-key -d jobs014*
) So I run sudo salt-run cache.clear_all tgt='*'
to clear the cache on the master and this is the result:
sudo salt 'nginx013' mine.get 'role:server' 'profiles' expr_form=grain
nginx013:
----------
jobs014:
- jobs
jobs0141:
- jobs
The culprit for this is likely this line which appears to not attempt clearing any data if the minion id is not valid. Obviously in the case of a deleted minion the id wouldn't be valid since the pki path is non-existent...
This is on salt 2015.8.8 (Beryllium)
This is still an issue on 2016.3.5. if a minion
This is still an issue for 2016.11.3, it's bad. Here's the message from my ops department:
Prod Ops ran into a service-disrupting issue when some
mines failed to return lists of servers during a salt state run. I appeared to be caused by doing targeting by grains, when the salt master had some stale grain values cached. We cleared the caches on the salt master and the targeting started working correctly again.
Due to this, we discussed putting a cache.clear_all command on a cron or something to make sure that the salt cache doesn't get stale. Head of ProdOps would like to know when we have a solution in place.
So, since a cron job is just a black eye on the DevOps team and makes us look bad, I would like this fixed 馃憤 thanks
Looking at the docs, as a workaround, do you think setting minion_data_cache
to false would solve this? Thanks https://docs.saltstack.com/en/latest/ref/configuration/master.html#minion-data-cache
I would like to confirm - running 2016.11.3 (Carbon)
here and experiencing the same issue. Tried clearing cache, removing mine.p
etc. to no avail.
It really breaks matching - as dead minions stop execution until timeout + things generally become out of sync.
I think perhaps this is related to #35439
To clarify: in relation to #35439 , if you run salt '*' saltutil.sync_all
and ALL of your minions sync, and if you do NOT set minion_data_cache
to false in the master config, then this problem and the problem associated with #35439 goes away. It's just that this is not a good workaround for us, because we have nodes that are shut down. We can't start them back up just so we can update their minion cache.
One thing I have thought of doing to resolve this is to make the minion cache less ephemeral by using the consul backend for the minion data cache. However, in doing this, I discovered #40748 , the consul minion data cache backend does not work currently. Sad panda face 馃槩 @micdobro hope this helps
@djhaskin987 thanks a lot for your reply. I have been also through #35439 - set the minion_data_cache
to false etc. The "fun part" of the problem is that the issue pops-up periodically, and it's usually 1-2 dead minions that absolutely do not want to disappear (especially now when there's a lot of dev machines coming up & down also for testing purposes).
Sometimes it's enough to resync all data - yesterday it helped to use salt-key -d
together with all other "spells".
running into this issue as well:
$ sudo salt --version
salt 2016.11.3 (Carbon)
mine.delete and mine.flush don't seem to account for dead minions still
This is 100% still an issue. Nodes removed with salt-cloud -d are not removed from the mine data resulting in stale mine data and effectively making salt-mine useless.
Can confirm that
rm /var/cache/salt/master/minions/*/mine.p && \
salt '*' mine.update
as suggested by @komljen works
I've done more to debug issue #35439 since I last wrote, and I believe this issue
and that one are the same thing. I wanted to give some information to other
ops professionals out there who might read this, and also workarounds.
tl;dr:
saltutil.cmd
in an orchestration, if orchestration is an option.Deeply situated in the logic of salt is a function dealing with how to
target minions based on grains. When salt does this, it consults the
minion data cache. By default, the minion data cache is found in
/var/cache/salt/master/minions/
. If there is an entry for a particular
minion in the cache, salt uses it to determine if the minion should
be targeted. If there is not an entry for the minion, salt assumes
the minion should be targeted.
The last sentence is what fundamentally messes things up for mine
users. It is actually a fairly safe assumption in the case where you
are using grain targeting to run something like state.apply
, since
when a minion is targeted which shouldn't be targeted (that is, the
grain used in targeting isn't set on that minion), it simply ignores
the request and you get "Minion did not return" when the call returns.
However, the same logic is used for figuring out what entries are or
even should be in the mine. On mine.get
calls, say in the jinja of
a state file, this causes minions which shouldn't be returned by mine.get
to be returned, causing mayhem. I'm hazier on the details here, but
I'm sure this is what is happening.
So in reality, there is no problem with the mine code; it's deeper than
that. It's in the grain targeting code in salt.
If you want this fixed, please comment on issue #35439 . It's where I
documented all of my debugging work, and where I found out why this is
happening.
You might not be able to do this, but if you are, it's easily the best
way to get around this issue. For example, instead of using this:
salt['mine.get'](expr_form='grain', tgt='role:webserver')
You may want to use this instead:
salt['mine.get'](tgt='*webserver*')
This seems to work under all sorts of conditions. It also is sad :(, since
grain targeting is awesome.
@almoore pointed this one out :)
One way to get around using the mine entirely is to use saltutil.cmd
in conjunction
inline pillars in an orchestration.
As an example, the following salt orchestration provides network IP
address information of other minions as a pillar, instead of using
the mine to accomplish the same thing:
{% set ips = salt['saltutil.cmd'](tgt='role:webserver', expr_form='grain', fun='network.ip_addrs') %}
apply_state:
salt.state:
- tgt: 'role:webserver'
- expr_form: 'grain'
- pillar:
previously_mined_data:
ips:
{%- for name, ip in ips.items() %}
{{name}}: {{ip['ret'][0]}}
{%- endfor %}
- sls:
- applied.state
This workaround has the advantage of getting the "mined data"
immediately before the state is called.
It has the disadvantage that it can't be called using salt-call
; this
orchestration must be run from the master.
You can use Consul as a makeshift mine. You would create one state to populate consul with
"mine data", and one state to consume the mine data. All the minions
that would need to put data into consul get the "populate" state
applied, and all the minions that would need to consume it have states
run on them which contain the appropriate salt['consul.get']()
calls.
The advantage is that you can use this as a drop-in replacement for
the mine. Since consul is populated using a state call, it should be
safe to use grain targeting using this option. salt-call
s should
work as well as calling the states from the master.
The disadvantange is that it's a bit complicated to set up. That said,
I have set up a POC, and I know it works.
this probably wont help at all if you're not running systemd but this was my workaround:
create a systemd .path and .service file. when there is any change to the minion directory it triggers the .path file which in turn invokes the .service file that clears out old keys. It's basically an event driven of the cron workaround posted a long time ago by @djhaskin987 so it's not really desirable at all. any minions using that data need to be have affected services restarted afterward still.
It would be super cool to get a resolution to this.
We also set up a cron job, periodically flushing and re-populating the mine. A proper resolution would be very much appreciated!
Same issue here with 2017.7.1. I've added the cron job mentioned as a workaround, but it still leaves a gap that bites us occasionally. We use the salt mine as part of rundeck's mechanism to target nodes for deployments. After rebuilding EC2 instances quite heavily over the last few weeks, this has gotten worse and worse in our dev and stage salt environments. We are holding on large rebuild of production until we find a solid workaround.
Just ran into the same issue on 2016.11.7.
Adding flush_mine_on_destroy: True
into corresponding salt-cloud profile didn't help.
As far as I see, flush_mine_on_destroy must be supported by salt-cloud driver, for now I see it only for nova and divers with libcloud.
Still no update on this? This is making Orchestration extremely difficult since grains are pretty critical and unique to hosts for deploying clustered or distributed services
Ran into this issue myself yesterday. I destroyed some VMs in my environment through salt-cloud and then ran through some states, orchestration, and modules trying to figure out why minions I destroyed were returning in my mine.get calls. Because they returned, they broke logic and caused things to fail / become misconfigured. Please fix!
Most helpful comment
What I ended up having to do was to go in
/var/cache/salt/master
and delete all the minions that are dead by removing the whole directory. Than runningsalt '*' saltutil.clear_cache
. This seemed to finally worked and the data I receive is the correct one.