Salt: Using mine.get while targeting grain returns data from dead minions

Created on 25 Mar 2015 · 44Comments · Source: saltstack/salt

Hello,

I'm currently running 2014.7.2 and after removing some minions that were VMs using Foreman with the foreman-salt plugging (hence deleting their keys), the data I get from mine while targeting using grain is always from the minions that are now gone.

I've tried using salt-run cache.clear_all tgt='*', salt '*' saltutil.clear_cache and salt '*' mine.flush and nothing seems to change the outcome.

What I'm trying to run is salt-call mine.get 'kernel:Linux' backend_ip_addr grain in a template so it looks like {% for host, ips in salt['mine.get']('kernel:Linux', 'backend_ip_addr', 'grain').items() %}.

If I do salt-call mine.get '*' backend_ip_addr I get data without the dead minions. Weirdly enough If do something like salt-call mine.get 'os:Ubuntu' backend_ip_addr grain. I only get data from the same host as the minion I'm running this on even if there is many more minions that are running Ubuntu.

I've tried to target minions using pillar and I get either nothing or things that doesn't make sense...

Thank you.

Aluminium Bug Community Grains Renderers State Module phase-plan severity-medium status-in-prog team-core

Source

numkem

👍2

Most helpful comment

What I ended up having to do was to go in /var/cache/salt/master and delete all the minions that are dead by removing the whole directory. Than running salt '*' saltutil.clear_cache. This seemed to finally worked and the data I receive is the correct one.

numkem on 25 Mar 2015

👍2

All 44 comments

numkem on 25 Mar 2015

👍2

Hi @numkem - Thanks for reporting this bug! It looks to me like there are two things happening here that need to be addressed to fix this bug:

The foreman-salt plugin should probably remove that mine data
mine.flush and the other cache function should probably be clearing that out as well.

rallytime on 27 Mar 2015

Hmm, I'm having a little trouble reproducing this. In my testing, using salt-run cache.clear_all tgt='*' performs its intended function and the "dead" minions no longer appear in a mine query.

You've said here that after manually removing the master's cache directory that things work as expected. Could you please try again to use the cache runner to clear the cache and then before deleting this directory by hand, see if the "dead" minion data has been removed? If not, there may be some sort of strange bug in the cache cleaner.

cachedout on 7 Apr 2015

I tried using salt-run cache.clear_all tgt='*' and while it did feel like some things changed as some of the bad data went away, some stayed and I ended up having to resort to deleting the content of /var/cache/salt/master anyway.

Also salt '*' saltutil.clear_cache doesn't change anything either.

numkem on 21 Apr 2015

@numkem Yes, those two commands operate on completely different caches. The runner command is the one we want here. Can you please try this again and after running salt-run cache.clear_all tgt='*' post the contents of /var/cache/salt/minions/<dead_minion_in_question>?

cachedout on 5 May 2015

@cachedout I don't see any difference after using salt-run cache.clear_all tgt='*'.

The problem is really when doing a mine.get while targeting with either grains or pillar.

I don't have a folder like you mentionned but I did find /var/cache/salt/master/minions containing 2 files for the dead minion.

How can I send this to you as it looks like its containing "private" things?

numkem on 8 May 2015

@cachedout Anything I can do to help to get this issue fixed?

numkem on 29 May 2015

Not sure if "+1"s are useful, but I think I'm experiencing the same problem; even after running salt-run cache.clear_all tgt='*' I still have lots of (some dead, some not) minions listed in /var/cache/salt/master/minions/

Let me know if I can provide some useful other info.

kevinbowman-ta on 17 Jun 2015

joshughes on 21 Jul 2015

I can confirm that after upgrading to 2015.5.2 from 2014.7.5 that whenever I destroy an minion its data stays in the salt mine cache... this was not the case with the previous version

joshughes on 21 Jul 2015

This workaround works for me, 2015.5.2:

clean_mine_cache:
  cmd.run:
    - name: |
        salt-run cache.clear_mine tgt='*' && \
        salt '*' mine.update

I added this to orchestrate state and it is executed each time before other steps.

komljen on 21 Jul 2015

Actually, it seems it doesn't work if any of minions are down. Tested on two environments:

Env 1, all minions up:

$ find /var/cache/salt/master/minions -name mine.p | wc -l
42
$ sudo salt-run cache.clear_mine tgt='*'
$ find /var/cache/salt/master/minions -name mine.p | wc -l
0

Env 2, 10 minions down:

$ find /var/cache/salt/master/minions -name mine.p | wc -l
16
$ sudo salt-run cache.clear_mine tgt='*'
$ find /var/cache/salt/master/minions -name mine.p | wc -l
16

Same version of salt on both nodes:

$ salt-call --versions-report
                  Salt: 2015.5.2
                Python: 2.7.6 (default, Mar 22 2014, 22:59:56)
                Jinja2: 2.7.2
              M2Crypto: 0.21.1
        msgpack-python: 0.3.0
          msgpack-pure: Not Installed
              pycrypto: 2.6.1
               libnacl: Not Installed
                PyYAML: 3.10
                 ioflo: Not Installed
                 PyZMQ: 14.0.1
                  RAET: Not Installed
                   ZMQ: 4.0.4
                  Mako: 0.9.1
 Debian source package: 2015.5.2+ds-1trusty1

However, this works, simple and effective:

clean_mine_cache:
  cmd.run:
    - name: |
        rm /var/cache/salt/master/minions/*/mine.p && \
        salt '*' mine.update

komljen on 21 Jul 2015

Also it now seems like two servers come up in any compound match...

root@inf-use1a-pr-01-salt-salt-0001:~# salt-call mine.get 'G@roles:elasticsearch and G@elasticsearch:cluster_name:inf-kibana-cluster' network.ip_addrs compound
local:
    ----------
    dev-use1b-pr-01-utopia-madb-0001:
        - 10.10.50.250
    dev-use1e-pr-01-utopia-madb-0001:
        - 10.10.60.60
    inf-use1a-pr-01-kibana-kibana-0001:
        - 10.40.100.5
    inf-use1b-pr-01-kibana-esdb-0001:
        - 10.40.50.84
    inf-use1b-pr-01-kibana-kibana-0001:
        - 10.40.110.245
    inf-use1e-pr-01-kibana-esdb-0001:
        - 10.40.60.108

joshughes on 22 Jul 2015

@joshughes I'm running into the exact same issue.

What I end up doing is to delete everything under /var/cache/salt/master/minions/ and restart the salt-master and all the salt-minion service on all minions.

numkem on 22 Jul 2015

It is also worth noting that I do have some minions that are still running 2014.7.5... I hessitent to upgrade them because I am at the point where I think i will be rolling back...

Also I opened this issue...
https://github.com/saltstack/salt/issues/25613

It seems to me that salt-mine should return results correctly even if there is somehow stale data.

joshughes on 22 Jul 2015

@numkem

I am seeing my compound queries return all of my minions now... Which might be better for helping track this bug down but basically makes salt mine useless for me...

salt-call mine.get 'G@roles:mariadb and G@mariadb:cluster_name:adfasd_utopia' network.ip_addrs compound

That should only return 6 servers but instead is returning all of my servers now... Wondering if you see similar behavior if your try a compound query

joshughes on 22 Jul 2015

@joshughes I really do, it's not only when targeting with grains its with any kind of compound query. If I try using I@roles with pillar I get the same erratic result until I restart the minion.

numkem on 22 Jul 2015

@numkem so I restarted my minions and my salt-mine is completely hosed at this point. With any compound query I get all my minions and any other query returns nothing... I downgraded to salt v2014.7.5 and it doing the same thing... so I am at a loss for what could be causing the issue... unless it is the v2015.5.2 minion that are reporting data the make the whole thing break down.

joshughes on 22 Jul 2015

@joshughes I remember having the issue with 2014.7.5 as well. It just seems like the data reported to mine can only be added or modified but not removed. I has to be by design since it would make sens in some way but one thing for sure the behavior that you are seeing is as bad as it gets.

I can't recall exactly what I did to make it work again but I know that deleting minions cached data and restarting master/minions ended up correcting the problem. I also put a lot of pillar data inside mine so refreshing pillar was necessary as well.

numkem on 22 Jul 2015

Ya I am not seeing the issue that dead minions are reporting... but that the salt mine is just not working at all for me anymore, now that I deleted the minion cache and restarted everything.

joshughes on 22 Jul 2015

I am currently experiencing this issue. I upgraded my master and all minions from 2014.1.10 to 2015.5.3. I am getting back 'dead' minion information in a mine.get. Was there a work around? I am hesitant to do what @joshughes has done simply becuase I don't want the salt mine to break.

powellchristoph on 17 Sep 2015

salt '*' mine.flush removes the /var/cache/salt/master/minions/<node>/mine.p file but only for minions that are 'alive'. The removed minions still have their directory containing mine.p and data.p. Neither salt-run cache.clear_all tgt='*' nor salt '*' saltutil.clear_cache appears to do anything.

My resolution steps, not all steps might be necessary but I wasn't taking chances:

service salt-master stop
salt '*' mine.flush
rm -rf /var/cache/salt/master/minions/<offending minion dir>
service salt-master start
salt '*' mine.update

powellchristoph on 17 Sep 2015

👍1

Or just this:

rm /var/cache/salt/master/minions/*/mine.p && \
   salt '*' mine.update

I'm using this solution in orchestrate state for 3 months already without any issues.

komljen on 17 Sep 2015

👍1

If your using config enforcement removing all the mine data can leave you open to a salt mine call returning no data and misconfiguring a host. Salt is fast but removing all mine data does create a race condition where somewhere in your infrastructure a minion can get some bad data.

I have adopted @powellchristoph approach after being burned by the race condition with the brute force solution to this issue. If you have a dead minion returning in your salt mine call, the safest solution I have found is to remove that minion specifically... especially if your formula rely on salt mine and you have config enforcement enabled.

joshughes on 17 Sep 2015

I just checked my salt state, I have 'sleep 2' after mine.update.

So, yes, it can create some issues, but I'm using it to orchestrate hundreds of docker containers and first call in orchestrate state is to clean mine data with above command. I have 30+ minions on this salt master, never got any issues.

I also prefer to clean all mine data from all minions, because I wanted fresh data from salt mine before deployment. I think I had salt mine configured to get some grains from minions.

komljen on 17 Sep 2015

I'm facing this issue as well. For some reason, before 2015.5.* deleting a minion using salt-key -d <id> also removed its entries from the mine but it looks like it's no longer the case...

falzm on 5 Nov 2015

Yea it looks like the cache.clear_all salt runner function refuses to delete data from deleted minions. Here's a before and after:

sudo salt 'nginx013' mine.get 'role:server' 'profiles' expr_form=grain
nginx013:
    ----------
    app014:
        - app
    app022:
        - app
    app027:
        - app
    beta022:
        - beta
    jobs012:
        - jobs
    jobs014:
        - jobs
    jobs0141:
        - jobs
    jobs022:
        - jobs
    jobs026:
        - jobs

The minions jobs014 and jobs0141 have been deleted and no longer exist (salt-key -d jobs014*) So I run sudo salt-run cache.clear_all tgt='*' to clear the cache on the master and this is the result:

sudo salt 'nginx013' mine.get 'role:server' 'profiles' expr_form=grain
nginx013:
    ----------
    jobs014:
        - jobs
    jobs0141:
        - jobs

The culprit for this is likely this line which appears to not attempt clearing any data if the minion id is not valid. Obviously in the case of a deleted minion the id wouldn't be valid since the pki path is non-existent...

This is on salt 2015.8.8 (Beryllium)

fooka03 on 23 Mar 2016

This is still an issue on 2016.3.5. if a minion

aboe76 on 7 Apr 2017

This is still an issue for 2016.11.3, it's bad. Here's the message from my ops department:

Prod Ops ran into a service-disrupting issue when some mines failed to return lists of servers during a salt state run. I appeared to be caused by doing targeting by grains, when the salt master had some stale grain values cached. We cleared the caches on the salt master and the targeting started working correctly again.
Due to this, we discussed putting a cache.clear_all command on a cron or something to make sure that the salt cache doesn't get stale. Head of ProdOps would like to know when we have a solution in place.

So, since a cron job is just a black eye on the DevOps team and makes us look bad, I would like this fixed 👍 thanks

djhaskin987 on 8 Apr 2017

Looking at the docs, as a workaround, do you think setting minion_data_cache to false would solve this? Thanks https://docs.saltstack.com/en/latest/ref/configuration/master.html#minion-data-cache

djhaskin987 on 8 Apr 2017

I would like to confirm - running 2016.11.3 (Carbon) here and experiencing the same issue. Tried clearing cache, removing mine.p etc. to no avail.

It really breaks matching - as dead minions stop execution until timeout + things generally become out of sync.

micdobro on 26 Apr 2017

I think perhaps this is related to #35439

djhaskin987 on 26 Apr 2017

To clarify: in relation to #35439 , if you run salt '*' saltutil.sync_all and ALL of your minions sync, and if you do NOT set minion_data_cache to false in the master config, then this problem and the problem associated with #35439 goes away. It's just that this is not a good workaround for us, because we have nodes that are shut down. We can't start them back up just so we can update their minion cache.

One thing I have thought of doing to resolve this is to make the minion cache less ephemeral by using the consul backend for the minion data cache. However, in doing this, I discovered #40748 , the consul minion data cache backend does not work currently. Sad panda face 😢 @micdobro hope this helps

djhaskin987 on 26 Apr 2017

@djhaskin987 thanks a lot for your reply. I have been also through #35439 - set the minion_data_cache to false etc. The "fun part" of the problem is that the issue pops-up periodically, and it's usually 1-2 dead minions that absolutely do not want to disappear (especially now when there's a lot of dev machines coming up & down also for testing purposes).
Sometimes it's enough to resync all data - yesterday it helped to use salt-key -d together with all other "spells".

micdobro on 27 Apr 2017

running into this issue as well:

$ sudo salt --version
salt 2016.11.3 (Carbon)

mine.delete and mine.flush don't seem to account for dead minions still

ptdel on 27 Apr 2017

This is 100% still an issue. Nodes removed with salt-cloud -d are not removed from the mine data resulting in stale mine data and effectively making salt-mine useless.

Can confirm that
rm /var/cache/salt/master/minions/*/mine.p && \ salt '*' mine.update

as suggested by @komljen works

naegelin on 6 Jul 2017

I've done more to debug issue #35439 since I last wrote, and I believe this issue
and that one are the same thing. I wanted to give some information to other
ops professionals out there who might read this, and also workarounds.

tl;dr:

Grain targeting in all its forms is broken, and this is
particularly painful when using the mine. See #35439 for more
information. If you want this to change, comment on that ticket.
There are three workarounds that I know about:
- Use glob targeting. It always works.
- Use saltutil.cmd in an orchestration, if orchestration is an option.
- Use Consul instead of the mine.

Grain Targeting is Broken

Deeply situated in the logic of salt is a function dealing with how to
target minions based on grains. When salt does this, it consults the
minion data cache. By default, the minion data cache is found in
/var/cache/salt/master/minions/. If there is an entry for a particular
minion in the cache, salt uses it to determine if the minion should
be targeted. If there is not an entry for the minion, salt assumes
the minion should be targeted.

The last sentence is what fundamentally messes things up for mine
users. It is actually a fairly safe assumption in the case where you
are using grain targeting to run something like state.apply, since
when a minion is targeted which shouldn't be targeted (that is, the
grain used in targeting isn't set on that minion), it simply ignores
the request and you get "Minion did not return" when the call returns.

However, the same logic is used for figuring out what entries are or
even should be in the mine. On mine.get calls, say in the jinja of
a state file, this causes minions which shouldn't be returned by mine.get
to be returned, causing mayhem. I'm hazier on the details here, but
I'm sure this is what is happening.

So in reality, there is no problem with the mine code; it's deeper than
that. It's in the grain targeting code in salt.

If you want this fixed, please comment on issue #35439 . It's where I
documented all of my debugging work, and where I found out why this is
happening.

Workaround: Use Glob Targeting

You might not be able to do this, but if you are, it's easily the best
way to get around this issue. For example, instead of using this:

salt['mine.get'](expr_form='grain', tgt='role:webserver')

You may want to use this instead:

salt['mine.get'](tgt='*webserver*')

This seems to work under all sorts of conditions. It also is sad :(, since
grain targeting is awesome.

Workaround: Salt Orchestration

@almoore pointed this one out :)

One way to get around using the mine entirely is to use saltutil.cmd in conjunction
inline pillars in an orchestration.

As an example, the following salt orchestration provides network IP
address information of other minions as a pillar, instead of using
the mine to accomplish the same thing:

{% set ips = salt['saltutil.cmd'](tgt='role:webserver', expr_form='grain', fun='network.ip_addrs') %}
apply_state:
  salt.state:
    - tgt: 'role:webserver'
    - expr_form: 'grain'
    - pillar:
        previously_mined_data:
          ips:
{%- for name, ip in ips.items() %}
            {{name}}: {{ip['ret'][0]}}
{%- endfor %}
    - sls:
      - applied.state

This workaround has the advantage of getting the "mined data"
immediately before the state is called.

It has the disadvantage that it can't be called using salt-call; this
orchestration must be run from the master.

Workaround: Consul

You can use Consul as a makeshift mine. You would create one state to populate consul with
"mine data", and one state to consume the mine data. All the minions
that would need to put data into consul get the "populate" state
applied, and all the minions that would need to consume it have states
run on them which contain the appropriate salt['consul.get']()
calls.

The advantage is that you can use this as a drop-in replacement for
the mine. Since consul is populated using a state call, it should be
safe to use grain targeting using this option. salt-calls should
work as well as calling the states from the master.

The disadvantange is that it's a bit complicated to set up. That said,
I have set up a POC, and I know it works.

djhaskin987 on 6 Jul 2017

👍1

this probably wont help at all if you're not running systemd but this was my workaround:

create a systemd .path and .service file. when there is any change to the minion directory it triggers the .path file which in turn invokes the .service file that clears out old keys. It's basically an event driven of the cron workaround posted a long time ago by @djhaskin987 so it's not really desirable at all. any minions using that data need to be have affected services restarted afterward still.

It would be super cool to get a resolution to this.

ptdel on 9 Jul 2017

We also set up a cron job, periodically flushing and re-populating the mine. A proper resolution would be very much appreciated!

LoicAG on 26 Jul 2017

Same issue here with 2017.7.1. I've added the cron job mentioned as a workaround, but it still leaves a gap that bites us occasionally. We use the salt mine as part of rundeck's mechanism to target nodes for deployments. After rebuilding EC2 instances quite heavily over the last few weeks, this has gotten worse and worse in our dev and stage salt environments. We are holding on large rebuild of production until we find a solid workaround.

frozenmatt on 5 Sep 2017

Just ran into the same issue on 2016.11.7.

Adding flush_mine_on_destroy: True into corresponding salt-cloud profile didn't help.

defanator on 15 Nov 2017

As far as I see, flush_mine_on_destroy must be supported by salt-cloud driver, for now I see it only for nova and divers with libcloud.

samodid on 21 Jan 2018

👍1

Still no update on this? This is making Orchestration extremely difficult since grains are pretty critical and unique to hosts for deploying clustered or distributed services

eapugo on 6 Feb 2018

Ran into this issue myself yesterday. I destroyed some VMs in my environment through salt-cloud and then ran through some states, orchestration, and modules trying to figure out why minions I destroyed were returning in my mine.get calls. Because they returned, they broke logic and caused things to fail / become misconfigured. Please fix!