Salt: Cache backend gets hit a lot

Created on 9 Feb 2017 · 19Comments · Source: saltstack/salt

Description of Issue/Question

It turns out that the cache backend fech() function gets hit a LOT. When I put consul as a cache backend in an env with ~200 minions, salt became unusable due to the consul process getting overloaded.
When I put a print statement in the fetch() function, it seems it gets called 400+ times per minion, making the consul backend essentially unusable.

Setup

set cache: consul in the master config and count the number you hit fetch() in salt/cache/consul.py

Versions Report

Salt Version:
Salt: 2016.11.2

Dependency Versions:
cffi: Not Installed
cherrypy: Not Installed
dateutil: 1.5
gitdb: 0.5.4
gitpython: 0.3.2 RC1
ioflo: Not Installed
Jinja2: 2.7.2
libgit2: Not Installed
libnacl: Not Installed
M2Crypto: Not Installed
Mako: 0.9.1
msgpack-pure: Not Installed
msgpack-python: 0.4.6
mysql-python: 1.2.3
pycparser: Not Installed
pycrypto: 2.6.1
pygit2: Not Installed
Python: 2.7.6 (default, Jun 22 2015, 17:58:13)
python-gnupg: Not Installed
PyYAML: 3.10
PyZMQ: 14.0.1
RAET: Not Installed
smmap: 0.8.2
timelib: Not Installed
Tornado: 4.2.1
ZMQ: 4.0.5

System Versions:
dist: Ubuntu 14.04 trusty
machine: x86_64
release: 3.13.0-100-generic
system: Linux
version: Ubuntu 14.04 trusty

Bug ZD

Source

yhekma

All 19 comments

Hi @yhekma - that's very interesting to hear. I have played a bit with the cache subsystem, but never in prod.
Recently I have introduced a new storage for the caching, in Redis. While I know for sure there are some improvements to be applied to the flush, I would expect the fetch to be very fast, as this is what Redis is supposed to be good at.
The Redis plugin is in the develop branch.
I would love to see a benchmark on 200 minions. Would you mind giving it a try?

Thanks,
Mircea

mirceaulinic on 9 Feb 2017

@mirceaulinic sure, I can do that. I was considering using something like redis myself as well, since we need this functionality to go multi-master and consul obviously is not up to the task.

Regardless of whether redis performs though, I think this many lookups of the same key is Not Good(tm)

I'll setup redis and report my findings tomorrow or in this weekend.

yhekma on 9 Feb 2017

👍1

@DmitryKuzmenko Would you mind taking a look at this?

Thanks,
Daniel

gtmanfred on 9 Feb 2017

@yhekma This definitely sounds like a bug, but I've tried a simple highstate with the cache enabled and see the only periodic fetch for mine. Could you please provide more details about your configuration and reproduction steps?

DmitryKuzmenko on 10 Feb 2017

@yhekma or/and it would be helpful if you could to modify the cache/consul.py adding

    log.critical('CACHE FETCH {0}/{1}'.format(bank, key))

to the beginning of the fetch function and reproduce this issue and share the master log file.

DmitryKuzmenko on 10 Feb 2017

@DmitryKuzmenko Upon further investigation it seems I was not complete in my report, although I still believe this is unintended.

The reason we are hitting the cache so much, is because we have a lot of pillar data and states that use grains and/or pillardata.

The thing is, there should be no need to do a key lookup every time you need a value, because the key you get already contains all the values in serialized form. So the way I see it, if my cache backend contains mine and cache data of 10 minions, I should never hit my cache more than 20 times per salt run.

Now, as long as you are using a localfs backend, this is not going to hurt that much. The cache (data.p) files are pretty small (in our case ~20k) and end up in filesystem cache anyway, so you are essentially reading from ram. But when using consul as a backend, it really hurts that everytime we use a grain from a minion, we hit the cache to get the exact same data again and again.

I hope I made myself clearer and apologise for not being complete the first time

yhekma on 10 Feb 2017

@DmitryKuzmenko Please find attached the master log file generated during a salt run on 1 minion in an environment of 15 minions total.

master.txt

Sorted by minion that gives:

   4 minions/consul001.yhekma.dev.ams1.cloud.ecg.so/mine
   4 minions/consul002.yhekma.dev.ams1.cloud.ecg.so/mine
   4 minions/consul003.yhekma.dev.ams1.cloud.ecg.so/mine
   4 minions/graph002.yhekma.dev.ams1.cloud.ecg.so/mine
   4 minions/graph003.yhekma.dev.ams1.cloud.ecg.so/mine
   8 minions/graph001.yhekma.dev.ams1.cloud.ecg.so/mine
  25 minions/consul001.yhekma.dev.ams1.cloud.ecg.so/data
  25 minions/consul002.yhekma.dev.ams1.cloud.ecg.so/data
  25 minions/consul003.yhekma.dev.ams1.cloud.ecg.so/data
  25 minions/fileserver001.yhekma.dev.ams1.cloud.ecg.so/data
  25 minions/graph001.yhekma.dev.ams1.cloud.ecg.so/data
  25 minions/graph002.yhekma.dev.ams1.cloud.ecg.so/data
  25 minions/graph003.yhekma.dev.ams1.cloud.ecg.so/data
  25 minions/saltmaster001.yhekma.dev.ams1.cloud.ecg.so/data
  25 minions/shellserver001.yhekma.dev.ams1.cloud.ecg.so/data

As I stated earlier, we do a lot of pillar.get and grains.get in our salt code, but that should not result in fetching the serialized grains/pillar again and again imo. Also, the serialized grains are ~20k, and I do 25 mine lookups for each minion, when I run this in my LP env consisting of ~200 minions, that gives me per salt run per minion (20025)20k = ~100MB in data. If I multiply that by all the minions (like when I do a salt '*' state.apply) I can see why consul doesn't like that.

Now when we are dealing with localfs, you just hit the fs-cache, so you sort of have infinite bandwith.

IMO you should do either a) read the serialized data once and do subsequent lookups in memory or b) split out the data into separate key-value pairs, so that when I need for instance the hostname of a minion, I only fetch that key, and not all the data again and again.

yhekma on 13 Feb 2017

I will also keep an eye on this thread as I'm also interested in the caching system.

Just a couple of observations:

In the log above I can see about 3 requests within a second for /data, but less frequent for /mine. Given that you request a lot of pillar.get, I personally don't find it surprising given that Salt will go and grab it each time from the cache, which in this case is consul, not localfs.
A total of ~21 requests per second for 15 minions, which means about 280 request per second for 200 minions. This is not that much, I find it quite tiny actually. Isn't consul able to handle that amount of fetch?
I agree on the bandwidth limitation, but I think the only way to reduce this is running your caching system locally, otherwise you can always hit network limitations, unreliability etc.

Regarding your suggestions, I will leave @DmitryKuzmenko have his say about that, but I'm not sure about a): what do you mean by "do subsequent lookups in memory"?
b): This sounds sane to me, but is this doable @DmitryKuzmenko? My understanding is that such a model would not be compatible with the localfs structure, but maybe I'm wrong.

mirceaulinic on 13 Feb 2017

@mirceaulinic Consul has no problem with the amount of requests per second, it's the volume that's killing it.
As stated, the value that gets fetched from consul every time contains every grain for that particular minion. It's like you create a python dict like:

import json

mydict = {'a': 10, 'b': 20, 'c': 30}
mydata = { 'mydict': json.dumps(mydict)}

and then query it like:

for k in ['a', 'b', 'c']:
   jsonobject = json.loads(mydata['mydict'])
   print('{} == {}'.format(k, jsonobject[k]))

or maybe I am missing something?

Wrt your question about option a, I mean that you could read the value once and since then you already have read the complete mine, there is no more need to do another fetch() of the data.

yhekma on 13 Feb 2017

I see @yhekma thanks for explanation.

As stated, the value that gets fetched from consul every time contains every grain for that particular minion.

If we could change this behaviour, would be fantastic!

mirceaulinic on 13 Feb 2017

@mirceaulinic Just fyi, using a local redis instance everything runs fine (even with the 200 odd minions). But reading ~2GB from redis in 5k requests is a lot easier than doing the same wit consul. So yes, if I want to set up a multi master environment I could go the redis route, but honestly I think this should be fixed asap and get a pretty high priority label. At least it should be mention in the documentation that using anything other than localfs as backend should be tested thoroughly wrt bandwidth depending on the amount of pillar calls and grain lookups.

yhekma on 15 Feb 2017

Thanks for info @yhekma!

I agree with you, if you didn't have any problem with 200 minions, you'll most likely hit it when you have many hundreds!
Reading the whole data feels (and you have also proved that is) sub-optimal. At the same time, I'm not sure if it would be possible to change this behaviour without affecting the localfs - which definitely must be handled with care!
But if we want the caching subsystem to be really useful with others than localfs, we need to address that IMO!

@DmitryKuzmenko do you see any improvement plan for this?

Thanks,
Mircea

mirceaulinic on 15 Feb 2017

@mirceaulinic I am not sure you would affect localfs. Couldn't you just fetch the data once, put it in a dict and to subsequent lookups in there? That way you fix this across all the backends. Now whether this is feasible I do not know without spelunking in the salt code (I think that would be in masterapi.py?)

yhekma on 15 Feb 2017

This sounds fine for me, at least theoretical. But I am not an expert at all on this topic, I'm here just because I'm interested in the subject & looking to move our caching elsewhere.
Your approach sounds good, but I'm afraid that would move the overhead in the memory. When the pillars are pretty big as in your case and running against a high number of minions, there might be too much data buffered. Basically those 2GB of information you read from Redis would be moved in the salt process memory.
This is what I suspect, and I think we should go ahead, give it a whirl and compare the performances!

mirceaulinic on 15 Feb 2017

@mirceaulinic The problem is not that we are dealing with a dataset of 2GB, the problem is that we transfer the same amount of data for the cache backends again and again, totalling 2GB.

We don't move any overhead, in so far as we are already doing stuff in memory anyway.

So the current situation:
For every grain and pillar lookup we do a fetch() from the backend and store the result in a variable in memory (I presume it's a dict()). This fetch() fetches, in my case, 20kb worth of serialized data for a specific minion. This is repeated every time I need a value for this minion. So if I store 100 grains, every lookup fetches 100 grains, picks the one it needs and drops the rest.

My proposal:
Keep the data around after fetching it so we don't have to fetch() every time we need a key. The memory consumption would be 20kb per minion, not 2GB.

As stated earlier, option B is to not store a serialized object containing all the values but store the key-value pairs themselves (which makes more sense anyway imo). Right now when you want to do a lookup of 1 value (say hostname) you have no choice but to fetch all the other ones as well, which results in the excessive bandwidth usage. This option could be a significant change though depending on the way it is currently set up.

yhekma on 15 Feb 2017

👍1

@yhekma, @mirceaulinic thank you for sharing your thoughts and experience. I'll take care of this issue as soon as I can.

DmitryKuzmenko on 15 Feb 2017

@DmitryKuzmenko Thanks!
In the meantime, is there a blessed way to run a multi-master setup when using the mine? The only option we have now is to set up a redis cluster, but I guess there should be another way that is already supported since I see documentation on MM on the site....

yhekma on 16 Feb 2017

ZD-1247