It turns out that the cache backend fech() function gets hit a LOT. When I put consul as a cache backend in an env with ~200 minions, salt became unusable due to the consul process getting overloaded.
When I put a print statement in the fetch() function, it seems it gets called 400+ times per minion, making the consul backend essentially unusable.
set cache: consul in the master config and count the number you hit fetch() in salt/cache/consul.py
Salt Version:
Salt: 2016.11.2
Dependency Versions:
cffi: Not Installed
cherrypy: Not Installed
dateutil: 1.5
gitdb: 0.5.4
gitpython: 0.3.2 RC1
ioflo: Not Installed
Jinja2: 2.7.2
libgit2: Not Installed
libnacl: Not Installed
M2Crypto: Not Installed
Mako: 0.9.1
msgpack-pure: Not Installed
msgpack-python: 0.4.6
mysql-python: 1.2.3
pycparser: Not Installed
pycrypto: 2.6.1
pygit2: Not Installed
Python: 2.7.6 (default, Jun 22 2015, 17:58:13)
python-gnupg: Not Installed
PyYAML: 3.10
PyZMQ: 14.0.1
RAET: Not Installed
smmap: 0.8.2
timelib: Not Installed
Tornado: 4.2.1
ZMQ: 4.0.5
System Versions:
dist: Ubuntu 14.04 trusty
machine: x86_64
release: 3.13.0-100-generic
system: Linux
version: Ubuntu 14.04 trusty
Hi @yhekma - that's very interesting to hear. I have played a bit with the cache subsystem, but never in prod.
Recently I have introduced a new storage for the caching, in Redis. While I know for sure there are some improvements to be applied to the flush, I would expect the fetch to be very fast, as this is what Redis is supposed to be good at.
The Redis plugin is in the develop branch.
I would love to see a benchmark on 200 minions. Would you mind giving it a try?
Thanks,
Mircea
@mirceaulinic sure, I can do that. I was considering using something like redis myself as well, since we need this functionality to go multi-master and consul obviously is not up to the task.
Regardless of whether redis performs though, I think this many lookups of the same key is Not Good(tm)
I'll setup redis and report my findings tomorrow or in this weekend.
@DmitryKuzmenko Would you mind taking a look at this?
Thanks,
Daniel
@yhekma This definitely sounds like a bug, but I've tried a simple highstate with the cache enabled and see the only periodic fetch for mine. Could you please provide more details about your configuration and reproduction steps?
@yhekma or/and it would be helpful if you could to modify the cache/consul.py adding
log.critical('CACHE FETCH {0}/{1}'.format(bank, key))
to the beginning of the fetch function and reproduce this issue and share the master log file.
@DmitryKuzmenko Upon further investigation it seems I was not complete in my report, although I still believe this is unintended.
The reason we are hitting the cache so much, is because we have a lot of pillar data and states that use grains and/or pillardata.
The thing is, there should be no need to do a key lookup every time you need a value, because the key you get already contains all the values in serialized form. So the way I see it, if my cache backend contains mine and cache data of 10 minions, I should never hit my cache more than 20 times per salt run.
Now, as long as you are using a localfs backend, this is not going to hurt that much. The cache (data.p) files are pretty small (in our case ~20k) and end up in filesystem cache anyway, so you are essentially reading from ram. But when using consul as a backend, it really hurts that everytime we use a grain from a minion, we hit the cache to get the exact same data again and again.
I hope I made myself clearer and apologise for not being complete the first time
@DmitryKuzmenko Please find attached the master log file generated during a salt run on 1 minion in an environment of 15 minions total.
Sorted by minion that gives:
4 minions/consul001.yhekma.dev.ams1.cloud.ecg.so/mine
4 minions/consul002.yhekma.dev.ams1.cloud.ecg.so/mine
4 minions/consul003.yhekma.dev.ams1.cloud.ecg.so/mine
4 minions/graph002.yhekma.dev.ams1.cloud.ecg.so/mine
4 minions/graph003.yhekma.dev.ams1.cloud.ecg.so/mine
8 minions/graph001.yhekma.dev.ams1.cloud.ecg.so/mine
25 minions/consul001.yhekma.dev.ams1.cloud.ecg.so/data
25 minions/consul002.yhekma.dev.ams1.cloud.ecg.so/data
25 minions/consul003.yhekma.dev.ams1.cloud.ecg.so/data
25 minions/fileserver001.yhekma.dev.ams1.cloud.ecg.so/data
25 minions/graph001.yhekma.dev.ams1.cloud.ecg.so/data
25 minions/graph002.yhekma.dev.ams1.cloud.ecg.so/data
25 minions/graph003.yhekma.dev.ams1.cloud.ecg.so/data
25 minions/saltmaster001.yhekma.dev.ams1.cloud.ecg.so/data
25 minions/shellserver001.yhekma.dev.ams1.cloud.ecg.so/data
As I stated earlier, we do a lot of pillar.get and grains.get in our salt code, but that should not result in fetching the serialized grains/pillar again and again imo. Also, the serialized grains are ~20k, and I do 25 mine lookups for each minion, when I run this in my LP env consisting of ~200 minions, that gives me per salt run per minion (20025)20k = ~100MB in data. If I multiply that by all the minions (like when I do a salt '*' state.apply) I can see why consul doesn't like that.
Now when we are dealing with localfs, you just hit the fs-cache, so you sort of have infinite bandwith.
IMO you should do either a) read the serialized data once and do subsequent lookups in memory or b) split out the data into separate key-value pairs, so that when I need for instance the hostname of a minion, I only fetch that key, and not all the data again and again.
I will also keep an eye on this thread as I'm also interested in the caching system.
Just a couple of observations:
/data, but less frequent for /mine. Given that you request a lot of pillar.get, I personally don't find it surprising given that Salt will go and grab it each time from the cache, which in this case is consul, not localfs.Regarding your suggestions, I will leave @DmitryKuzmenko have his say about that, but I'm not sure about a): what do you mean by "do subsequent lookups in memory"?
b): This sounds sane to me, but is this doable @DmitryKuzmenko? My understanding is that such a model would not be compatible with the localfs structure, but maybe I'm wrong.
@mirceaulinic Consul has no problem with the amount of requests per second, it's the volume that's killing it.
As stated, the value that gets fetched from consul every time contains every grain for that particular minion. It's like you create a python dict like:
import json
mydict = {'a': 10, 'b': 20, 'c': 30}
mydata = { 'mydict': json.dumps(mydict)}
and then query it like:
for k in ['a', 'b', 'c']:
jsonobject = json.loads(mydata['mydict'])
print('{} == {}'.format(k, jsonobject[k]))
or maybe I am missing something?
Wrt your question about option a, I mean that you could read the value once and since then you already have read the complete mine, there is no more need to do another fetch() of the data.
I see @yhekma thanks for explanation.
As stated, the value that gets fetched from consul every time contains every grain for that particular minion.
If we could change this behaviour, would be fantastic!
@mirceaulinic Just fyi, using a local redis instance everything runs fine (even with the 200 odd minions). But reading ~2GB from redis in 5k requests is a lot easier than doing the same wit consul. So yes, if I want to set up a multi master environment I could go the redis route, but honestly I think this should be fixed asap and get a pretty high priority label. At least it should be mention in the documentation that using anything other than localfs as backend should be tested thoroughly wrt bandwidth depending on the amount of pillar calls and grain lookups.
Thanks for info @yhekma!
I agree with you, if you didn't have any problem with 200 minions, you'll most likely hit it when you have many hundreds!
Reading the whole data feels (and you have also proved that is) sub-optimal. At the same time, I'm not sure if it would be possible to change this behaviour without affecting the localfs - which definitely must be handled with care!
But if we want the caching subsystem to be really useful with others than localfs, we need to address that IMO!
@DmitryKuzmenko do you see any improvement plan for this?
Thanks,
Mircea
@mirceaulinic I am not sure you would affect localfs. Couldn't you just fetch the data once, put it in a dict and to subsequent lookups in there? That way you fix this across all the backends. Now whether this is feasible I do not know without spelunking in the salt code (I think that would be in masterapi.py?)
This sounds fine for me, at least theoretical. But I am not an expert at all on this topic, I'm here just because I'm interested in the subject & looking to move our caching elsewhere.
Your approach sounds good, but I'm afraid that would move the overhead in the memory. When the pillars are pretty big as in your case and running against a high number of minions, there might be too much data buffered. Basically those 2GB of information you read from Redis would be moved in the salt process memory.
This is what I suspect, and I think we should go ahead, give it a whirl and compare the performances!
@mirceaulinic The problem is not that we are dealing with a dataset of 2GB, the problem is that we transfer the same amount of data for the cache backends again and again, totalling 2GB.
We don't move any overhead, in so far as we are already doing stuff in memory anyway.
So the current situation:
For every grain and pillar lookup we do a fetch() from the backend and store the result in a variable in memory (I presume it's a dict()). This fetch() fetches, in my case, 20kb worth of serialized data for a specific minion. This is repeated every time I need a value for this minion. So if I store 100 grains, every lookup fetches 100 grains, picks the one it needs and drops the rest.
My proposal:
Keep the data around after fetching it so we don't have to fetch() every time we need a key. The memory consumption would be 20kb per minion, not 2GB.
As stated earlier, option B is to not store a serialized object containing all the values but store the key-value pairs themselves (which makes more sense anyway imo). Right now when you want to do a lookup of 1 value (say hostname) you have no choice but to fetch all the other ones as well, which results in the excessive bandwidth usage. This option could be a significant change though depending on the way it is currently set up.
@yhekma, @mirceaulinic thank you for sharing your thoughts and experience. I'll take care of this issue as soon as I can.
@DmitryKuzmenko Thanks!
In the meantime, is there a blessed way to run a multi-master setup when using the mine? The only option we have now is to set up a redis cluster, but I guess there should be another way that is already supported since I see documentation on MM on the site....
ZD-1247
Done here. Just will need to perform one more code update after merge forward.