Describe the enhancement:
Add some kind of event rate quota per Cloud Foundry organization, to allow to dropping events once some limit is reached.
There is an issue about adding throttling in general (https://github.com/elastic/beats/issues/17775), but this may require many more changes.
In the case of Cloud Foundry we could add the rate limit per organization in the specific input.
Add a field or tag to the events to indicate that they are being throttled.
Describe a specific use case for the enhancement or feature:
On clusters with many organizations it may be good to limit the events rate so organizations make a fair use of resources.
Pinging @elastic/integrations-platforms (Team:Platforms)
Discussed this with @jsoriano off-issue so wanted to summarize our discussion here:
kubernetes.namespace.Starting to think a bit about what the configuration for such a processor might look like, here's an initial proposal (not married to any of the setting names, of course):
processors:
- rate_limiter:
global: "500/m" # optional, but either global or by_field or both must be specified
by_field: # optional, but either global or by_field or both must be specified
- field: "foo.bar" # required
value: "56/s" # optional, but either value or values or both must be specified
values: # optional, but either value or values or both must be specified
- baz: "4500/h" # required
The way the above example configuration would be interpreted is:
foo.bar == "baz", a rate limit of 4500 events per hour will be applied.foo.bar field present, a rate limit of 56 events per second will be applied.This configuration would allow for complex rate limiting policies to be configured while also allowing simple ones to be configured quite simply. For example, to enforce a rate limit of 500 events per second for events from cloud foundry org acme, the configuration would look like:
processors:
- rate_limiter:
by_field:
- field: "cloudfoundry.org.name"
values:
- acme: "500 /s"
I'm deliberately leaving aside the choice of rate limiting algorithm (fixed window, sliding window, token bucket, leaky bucket) for now. Just trying to focus on the configuration UX first.
WDYT @jsoriano?
@ycombinator thanks for your proposal, it looks quite good, but I have some of observations, maybe we can make the processor simpler.
One is that instead of having global and per field configurations, we could rely on conditional processors for more complex configurations. There are already a couple of ways of defining conditional processors, and maybe we can use them for complex configurations, so we can simplify the processor itself.
Bringing simplicity of rate_limiter to the limit, imagine for example that it can be only configured with the field name and the limit. The examples in your comment could be defined as something like this:
processors:
- if.has_fields: ['foo.bar']
then:
- if.equals.foo.bar: baz
then:
- rate_limiter:
field: "foo.bar"
limit: "4500/h"
else:
- rate_limiter:
limit: "56/s"
else:
- rate_limiter:
limit: "500/m"
processors:
- rate_limiter:
when.equals.cloudfoundry.org.name: "acme"
limit: "500/s"
md5-551ab3a00bd69f05eb675089cdaf26b9
processors:
- rate_limit:
field: "cloudfoundry.org.name"
limit: "500/s"
md5-4bcdab0baa352b4ff2b6e94f18ce8203
processors:
- rate_limiter:
by_field:
- field: "cloudfoundry.org.name"
value: "500/s"
- field: "cloudfoundry.app.name"
value: "100/s"
md5-29a794b5126a150b37a64aaab744be4a
- rate_limiter:
by_field:
- fields:
- "cloudfoundry.app.name"
- "cloudfoundry.org.name"
value: "100/s"
md5-afcab7c40d5ee9a033169366e61fbaab
- rate_limiter:
by_field:
- field_pattern: "%{{cloudfoundry.app.name}}_%{{cloudfoundry.org.name}}"
value: "100/s"
md5-2670af6c85cbe60f444f761456aa6a17
processors:
- rate_limiter:
period: second
global_limit: 8
by_field:
- field: "foo.bar"
limit: 56
values:
- baz: 1.25
But this is may be more an implementation detail, and not so needed if we simplify the processor.
Yeah, I had also thought of using conditionals for complex conditions but what I didn't like about it is there is a lot of repetition (and therefore chance of making errors) with the field name. So if you take the first example with conditionals:
processors:
- if.has_fields: ['foo.bar']
then:
- if.equals.foo.bar: baz
then:
- rate_limiter:
field: "foo.bar"
limit: "4500/h"
else:
- rate_limiter:
limit: "56/s"
else:
- rate_limiter:
limit: "500/m"
The field foo.bar is repeated in three places.
OTOH, I do like the readability of using conditionals 鈥斅營 think it's much more obvious what rate limits will be applied in which cases. So on the whole, I'm +1 to going with the conditionals approach instead of my original proposed syntax.
Regarding the question about multiple fields causing ambiguity, I like this proposal:
- rate_limiter:
by_field:
fields:
- "cloudfoundry.app.name"
- "cloudfoundry.org.name"
limit: "100/s"
It's similar to the field_pattern proposal but I like this one better because it's a bit easier for users to configure IMO. IIUC the purpose of the field_pattern is mainly to build a key for the rate limit tracking. If so, I think we can build this key internally instead of asking the user to supply it via field_pattern. Or are there other use cases you were thinking of unlocking with the field_pattern idea that I'm not thinking of?
Regarding the units, I think we should _try_ to keep it as easy for users as possible, which would mean allowing them to define different units for multiple values in the same processor. I think some of this will come down to the implementation of the rate limiting algorithm. If it proves to be too difficult then we can go with your suggestion of asking the user to define a common unit for the entire processor definition. But initially I think it would be nice if we could provide more flexibility to the users and take on the resulting complexity on ourselves in the implementation.
Regarding the question about multiple fields causing ambiguity, I like this proposal:
- rate_limiter: by_field: fields: - "cloudfoundry.app.name" - "cloudfoundry.org.name" limit: "100/s"Reviewing my suggestion, I think one of the
by_field/fieldslevel is not needed, so it could be like this:
- rate_limiter:
fields:
- "cloudfoundry.app.name"
- "cloudfoundry.org.name"
limit: "100/s"
Or do you expect to have more rate limiting methods apart of by_fields?
I think we can build this key internally instead of asking the user to supply it via
field_pattern. Or are there other use cases you were thinking of unlocking with thefield_patternidea that I'm not thinking of?
Agree, better to build this key internally, it will be less error-prone. I cannot think on any legit use case that would work with field_pattern and not with fields.
Regarding the units, I think we should _try_ to keep it as easy for users as possible, which would mean allowing them to define different units for multiple values in the same processor. I think some of this will come down to the implementation of the rate limiting algorithm. If it proves to be too difficult then we can go with your suggestion of asking the user to define a common unit for the entire processor definition. But initially I think it would be nice if we could provide more flexibility to the users and take on the resulting complexity on ourselves in the implementation.
Agree, let's keep them as you proposed by now, we can reconsider this during implementation. In any case I would only see it as a potential problem if we had multiple limits in the same processor, something we don't have in the simplified one.
Or do you expect to have more rate limiting methods apart of
by_fields?
No, let's simplify as you suggested and drop the extra level of configuration.
BTW, I just my 1-1 with @exekias and he suggested naming this processor something like sample for two reasons:
add_metadata, anddrop filter plugin.Rate limiting with dropping events is basically sampling.
I slightly disagree :slightly_smiling_face: I agree that they do basically the same thing: dropping a part of the total events, but they have different use cases and expectations.
Rate limiting is related to the available resources you have to monitor something, if this something exceeds these limits, it cannot be properly monitored, and you are protecting the rest of the system so it doesn't affect other things.
Sampling is an optimization that you can apply to reduce the use of resources when you know that collecting a reduced quantity of data is going to give you similar information.
When you drop events because of a rate limit, you are losing information, when you drop events because of sampling, you do it in a controlled way, losing accuracy at most.
Also, we could decide to do different actions in the future. Apart of a default action: drop, we could have action: write_to_local_file or action: wait. These actions wouldn't make much sense for sampling, but can make sense for a rate limit.
In any case, we could decide to implement sampling in this processor, and if we implement both things in the same processor then I am ok with calling it sample. If not, I think we can call it rate_limit to make it an action, and call sample the one that makes sampling when/if we implement it.
Take into account that the parameters required for sampling and rate limiting are different, e.g. I may want to get 1% of the metrics, but no more than 100/s, I would need a definition like this one:
- sample:
fields:
- "cloudfoundry.app.name"
- "cloudfoundry.org.name"
sampling: 0.01
limit: "100/s"
Or could be done as this with more specific processors:
- sample:
when.has_fields: ['cloudfoundry.app.name', 'cloudfoundry.org.name']
sampling: 0.01
- rate_limit:
fields:
- "cloudfoundry.app.name"
- "cloudfoundry.org.name"
limit: "100/s"
Continuing with my previous comment, about doing sampling and rate limiting in the same processor. I am thinking now that they cover different use cases, and they probably require two completely different implementations (sampling is simpler than rate limiting), so in my opinion they should be two different processors.
Thanks @jsoriano for your thoughtful comments on rate limiting vs. sampling. @exekias and I discussed some of these points too (off issue). In the end, I think there are enough differences, in semantics but also in options that might only make sense for either rate limiting or sampling, that I think we should make two separate processors as well.
Just to get things moving for now, I'm going to start working on a rate_limit processor PR. If we decide to including sampling in the same processor we can always make changes before the PR is merged.
Sounds good folks, sorry for the noise. My comment was around the fact that rate limiting doesn't necessarily imply dropping data, where sampling implies it. I think rate_limit as a name is good enough, as long as the expectations are correctly documented.
_The following are more my own notes for writing tests and docs but posting them here in case anyone sees any issues:_
Configuration:
processors:
- rate_limit:
limit: "10000/m"
acme Cloud Foundry org to 500 /sConfigurations (each of these are alternatives to one another):
processors:
- rate_limit:
when.equals.cloudfoundry.org.name: "acme"
limit: "500/s"
processors:
- if.equals.cloudfoundry.org.name: "acme"
then:
- rate_limit:
limit: "500/s"
acme Cloud Foundry org and roadrunner space to 1000 /hConfigurations (each of these are alternatives to one another):
processors:
- rate_limit:
when.and:
- equals.cloudfoundry.org.name: "acme"
- equals.cloudfoundry.space.name: "roadrunner"
limit: "1000/h"
processors:
- if.and:
- equals.cloudfoundry.org.name: "acme"
- equals.cloudfoundry.space.name: "roadrunner"
then:
- rate_limit:
limit: "1000/h"
processors:
- rate_limit:
fields:
- "cloudfoundry.org.name"
limit: "400/s"
processors:
- rate_limit:
fields:
- "cloudfoundry.org.name"
- "cloudfoundry.space.name"
limit: "20000/h"
This is a bit of a contrived use case as we could probably just use a single field, cloudfoundry.space.id, in the configuration since that has globally unique values across all orgs, but I wanted to demonstrate the idea of a field combination.
The PR for a basic rate_limit processor is up here: https://github.com/elastic/infra/issues/25378.
Additionally, we will need a follow up PR for this requirement:
Add a field or tag to the events to indicate that they are being throttled.
Some quick thoughts about this requirement after discussing it with @jsoriano off-issue:
rate_limit processor, the field or tag will need to go on the next event that is allowed through.rate_limit processor.
Most helpful comment
_The following are more my own notes for writing tests and docs but posting them here in case anyone sees any issues:_
Use case: rate limit all events to 10000 /m
Configuration:
Use case: rate limit events from the
acmeCloud Foundry org to 500 /sConfigurations (each of these are alternatives to one another):
Use case: rate limit events from the
acmeCloud Foundry org androadrunnerspace to 1000 /hConfigurations (each of these are alternatives to one another):
Use case: rate limit events for each distinct Cloud Foundry org to 400 /s
Use case: rate limit events for each distinct Cloud Foundry org and space combination to 20000 /h
This is a bit of a contrived use case as we could probably just use a single field,
cloudfoundry.space.id, in the configuration since that has globally unique values across all orgs, but I wanted to demonstrate the idea of a field combination.