Beats: Add event rate quota per Cloud Foundry organization

Created on 8 Sep 2020 · 14Comments · Source: elastic/beats

Describe the enhancement:

Add some kind of event rate quota per Cloud Foundry organization, to allow to dropping events once some limit is reached.

There is an issue about adding throttling in general (https://github.com/elastic/beats/issues/17775), but this may require many more changes.
In the case of Cloud Foundry we could add the rate limit per organization in the specific input.

Add a field or tag to the events to indicate that they are being throttled.

Describe a specific use case for the enhancement or feature:

On clusters with many organizations it may be good to limit the events rate so organizations make a fair use of resources.

Platforms enhancement v7.11.0

Source

jsoriano

Most helpful comment

_The following are more my own notes for writing tests and docs but posting them here in case anyone sees any issues:_

Use case: rate limit all events to 10000 /m

Configuration:

processors:
- rate_limit:
   limit: "10000/m"

Use case: rate limit events from the `acme` Cloud Foundry org to 500 /s

Configurations (each of these are alternatives to one another):

processors:
- rate_limit:
    when.equals.cloudfoundry.org.name: "acme"
    limit: "500/s"

processors:
- if.equals.cloudfoundry.org.name: "acme"
  then:
  - rate_limit:
      limit: "500/s"

Use case: rate limit events from the `acme` Cloud Foundry org and `roadrunner` space to 1000 /h

Configurations (each of these are alternatives to one another):

processors:
- rate_limit:
    when.and:
    - equals.cloudfoundry.org.name: "acme"
    - equals.cloudfoundry.space.name: "roadrunner"
    limit: "1000/h"

processors:
- if.and:
  - equals.cloudfoundry.org.name: "acme"
  - equals.cloudfoundry.space.name: "roadrunner"
  then:
  - rate_limit:
      limit: "1000/h"

Use case: rate limit events for each distinct Cloud Foundry org to 400 /s

processors:
- rate_limit:
   fields:
   - "cloudfoundry.org.name"
   limit: "400/s"

Use case: rate limit events for each distinct Cloud Foundry org and space combination to 20000 /h

processors:
- rate_limit:
   fields:
   - "cloudfoundry.org.name"
   - "cloudfoundry.space.name"
   limit: "20000/h"

This is a bit of a contrived use case as we could probably just use a single field, cloudfoundry.space.id, in the configuration since that has globally unique values across all orgs, but I wanted to demonstrate the idea of a field combination.

ycombinator on 2 Dec 2020

❤1 👍1

All 14 comments

Pinging @elastic/integrations-platforms (Team:Platforms)

elasticmachine on 8 Sep 2020

Discussed this with @jsoriano off-issue so wanted to summarize our discussion here:

There's no way to ask CF to restrict events by org when consuming them from the firehose.
Consequently, Beats would have to consume all the events from the firehose and then drop any events that violate the defined rate limiting policy.
The advantage of this approach is that we can implement this functionality as a generic rate limiting processor that can be used against _any_ events, for example, we can use it to rate limit events based on kubernetes.namespace.
The disadvantage of this approach is that the rate limiting only helps with resource consumption in Beats and downstream components (ES, LS, etc.); it does not help with resource usage in CF itself.

ycombinator on 30 Nov 2020

Starting to think a bit about what the configuration for such a processor might look like, here's an initial proposal (not married to any of the setting names, of course):

processors:
- rate_limiter:
    global: "500/m"          # optional, but either global or by_field or both must be specified
    by_field:                # optional, but either global or by_field or both must be specified
    - field: "foo.bar"       # required
      value: "56/s"          # optional, but either value or values or both must be specified
      values:                # optional, but either value or values or both must be specified
      - baz: "4500/h"        # required

The way the above example configuration would be interpreted is:

For events that have foo.bar == "baz", a rate limit of 4500 events per hour will be applied.
For all other events that have a foo.bar field present, a rate limit of 56 events per second will be applied.
For all other events, a rate limit of 500 events per minute will be applied.

This configuration would allow for complex rate limiting policies to be configured while also allowing simple ones to be configured quite simply. For example, to enforce a rate limit of 500 events per second for events from cloud foundry org acme, the configuration would look like:

processors:
- rate_limiter:
    by_field:
    - field: "cloudfoundry.org.name"
      values: 
      - acme: "500 /s"

I'm deliberately leaving aside the choice of rate limiting algorithm (fixed window, sliding window, token bucket, leaky bucket) for now. Just trying to focus on the configuration UX first.

WDYT @jsoriano?

ycombinator on 30 Nov 2020

@ycombinator thanks for your proposal, it looks quite good, but I have some of observations, maybe we can make the processor simpler.

One is that instead of having global and per field configurations, we could rely on conditional processors for more complex configurations. There are already a couple of ways of defining conditional processors, and maybe we can use them for complex configurations, so we can simplify the processor itself.
Bringing simplicity of rate_limiter to the limit, imagine for example that it can be only configured with the field name and the limit. The examples in your comment could be defined as something like this:

processors:
- if.has_fields: ['foo.bar']
  then:
    - if.equals.foo.bar: baz
      then:
      - rate_limiter:
          field: "foo.bar"
          limit: "4500/h"
      else:
      - rate_limiter:
          limit: "56/s"
  else:
    - rate_limiter:
        limit: "500/m"

processors:
- rate_limiter:
    when.equals.cloudfoundry.org.name: "acme"
    limit: "500/s"



md5-551ab3a00bd69f05eb675089cdaf26b9



processors:
- rate_limit:
    field: "cloudfoundry.org.name"
    limit: "500/s"



md5-4bcdab0baa352b4ff2b6e94f18ce8203



processors:
- rate_limiter:
    by_field:
    - field: "cloudfoundry.org.name"
      value: "500/s"
    - field: "cloudfoundry.app.name"
      value: "100/s"



md5-29a794b5126a150b37a64aaab744be4a



- rate_limiter:
    by_field:
    - fields:
      - "cloudfoundry.app.name"
      - "cloudfoundry.org.name"
      value: "100/s"



md5-afcab7c40d5ee9a033169366e61fbaab



- rate_limiter:
    by_field:
    - field_pattern: "%{{cloudfoundry.app.name}}_%{{cloudfoundry.org.name}}"
      value: "100/s"



md5-2670af6c85cbe60f444f761456aa6a17



processors:
- rate_limiter:
    period: second
    global_limit: 8
    by_field:
    - field: "foo.bar"
      limit: 56
      values:
      - baz: 1.25

But this is may be more an implementation detail, and not so needed if we simplify the processor.

jsoriano on 30 Nov 2020

Yeah, I had also thought of using conditionals for complex conditions but what I didn't like about it is there is a lot of repetition (and therefore chance of making errors) with the field name. So if you take the first example with conditionals:

processors:
- if.has_fields: ['foo.bar']
  then:
    - if.equals.foo.bar: baz
      then:
      - rate_limiter:
          field: "foo.bar"
          limit: "4500/h"
      else:
      - rate_limiter:
          limit: "56/s"
  else:
    - rate_limiter:
        limit: "500/m"

The field foo.bar is repeated in three places.

OTOH, I do like the readability of using conditionals — I think it's much more obvious what rate limits will be applied in which cases. So on the whole, I'm +1 to going with the conditionals approach instead of my original proposed syntax.

Regarding the question about multiple fields causing ambiguity, I like this proposal:

- rate_limiter:
    by_field:
      fields:
      - "cloudfoundry.app.name"
      - "cloudfoundry.org.name"
      limit: "100/s"

It's similar to the field_pattern proposal but I like this one better because it's a bit easier for users to configure IMO. IIUC the purpose of the field_pattern is mainly to build a key for the rate limit tracking. If so, I think we can build this key internally instead of asking the user to supply it via field_pattern. Or are there other use cases you were thinking of unlocking with the field_pattern idea that I'm not thinking of?

Regarding the units, I think we should _try_ to keep it as easy for users as possible, which would mean allowing them to define different units for multiple values in the same processor. I think some of this will come down to the implementation of the rate limiting algorithm. If it proves to be too difficult then we can go with your suggestion of asking the user to define a common unit for the entire processor definition. But initially I think it would be nice if we could provide more flexibility to the users and take on the resulting complexity on ourselves in the implementation.

ycombinator on 1 Dec 2020

👍1

Regarding the question about multiple fields causing ambiguity, I like this proposal:
- rate_limiter:
    by_field:
      fields:
      - "cloudfoundry.app.name"
      - "cloudfoundry.org.name"
      limit: "100/s"
Reviewing my suggestion, I think one of the by_field/fields level is not needed, so it could be like this:

- rate_limiter:
    fields:
    - "cloudfoundry.app.name"
    - "cloudfoundry.org.name"
    limit: "100/s"

Or do you expect to have more rate limiting methods apart of by_fields?

I think we can build this key internally instead of asking the user to supply it via field_pattern. Or are there other use cases you were thinking of unlocking with the field_pattern idea that I'm not thinking of?

Agree, better to build this key internally, it will be less error-prone. I cannot think on any legit use case that would work with field_pattern and not with fields.

Regarding the units, I think we should _try_ to keep it as easy for users as possible, which would mean allowing them to define different units for multiple values in the same processor. I think some of this will come down to the implementation of the rate limiting algorithm. If it proves to be too difficult then we can go with your suggestion of asking the user to define a common unit for the entire processor definition. But initially I think it would be nice if we could provide more flexibility to the users and take on the resulting complexity on ourselves in the implementation.

Agree, let's keep them as you proposed by now, we can reconsider this during implementation. In any case I would only see it as a potential problem if we had multiple limits in the same processor, something we don't have in the simplified one.

jsoriano on 1 Dec 2020

Or do you expect to have more rate limiting methods apart of by_fields?

No, let's simplify as you suggested and drop the extra level of configuration.

ycombinator on 1 Dec 2020

BTW, I just my 1-1 with @exekias and he suggested naming this processor something like sample for two reasons:

All our processors are named like actions, i.e. they are verbs, e.g. add_metadata, and
Rate limiting with dropping events is basically sampling. Plus, you had also suggested off-issue that we can introduce a percentage-based sampling strategy to this processor in the future, à la the Logstash drop filter plugin.

ycombinator on 1 Dec 2020

Rate limiting with dropping events is basically sampling.

I slightly disagree :slightly_smiling_face: I agree that they do basically the same thing: dropping a part of the total events, but they have different use cases and expectations.
Rate limiting is related to the available resources you have to monitor something, if this something exceeds these limits, it cannot be properly monitored, and you are protecting the rest of the system so it doesn't affect other things.
Sampling is an optimization that you can apply to reduce the use of resources when you know that collecting a reduced quantity of data is going to give you similar information.
When you drop events because of a rate limit, you are losing information, when you drop events because of sampling, you do it in a controlled way, losing accuracy at most.

Also, we could decide to do different actions in the future. Apart of a default action: drop, we could have action: write_to_local_file or action: wait. These actions wouldn't make much sense for sampling, but can make sense for a rate limit.

In any case, we could decide to implement sampling in this processor, and if we implement both things in the same processor then I am ok with calling it sample. If not, I think we can call it rate_limit to make it an action, and call sample the one that makes sampling when/if we implement it.

Take into account that the parameters required for sampling and rate limiting are different, e.g. I may want to get 1% of the metrics, but no more than 100/s, I would need a definition like this one:

- sample:
    fields:
    - "cloudfoundry.app.name"
    - "cloudfoundry.org.name"   
    sampling: 0.01
    limit: "100/s"

Or could be done as this with more specific processors:

- sample:
    when.has_fields: ['cloudfoundry.app.name', 'cloudfoundry.org.name']
    sampling: 0.01
- rate_limit:
    fields:
    - "cloudfoundry.app.name"
    - "cloudfoundry.org.name"   
    limit: "100/s"

jsoriano on 1 Dec 2020

Continuing with my previous comment, about doing sampling and rate limiting in the same processor. I am thinking now that they cover different use cases, and they probably require two completely different implementations (sampling is simpler than rate limiting), so in my opinion they should be two different processors.

jsoriano on 1 Dec 2020

Thanks @jsoriano for your thoughtful comments on rate limiting vs. sampling. @exekias and I discussed some of these points too (off issue). In the end, I think there are enough differences, in semantics but also in options that might only make sense for either rate limiting or sampling, that I think we should make two separate processors as well.

Just to get things moving for now, I'm going to start working on a rate_limit processor PR. If we decide to including sampling in the same processor we can always make changes before the PR is merged.

ycombinator on 2 Dec 2020

👍1

Sounds good folks, sorry for the noise. My comment was around the fact that rate limiting doesn't necessarily imply dropping data, where sampling implies it. I think rate_limit as a name is good enough, as long as the expectations are correctly documented.

exekias on 2 Dec 2020

👍2

_The following are more my own notes for writing tests and docs but posting them here in case anyone sees any issues:_

Use case: rate limit all events to 10000 /m

Configuration:

processors:
- rate_limit:
   limit: "10000/m"

Use case: rate limit events from the `acme` Cloud Foundry org to 500 /s

Configurations (each of these are alternatives to one another):

processors:
- rate_limit:
    when.equals.cloudfoundry.org.name: "acme"
    limit: "500/s"

processors:
- if.equals.cloudfoundry.org.name: "acme"
  then:
  - rate_limit:
      limit: "500/s"

Use case: rate limit events from the `acme` Cloud Foundry org and `roadrunner` space to 1000 /h

Configurations (each of these are alternatives to one another):

processors:
- rate_limit:
    when.and:
    - equals.cloudfoundry.org.name: "acme"
    - equals.cloudfoundry.space.name: "roadrunner"
    limit: "1000/h"

processors:
- if.and:
  - equals.cloudfoundry.org.name: "acme"
  - equals.cloudfoundry.space.name: "roadrunner"
  then:
  - rate_limit:
      limit: "1000/h"

Use case: rate limit events for each distinct Cloud Foundry org to 400 /s

processors:
- rate_limit:
   fields:
   - "cloudfoundry.org.name"
   limit: "400/s"

Use case: rate limit events for each distinct Cloud Foundry org and space combination to 20000 /h

processors:
- rate_limit:
   fields:
   - "cloudfoundry.org.name"
   - "cloudfoundry.space.name"
   limit: "20000/h"

ycombinator on 2 Dec 2020

❤1 👍1

The PR for a basic rate_limit processor is up here: https://github.com/elastic/infra/issues/25378.

Additionally, we will need a follow up PR for this requirement:

Add a field or tag to the events to indicate that they are being throttled.

Some quick thoughts about this requirement after discussing it with @jsoriano off-issue:

as rate-limited events are dropped by the rate_limit processor, the field or tag will need to go on the next event that is allowed through.
the field or tag name (and maybe even the value?) should probably be configurable via an optional setting on the rate_limit processor.

ycombinator on 10 Dec 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Beats developer guide needs to describe how to install mage

dedemorton · 3Comments

Doc: Clarify <stack_product> vs. <stack_product>-xpack modules

ppf2 · 3Comments

Disable monitoring metrics on filebeat 6.2

EndlessTundra · 3Comments

Allow to add condition that matches all events in autodiscover templates

jsoriano · 3Comments

Filebeat exclude_lines prior to multiline

ptrlv · 3Comments

Beats: Add event rate quota per Cloud Foundry organization

Most helpful comment

Use case: rate limit all events to 10000 /m

Use case: rate limit events from the acme Cloud Foundry org to 500 /s

Use case: rate limit events from the acme Cloud Foundry org and roadrunner space to 1000 /h

Use case: rate limit events for each distinct Cloud Foundry org to 400 /s

Use case: rate limit events for each distinct Cloud Foundry org and space combination to 20000 /h

All 14 comments

Use case: rate limit all events to 10000 /m

Use case: rate limit events from the acme Cloud Foundry org to 500 /s

Use case: rate limit events from the acme Cloud Foundry org and roadrunner space to 1000 /h

Use case: rate limit events for each distinct Cloud Foundry org to 400 /s

Use case: rate limit events for each distinct Cloud Foundry org and space combination to 20000 /h

Related issues

Use case: rate limit events from the `acme` Cloud Foundry org to 500 /s

Use case: rate limit events from the `acme` Cloud Foundry org and `roadrunner` space to 1000 /h

Use case: rate limit events from the `acme` Cloud Foundry org to 500 /s

Use case: rate limit events from the `acme` Cloud Foundry org and `roadrunner` space to 1000 /h