Vector: High cardinality protection

Created on 6 Feb 2020  Â·  21Comments  Â·  Source: timberio/vector

This is something that came out of meetings with users. High-cardinality tags on metrics data can severely disrupt downstream metrics systems. We should offer a transform, or setting, that protects against this.

Something like hyperloglog is an interesting solution to maintain rough cardinality counts on attributes. And perhaps the lossiness could be configurable too? Ex:

Behavior

  1. If a single field cardinaltiy exceeds a configurable limit new values will be rejected/removed.
  2. Log a warning if the limit is approached within 10%.
  3. I don't want to concern ouurselves with global cardinality here. I think that iis better solved by limiting tags to a defined schema?

Example

[transforms.cardinality_no_more]
   type  = "cardinality_limiter"
   max = 500
   algorithm = "hyperloglog"
data model metrics feedback requirements feature

All 21 comments

Would we essentially build a first-come-first-served list of acceptable values and always let those pass? In that case, we'd need more than a count like HyperLogLog.

If we expect the max to be relatively low (which seems like the point), we'd likely be fine just keeping a set of ~500 strings in memory. If we do want to go probabilistic, we could do something like build up a bloom filter from the first N values and then check against that.

That makes sense. As a follow-up, how would we persist this state across restarts? That's important since the order of the values is not guaranteed.

I think the most basic version would not persist it across restarts.

If we want to add that, I think we should put in the time to expose some slightly generalized storage feature for components. The file source offset checkpoints are really the only state we have now, but I wouldn't want to reimplement all of the logic around picking a data directory, etc. We'll also want to handle things like namespacing, etc, in a safe way.

The file source offset checkpoints are really the only state we have now

The journald source also saves a checkpoint. It's just a single file/record though for each journald source. While implementing it I generalized a bit of the filename handling between them.

High-cardinality tags on metrics data can severely disrupt downstream metrics systems.

I'd like to clarify if this issue is about limiting the cardinality of values for metric tags or for metric data. If it's the latter then we have to discuss which types of metrics it makes sense to operate on. For instance I could imagine how this would work for the "set" type pretty easily, but don't think it would make sense for "histogram" data, for instance.

If we only actually care about 'tags' data then that makes this more straightforward.

Also should this transform limit the cardinality of all fields in the metrics/tags data equally, or should it be configured to just a certain field/list of fields? Do we need the ability to limit some fields to a certain degree of cardinality and other fields to a different degree within the same Metric event?

Yes, this is about limiting the cardinality of individual tags (not data). The goal for v1 is to provide Vector operators protection from upstream changes and mistakes so that they can sleep at night :). For example:

  1. A developer adds a single request_id tag whose value has a high cardinality.
  2. A developer uses a high cardinality value as part of the tag name. Ex: ‘request_#{request_id}’.

Both mistakes would cause service disruption for most metrics storages. The first is more common, but given that both cause the service disruptions we should address both. I propose 3 settings:

  1. global_cardinality_limit - keys X values, the upstream developer can distribute cardinality across keys and values however they wish.
  2. key_cardinality_limit - the maximum number of unique key names.
  3. value_cardinality_limit - the maximum allowed unique values for a _single_ key.

Finally, it’s worth considering a stream_discriminants option (see the merge transform), but this is not required for v1.

And it goes without saying, I want to think carefully about memory usage here and run some tests that give us a general idea of memory usage.

Okay sounds good. Just to be super clear, for value_cardinality_limit, this limit would be a max # of unique values for any single key, or would users specify a single key name that they care to limit?

It would be for any single key. For v1 I don't think we should target specific tags, and based on my user interviews I haven't seen a need for it.

I agree with @lukesteensen's assessment above that two options that make the most sense are just storing the strings in a set in memory, or using a bloom filter. A bloom filter would be more space efficient, at the cost of slightly higher performance cost (the extra hash functions) and the ability to occasionally let through new tags even after the limit has been reached. Also if we want to be able to persist the state, rust's built-in bloom filter implementation doesn't provide any way to serialize itself, so couldn't be persisted. I did find some other third-party bloom filter implementations that provide serialization (for example this and this), but who knows what kind of quality they have. We'd probably want to do some decent testing to compare their performance against rust's built-in bloom filter implementation before using any of them.

Since the approach of keeping all keys in a set in memory is appealing due to its simplicity and how straightforward it would be to persist it, let's think about what the memory cost would be.

Let's say that the average character in a tag (field name or value) is 2 bytes (Rust encodes all strings as utf-8, all US ASCII characters are only 1 byte, and the alphabets for almost all other languages except Chinese, Japanese, and Korean can be encoded in 2 bytes). Let's say the average tag field name will be less than 100 characters long, and the average value will be less than 1000 characters long. So that's 200 bytes per field name and 2kb per value.

Next, let's assume that the upper bound of cardinality limit we need to support is around 1000. If a user configured key_cardinality_limit = 1000 and value_cardinality_limit = 1000, that would give them 1 thousand unique keys and 1 million unique values to work with. So that means we're looking at about 2Gb worth of memory overhead in the worst case. If the average tag values are actually closer to 100 characters long than 1000, then that brings the memory usage down to closer to 200Mb.

That seems... probably too high. If we limit the cardinality to 100 for both, however, then it gets much more manageable at 20Kb.

So I think the question comes down to how low we're comfortable limiting the max cardinality users can configure (not saying we would actually enforce a limit, I'm just talking about the practical usability limit implicitly imposed), how important persisting the state is, and how willing we are to use a random third-party implementation of bloom filter to gain persistence.

Design proposal using bloom filters

Config specification

[transforms.tag_cardinality]
  type = "tag_cardinality_limit"
  value_limit = 500 # default, explicit 0 means unset
  keys_limit = 0 # defaults to 0, meaning unset
  global_limit = 0 # defaults to 0, meaning unset
  limit_exceeded_action = "drop_tag" # enum, two options: drop_tag, drop_event.  Default: drop_tag
  false_positive_rate = 0.01 # Will set a sensible default based on testing how much memory is utilized by the bloom filter

Main behavior

The value_limit and keys_limit behaviors will share a single data structure when either or both are in use. global_limit has a separate data structure when in use. Users will be able to use any combination of the three together or separately. We will avoid using any memory or other resources for limit modes that are not in use.

value_limit and key_limit

When non-zero values for both value_limit and key_limit have been specified, we will instantiate a HashMap<Atom, Option<Box<BloomFilter>>. We will also keep 2 counters: num_values and num_keys. As new Metric Events come in we will first check if both num_values and num_keys are less than the configured values_limit and keys_limit, respectively. Assuming so, we will iterate over all tags in the Metric and insert into the HashMap the given key (constructing a new BloomFilter instance if they key did not already exist in the HashMap) and insert a 2-tuple whose first element is the TypeId of the value and whose second element is the Value.as_bytes() into the corresponding BloomFilter, incrementing num_values and num_keys as needed. If at any point key_limit is hit, we will check that the key is already in the HashMap, otherwise we take the configured limit_exceeded_action. If at any point value_limit is hit, we look up the value in the BloomFilter corresponding to the key and if the value isn’t present in the BloomFilter than we take the configured limit_exceeded_action.

key_limit without value_limit

Works the same as above but the HashMap values are left as None, avoiding allocating a BloomFilter unnecessarily.

value_limit without key_limit

Works the same as above we just don’t enforce any limits on num_keys.

global_limit

If a non-zero value of global_limit is specified, we instantiate a single BloomFilter and a single counter num_keys_or_values. As metrics come in we iterate over all the tags and for each tag we insert the key and the value into the BloomFilter separately. We will create an enum EntryType with 2 values EntryType::Key and EntryType::Value which we will use when inserting entries into the BloomFilter

key handling

For each key we will insert into the BloomFilter a 2-tuple whose first element is EntryType::Key and whose second element is the key as an Atom.

value handling

For each value we will insert into the BloomFilter a 4-tuple whose first element is EntryType::Value, second element is the key as an Atom, third element is the TypeId of the Value, and whose fourth element is the Value.as_bytes().

Additional details

limit_exceeded_action

When a Metric event comes in with a new tag that would exceed any of the configured limits, we will do one of two things depending on the configured limit_exceeded_action. If limit_exceeded_action is set to drop_tag (the default), then we will remove the tag key-value pair from the incoming Event before passing it on from the Transform. If limit_exceeded_action is set to drop_event, then we will drop the entire event and not emit it from the Transform.

Configuring the BloomFilters

BloomFilters have two parameters that need to be set when they are initialized: rate and expected_num_items. expected_num_items will be set to value_limit or global_limit. rate will be set to false_positive_rate. We will print at startup the amount of memory cost associated with the data structures in use backing this Transform, so that users can make more informed decisions about what to set false_positive_rate to. false_positive_rate controls how likely it is that a new never-before-seen tag will slip through the transform even after one of the limits has been reached.

I went ahead and wrote up a design proposal based on using BloomFilters since that is the slightly more complicated design. If we decide to go with the other approach you can basically just replace all instances of the word BloomFilter with HashSet above and remove the false_positive_rate config options and otherwise the design is basically the same.

I think we can simplify the initial implementation by leaving out the concepts of key_limit and global_limit for now. It's not clear to me that users really need those, so I'd rather wait to implement them.

Otherwise, this looks solid! There may be some additional things to add later (e.g. specifying which tags to act on or not, etc), but those can come later after discussion with users.

Okay great, thanks for the review @lukesteensen!

So I take it then that you feel that the memory saving of BloomFilter is worth it even if that means we have to forgo persisting the set of accepted values or switch to a non-standard BloomFilter impl to do that? Just want to double check since @binarylogic specifically called out the persistence as something to think about for the design here.

I could go either way between bloom filters and sets. Each will likely be a better fit for different users. I don't think we should care about persistence right now. We'll likely implement both eventually, so it'd make sense to start with the simplest one and try to keep it relatively isolated.

And just FYI, there's nothing official or built-in about the bloom crate you linked. It seems like a perfectly good library, but not otherwise special :smile:

Okay, if we're indifferent between the two approaches I guess I'll start with the HashSet approach since it's slightly more straightforward.

Since the HashSet approach is so simple to build I wonder if I should just go ahead and build both versions now? If you think it's inevitable that we're going to do both anyway...
Or would you rather just put out one for now and wait until we have some users to get more input on how it's used/what the requirements are?

That would be fine with me! It could be an advantage to require the user to pick a strategy upfront, rather than implementing one and adding the other as an option later.

okay cool, will do! Has the nice added advantage of making the project more interesting/fun for me to build! 😎

I'm still putting some finishing touches on this work before putting it into review, but the core functionality is complete. I wanted to share some numbers around memory consumption of the two approaches in some different configurations. This data was not collected with a particularly rigorous methodology and should be taken with a decent sized grain of salt, but I think it at least gives us a rough idea of the scale and scaling factors we're looking at. This was collected simply by making a unit test that builds Metric events with randomly generated strings for the keys and values, and then loops forever after sending them all through the transform. While it is looping at the end, I just checked the OSX Activity Monitor to see memory consumption of the entire vector process.

Every test: Ran with 5000 distinct values input for each key. Each key is 15 chars, each val is 100 chars
bloom_low_rate means false_positive_rate of 0.00001
bloom_high_rate means false_positive_rate of 0.01

num_keys: 100, value_limit: 100

hashset: 4.0 MB
bloom_high_rate: 2.6 MB
bloom_low_rate: 2.6 MB

num_keys: 100, value_limit: 1000
ccc
hashset: 18.9 MB
bloom_high_rate: 2.7 MB
bloom_low_rate: 2.9 MB

num_keys: 1000, value_limit: 1000

hashset: 163.5 MB
bloom_high_rate: 4.0 MB
bloom_low_rate: 5.7 MB

num_keys: 1000, value_limit: 100

hashset: 16.9 MB
bloom_high_rate: 2.9 MB
bloom_low_rate: 3.1 MB

Was this page helpful?
0 / 5 - 0 ratings