I noticed that the docs say that regex is a more expensive operation than the other conditions.
Note that this condition is considerably more expensive than a regular string match (such as starts_with or contains) so the use of those conditions are preferred where possible.
https://vector.dev/docs/reference/transforms/filter/#field_nameregex
This is true for matching against a single string. However, with #2745 you can now specify a list of strings to match against. I was curious on how the performance difference changed with different numbers of matching strings, so I wrote a little test.
Results:
num matching strings | Regex | Ends With
------------------ | ------ | ----------
250 | 2.1s | 1 minute 45s
100 | 2.1s | 38s
10 | 1.9s | 5.1s
5 | 2.2s | 3.4s
1 | 2.4s | 1.8s
Basically, if you have a large list of matching strings, it's much better to use regex. My guess as to why is that the regex crate compiles a DFA - and so only has to evaluate for a match once - which would explain why the time doesn't increase with the number of matching strings. Whereas what we do for list of string (for conditions like ends_with, starts_with, etc) is to iterate over the list and check each one.
Not that I'm saying regex should be the go-to. If unfamiliar with regex you can end up with terrible performance (evil regex). Plus for the case of a single string ends_with was better.
Not sure how to sum this up in a useful way for the docs, maybe something like:
Note that this condition is considerably more expensive than a regular string match (such as starts_with or contains) so the use of those conditions are preferred where possible. There are exceptions where regex may be faster, such as when matching against a list of strings.
That is impressive behavior for the regex engine. I agree it is worth improving the documentation for this. Thanks for the suggestion!
If unfamiliar with regex you can end up with terrible performance (evil regex).
Our regex engine actually guarantees worst-case linear runtime, so we don't have to worry as much about cases like this. You can read some background on the approach here.
It's also quite intelligent about optimizing literals, so I'd expect it to perform very well in many of these situations.
I was curious about these benchmarks, so I set up one of my own using criterion, and made it more representative of the actual usage in vector. In particular:
The results are more interesting. First, neither matching on string start or end was ever as fast with regex as with plain repeated string matching. There are signs of it catching up, but it would have taken many times more than 8 matches to make up the difference. On the other hand, using regex for contains was always faster than string matching, and even improved in speed for more than one string.
These results suggest we should actually consider internally converting the contains matches to a regex internally as an optimization.
starts/string/1 time: [42.783 us 43.008 us 43.207 us]
starts/regex/1 time: [130.20 us 130.84 us 131.67 us]
starts/string/2 time: [64.391 us 64.615 us 64.901 us]
starts/regex/2 time: [205.37 us 207.17 us 209.20 us]
starts/string/4 time: [100.13 us 100.24 us 100.34 us]
starts/regex/4 time: [298.95 us 299.23 us 299.52 us]
starts/string/8 time: [186.78 us 187.29 us 187.94 us]
starts/regex/8 time: [412.41 us 413.25 us 414.34 us]
contains/string/1 time: [1.0500 ms 1.0516 ms 1.0531 ms]
contains/regex/1 time: [676.24 us 676.72 us 677.26 us]
contains/string/2 time: [2.3060 ms 2.3079 ms 2.3099 ms]
contains/regex/2 time: [493.41 us 493.72 us 494.03 us]
contains/string/4 time: [4.6768 ms 4.6815 ms 4.6865 ms]
contains/regex/4 time: [504.56 us 505.98 us 507.67 us]
contains/string/8 time: [9.8432 ms 9.8644 ms 9.8851 ms]
contains/regex/8 time: [679.87 us 680.92 us 682.11 us]
ends/string/1 time: [40.508 us 40.700 us 40.966 us]
ends/regex/1 time: [122.45 us 123.21 us 124.33 us]
ends/string/2 time: [63.540 us 63.647 us 63.777 us]
ends/regex/2 time: [171.83 us 172.54 us 173.54 us]
ends/string/4 time: [112.73 us 112.85 us 112.98 us]
ends/regex/4 time: [246.38 us 246.97 us 247.65 us]
ends/string/8 time: [211.92 us 212.25 us 212.60 us]
ends/regex/8 time: [388.29 us 389.65 us 391.32 us]
Just adding a note to clarify the work here:
Most helpful comment
I was curious about these benchmarks, so I set up one of my own using criterion, and made it more representative of the actual usage in vector. In particular:
The results are more interesting. First, neither matching on string start or end was ever as fast with regex as with plain repeated string matching. There are signs of it catching up, but it would have taken many times more than 8 matches to make up the difference. On the other hand, using regex for contains was always faster than string matching, and even improved in speed for more than one string.
These results suggest we should actually consider internally converting the
containsmatches to a regex internally as an optimization.