Vector: Docs: regex performance vs ends_with

Created on 16 Jun 2020 · 4Comments · Source: timberio/vector

I noticed that the docs say that regex is a more expensive operation than the other conditions.

Note that this condition is considerably more expensive than a regular string match (such as starts_with or contains) so the use of those conditions are preferred where possible.

https://vector.dev/docs/reference/transforms/filter/#field_nameregex

This is true for matching against a single string. However, with #2745 you can now specify a list of strings to match against. I was curious on how the performance difference changed with different numbers of matching strings, so I wrote a little test.

regex_test.zip

Results:

in each case, tested 10 million 10-character strings (this is the equivalent of a field in an Event)
only difference is number of matching strings to test against from 1 to 250 (this is the equivalent of the list of strings provided in the config.toml file)

num matching strings | Regex | Ends With
------------------ | ------ | ----------
250 | 2.1s | 1 minute 45s
100 | 2.1s | 38s
10 | 1.9s | 5.1s
5 | 2.2s | 3.4s
1 | 2.4s | 1.8s

Basically, if you have a large list of matching strings, it's much better to use regex. My guess as to why is that the regex crate compiles a DFA - and so only has to evaluate for a match once - which would explain why the time doesn't increase with the number of matching strings. Whereas what we do for list of string (for conditions like ends_with, starts_with, etc) is to iterate over the list and check each one.

Not that I'm saying regex should be the go-to. If unfamiliar with regex you can end up with terrible performance (evil regex). Plus for the case of a single string ends_with was better.

Not sure how to sum this up in a useful way for the docs, maybe something like:

Note that this condition is considerably more expensive than a regular string match (such as starts_with or contains) so the use of those conditions are preferred where possible. There are exceptions where regex may be faster, such as when matching against a list of strings.

external docs performance regex_parser enhancement help

Source

bill-bateman

Most helpful comment

I was curious about these benchmarks, so I set up one of my own using criterion, and made it more representative of the actual usage in vector. In particular:

vector does not do case-insensitive searches, so all of the case mapping is extra work.
It is highly likely that the search terms will be of varying length, so make their lengths different too.
Regex compilation is a fixed cost, and can be quite expensive, so avoid counting it.
It is rather unlikely to have more than a handful of search terms per condition. Even 10 is probably unrepresentative.
Logs are not random data, and the matches are something that are likely to match, as opposed to random vs random matches that are highly unlikely. So, I imported some arbitrary Apache logs and used terms that are found at the start, middle, and end of the lines.

The results are more interesting. First, neither matching on string start or end was ever as fast with regex as with plain repeated string matching. There are signs of it catching up, but it would have taken many times more than 8 matches to make up the difference. On the other hand, using regex for contains was always faster than string matching, and even improved in speed for more than one string.

These results suggest we should actually consider internally converting the contains matches to a regex internally as an optimization.

starts/string/1         time:   [42.783 us 43.008 us 43.207 us]                             
starts/regex/1          time:   [130.20 us 130.84 us 131.67 us]                           
starts/string/2         time:   [64.391 us 64.615 us 64.901 us]                            
starts/regex/2          time:   [205.37 us 207.17 us 209.20 us]                           
starts/string/4         time:   [100.13 us 100.24 us 100.34 us]                            
starts/regex/4          time:   [298.95 us 299.23 us 299.52 us]                           
starts/string/8         time:   [186.78 us 187.29 us 187.94 us]                            
starts/regex/8          time:   [412.41 us 413.25 us 414.34 us]                           

contains/string/1       time:   [1.0500 ms 1.0516 ms 1.0531 ms]                               
contains/regex/1        time:   [676.24 us 676.72 us 677.26 us]                             
contains/string/2       time:   [2.3060 ms 2.3079 ms 2.3099 ms]                               
contains/regex/2        time:   [493.41 us 493.72 us 494.03 us]                             
contains/string/4       time:   [4.6768 ms 4.6815 ms 4.6865 ms]                               
contains/regex/4        time:   [504.56 us 505.98 us 507.67 us]                             
contains/string/8       time:   [9.8432 ms 9.8644 ms 9.8851 ms]                               
contains/regex/8        time:   [679.87 us 680.92 us 682.11 us]                             

ends/string/1           time:   [40.508 us 40.700 us 40.966 us]                           
ends/regex/1            time:   [122.45 us 123.21 us 124.33 us]                         
ends/string/2           time:   [63.540 us 63.647 us 63.777 us]                          
ends/regex/2            time:   [171.83 us 172.54 us 173.54 us]                         
ends/string/4           time:   [112.73 us 112.85 us 112.98 us]                          
ends/regex/4            time:   [246.38 us 246.97 us 247.65 us]                         
ends/string/8           time:   [211.92 us 212.25 us 212.60 us]                          
ends/regex/8            time:   [388.29 us 389.65 us 391.32 us]

bruceg on 25 Jun 2020

👍2

All 4 comments

That is impressive behavior for the regex engine. I agree it is worth improving the documentation for this. Thanks for the suggestion!

bruceg on 18 Jun 2020

If unfamiliar with regex you can end up with terrible performance (evil regex).

Our regex engine actually guarantees worst-case linear runtime, so we don't have to worry as much about cases like this. You can read some background on the approach here.

It's also quite intelligent about optimizing literals, so I'd expect it to perform very well in many of these situations.

lukesteensen on 19 Jun 2020

🎉1

I was curious about these benchmarks, so I set up one of my own using criterion, and made it more representative of the actual usage in vector. In particular:

vector does not do case-insensitive searches, so all of the case mapping is extra work.
It is highly likely that the search terms will be of varying length, so make their lengths different too.
Regex compilation is a fixed cost, and can be quite expensive, so avoid counting it.
It is rather unlikely to have more than a handful of search terms per condition. Even 10 is probably unrepresentative.
Logs are not random data, and the matches are something that are likely to match, as opposed to random vs random matches that are highly unlikely. So, I imported some arbitrary Apache logs and used terms that are found at the start, middle, and end of the lines.

These results suggest we should actually consider internally converting the contains matches to a regex internally as an optimization.

starts/string/1         time:   [42.783 us 43.008 us 43.207 us]                             
starts/regex/1          time:   [130.20 us 130.84 us 131.67 us]                           
starts/string/2         time:   [64.391 us 64.615 us 64.901 us]                            
starts/regex/2          time:   [205.37 us 207.17 us 209.20 us]                           
starts/string/4         time:   [100.13 us 100.24 us 100.34 us]                            
starts/regex/4          time:   [298.95 us 299.23 us 299.52 us]                           
starts/string/8         time:   [186.78 us 187.29 us 187.94 us]                            
starts/regex/8          time:   [412.41 us 413.25 us 414.34 us]                           

contains/string/1       time:   [1.0500 ms 1.0516 ms 1.0531 ms]                               
contains/regex/1        time:   [676.24 us 676.72 us 677.26 us]                             
contains/string/2       time:   [2.3060 ms 2.3079 ms 2.3099 ms]                               
contains/regex/2        time:   [493.41 us 493.72 us 494.03 us]                             
contains/string/4       time:   [4.6768 ms 4.6815 ms 4.6865 ms]                               
contains/regex/4        time:   [504.56 us 505.98 us 507.67 us]                             
contains/string/8       time:   [9.8432 ms 9.8644 ms 9.8851 ms]                               
contains/regex/8        time:   [679.87 us 680.92 us 682.11 us]                             

ends/string/1           time:   [40.508 us 40.700 us 40.966 us]                           
ends/regex/1            time:   [122.45 us 123.21 us 124.33 us]                         
ends/string/2           time:   [63.540 us 63.647 us 63.777 us]                          
ends/regex/2            time:   [171.83 us 172.54 us 173.54 us]                         
ends/string/4           time:   [112.73 us 112.85 us 112.98 us]                          
ends/regex/4            time:   [246.38 us 246.97 us 247.65 us]                         
ends/string/8           time:   [211.92 us 212.25 us 212.60 us]                          
ends/regex/8            time:   [388.29 us 389.65 us 391.32 us]

bruceg on 25 Jun 2020

👍2

Just adding a note to clarify the work here: