rg --version
ripgrep 12.1.1 (rev a6d05475fb)
-SIMD -AVX (compiled)
+SIMD +AVX (runtime
I run a locally compiled version.
Ubuntu 20.04.
Searching for an e-mail pattern is faster using grep than rg. I experience this in my local Maildir and in the Linux kernels linux/fs/ subfolder.
I use the Linux kernels linux/fs (rev 85a2c56cb4454c73f56d3099d96942e7919b292f) subfolder as an example for a folder we both have access too. In practice I would like to run the commands in my Maildir.
cd /path/to/linux/fs
pattern="[A-Z0-9._%+-]*lin[A-Z0-9._%+-]*tor[A-Z0-9._%+-]*@[a-z]+\.[a-z]{2,}"
# some grep warm up runs
time grep -rhioE "$pattern" | len
0
grep --color=auto --exclude-dir={.bzr,CVS,.git,.hg,.svn,.idea,.tox} -rhioE 0,02s user 0,01s system 89% cpu 0,025 total
wc -l 0,00s user 0,00s system 4% cpu 0,024 total
# some ripgrep warm up runs
time rg -ioI "$pattern" | len
0
/tmp/rg -ioI "$pattern" 0,34s user 0,01s system 252% cpu 0,137 total
wc -l 0,00s user 0,00s system 1% cpu 0,136 total
cd /path/to/linux/fs
pattern="[A-Z0-9._%+-]*vid[A-Z0-9._%+-]*@[a-z]+\.[a-z]{2,}"
# some grep warm up runs
time grep -rhioE "$pattern" | len
10
grep --color=auto --exclude-dir={.bzr,CVS,.git,.hg,.svn,.idea,.tox} -rhioE 0,02s user 0,03s system 98% cpu 0,048 total
wc -l 0,00s user 0,00s system 5% cpu 0,047 total
# some ripgrep warm up runs
time rg -ioI "$pattern" | len
10
/tmp/rg -ioI "$pattern" 0,16s user 0,03s system 294% cpu 0,065 total
wc -l 0,00s user 0,00s system 4% cpu 0,065 total
I would expect ripgrep to be faster than grep.
I cannot seem to reproduce this easily... But read on.
For your second example, I'd consider those times close enough (especially on such a small corpus) that they are effectively the same. But in any case, ripgrep is faster for me:
$ hyperfine 'rg -i "[A-Z0-9._%+-]*vid[A-Z0-9._%+-]*@[a-z]+\.[a-z]{2,}"'
Benchmark #1: rg -i "[A-Z0-9._%+-]*vid[A-Z0-9._%+-]*@[a-z]+\.[a-z]{2,}"
Time (mean ± σ): 19.5 ms ± 1.4 ms [User: 115.1 ms, System: 22.6 ms]
Range (min … max): 16.2 ms … 23.5 ms 115 runs
$ hyperfine 'grep -riE "[A-Z0-9._%+-]*vid[A-Z0-9._%+-]*@[a-z]+\.[a-z]{2,}"'
Benchmark #1: grep -riE "[A-Z0-9._%+-]*vid[A-Z0-9._%+-]*@[a-z]+\.[a-z]{2,}"
Time (mean ± σ): 46.3 ms ± 1.1 ms [User: 25.3 ms, System: 20.7 ms]
Range (min … max): 43.6 ms … 48.7 ms 58 runs
Now, your first example is more interesting. That's a substantial difference. On my machine, ripgrep runs a hair faster, but they're effectively tied:
$ hyperfine --ignore-failure 'rg -i "[A-Z0-9._%+-]*lin[A-Z0-9._%+-]*tor[A-Z0-9._%+-]*@[a-z]+\.[a-z]{2,}"'
Benchmark #1: rg -i "[A-Z0-9._%+-]*lin[A-Z0-9._%+-]*tor[A-Z0-9._%+-]*@[a-z]+\.[a-z]{2,}"
Time (mean ± σ): 40.1 ms ± 1.6 ms [User: 364.8 ms, System: 22.6 ms]
Range (min … max): 36.7 ms … 44.4 ms 65 runs
Warning: Ignoring non-zero exit code.
$ hyperfine --ignore-failure 'grep -riE "[A-Z0-9._%+-]*lin[A-Z0-9._%+-]*tor[A-Z0-9._%+-]*@[a-z]+\.[a-z]{2,}"'
Benchmark #1: grep -riE "[A-Z0-9._%+-]*lin[A-Z0-9._%+-]*tor[A-Z0-9._%+-]*@[a-z]+\.[a-z]{2,}"
Time (mean ± σ): 46.8 ms ± 1.4 ms [User: 24.0 ms, System: 22.3 ms]
Range (min … max): 43.9 ms … 50.7 ms 59 runs
Warning: Ignoring non-zero exit code.
Do you perhaps have a ripgrep config file that is running ripgrep single threaded? If I force ripgrep into a single thread, then it gets closer to your reproduction:
$ hyperfine 'rg -i -j1 "[A-Z0-9._%+-]*vid[A-Z0-9._%+-]*@[a-z]+\.[a-z]{2,}"'
Benchmark #1: rg -i -j1 "[A-Z0-9._%+-]*vid[A-Z0-9._%+-]*@[a-z]+\.[a-z]{2,}"
Time (mean ± σ): 91.7 ms ± 5.4 ms [User: 71.1 ms, System: 20.3 ms]
Range (min … max): 75.7 ms … 98.1 ms 29 runs
And if I have ripgrep search the same set of files as grep, then a real difference can be observed:
$ hyperfine 'rg -i -j1 -uuu "[A-Z0-9._%+-]*vid[A-Z0-9._%+-]*@[a-z]+\.[a-z]{2,}"'
Benchmark #1: rg -i -j1 -uuu "[A-Z0-9._%+-]*vid[A-Z0-9._%+-]*@[a-z]+\.[a-z]{2,}"
Time (mean ± σ): 151.0 ms ± 10.4 ms [User: 129.2 ms, System: 21.3 ms]
Range (min … max): 140.5 ms … 168.4 ms 17 runs
OK, let's see if this is a regex engine problem or a directory traversal problem. Easiest way to do that is to see if we can reproduce the performance difference on a single file. So I've just concatenated all of the files in linux/fs into a single file. And to make differences more pronounced (and eliminate the effects of noise), I've made the file 10x bigger:
$ find ./ -type f -print0 | xargs -0 cat > /tmp/linux-fs-all
$ for ((i=0;i<10;i++)); do cat /tmp/linux-fs-all; done > /tmp/linux-fs-all.10x
The first thing to note is that there is some binary data:
$ grep -iE "[A-Z0-9._%+-]*vid[A-Z0-9._%+-]*@[a-z]+\.[a-z]{2,}" /tmp/linux-fs-all.10x
grep: /tmp/linux-fs-all.10x: binary file matches
So let's use the -a/--text flag and see if we can re-create the performance difference on a single file:
$ time grep -aciE "[A-Z0-9._%+-]*vid[A-Z0-9._%+-]*@[a-z]+\.[a-z]{2,}" /tmp/linux-fs-all.10x
100
real 0.295
user 0.199
sys 0.096
maxmem 7 MB
faults 0
$ time rg -aci "[A-Z0-9._%+-]*vid[A-Z0-9._%+-]*@[a-z]+\.[a-z]{2,}" /tmp/linux-fs-all.10x
100
real 1.249
user 1.231
sys 0.017
maxmem 580 MB
faults 0
Great, got it. Running it under perf shows the total time split between Teddy (a fast literal prefilter) and the regex engine. My first thought is whether this is a case where ripgrep's more aggressive literal optimizations result in it being slower than grep. To figure that out, we need to find 1) which literals ripgrep is searching for and 2) whether they have a lot of matches or not. We can do (1) by running ripgrep with the --trace flag:
$ rg -aci "[A-Z0-9._%+-]*vid[A-Z0-9._%+-]*@[a-z]+\.[a-z]{2,}" /tmp/linux-fs-all.10x --trace
...
DEBUG|grep_regex::literal|crates/regex/src/literal.rs:109: required literals found: [Cut(VID), Cut(vID), Cut(ViD), Cut(viD), Cut(VId), Cut(vId), Cut(Vid), Cut(vid), Cut(@), Cut(.)]
DEBUG|grep_regex::matcher|crates/regex/src/matcher.rs:50: extracted fast line regex: (?-u:VID|vID|ViD|viD|VId|vId|Vid|vid|@|\x2e)
...
Yeah... That doesn't look so good. Let's see how often the "fast line regex" matches:
$ time rg -c '(?-u:VID|vID|ViD|viD|VId|vId|Vid|vid|@|\x2e)' /tmp/linux-fs-all.10x
4489830
real 0.722
user 0.701
sys 0.020
maxmem 580 MB
faults 0
Yikes. That is almost certainly the problem. The killers are the @ and . literals that were extracted, because they make up the vast majority of all false positives. (0x2E is ASCII for ..) If we remove them from the literal search, the number of matches drops precipitously:
$ time rg -c '(?-u:VID|vID|ViD|viD|VId|vId|Vid|vid)' /tmp/linux-fs-all.10x
24320
real 0.157
user 0.139
sys 0.017
maxmem 580 MB
faults 0
For grins, observe how ripgrep is actually faster than GNU grep on these specific queries:
$ time grep -cE '(?-u:VID|vID|ViD|viD|VId|vId|Vid|vid|@|\.)' /tmp/linux-fs-all.10x
4773860
real 1.098
user 0.990
sys 0.107
maxmem 7 MB
faults 0
$ time grep -cE '(?-u:VID|vID|ViD|viD|VId|vId|Vid|vid)' /tmp/linux-fs-all.10x
22320
real 0.963
user 0.895
sys 0.067
maxmem 7 MB
faults 0
The problem here is that ripgrep's heuristic optimization is sub-optimal in this case. These sorts of things are difficult to fix, because fixing it for one case might actually result in making other cases slower. One possible fix would be to just remove @ and . from the set of literals we search. And indeed this is a correct transformation in this case. And removing things like that is probably a smart thing to do anyway, since they are very common. Well, at least . is. @ is common in some corpora and not others. We can see what performance _could_ be like by replacing @ and . with \S to defeat the literal detection:
$ time rg -aci "[A-Z0-9._%+-]*vid[A-Z0-9._%+-]*\S[a-z]+\S[a-z]{2,}" /tmp/linux-fs-all.10x
10760
real 0.144
user 0.100
sys 0.043
maxmem 581 MB
faults 0
Obviously it doesn't produce the same results, but if you put it together into a pipeline...
$ time rg -ai "[A-Z0-9._%+-]*vid[A-Z0-9._%+-]*\S[a-z]+\S[a-z]{2,}" /tmp/linux-fs-all.10x | rg -aci '[A-Z0-9._%+-]*vid[A-Z0-9._%+-]*@[a-z]+\.[a-z]{2,}'
100
real 0.168
user 0.134
sys 0.034
maxmem 579 MB
faults 0
Unfortunately, while removing . and @ from the literal detection is conceptually simple, the code for extracting literals is not really flexible enough to do this easily. I've been meaning to rewrite it for years, but it's pretty thorny.
Anyway, good find and thank you for the great report!
Most helpful comment
I cannot seem to reproduce this easily... But read on.
For your second example, I'd consider those times close enough (especially on such a small corpus) that they are effectively the same. But in any case, ripgrep is faster for me:
Now, your first example is more interesting. That's a substantial difference. On my machine, ripgrep runs a hair faster, but they're effectively tied:
Do you perhaps have a ripgrep config file that is running ripgrep single threaded? If I force ripgrep into a single thread, then it gets closer to your reproduction:
And if I have ripgrep search the same set of files as grep, then a real difference can be observed:
OK, let's see if this is a regex engine problem or a directory traversal problem. Easiest way to do that is to see if we can reproduce the performance difference on a single file. So I've just concatenated all of the files in
linux/fsinto a single file. And to make differences more pronounced (and eliminate the effects of noise), I've made the file 10x bigger:The first thing to note is that there is some binary data:
So let's use the
-a/--textflag and see if we can re-create the performance difference on a single file:Great, got it. Running it under
perfshows the total time split between Teddy (a fast literal prefilter) and the regex engine. My first thought is whether this is a case where ripgrep's more aggressive literal optimizations result in it being slower than grep. To figure that out, we need to find 1) which literals ripgrep is searching for and 2) whether they have a lot of matches or not. We can do (1) by running ripgrep with the--traceflag:Yeah... That doesn't look so good. Let's see how often the "fast line regex" matches:
Yikes. That is almost certainly the problem. The killers are the
@and.literals that were extracted, because they make up the vast majority of all false positives. (0x2Eis ASCII for..) If we remove them from the literal search, the number of matches drops precipitously:For grins, observe how ripgrep is actually faster than GNU grep on these specific queries:
The problem here is that ripgrep's heuristic optimization is sub-optimal in this case. These sorts of things are difficult to fix, because fixing it for one case might actually result in making other cases slower. One possible fix would be to just remove
@and.from the set of literals we search. And indeed this is a correct transformation in this case. And removing things like that is probably a smart thing to do anyway, since they are very common. Well, at least.is.@is common in some corpora and not others. We can see what performance _could_ be like by replacing@and.with\Sto defeat the literal detection:Obviously it doesn't produce the same results, but if you put it together into a pipeline...
Unfortunately, while removing
.and@from the literal detection is conceptually simple, the code for extracting literals is not really flexible enough to do this easily. I've been meaning to rewrite it for years, but it's pretty thorny.Anyway, good find and thank you for the great report!