Ripgrep: Filtering on the presence or absence of captures

Created on 17 Sep 2017 · 5Comments · Source: BurntSushi/ripgrep

💡 [This trick] relies on your ability to inspect Group 1 captures (at least in the generic flavor), so it will not work in a non-programming environment, such as a text editor's search-and-replace function or a grep command -- The Best Regex Trick

TL;DR

Select all lines which match \bTarzan\b but not "Tarzan":

$ rg -w '"Tarzan"|(Tarzan)' --defined '$1'

AKA

$ rg -w '"Tarzan"|(Tarzan)' -d '$1'

Suppose I want to select all lines which contain the unquoted word Tarzan i.e. \bTarzan\b but not "Tarzan" e.g. the first 4 lines of:

test.txt

"Tarzan and Jane"
"Jane and Tarzan"
Me Tarzan, you Jane
Tarzan vs "Tarzan"
This line doesn't mention him
He's moved to Tarzania
He's no "Tarzan"!

It can be done with a pipeline e.g.:

$ rg -w 'Tarzan' test.txt | rg -v '"Tarzan"'

But that particular example rejects lines which contain both, which is not what we want in this case. The same would be true if ripgrep added e.g. an -E (--no-regexp) option to complement -e/--regexp:

$ rg -we 'Tarzan' -E '"Tarzan"' test.txt

It can be done in one pass with PCRE-flavored greps such as GNU grep and ack, with varying degrees of difficulty/unreadability, by using negative lookahead/look-behind assertions e.g.:

$ grep -P '^(?:(?!"Tarzan"|Tarzan\w+)(Tarzan|.))+$' test.txt

That's already pretty gnarly for a single exclusion, and quickly becomes impractical/incomprehensible for multiple exclusions. It also matches lines which don't contain Tarzan and, again, excludes lines which contain both patterns.

In programming languages, there's a common pattern for performing exclusions in a simple, readable way without multiple passes:

match and discard the exclusions
match and capture the inclusion
test for its existence

e.g.:

JavaScript

[ '', '"Tarzan"', 'Tarzan', 'Tarzania', 'Tarzan vs "Tarzan"' ].filter(it => {
    const m = it.match(/"Tarzan"|\b(Tarzan)\b/)
    return m && m[1]
}) // [ 'Tarzan', 'Tarzan vs "Tarzan"' ]

ES.next^[1]

[ '', '"Tarzan"', 'Tarzan', 'Tarzania', 'Tarzan vs "Tarzan"' ].filter(it => {
    return it.match(/"Tarzan"|\b(Tarzan)\b/)?.[1]
}) // [ 'Tarzan', 'Tarzan vs "Tarzan"' ]

Ruby

[ '', '"Tarzan"', 'Tarzan', 'Tarzania', 'Tarzan vs "Tarzan"' ].select { |it|
    it[/"Tarzan"|\b(Tarzan)\b/, 1]
} # => [ 'Tarzan', 'Tarzan vs "Tarzan"' ]

$ ruby -ne 'print if $_[/"Tarzan"|\b(Tarzan)\b/, 1]' test.txt

etc.

This isn't available in any greps I'm aware of, but since the machinery is already there to capture and reference subexpressions by index and name, it seems like a small step to use them in predicates to reproduce the flexibility and simplicity of this pattern on the command line e.g.:

$ rg -w '"Tarzan"|(Tarzan)' -d '$1' test.txt

output

"Tarzan and Jane"
"Jane and Tarzan"
Me Tarzan, you Jane
Tarzan vs "Tarzan"

Notes

1) I assume that the predicate can be inverted e.g.:

$ rg --not-defined '$1'

AKA

$ rg -D '$1'

There aren't many single-letter options left. The last remaining pairs are -d/-D, -y/-Y and -z/-Z. The latter are commonly used to denote null/zero values, so they could be used instead, with the meaning of -d and -D inverted e.g.:

$ rg -z '$1' # AKA rg --not-defined '$1'
$ rg -Z '$1' # AKA rg --defined '$1'

2) I assume that indices increment across multiple patterns, and that multiple -d and -D options can be combined e.g.:

$ rg -e 'Foo|(Bar)' -e '(Baz|(Quux))' -d '$1' -D '$3'

3) I also assume that numbered and named captures can be mixed e.g.:

$ rg -e 'Foo|(Bar)' -e 'Baz|(?P<name>Quux)' -d '$1' -D '$name'

4) The full version of the matching command would currently be:

rg '^.*?(?:"Tarzan"|\b(Tarzan)\b).*$' -d '$1' test.txt

Hopefully some of that boilerplate can be removed e.g. via #389 or #593.

5) For clarity, "Tarzan" vs Tarzan is omitted from the examples. Handling it only slightly complicates the regex:

$ rg '^(?:"Tarzan"|\b(Tarzan)\b|.)*$' -d '$1' test.txt

enhancement help wanted icebox

Source

chocolateboy

👍4

Most helpful comment

Thanks for this very thorough write up!

I kind of feel like that semantics of this are too complex, which will probably lead to a feature that almost nobody uses. By that, I don't mean that the flags --not-defined and --defined are themselves complex, but using them effectively---as you've demonstrated here---requires some ingenuity in crafting the regex.

With that said, I'd be willing to adopt a feature like this because I do agree that it could be useful, but I'd have to strongly insist on the following:

It should not begin life with short flags. I used short flags whenever the flags are common, or if there was a precedent for their existence in other tools. For a feature like this, that is neither common nor familiar, I would like to hold off on adding short flags. If I'm wrong and it becomes popular, then we can revisit it.
The maintenance burden of the feature needs to be low. That means adding the feature shouldn't require any significant complications and it should be reasonably well tested.
Since the use case motivating the existence of these flags is somewhat complicated, I would like the documentation to be clear. It should be concise, but contain an example usage. (Perhaps a condensed version of the example in this ticket.)

BurntSushi on 24 Sep 2017

👍4

All 5 comments

Thanks for this very thorough write up!

With that said, I'd be willing to adopt a feature like this because I do agree that it could be useful, but I'd have to strongly insist on the following:

It should not begin life with short flags. I used short flags whenever the flags are common, or if there was a precedent for their existence in other tools. For a feature like this, that is neither common nor familiar, I would like to hold off on adding short flags. If I'm wrong and it becomes popular, then we can revisit it.
The maintenance burden of the feature needs to be low. That means adding the feature shouldn't require any significant complications and it should be reasonably well tested.
Since the use case motivating the existence of these flags is somewhat complicated, I would like the documentation to be clear. It should be concise, but contain an example usage. (Perhaps a condensed version of the example in this ticket.)

BurntSushi on 24 Sep 2017

👍4

Now that PRCE is (optionally) supported, can either of you think of a use-case for this that isn't handled by lookahead and lookbehind? I think this would be strictly more powerful than negative lookbehind, since lookbehind can't contain variable-length patterns, but that's the only advantage I can see. (Granted, that's an advantage I think I would occasionally find useful.)

BatmanAoD on 21 Sep 2018

I think it could be possible to define a simpler UX than needing to resort to look-around.

With that said, it's a good point and I was never a big fan of adding this feature anyway. So I'm going to close this.

BurntSushi on 3 Apr 2019

Lookaround assertions still have the issues mentioned above. For anyone looking for a clean solution to this with the PCRE engine, the backtracking-control verbs [1] [2] [3] are your friends:

Input

"Tarzan and Jane"
"Jane and Tarzan"
Me Tarzan, you Jane
Tarzan vs "Tarzan"
"Tarzan" vs Tarzan
This line doesn't mention him
He's moved to Tarzania
He's no "Tarzan"!

Command

$ rg --pcre2 '(?:"Tarzan")(*SKIP)(*FAIL)|\bTarzan\b' test.txt

Output

"Tarzan and Jane"
"Jane and Tarzan"
Me Tarzan, you Jane
Tarzan vs "Tarzan"
"Tarzan" vs Tarzan

Or, to exclude lines which contain "Tarzan":

Command

$ rg --pcre2 '(?:.*?"Tarzan".*)(*SKIP)(*FAIL)|\bTarzan\b' test.txt
$ rg --pcre2 '(?:.*?"Tarzan".*)(*COMMIT)(*FAIL)|\bTarzan\b' test.txt

Output

"Tarzan and Jane"
"Jane and Tarzan"
Me Tarzan, you Jane

[1] https://www.rexegg.com/regex-best-trick.html#pcrevariation
[2] https://www.rexegg.com/backtracking-control-verbs.html
[3] https://perldoc.perl.org/perlre.html#Special-Backtracking-Control-Verbs

chocolateboy on 20 Sep 2020

🚀3

@chocolateboy Wow, I had never heard of those before. Thanks for sharing.

BatmanAoD on 20 Sep 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

ripgrep isn't working on GitLab CI

danpintara · 3Comments

Inconsistent behaviour for --invert-match

kenorb · 3Comments

Add option to disable regex search and search only for exact matches

hauleth · 3Comments

11.0.0 regression: Seemingly infinite loop on non-Unicode files

Deewiant · 3Comments

Wrong "Permission denied" error message on hidden files

fcantournet · 3Comments

Ripgrep: Filtering on the presence or absence of captures

TL;DR

test.txt

JavaScript

ES.next[1]

Ruby

output

Notes

Most helpful comment

All 5 comments

Input

Command

Output

Command

Output

Related issues

ES.next^[1]