馃挕 [This trick] relies on your ability to inspect Group 1 captures (at least in the generic flavor), so it will not work in a non-programming environment, such as a text editor's search-and-replace function or a grep command -- The Best Regex Trick
Select all lines which match \bTarzan\b but not "Tarzan":
$ rg -w '"Tarzan"|(Tarzan)' --defined '$1'
AKA
$ rg -w '"Tarzan"|(Tarzan)' -d '$1'
Suppose I want to select all lines which contain the unquoted word Tarzan i.e. \bTarzan\b but not "Tarzan" e.g. the first 4 lines of:
"Tarzan and Jane"
"Jane and Tarzan"
Me Tarzan, you Jane
Tarzan vs "Tarzan"
This line doesn't mention him
He's moved to Tarzania
He's no "Tarzan"!
It can be done with a pipeline e.g.:
$ rg -w 'Tarzan' test.txt | rg -v '"Tarzan"'
But that particular example rejects lines which contain both, which is not what we want in this case. The same would be true if ripgrep added e.g. an -E (--no-regexp) option to complement -e/--regexp:
$ rg -we 'Tarzan' -E '"Tarzan"' test.txt
It can be done in one pass with PCRE-flavored greps such as GNU grep and ack, with varying degrees of difficulty/unreadability, by using negative lookahead/look-behind assertions e.g.:
$ grep -P '^(?:(?!"Tarzan"|Tarzan\w+)(Tarzan|.))+$' test.txt
That's already pretty gnarly for a single exclusion, and quickly becomes impractical/incomprehensible for multiple exclusions. It also matches lines which don't contain Tarzan and, again, excludes lines which contain both patterns.
In programming languages, there's a common pattern for performing exclusions in a simple, readable way without multiple passes:
e.g.:
[ '', '"Tarzan"', 'Tarzan', 'Tarzania', 'Tarzan vs "Tarzan"' ].filter(it => {
const m = it.match(/"Tarzan"|\b(Tarzan)\b/)
return m && m[1]
}) // [ 'Tarzan', 'Tarzan vs "Tarzan"' ]
[ '', '"Tarzan"', 'Tarzan', 'Tarzania', 'Tarzan vs "Tarzan"' ].filter(it => {
return it.match(/"Tarzan"|\b(Tarzan)\b/)?.[1]
}) // [ 'Tarzan', 'Tarzan vs "Tarzan"' ]
[ '', '"Tarzan"', 'Tarzan', 'Tarzania', 'Tarzan vs "Tarzan"' ].select { |it|
it[/"Tarzan"|\b(Tarzan)\b/, 1]
} # => [ 'Tarzan', 'Tarzan vs "Tarzan"' ]
$ ruby -ne 'print if $_[/"Tarzan"|\b(Tarzan)\b/, 1]' test.txt
etc.
This isn't available in any greps I'm aware of, but since the machinery is already there to capture and reference subexpressions by index and name, it seems like a small step to use them in predicates to reproduce the flexibility and simplicity of this pattern on the command line e.g.:
$ rg -w '"Tarzan"|(Tarzan)' -d '$1' test.txt
"Tarzan and Jane"
"Jane and Tarzan"
Me Tarzan, you Jane
Tarzan vs "Tarzan"
1) I assume that the predicate can be inverted e.g.:
$ rg --not-defined '$1'
AKA
$ rg -D '$1'
There aren't many single-letter options left. The last remaining pairs are -d/-D, -y/-Y and -z/-Z. The latter are commonly used to denote null/zero values, so they could be used instead, with the meaning of -d and -D inverted e.g.:
$ rg -z '$1' # AKA rg --not-defined '$1'
$ rg -Z '$1' # AKA rg --defined '$1'
2) I assume that indices increment across multiple patterns, and that multiple -d and -D options can be combined e.g.:
$ rg -e 'Foo|(Bar)' -e '(Baz|(Quux))' -d '$1' -D '$3'
3) I also assume that numbered and named captures can be mixed e.g.:
$ rg -e 'Foo|(Bar)' -e 'Baz|(?P<name>Quux)' -d '$1' -D '$name'
4) The full version of the matching command would currently be:
rg '^.*?(?:"Tarzan"|\b(Tarzan)\b).*$' -d '$1' test.txt
Hopefully some of that boilerplate can be removed e.g. via #389 or #593.
5) For clarity, "Tarzan" vs Tarzan is omitted from the examples. Handling it only slightly complicates the regex:
$ rg '^(?:"Tarzan"|\b(Tarzan)\b|.)*$' -d '$1' test.txt
Thanks for this very thorough write up!
I kind of feel like that semantics of this are too complex, which will probably lead to a feature that almost nobody uses. By that, I don't mean that the flags --not-defined and --defined are themselves complex, but using them effectively---as you've demonstrated here---requires some ingenuity in crafting the regex.
With that said, I'd be willing to adopt a feature like this because I do agree that it could be useful, but I'd have to strongly insist on the following:
Now that PRCE is (optionally) supported, can either of you think of a use-case for this that isn't handled by lookahead and lookbehind? I think this would be strictly more powerful than negative lookbehind, since lookbehind can't contain variable-length patterns, but that's the only advantage I can see. (Granted, that's an advantage I think I would occasionally find useful.)
I think it could be possible to define a simpler UX than needing to resort to look-around.
With that said, it's a good point and I was never a big fan of adding this feature anyway. So I'm going to close this.
Lookaround assertions still have the issues mentioned above. For anyone looking for a clean solution to this with the PCRE engine, the backtracking-control verbs [1] [2] [3] are your friends:
"Tarzan and Jane"
"Jane and Tarzan"
Me Tarzan, you Jane
Tarzan vs "Tarzan"
"Tarzan" vs Tarzan
This line doesn't mention him
He's moved to Tarzania
He's no "Tarzan"!
$ rg --pcre2 '(?:"Tarzan")(*SKIP)(*FAIL)|\bTarzan\b' test.txt
"Tarzan and Jane"
"Jane and Tarzan"
Me Tarzan, you Jane
Tarzan vs "Tarzan"
"Tarzan" vs Tarzan
Or, to exclude lines which contain "Tarzan":
$ rg --pcre2 '(?:.*?"Tarzan".*)(*SKIP)(*FAIL)|\bTarzan\b' test.txt
$ rg --pcre2 '(?:.*?"Tarzan".*)(*COMMIT)(*FAIL)|\bTarzan\b' test.txt
"Tarzan and Jane"
"Jane and Tarzan"
Me Tarzan, you Jane
[1] https://www.rexegg.com/regex-best-trick.html#pcrevariation
[2] https://www.rexegg.com/backtracking-control-verbs.html
[3] https://perldoc.perl.org/perlre.html#Special-Backtracking-Control-Verbs
@chocolateboy Wow, I had never heard of those before. Thanks for sharing.
Most helpful comment
Thanks for this very thorough write up!
I kind of feel like that semantics of this are too complex, which will probably lead to a feature that almost nobody uses. By that, I don't mean that the flags
--not-definedand--definedare themselves complex, but using them effectively---as you've demonstrated here---requires some ingenuity in crafting the regex.With that said, I'd be willing to adopt a feature like this because I do agree that it could be useful, but I'd have to strongly insist on the following: