Ripgrep: support more sophisticated boolean matching operations

Created on 3 Apr 2018 · 39Comments · Source: BurntSushi/ripgrep

With the new convention to use the capitalized version of a short flag to indicate the opposite it's too bad that -E is already used to mean --encoding, as I would like to suggest an "inverse pattern" mode where only lines/words (depending on other parameters as normal) matching pattern e but not matching pattern E are included in the result set.

Andrew, I know you are loathe to add more ! support but given the pre-existing -E, perhaps a -e !PATTERN?

enhancement icebox question

Source

mqudsi

👍5

Most helpful comment

@peterbe You should be able to fix that by adding --color always to your first invocation of ripgrep. Not ideal of course.

BurntSushi on 6 Jun 2018

👍7

All 39 comments

The name of the flag is really not the interesting part of this feature request. The interesting part is the request to support more sophisticated boolean tests.

I think if we were to decide to do this, then it needs to be part of a larger story that encompasses more sophisticated expressions. We also need to address the fact that, today, we can actually express quite a bit, but it requires piping. Namely, piping permits expressing "and". Piping plus the -v flag permits any arbitrary boolean expression you might want. For example, rg foo | rg -v bar says "show lines matching foo but do not contain bar," which is exactly your feature request.

git grep has support for this via -not, -and and -or. I don't know if I'm willing to add this to ripgrep. There must be a point at which we say, "piping is good enough."

An alternative way to implement this feature is in the regex engine itself (since intersection and complement are available as operations on regular languages), but this is extremely non-trivial to do.

I try not to speak in absolutes, but, "I don't want to add anything else that uses ! in a shell" is as close to an absolute that I can get. Let's drop that idea.

BurntSushi on 3 Apr 2018

👍3

I understand completely. I currently pipe (to grep, I didn't realize I could pipe to rg itself!) but was wondering from a performance perspective basically about using the regex engine itself to optimize the search with the additional boolean constraints.

Thanks.

mqudsi on 3 Apr 2018

👍1

but was wondering from a performance perspective basically about using the regex engine itself to optimize the search with the additional boolean constraints.

Well, the "best" way is to, as I hinted at, build complement and intersection into the regex engine. But as I said, this is extremely non-trivial to do efficiently. If we were to implement this, then we'd need an algorithm that selects the (attempted) optimal matching path given all of the boolean conditions. e.g., if you said "x and not y and not z," then ripgrep would search for x and only apply the y and z blacklist on matches to filter them out. If you had x or y or z, then ripgrep would, as it does today, combine them into one regex joined by |. If you had not x and not y and not z, then ripgrep behave as it would today if you ran rg -v x and then use the y and z blacklists to filter our matches. If you had not x or not y or not z, then ripgrep could behave as it does today if you ran rg -v 'x|y|z'. And so on...

It is plausible that this would result in a performance improvement. But you can't just throw that out there as a benefit and expect it to stick. :-) Performance does not exist in a vacuum. Pipelines tend to be constructed in a way that iteratively reduces the search space, which in turn makes performance less and less of an issue. The interesting bits are probably pipelines that start with an inverted match on a rarely occurring pattern, which would not reduce the search space much. Regardless, I personally find this to be a somewhat flimsy motivation for a feature like this unless someone can convince me otherwise. IMO, if we add a feature like this, it should be primarily for the UX.

BurntSushi on 3 Apr 2018

👍1

Example of using git grep with AND patterns:

git grep -e pattern1 --and -e pattern2 --and -e pattern3

kenorb on 11 Apr 2018

Example of AND operation using Rust's regex engine:

rg -N '(?P<p1>.*pattern1.*)(?P<p2>.*pattern2.*)(?P<p3>.*pattern3.*)' file.txt

kenorb on 11 Apr 2018

@kenorb That's presumably not the same as what git grep does. git grep -e pattern1 --and -e pattern2 will match pattern2pattern1 but (.*pattern1.*)(.*pattern2.*) will not. The standard way to perform "and" queries in ripgrep is with piping, as I mentioned above in my comment.

BurntSushi on 11 Apr 2018

I quite like the simplicity and "natural feel" of using rg foo | rg bar to do the equivalent of git grep -e foo --and -e bar. The only significant difference is the color.

git grep -e foo --and -e bar
screen shot 2018-06-06 at 8 03 13 am

rg string | rg query
screen shot 2018-06-06 at 8 04 42 am

See, no highlight of the word string in the rg pipe.

peterbe on 6 Jun 2018

👍2

@peterbe You should be able to fix that by adding --color always to your first invocation of ripgrep. Not ideal of course.

BurntSushi on 6 Jun 2018

👍7

I don't even know if it's possible with pipes but if you could know that that the next pipe is another rg the --color always could be on by default. One can dream.

peterbe on 6 Jun 2018

Piping loses the file headers.

rg abc

a.txt
4: ...abc...xyz...
7: ...abc...

b.txt
3: ...abc...xyz...

rg abc | rg xyz

4: ...abc...xyz...
3: ...abc...xyz...

elbaro on 29 Jun 2018

That example doesn't look right. It should retain file names not as headers but in each line in standard grep format.

BurntSushi on 29 Jun 2018

Sorry my bad. It looks like this:

rg abc | rg xyz
a.txt: ...abc...xyz...
a.txt: ...abc...xyz...
b.txt: ...abc...xyz...
b.txt: ...abc...xyz...

Still hard to parse when there are many files.
I think it's an example where the built-in op can provide better UX than piping.

Another example is piping with -A or -B.

// want to print a line including "abc" and "xyz" with +- 3 lines
rg abc -A 3 -B -3 | rg xyz -A 3 -B 3  // not what we want

elbaro on 29 Jun 2018

👍3

That's certainly part of an argument in favor of this, but I will not allow that argument to be used as a hammer. Taken to its logical conclusion, ripgrep should bundle every conceivable transform on its data. At some point, people need to become OK with piping ripgrep's output and dealing with the different format. Different people will have different opinions on where that line is drawn.

BurntSushi on 29 Jun 2018

👍2

I have definitely wished for an easy way to preserve headers when piping rg to rg. Maybe a flag for "header passthrough" would be useful on its own.

BatmanAoD on 21 Sep 2018

👍2

I have definitely wished for an easy way to preserve headers when piping rg to rg. Maybe a flag for "header passthrough" would be useful on its own.

That would be nice but won't work in all cases. E.g., consider

rg -C5 foo | rg -v bar

Now the context lines around the matched lines in the first rg call are being matched by the second rg call and your output may end up being a bit of a mess and not what you might expect.

IMO, if we add a feature like this, it should be primarily for the UX.

Looking at a few now-closed duplicate issues, what most people want is just "a and not b" with all of headers/context preserved which might make sense to special-case if that's much simpler that the general case.

aldanor on 7 Jan 2019

Files looks like this:

a.txt
4: ...abc...
30: ...xyz...

b.txt
4: ...abc...
.....
(no 'xyz' in content)

How to find files like a.txt with 'abc' and 'xyz' in different lines?

amitbha on 23 Feb 2019

Use multiline search.

On Fri, Feb 22, 2019, 19:35 amitbha notifications@github.com wrote:

Files looks like this:

a.txt
4: ...abc...
30: ...xyz...

b.txt
4: ...abc...
.....
(no 'xyz' in content)

How to find files like a.txt with 'abc' and 'xyz' in different lines?

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/BurntSushi/ripgrep/issues/875#issuecomment-466595243,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAb34iFvSILtyapoZbiWQTX9675DE3n0ks5vQIzHgaJpZM4TEQ9s
.

BurntSushi on 23 Feb 2019

Use multiline search.
…
On Fri, Feb 22, 2019, 19:35 amitbha @.*> wrote: Files looks like this: a.txt 4: ...abc... 30: ...xyz... b.txt 4: ...abc... ..... (no 'xyz' in content) How to find files like a.txt with 'abc' and 'xyz' in different lines? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#875 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/AAb34iFvSILtyapoZbiWQTX9675DE3n0ks5vQIzHgaJpZM4TEQ9s .

Thanks for reply.
I tried rg -U --multiline-dotall -e 'abc.*xyz, the right files were found. But there were too many outputs like:

4: ...abc...
5: xxxxx
6: xxxxx
...
29: xxxxx
30: ...xyz...

rg -U --multiline-dotall -e 'abc.*xyz | rg abc
No filename and line-numbers.

rg -U --multiline-dotall -l -e 'abc.*xyz' | rg 'abc' -
No result. How to read path from pipe?

rg -U --multiline-dotall -l -e 'abc.*xyz' | while read line; do rg 'xyz' "$line"; done
Almost done! But filenames are missing. 😔

rg -U --multiline-dotall -l -e 'abc.*xyz' | while read line; do echo "$line"; rg 'xyz' "$line"; echo; done
Done! 😌

amitbha on 23 Feb 2019

Please skim the options in the man page. Use the -n and --with-filename
flags.

On Sat, Feb 23, 2019, 03:25 amitbha notifications@github.com wrote:

Use multiline search.
… <#m_6621645017383223918_>
On Fri, Feb 22, 2019, 19:35 amitbha @.*> wrote: Files looks like
this: a.txt 4: ...abc... 30: ...xyz... b.txt 4: ...abc... ..... (no 'xyz'
in content) How to find files like a.txt with 'abc' and 'xyz' in different
lines? — You are receiving this because you commented. Reply to this email
directly, view it on GitHub <#875 (comment)
https://github.com/BurntSushi/ripgrep/issues/875#issuecomment-466595243>,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAb34iFvSILtyapoZbiWQTX9675DE3n0ks5vQIzHgaJpZM4TEQ9s
.

Thanks for reply.
I tried rg -U --multiline-dotall -e 'abc.*xyz, the right files were
found. But there were too many outputs like:

4: ...abc...
5: xxxxx
6: xxxxx
...
29: xxxxx
30: ...xyz...

rg -U --multiline-dotall -e 'abc.*xyz | rg abc
No filename and line-numbers.

rg -U --multiline-dotall -l -e 'abc.*xyz' | rg -e 'abc' -
No result. How to read path from pipe?

rg -U --multiline-dotall -l -e 'abc.*xyz' | while read line; do rg -e
'xyz' "$line"; done
Almost done! But filenames are missing.

😔

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/BurntSushi/ripgrep/issues/875#issuecomment-466628741,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAb34jwonl0CGHe9DS2PCPvcqLH8d2rFks5vQPr0gaJpZM4TEQ9s
.

BurntSushi on 23 Feb 2019

rg -U --multiline-dotall -l -e 'abc.*xyz' | while read line; do rg --with-filename 'xyz' "$line"; echo; done
Got it!
😌

amitbha on 24 Feb 2019

Friendly note: the utility of this feature is not in question. More comments explaining how useful this is or the kinds of problems it solves that aren't solved well by the status quo aren't necessary. The key thing blocking this feature is the potentially immense complexity that it adds not only to the implementation, but to the UX. It requires serious design work first, and it's still not clear to me that this is a feature I want to add.

It is well known that git grep supports this stuff. If it does what you want, then just use that.

BurntSushi on 24 Feb 2019

👍6

Please consider a utility rg --compile-expr a -and b -and c generates relevant DFA.

Usage something like rg --dfa $(rg --compile-expr a -and -not b). This will seal complexity only in the compile-expr option. Rest UX will remain identical.

Also piping is problematic for huge files as data is being copied again for every pipe.

elazarl on 4 Dec 2019

😕2

Piping is also an issue when using e.g. --heading

zachriggle on 1 Apr 2020

@zachriggle That's already been mentioned.

BurntSushi on 1 Apr 2020

re --and, I'm not sure if this is blasphemy or even correct at all and I'm probably missing edge cases but we could demorgan it...

$ echo -e 'Hello, foo\nBye, baz\nHello, james\nHello, baz\nbaz likes yellow' | \
    rg --pcre2 '^(?!((?!.*baz.*$)|(?!.*ello.*$)))'
Hello, baz
baz likes yellow

for matching any line containing baz and ello. perhaps a useful stop-gap for anyone desperate for a work-around?

hraban on 20 May 2020

@hraban If you just want a simple and query, then I'd probably recommend just doing

$ echo -e 'Hello, foo\nBye, baz\nHello, james\nHello, baz\nbaz likes yellow' | rg baz | rg ello

With the downsides of course being that you lose the nice formatting and highlighting of baz.

BurntSushi on 20 May 2020

I'm going to suggest that maybe this issue and https://github.com/BurntSushi/ripgrep/issues/473 should be two separate issues.

Personally I'm not that interested in using complex boolean or regex patterns with ripgrep. I just want to be able to specify multiple patterns. Perhaps this could just be specified with a new flag like

rg --patterns "level=error" --patterns "requestID"

Maybe that's too simplistic, but I've been using rg nearly since it was started and I've never had any desire for anything besides a simple 'and' match on multiple patterns.

sparrc on 28 Oct 2020

@sparrc Conceptually, you might be right. But in terms of implementation, I don't think there is much of a difference, so I'm treating them the same. Also, ripgrep _does_ have the ability to search multiple patterns (using the same exact flags as grep). It's just that it's a "or" match.

On top of that, the reason why just wanting "and" match is a little weird is because you can do it with pipelines: rg level=error | rg requestID. It's just that the UX isn't quite as good...

BurntSushi on 28 Oct 2020

@BurntSushi it's not just the UX (which is a major, unfixable problem IMHO. UX issues are much more important than "real" bugs, say, 100% slowdown of some cases).

One of the main reasons for me to use ripgrep, and one of its advantages is speed, so I'm picking it when I'm searching large files. Using multiple pipes slows things down in some cases, as it copies the data, adds syscalls, etc.

This is not 100% the same search, and of course I picked a 3GB file with search terms appearing in most lines, but

$ time ./ripgrep-12.1.1-x86_64-unknown-linux-musl/rg DMOD.\*LOW dr_agg  >/dev/null

real    0m5.772s
user    0m5.017s
sys     0m0.754s
$ time ./ripgrep-12.1.1-x86_64-unknown-linux-musl/rg DMOD.\*LOW dr_agg  >/dev/null

real    0m5.749s
user    0m4.987s
sys     0m0.760s
$ time ./ripgrep-12.1.1-x86_64-unknown-linux-musl/rg DMOD dr_agg |./ripgrep-12.1.1-x86_64-unknown-linux-musl/rg LOW >/dev/null

real    0m6.330s
user    0m7.147s
sys     0m2.781s
$ time ./ripgrep-12.1.1-x86_64-unknown-linux-musl/rg DMOD dr_agg |./ripgrep-12.1.1-x86_64-unknown-linux-musl/rg LOW >/dev/null

real    0m6.168s
user    0m7.245s
sys     0m2.777s

1 second hardly matter, but it is not uncommon for me to search 300GB of file.

elazarl on 29 Oct 2020

👍2

+1 from me for multiple "AND" searches

gd4c on 4 Nov 2020

😕1

@gd4c Please don't post +1 comments. They are noise that makes it into my inbox. If you feel obligated to +1 something, then use GitHub's emoji reactions. See also: https://github.com/BurntSushi/ripgrep/issues/875#issuecomment-466774329

BurntSushi on 4 Nov 2020

👍1

Sorry. I initially upvoted the initial post, but it wasn't what I was after (which kind of evolved into the thread). I just wanted to make it clear what I was thumbing up.

gd4c on 5 Nov 2020

Continuing from #1149

The implementation complexity of more sophisticated boolean matching is precisely my main argument against it. And it requires a very thorough specification of behavior.

And the upsides are limited. Yes, you can't get the "nice" output when using pipelines, but you can still use pipelines and the output is still serviceable.

Is it much trouble to ask you to give us an idea of what you have in mind?
I would imagine it like this:

Run rg with first expression as usual.
Instead of formatting file matches as headings, pass them as a list of input files for the second expression.
repeat n times.
Show the results of the last run as usual with color for the last match only (as if you only ran rg with just the last expression).

That would be enough for me, and I suspect many of the people above.

gd4c on 5 Nov 2020

@gd4c rg foo | rg bar will only print lines that contain both foo and bar.

BurntSushi on 5 Nov 2020

True, but without the easily readable formatting. 🙂
Regular grep can do that for the matter. It is just that ripgrep is nicer to use which drives this request – to increase the nice UX rather than include an otherwise impossible feature!

In any case, if you can find an easy way to do it, that'd be great. If you consider it too much trouble, there are (less nice) workaround we can use!

gd4c on 5 Nov 2020

😕1

What's confusing?

Also, forgot to say that piping to rg searches lines in previous stdout, not the matching _files_!

gd4c on 5 Nov 2020

The upsides and downsides here are well known. I've stated repeatedly what the problems are with rg foo | rg bar. They don't need to keep being repeated. So I'm confused at why you're rehashing things.

Adding this feature reflects _significant_ work. The first step is to come up with a comprehensive UX specification of behavior. _That_ would be useful. Further argumentation about _why_ ripgrep should have this feature is _not_ useful. It's just noise and it's just filling up my inbox.

I said about as much almost a year and a half ago, so now I'm just repeating myself. And I'm confused at why I need to do it.

BurntSushi on 5 Nov 2020

Sorry for the misunderstanding. I recognize that you understand its usefulness and that the issue is the complexity. I was just responding to your solution above.

You made a great tool and I am grateful!

Cheers!

gd4c on 6 Nov 2020

❤2

My current work-around for this is effectively use rg -l to find all OR'ed matches, and then pass off to git grep.

A statement that looks roughly like this:

$ my-grep -e foo --and -e bar --and --not '(' -e fizz -e buzz ')'

Gets translated roughly to:

rg -e foo -e bar -l -0 | xargs -0 git grep --threads 12 --no-index -e foo --and -e bar --and --not '(' -e fizz -e buzz ')'

There's a LOT of extra plumbing in my shell script to achieve better performance (e.g. don't have ripgrep search for expressions in an --and --not ( -e fizz -e buzz ) block, but ultimately rg -l -0 | xargs -0 git grep --no-index works pretty effectively, and is much faster than git grep by itself if you make use of e.g. rg type filters (e.g. rg -t c -t py).

This also allows you to specify some git grep specific formatting, like --show-function, in addition to those that rg also supports like --break --heading --line-number.

zachriggle on 6 Nov 2020

Was this page helpful?

0 / 5 - 0 ratings