Ripgrep: match bug

Created on 5 Jul 2019  路  5Comments  路  Source: BurntSushi/ripgrep

$ rg --version
ripgrep 11.0.1 (rev 7bf7ceb5d3)
-SIMD -AVX (compiled)
+SIMD +AVX (runtime)

This matches:

$ echo 'CCAGCTACTCGGGAGGCTGAGGCTGGAGGATCGCTTGAGTCCAGGAGTTC' | egrep 'CCAGCTACTCGGGAGGCTGAGGCTGGAGGATCGCTTGAGTCCAGGAG[ATCG]{2}C'
CCAGCTACTCGGGAGGCTGAGGCTGGAGGATCGCTTGAGTCCAGGAGTTC

But this doesn't:

$ echo 'CCAGCTACTCGGGAGGCTGAGGCTGGAGGATCGCTTGAGTCCAGGAGTTC' | rg 'CCAGCTACTCGGGAGGCTGAGGCTGGAGGATCGCTTGAGTCCAGGAG[ATCG]{2}C'

To minimize, this doesn't match:

$ rg 'TTGAGTCCAGGAG[ATCG]{2}C' /tmp/subject

But this does:

$ rg 'TGAGTCCAGGAG[ATCG]{2}C' /tmp/subject
1:CCAGCTACTCGGGAGGCTGAGGCTGGAGGATCGCTTGAGTCCAGGAGTTC

The only difference between the latter two is that the latter removes the first
T from the regex.

From inspecting the --trace output, I note that from the former regex, it
says this:

DEBUG|grep_regex::literal|grep-regex/src/literal.rs:105: required literals found: [Complete(TTGAGTCCAGGAGAC), Complete(TTGAGTCCAGGAGCC), Complete(TTGAGTCCAGGAGGC), Complete(TTGAGTCCAGGAGTC)]
TRACE|grep_regex::matcher|grep-regex/src/matcher.rs:52: extracted fast line regex: (?-u:TTGAGTCCAGGAGAC|TTGAGTCCAGGAGCC|TTGAGTCCAGGAGGC|TTGAGTCCAGGAGTC)

But in the latter regex (the one that works), we have this:

DEBUG|grep_regex::literal|grep-regex/src/literal.rs:59: literal prefixes detected: Literals { lits: [Complete(TGAGTCCAGGAGAAC), Complete(TGAGTCCAGGAGCAC), Complete(TGAGTCCAGGAGGAC), Complete(TGAGTCC
AGGAGTAC), Complete(TGAGTCCAGGAGACC), Complete(TGAGTCCAGGAGCCC), Complete(TGAGTCCAGGAGGCC), Complete(TGAGTCCAGGAGTCC), Complete(TGAGTCCAGGAGAGC), Complete(TGAGTCCAGGAGCGC), Complete(TGAGTCCAGGAGGGC)
, Complete(TGAGTCCAGGAGTGC), Complete(TGAGTCCAGGAGATC), Complete(TGAGTCCAGGAGCTC), Complete(TGAGTCCAGGAGGTC), Complete(TGAGTCCAGGAGTTC)], limit_size: 250, limit_class: 10 }

Therefore, this is almost certainly a bug in literal extraction. Moreover,
this Rust program correctly prints true:

fn main() {
    let pattern = r"CCAGCTACTCGGGAGGCTGAGGCTGGAGGATCGCTTGAGTCCAGGAG[ATCG]{2}C";
    let haystack = "CCAGCTACTCGGGAGGCTGAGGCTGGAGGATCGCTTGAGTCCAGGAGTTC";

    let re = regex::Regex::new(pattern).unwrap();
    println!("{:?}", re.is_match(haystack));
}

Which points the finger at grep-regex's inner literal extraction. Sigh.

bug

Most helpful comment

I wrote a fuzzer that uses the regex_generate crate to produce matching inputs for a given fuzz-generated regular expression and then checks if grep_matcher::RegexMatcher successfully matches that input against the regex. It has successfully found the class of errors represented by this issue and with the proposed change applied it has not found any more failing cases yet.

All 5 comments

Oh, forgot to mention that this was originally reported in this StackOverflow question: https://stackoverflow.com/questions/56906725/ripgrep-missing-character-class-repetions

Replacing the line:

https://github.com/BurntSushi/ripgrep/blob/4858267f3b97fe2823d2ce104c1f90ec93eee8d7/grep-regex/src/literal.rs#L217

with

if max.map_or(true, |max| min <= max) {

or just dropping the whole if (as min > max never holds) condition fixes the issue for me. That condition looked suspicious to me at a first glance but as I do not understand that code very well, I sadly cannot substantiate why this is the right fix and I'm not even convinced it is.

I wrote a fuzzer that uses the regex_generate crate to produce matching inputs for a given fuzz-generated regular expression and then checks if grep_matcher::RegexMatcher successfully matches that input against the regex. It has successfully found the class of errors represented by this issue and with the proposed change applied it has not found any more failing cases yet.

@jakubadamw That's awesome! Do you want to submit a PR? If not, I'll get to this eventually!

@BurntSushi, sure! 馃檪

Was this page helpful?
0 / 5 - 0 ratings

Related issues

daxim picture daxim  路  3Comments

timotheecour picture timotheecour  路  3Comments

andschwa picture andschwa  路  3Comments

kenorb picture kenorb  路  3Comments

borekb picture borekb  路  3Comments