$ rg --version
ripgrep 11.0.1 (rev 7bf7ceb5d3)
-SIMD -AVX (compiled)
+SIMD +AVX (runtime)
This matches:
$ echo 'CCAGCTACTCGGGAGGCTGAGGCTGGAGGATCGCTTGAGTCCAGGAGTTC' | egrep 'CCAGCTACTCGGGAGGCTGAGGCTGGAGGATCGCTTGAGTCCAGGAG[ATCG]{2}C'
CCAGCTACTCGGGAGGCTGAGGCTGGAGGATCGCTTGAGTCCAGGAGTTC
But this doesn't:
$ echo 'CCAGCTACTCGGGAGGCTGAGGCTGGAGGATCGCTTGAGTCCAGGAGTTC' | rg 'CCAGCTACTCGGGAGGCTGAGGCTGGAGGATCGCTTGAGTCCAGGAG[ATCG]{2}C'
To minimize, this doesn't match:
$ rg 'TTGAGTCCAGGAG[ATCG]{2}C' /tmp/subject
But this does:
$ rg 'TGAGTCCAGGAG[ATCG]{2}C' /tmp/subject
1:CCAGCTACTCGGGAGGCTGAGGCTGGAGGATCGCTTGAGTCCAGGAGTTC
The only difference between the latter two is that the latter removes the first
T from the regex.
From inspecting the --trace output, I note that from the former regex, it
says this:
DEBUG|grep_regex::literal|grep-regex/src/literal.rs:105: required literals found: [Complete(TTGAGTCCAGGAGAC), Complete(TTGAGTCCAGGAGCC), Complete(TTGAGTCCAGGAGGC), Complete(TTGAGTCCAGGAGTC)]
TRACE|grep_regex::matcher|grep-regex/src/matcher.rs:52: extracted fast line regex: (?-u:TTGAGTCCAGGAGAC|TTGAGTCCAGGAGCC|TTGAGTCCAGGAGGC|TTGAGTCCAGGAGTC)
But in the latter regex (the one that works), we have this:
DEBUG|grep_regex::literal|grep-regex/src/literal.rs:59: literal prefixes detected: Literals { lits: [Complete(TGAGTCCAGGAGAAC), Complete(TGAGTCCAGGAGCAC), Complete(TGAGTCCAGGAGGAC), Complete(TGAGTCC
AGGAGTAC), Complete(TGAGTCCAGGAGACC), Complete(TGAGTCCAGGAGCCC), Complete(TGAGTCCAGGAGGCC), Complete(TGAGTCCAGGAGTCC), Complete(TGAGTCCAGGAGAGC), Complete(TGAGTCCAGGAGCGC), Complete(TGAGTCCAGGAGGGC)
, Complete(TGAGTCCAGGAGTGC), Complete(TGAGTCCAGGAGATC), Complete(TGAGTCCAGGAGCTC), Complete(TGAGTCCAGGAGGTC), Complete(TGAGTCCAGGAGTTC)], limit_size: 250, limit_class: 10 }
Therefore, this is almost certainly a bug in literal extraction. Moreover,
this Rust program correctly prints true:
fn main() {
let pattern = r"CCAGCTACTCGGGAGGCTGAGGCTGGAGGATCGCTTGAGTCCAGGAG[ATCG]{2}C";
let haystack = "CCAGCTACTCGGGAGGCTGAGGCTGGAGGATCGCTTGAGTCCAGGAGTTC";
let re = regex::Regex::new(pattern).unwrap();
println!("{:?}", re.is_match(haystack));
}
Which points the finger at grep-regex's inner literal extraction. Sigh.
Oh, forgot to mention that this was originally reported in this StackOverflow question: https://stackoverflow.com/questions/56906725/ripgrep-missing-character-class-repetions
Replacing the line:
with
if max.map_or(true, |max| min <= max) {
or just dropping the whole if (as min > max never holds) condition fixes the issue for me. That condition looked suspicious to me at a first glance but as I do not understand that code very well, I sadly cannot substantiate why this is the right fix and I'm not even convinced it is.
I wrote a fuzzer that uses the regex_generate crate to produce matching inputs for a given fuzz-generated regular expression and then checks if grep_matcher::RegexMatcher successfully matches that input against the regex. It has successfully found the class of errors represented by this issue and with the proposed change applied it has not found any more failing cases yet.
@jakubadamw That's awesome! Do you want to submit a PR? If not, I'll get to this eventually!
@BurntSushi, sure! 馃檪
Most helpful comment
I wrote a fuzzer that uses the regex_generate crate to produce matching inputs for a given fuzz-generated regular expression and then checks if
grep_matcher::RegexMatchersuccessfully matches that input against the regex. It has successfully found the class of errors represented by this issue and with the proposed change applied it has not found any more failing cases yet.