Runtime: Possible bug in regex' character class subtraction

Created on 8 Apr 2020  路  10Comments  路  Source: dotnet/runtime

First of all, I'm not a regex expert so I'm not exactly sure that this really is a bug and not just me but anyway...

When using character class subtraction in combination with the . (dot - any character) wildcard, it doesn't really work. I have this string of architectures. There may be one or more of them. If there are more, they are separated by a semicolon. Since I want only the first one, I've come to this but it's not working as it always returns just an empty string:

Regex.Match("x86_64;x86", "[.-[;]]+").Value

However if I replace the dot wildcard with two whitespace-related wildcards (that in my opinion should have the same effect when together) the pattern works as I'd expect and returns the first arch - x86_64:

Regex.Match("x86_64;x86", @"[\S\s-[;]]+").Value

The code here is really simplified to simply get to the point.

I'm targetting netcoreapp3.1 in a simple ASP.NET Core project.

Here's my dotnet --info

.NET Core SDK (reflecting any global.json):
 Version:   3.1.201
 Commit:    b1768b4ae7

Runtime Environment:
 OS Name:     Windows
 OS Version:  10.0.18363
 OS Platform: Windows
 RID:         win10-x64
 Base Path:   C:\Program Files\dotnet\sdk\3.1.201\

Host (useful for support):
  Version: 3.1.3
  Commit:  4a9f85e9f8

.NET Core SDKs installed:
  3.1.201 [C:\Program Files\dotnet\sdk]

.NET Core runtimes installed:
  Microsoft.AspNetCore.All 2.1.13 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.All]
  Microsoft.AspNetCore.All 2.2.7 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.All]
  Microsoft.AspNetCore.App 2.1.13 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
  Microsoft.AspNetCore.App 2.2.7 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
  Microsoft.AspNetCore.App 3.1.3 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
  Microsoft.NETCore.App 2.1.13 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 2.2.7 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 3.1.3 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.WindowsDesktop.App 3.1.3 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]
area-System.Text.RegularExpressions question

Most helpful comment

As far as I can tell what is happening in your first example: [.-[;]], this is saying: "All the characters in the set of . (which in this case is treated as actually the . character - literally 0x2E), subtract out ; (which obviously isn't in the set)".

In your second example [\S\s-[;]], now you are saying: "All the characters in the set of any non-whitespace or any whitespace, subtract out ;". Which is working.

So the question is, why is the . character being treated literally as the character 0x2E and not the wildcard character?

Looking at the code in

https://github.com/dotnet/runtime/blob/6ca5d511a15ff411fef52ffaf2cd9908bc66551e/src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexParser.cs#L548-L552

I don't see any special handling of the . character turning into the wildcard character. Specifically, this code:

https://github.com/dotnet/runtime/blob/6ca5d511a15ff411fef52ffaf2cd9908bc66551e/src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexParser.cs#L694-L700

Treats chPrev like any regular character.

To work around this, I would imagine using something like [^;] which means any character other than the ; character.

@stephentoub @ViktorHofer @pgovind - do any of you know if the . character in a character class is supposed to be treated as the literal . character or as the wildcard?

All 10 comments

Tagging @eerhardt as an area owner

As far as I can tell what is happening in your first example: [.-[;]], this is saying: "All the characters in the set of . (which in this case is treated as actually the . character - literally 0x2E), subtract out ; (which obviously isn't in the set)".

In your second example [\S\s-[;]], now you are saying: "All the characters in the set of any non-whitespace or any whitespace, subtract out ;". Which is working.

So the question is, why is the . character being treated literally as the character 0x2E and not the wildcard character?

Looking at the code in

https://github.com/dotnet/runtime/blob/6ca5d511a15ff411fef52ffaf2cd9908bc66551e/src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexParser.cs#L548-L552

I don't see any special handling of the . character turning into the wildcard character. Specifically, this code:

https://github.com/dotnet/runtime/blob/6ca5d511a15ff411fef52ffaf2cd9908bc66551e/src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexParser.cs#L694-L700

Treats chPrev like any regular character.

To work around this, I would imagine using something like [^;] which means any character other than the ; character.

@stephentoub @ViktorHofer @pgovind - do any of you know if the . character in a character class is supposed to be treated as the literal . character or as the wildcard?

do any of you know if the . character in a character class is supposed to be treated as the literal . character or as the wildcard?

From https://docs.microsoft.com/en-us/dotnet/standard/base-types/character-classes-in-regular-expressions#any-character-
"In a positive or negative character group, a period is treated as a literal period character, and not as a character class."

Note that we seem to parse this different to Javascript/PCRE. They do not build a stack of square brackets, so [\S\s-[;]]+ is treated as "one character in the class [\S\s-[;] followed by a literal ]" whereas we treat it as stated above.

I wonder whether we ought to fix that.

That's because they don't support character class subtraction.

For reference, see https://www.regular-expressions.info/charclasssubtract.html and the section "Notational Compatibility with Other Regex Flavors".

Ah thank you!

This seems resolved, then.

Closing, as I believe this question is answered. Feel free to re-open if it hasn't been.

Thank you all for answers. I was looking in docs but must have overlooked the part about . being treated as a literal character in that case.
Things like this would have been much easier to debug if we had something like RegExr but for .NET implementation as it explains every part of the expression. I was looking for that but didn't find anything such advanced. Do you know of any such tool or is it planned to be implemented?

@bramborman I am not aware of such a thing - perhaps because we do not expose the parse tree in a public API. It seems that RegExr wrote their own.

It doesn't help you in this case (with the question about the treatment of .), but in case it's interesting to someone, there is a mode that allows us to debug the regex engine. It is necessary to build and patch in a debug System.Text.RegularExpressions.dll, and pass (RegexOptions)0x0080). Then set up a listener for the debug output. Since I happened to have it setup, in the case of your first regex, for example this gives me (including tracing of the matching) :

[11628] Pattern: [.-[;]]+    Options: None    Timeout: infinite 
[11628] Capture index = 0 
[11628]   Setloopatomic [.-[;]]+ 
[11628]  
[11628] Direction:  left-to-right 
[11628] Anchors:    None 
[11628]  
[11628] First Chars: 
[11628] 0: [.-[;]] 
[11628]  
[11628] 000000 *Lazybranch       addr = 12 
[11628] 000002 *Setmark           
[11628] 000003  Setrep           [.-[;]], rep = 1 
[11628] 000006  Setloopatomic    [.-[;]], rep = inf 
[11628] 000009 *Capturemark      index = 0 
[11628] 000012  Stop              
[11628]  
[11628]  
[11628]  
[11628] Search range: from 0 to 10 
[11628] Firstchar search starting at 0 stopping at 10 
[11628] Capnum 0: 

and for the second pattern

[11628] Pattern: [\S\s-[;]]+    Options: None    Timeout: infinite 
[11628] Capture index = 0 
[11628]   Setloopatomic [\S\s-[;]]+ 
[11628]  
[11628] Direction:  left-to-right 
[11628] Anchors:    None 
[11628]  
[11628] First Chars: 
[11628] 0: [\S\s-[;]] 
[11628]  
[11628] 000000 *Lazybranch       addr = 12 
[11628] 000002 *Setmark           
[11628] 000003  Setrep           [\S\s-[;]], rep = 1 
[11628] 000006  Setloopatomic    [\S\s-[;]], rep = inf 
[11628] 000009 *Capturemark      index = 0 
[11628] 000012  Stop              
[11628]  
[11628]  
[11628]  
[11628] Search range: from 0 to 10 
[11628] Firstchar search starting at 0 stopping at 10 
[11628] Executing engine starting at 0 
[11628]  
[11628] Text:  0       ^>x86_64;x86$ 
[11628] Track: 0/32    () 
[11628] Stack: 0/24    () 
[11628]        000000 *Lazybranch       addr = 12 
[11628]  
[11628] Text:  0       ^>x86_64;x86$ 
[11628] Track: 2/32    (0 0) 
[11628] Stack: 0/24    () 
[11628]        000002 *Setmark           
[11628]  
[11628] Text:  0       ^>x86_64;x86$ 
[11628] Track: 3/32    (2 0 0) 
[11628] Stack: 1/24    (0) 
[11628]        000003  Setrep           [\S\s-[;]], rep = 1 
[11628]  
[11628] Text:  1       x>86_64;x86$ 
[11628] Track: 3/32    (2 0 0) 
[11628] Stack: 1/24    (0) 
[11628]        000006  Setloopatomic    [\S\s-[;]], rep = inf 
[11628]  
[11628] Text:  6       4>;x86$ 
[11628] Track: 3/32    (2 0 0) 
[11628] Stack: 1/24    (0) 
[11628]        000009 *Capturemark      index = 0 
[11628]  
[11628] Text:  6       4>;x86$ 
[11628] Track: 5/32    (9 0 2 0 0) 
[11628] Stack: 0/24    () 
[11628]        000012  Stop              
[11628]  
Was this page helpful?
0 / 5 - 0 ratings