Runtime: Regex should have a static TryParse() method

Created on 17 Nov 2017 · 31Comments · Source: dotnet/runtime

[edit by @ViktorHofer to put into review format]

Issue

from @udlose
The System.Text.RegularExpression class does not have a true ability to test for valid regular expression syntax from a string input. The only way that this can currently be accomplished is to surround the Regexctor (that takes a string) with a try/catch and look for an ArgumentExceptionif the string is not a valid regular expression.

Rationale

from @udlose
This is not desirable for 3 reasons:

It doesn't follow the "TryParse" pattern used by other types for "conversion" from strings such as Int32.TryParse( string ), etc.
Performance of possible exception throwing
Exceptions should not be used for control flow

Proposal

from @ViktorHofer

namespace System.Text.RegularExpressions
{
    public class Regex : ISerializable
    {
        public static bool TryParse(string pattern, out Regex regex);
        public static bool TryParse(string pattern, RegexOptions options, out Regex regex);
        public static bool TryParse(string pattern, RegexOptions options, TimeSpan matchTimeout, out Regex regex);

        public static bool Validate(string pattern);
        public static bool Validate(string pattern, Regexoptions options);
    }
}

Names are not final. Could either be TryCreate(...) and ValidatePattern(...)

Discussion

from @ViktorHofer
For this to work we would need to make some changes in RegexParser. Everywhere where we are currently throwing an exception with the helper method MakeException(string) we would need to add a condition and return null if no exceptions are desired (TryParse scenario). We would also need to make sure that the changes won't harm existing inlining optimizations. It would involve adding a lot of repetitive code.

area-System.Text.RegularExpressions

Source

udlose

👍2

Most helpful comment

I've had apps where regex's were stored in databases, xml files, .config files, external services, etc. Different clients, different projects. In other words, it didn't seem like an exception to the rule that I would have benefited from having this API. The most recent example that comes to mind is a coupon validation component that business folks store regex's dynamically in a database when running special events during Black Friday. The coupons (and corresponding regex's for different coupon types) would change very quickly and have to be updated real-time - not just the coupon code, but different coupon formats that are transient over time. Another example is a Rules Engine that consumed data from web services. Part of that data included expressions which had to be parsed. Anytime you have to create Regex's dynamically based on some input which is not constant. Placing the regex's in code is not always an option.

udlose on 17 Nov 2017

👍3

All 31 comments

We intentionally don't spread TryPattern everywhere. Only in places where it makes sense.
What are the scenarios where you need to parse unknown RegEx? How common are those scenarios? (Naively I would assume that RegEx is typically constant in code)

karelz on 17 Nov 2017

udlose on 17 Nov 2017

👍3

For this to work we would need to make some changes in RegexParser. Everywhere where we are currently throwing an exception with the helper method MakeException(string) we would need to add a condition and return null if no exceptions are desired (TryParse scenario). We would also need to make sure that the changes won't harm existing inlining optimizations.

That said, it would involve adding a lot of repetitive code which I'm not really a fan of (without loosing inlining optimization). For the public API surface we could have something like this:

[edit by @ViktorHofer, put up api proposal at the top]

ViktorHofer on 26 Nov 2017

I think this is definitely worthwhile. Dynamic regexes are very common, and I've used them many times.

We intentionally don't spread TryPattern everywhere.

@karelz Why? In .NET, absolutely anything that can be defined at compile time can also be created at runtime. For that reason alone it seems like any and every convert-from-string scenario should have a corresponding Try pattern, because there's a good chance that that string will be user input (requiring validation) in somebody's application somewhere.

MgSam on 27 Nov 2017

👍2

Talked to @ViktorHofer, looks ready for review.

terrajobst on 27 Nov 2017

@MgSam first, you already scoped it to convert-from-string scenarios. TryPattern is used for more than just that.
Second, historically we didn't blindly add hundreds of APIs which may or may not be useful - we always tried to understand the value of each API / API category, while maintaining certain level of consistency (e.g. across collections). Key motivation is to keep the API surface sane and easy to consume/reason about - that means avoiding useless APIs which might be just confusing for developers.
BTW: You can check our recorded API reviews for some of our motivations and reasoning.

karelz on 27 Nov 2017

@karelz Right, I'm asking in the context of this issue, which was about parsing from strings.

"Hundreds of APIs" seems a little extreme- I'm just pointing out that it should factor in the design process for any API that deserializes a string that somewhere someone will be doing this at runtime with user input, and as such, it should probably be the standard, not the exception, to provide Try methods for such cases.

I think it's important to establish design principles like that ahead of time, because as, is pointed out, doing it afterwards often requires a lot of refactoring to get to a good implementation. I've used other libraries which do not provide Try implementations for parsing from strings, and doing so at this point would require a major re-write.

MgSam on 27 Nov 2017

updated for review format

ViktorHofer on 1 Dec 2017

@ViktorHofer why the need for

a lot of repetitive code

This seems to imply the current implementation is not amenable to refactoring? Can you clarify if I misunderstood?

udlose on 1 Dec 2017

Video

We don't believe this API is very useful, because

TryParse is really about parsing user input in tight loops, which regex shouldn't be.
Returning a bool isn't very informative. Granted, our current ArgumentException we expose information as a string, which is not machine readable, but that's probably the area we should be improving instead.
Plumbing it through the current engine would be non-trival as we throw ArgumentException left and right.

We already have plans to rewrite the regex engine; when we do, we should think about diagnostics.

terrajobst on 5 Dec 2017

TryParse is really about parsing user input in tight loops,

This has nothing to do with parsing in tight loops. Rather, it has to do with following the very guidelines that MS has defined in the .NET Framework Design Guidelines (Try-Parse Pattern - Framework Design Guidelines)

which regex shouldn't be.

Unfortunately, when you design and write a public API, you do not have the luxury of dictating or controlling how it is going to be used.

Returning a bool isn't very informative. Granted, our current ArgumentException we expose information as a string, which is not machine readable, but that's probably the area we should be improving instead.

I'm totally missing how this is not informative. It is how the Try-Parse pattern is supposed to work: I give you an input, you validate it and let me know if it's valid. If it's valid, you give me an instance of what I asked for.

Plumbing it through the current engine would be non-trival as we throw ArgumentException left and right.

So it sounds like you are using exceptions for control-flow? If so, it is another violation of your own guideline X DO NOT use exceptions for the normal flow of control, if possible:

Except for system failures and operations with potential race conditions, framework designers should design APIs so users can write code that does not throw exceptions. For example, you can provide a way to check preconditions before calling a member so users can write code that does not throw exceptions.

Without this API, we are still stuck in the position of using exceptions for control-flow and consequently paying a performance penalty. I politely ask you to reconsider for the reasons mentioned above. Though you cannot see usefulness, there are those of us who do.

udlose on 5 Dec 2017

👍2

@terrajobst I don't get it. Users say "Hey, this would be really useful for us" and Microsoft's response is, "No, this is not useful, because we've never had occasion to use it". Well, I've never had occasion to use a cryptography library, that doesn't mean it's not useful.

Maybe your response is going to be "well it only has two votes on Github". Yes, in a repo with a ton of churn where the rare feature requests are mixed in with a thousand bugs. (If you guys properly used the dotnet/designs like you claimed you were going to, this complaint would be moot)

Whether you like it or not, applications use user-supplied regexes all the time. (Example: 43 year old grep) And throwing exceptions is the absolute wrong way to validate something. We should be removing vexing exceptions not encouraging them.

It sounds like the real reason is number 3- we don't think it's worth doing the work. So then either mark this for future or put it as up-for-grabs.

MgSam on 5 Dec 2017

👍2

FYI: The API review discussion was recorded - see https://youtu.be/BI3iXFT8H7E?t=14 (16 min duration)

karelz on 6 Dec 2017

@udlose

This has nothing to do with parsing in tight loops. Rather, it has to do with following the very guidelines that MS has defined in the .NET Framework Design Guidelines (Try-Parse Pattern - Framework Design Guidelines)
Unfortunately, when you design and write a public API, you do not have the luxury of dictating or controlling how it is going to be used.

The guideline is about Exceptions usage in Performance context -- that's what "tight loops" refers to. From that perspective it is the same thing.
We do not believe this scenario fits that guideline - i.e. it is not "_extremely performance-sensitive APIs_" as the guide phrases it.

I'm totally missing how this is not informative. It is how the Try-Parse pattern is supposed to work: I give you an input, you validate it and let me know if it's valid. If it's valid, you give me an instance of what I asked for.

What was left out from the write-up was that we agree there is a need to have better diagnostics for incorrect regexes - ideally with indexes, etc. Mainly for developer scenarios, but potentially could be used for more dynamic scenarios as well in a form of Validation (Validate method, but throwing).

So it sounds like you are using exceptions for control-flow?
Without this API, we are still stuck in the position of using exceptions for control-flow and consequently paying a performance penalty.

We do not believe it is common case to deal with invalid regexes at runtime as part of your control flow. We believe that majority of such cases are developer bugs. [EDIT] (It is not going to help as Validate will throw, sorry for my mistake) ~~The remaining scenarios should be satisfied by Validate method.~~
~~Would that work for your scenarios?~

I politely ask you to reconsider for the reasons mentioned above. Though you cannot see usefulness, there are those of us who do.

We are more than happy to reconsider things, but we need to understand the scenarios and believe they are useful and promote the right patterns in the platform. If there is more evidence we didn't see, feel free to bring it forward.

karelz on 6 Dec 2017

BTW: I had to update my answer above as it was incorrect based on what was discussed and I got a bit confused answering here.

karelz on 6 Dec 2017

@MgSam

I don't get it. Users say "Hey, this would be really useful for us" and Microsoft's response is, "No, this is not useful, because we've never had occasion to use it". Well, I've never had occasion to use a cryptography library, that doesn't mean it's not useful.

If you browse the closed API proposals in this repo, you will notice that we can't take every single API request and often we reject APIs, because we do not believe it is common-enough case, or that it does not belong into CoreFX repo (see e.g. dotnet/runtime#22228).

Maybe your response is going to be "well it only has two votes on Github". Yes, in a repo with a ton of churn where the rare feature requests are mixed in with a thousand bugs.

Number of votes / people interested also plays a role. It can swing the scenario from "we do not quite think it is important" into "oh, so many people really want it? Maybe we were wrong".

If you guys properly used the dotnet/designs like you claimed you were going to, this complaint would be moot

That repo was meant for larger design discussions - just browse the issues there. It was NOT meant for small API additions like this one. If you find any docs saying the contrary, please point us to it and we will fix it.

Whether you like it or not, applications use user-supplied regexes all the time. (Example: 43 year old grep) And throwing exceptions is the absolute wrong way to validate something. We should be removing vexing exceptions not encouraging them.

Yes, applications can do whatever they want, incl. hacking memory and runtime. It does not make it something we recommend to do or promote. It does not mean we should make it faster or easier to use (and hence spread the "bad pattern/practice" even more).
Using user-supplied regexes is a security risk (denial-of-service at minimum), so not quite recommended. Given it is costly operation anyway, one additional exception does not seem to be like a huge perf problem for that scenarios. Moreover it is rare scenario.
Again, if there is evidence that there are LOTS of applications and developers using this pattern, we may reconsider, but it does not look like it is the case based on our insights into nuget, ApiPort and other internal/external telemetry sources.

It sounds like the real reason is number 3- we don't think it's worth doing the work. So then either mark this for future or put it as up-for-grabs.

Yes, on top of all the above, it is technically pretty challenging to make it happen in today's RegEx implementation (both initial cost + significant ongoing maintenance cost which we will have to pay). If we put it up for grabs, we still have to pay the maintenance cost (it would likely increase further unrelated fixes in the space in future).
Given the security angle and all said above, it just doesn't sound like something we want in CoreFX repo.

If you still believe we underestimate the usage, you can also consider forking our RegEx code and shipping its mutation as a separate nuget package (it is under MIT license). Developers with similar dynamic scenarios as yours may help contribute to the library. It is something we recommended even for more common scenarios - see examples in dotnet/runtime#22228.

karelz on 6 Dec 2017

@karelz

I watched the design session and appreciate that you record and share those - so thank you!

Unfortunately, the design discussion seemed to miss the point I was making and centered on the performance aspect - rather than the anti-pattern of using exceptions for control-flow which is what I was most concerned about. Using exceptions for control-flow is specifically called out in the .NET Framework Design Guidelines (with noted exceptions that do not apply in this case). X DO NOT use exceptions for the normal flow of control which says:

Except for system failures and operations with potential race conditions, framework designers should design APIs so users can write code that does not throw exceptions. For example, you can provide a way to check preconditions before calling a member so users can write code that does not throw exceptions.

This point wasn't even mentioned in the Desgin Review recording!

Using user-supplied regexes is a security risk (denial-of-service at minimum), so not quite recommended.

Myabe my use case wasn't clear: every time I've had to use user-supplied regexes, it has been from a "trusted/internal" source. For example, developers crafting them and storing them in a DB, file, etc. Or the most significant cases I can remember were:

internal business users supplying regexes as input to an Admin portal for a Rules Engine Builder
internal business users supplying coupon code patterns (via an Admin portal) for one of the top 10 online retailers during BlackFriday.

Neither of these are from nefarious sources. Thus, the security-risk here is moot.

Given it is costly operation anyway, one additional exception does not seem to be like a huge perf problem for that scenarios. Moreover it is rare scenario.

Again, missed my point of control-flow.

udlose on 6 Dec 2017

Re the cost of implementation -- @udlose et al - would it still be helpful if the implementation of the new methods was simply a try/catch? There's still an exception, but if you are debugging in VS it by default will not break on exceptions in optimized code (like .NET itself). Also the code would not have to do the try/catch.

It would still have the perf impact of an exception, if any, but it sounds like that wasn't the primary concern. And some future point it could hypothetically be improved to not throw anywhere without changing any callers.

danmosemsft on 6 Dec 2017

@udlose

Your "using exceptions for control-flow" point was not missed. We just don't believe it is common scenario / normal flow of control to use invalid regexes. It is definitely not a practice we would recommend.

In general we do not plan to get rid of all exceptions in BCL which could in some not-so-common use cases be normal control flow by adding more Try* methods. We save it for common performance-sensitive scenarios and for common scenarios where it hurts debugging of apps.

karelz on 6 Dec 2017

[edited for clarity]

I think the "control flow" argument could apply to Regex.TryMatch as it's a new flavor of an existing method adjusted to not throw - analogous to Int32.TryParse.

However the purpose of the proposed Regex.Validate is to determine a boolean result. There is no reason it should do so by throwing an exception.

We might think it's not going to get enough use to add, but I don't think the "control flow" argument applies there.

danmosemsft on 6 Dec 2017

@karelz

In general we do not plan to get rid of all exceptions in BCL which could in some not-so-common use cases be normal control flow by adding more Try* methods. We save it for common performance-sensitive scenarios and for common scenarios where it hurts debugging of apps.

I don't think that @KrzysztofCwalina would agree with this practice. I'd be interested to hear his thoughts. If he agrees with your statement, then it would contradict the very design guidelines he authored.

udlose on 6 Dec 2017

This discussion is repeating https://github.com/dotnet/corefx/issues/19928 which was asking for AssemblyName.TryGetAssemblyName

danmosemsft on 6 Dec 2017

👍1

@danmosemsft - I appreciate your input. The actual ask was for a TryParse which would follow the Try-Parse pattern. I give you a string and if it can be created into a valid Regex, you return true and a new instance of that Regex in the out param. If it is not valid, you return false and the out param is null (default for reference types).

Your suggestion of a Validate is close to a compromise (at least control-flow issue is addressed) though it requires the caller to complete the operation in 2 steps instead of one: first call Validate, then manually create an instance if Validate returns true. Whereas with a true TryParse implementation, the caller can both validate and create an instance in a single step. It seems as though we cannot get around the exception throwing happening _inside of_ the Regex class and I acquiesce on that point.

udlose on 6 Dec 2017

@karelz, @terrajobst, @ViktorHofer, @danmosemsft after speaking with some coworkers they reminded me of an important business case for consuming/parsing many different regex patterns at runtime: globalization of address patterns, zip codes, area codes, credit card types (incl. non-US), etc.

udlose on 7 Dec 2017

@udlose how is that dynamic scenario? If your localized regex is incorrect, is that something "normal" you expect and won't enable your feature on the specific language?
IMO it is a case of multiple static regexes (each of them should be validated at build-time), not a truly dynamic scenario.

karelz on 7 Dec 2017

@karelz - it's dynamic in that I might have these loaded into a database, file, etc. and not hardcoded directly into code. I agree that in most cases, they are effectively static- but not always able to be stored in code as I gave examples of in prev comments. What do you mean by

validated at build time

TMK, there's no way to validate a regex at compilation time.

Also, I remember in the design discussion above, Uri class was used as a comparison. This request is basically akin to adding a TryCreate like the Uri class has (call it TryParse or whatever - the naming might be splitting hairs).

udlose on 7 Dec 2017

@udlose I never before heard about globalization resources being stored in database for application. I think that is extremely rare.
By validated at build time I meant you should validate it before release (e.g. via testing). Or run with the risk your app is broken on certain language.
Either way, what is the difference for the app if you get bool or exception -- either way the app will not work on that particular "globalized" language.

karelz on 7 Dec 2017

@karelz CMS is typically used for Globalization (i.e. outside of code)

udlose on 7 Dec 2017

Out of curiosity - what is CMS? (quick internet search didn't reveal anything related)

Either way, globalized regex for address/CC pattern seems to me similar to translated localized string (parameterized). What are you going to do when it is wrong? IMO you have to admit there is a bug in your application in such case. The globalized/localized resources are virtually part of your application (no matter where they are stored - in code, resources, config file or database-style storage) and if they are wrong, the chance is part of your app won't work as you expected. That is rather exceptional case, not a normal control flow. Of course, the more dynamic storage (config/database-style), the bigger need to protect yourself against misconfiguration errors, but they are still errors - something that should be surfaced to admin/user and fixed rather than considered normal as invalid end-user input for a credit card number. Ideally such validation and prompt for a fix should happen upon configuration/database update, not at runtime of the app.
The only scenario I can imagine where regexes are fully dynamic is search & replace function in editor (VS, Notepad++, etc.) -- where user inputs the regex to look for or to replace. That is highly specialized scenario though, far from common/mainstream.

Summary

Overall I think we are running in circles here. To sum it up:
If you have dynamic regexes (which is discouraged practice), you may want to verify if they are ok. You can today check them by catching exceptions:
c# try { Regex regEx = new Regex(dynamicRegex); // Valid regex - react } catch (ArgumentException ex) { // Invalid regex - react }

There may be several potential complains about the code snippet:

Performance (exceptions are slow) -- it is not common case (so not on hot path) & regex parsing is already expensive, so the exception overhead is likely negligible (would be nice to measure with simple and complex regexes).
Debuggability (exceptions under debugger disturb your debugging sessions) -- given we believe it is not common case, it will impact just a few customers in a few scenarios.
Diagnosability (it is hard to find out what exactly is wrong with the regex) -- valid concern and one we want to address in future by throwing more specific exceptions (ideally with offsets, etc.)
Convenience (I don't want to write the code over and over again) -- it should be fairly easy to wrap the code above in a method bool RegExExtensions.Validate(string regex). We believe it is not common case, so it is not good candidate to be added into BCL.
Exceptions should be avoided (exceptions should not be used for normal flow of control design pattern) - given we believe it is not common case, the flow is not considered "normal", but considered exceptional.

Most of the arguments are around belief that the dynamic scenario is not common. To change our position (which is possible), we will need more believable mainstream scenario examples or votes/replies from more developers on the issue (signaling to us we may be wrong at commonality assessment).

karelz on 7 Dec 2017

@karelz - CMS is a Content Management System. It is, in its simplest form, used for dynamic management of web and digital content. In its more complex form (as I've found many times is the usage) it is used for globalization of digital resources. There are literally dozens of enterprise-level CMS 3rd party systems for purchase. Many times I've seen large enterprises create in-house/custom CMS systems to solve this business problem. All of these use some sort of underlying datastore - which is typically a database though can be SaaS (software as a service) too. Just take a look here and you may soon be convinced that the concept of the 'dynamic scenario' is quite common and extensive indeed - https://en.wikipedia.org/wiki/List_of_content_management_systems

udlose on 7 Dec 2017

Sure, that's dynamic scenario for webapps. However, it does not mean storing regexes without prior validation in there is the right pattern and common scenario.
I expect only limited set of people can change the regexes.
If there is invalid regex stored there, it is basically a bug in your application. It is exceptional case, not the common, typical case. It is something you want to fix/log rather than live with it forever, right? You typically don't want to treat it as normal bad input from end-user (like editors do) and do nothing, do you?

karelz on 7 Dec 2017

Was this page helpful?

0 / 5 - 0 ratings