Runtime: Extend Regex capabilities

Created on 14 May 2018  ·  21Comments  ·  Source: dotnet/runtime

1- Let Regex measure word distances

Regex is a great tool to search text. but it lakes some patterns that can be used to measure the similarity between words, like Levenshtein distance algorithm. Maybe this is not of a great important in English because different forma of the word contains some sort of shared letter sequence such as in (accura)te, in(accura)te and (accura)cy. This not always true in other languages such as Arabic words such as يلعب, لاعب , ألعاب which comes from the root لعب but with different patterns. These words can't be matched using a regex pattern, unless using some morphology analyzer first to get the root and use it to create all possible regex patterns to match all the word formulas for this root, which is too complicated thing to do.
The easy solution, is to measure the distance between the words. Words with smaller distance is more likely to be related. I implemented and used Levenshtein distance to do so, and there are some NuGets that use such algorithms. But this approach is an approximate solution, that can be enhanced if combined with regex patterns. Things will be easier and regex will be more powerfull if it used such algorithms as a part of its patterns, so we can find related words based on complicated criteria. Besides, analyzing the text once with regex is more efficient than using do this twice by separate tools.
The suitable regex syntax to supply the accepted distance range can be something like this:
\b?<xit, 1, 4>\b
this searches for any whole word that has a distance from 1 to 4 from the sequence xit.
There are many algorithms such as:

  • Levenshtein distance
  • Normalized Levenshtein
  • Weighted Levenshtein
  • Damerau-Levenshtein
  • Tanimoto coefficient
  • Hamming distance
  • Optimal String Alignment
  • Jaro-Winkler
  • Longest Common Subsequence
  • Metric Longest Common Subsequence
  • N-Gram
  • Q-Gram
  • Cosine similarity
  • Jaccard index similarity
  • Sorensen-Dice coefficient
    Regex can implement some of then and let us decide which one to use by supplying this option through the constructor.

2- Allow us add callback functions in the regex pattern:

I have an idea: can regex allow us to call a custom callback function to use as a part of the pattern? say:
\b? where funcName can be any name of a callback function with the form:
bool funcName(string);
Each time the regex evaluates the supplied expression (such as \w{2,6}) sends it to the callback function , and if it returns true, the regex conseders this string as match and continue evakuating the remaing pattern at this posission.
This will allow us to use what ever algorithms we need as a part of the pattern.
This a "simple" regex to match IP
\b(?:1?\d{1,2}|2(?:[0-4]\d|5[0-5]))(?:\.(?:1?\d{1,2}|2(?:[0-4]\d|5[0-5]))){3}\b
It became that complex because of the need to check that each part is less than 255, which can be very simple if we mach three digits and pass them to a call back function to validat, like this:
\b(?Validate(\d{1,3}))(?:.(?Validate(\d{1,3}))){3}\b
Where Validate is defined as:

bool Validate(string No)
{
   if (byte.TryParse(No, out byte b))
       return true;
   return false;
}

And that is all!

3- Add validation expressions:

Another solution for this issue, to add on your shelf for future improvements:
Instead of calling an external function, regex can define some validating expressins using Len( ), val () and so, similar to SQL.
'''\b(?n=\d{1,3}, val(n)<256)(?:.(?d=\d{1,3}, val(d)<256){3}\b'''

4- Add named subexpression:

It will be also helpful if we can define named sub expressions:
'''(?DEF validNo:(n=\d{1,3}, val(n)<256))\b\'validNo'(?:.\'validNo'{3}\b'''

Compare above to:
'''\b(?:1?\d{1,2}|2(?:[0-4]\d|5[0-5]))(?:.(?:1?\d{1,2}|2(?:[0-4]\d|5[0-5]))){3}\b'''
which is readable, easier to write, and more iffecient?

api-suggestion area-System.Text.RegularExpressions

Most helpful comment

@MohammadHamdyGhanem the features we take into BCL or language need to be baked and rock solid. Once we take them, we live with them forever.
From that reason, we can't act on just "potentially interesting ideas" (there are thousands such ideas), unless it is something we feel we should focus on (e.g. Span), or it is something lots of customers need/want (top voted issues). As you see RegEx is neither (at this moment).

Moving things into separate NuGet packages is the right way to bake ideas and show their usefulness (by growing download numbers). Your ideas need baking and validation how useful they are.
You don't need to do things alone, you just need to find the right group of people who also believe in usefulness and importance of RegEx. And that is indeed hard. They need to be passionate about RegEx and/or believe that current RegEx state is inefficient to use and having something better would help them and others significantly.
So far we haven't seen much interest in the space from community, but rest assured that if we see some, we will send them to this issue.

All 21 comments

This feels more like natural language analysis. Something that is useful, but should be built on top of BCL, not inside.
I am not convinced that RegEx is the right place for APIs like these ...

Agree with what Karel says. Building these algorithms on top of the BCL and delivering them as a nuget package makes most sense here IMO. Still thanks @MohammadHamdyGhanem for your interest in Regex 👍

Closing per agreement above.

@karelz @ViktorHofer
I see, but using them as part of a pattern differs than using them alone.
I have an idea: can regex allow us to call a custom callback function to use as a part of the pattern? say:
\b?<funcName(\w{2,6})>\b
where funcName can be any name of a callback function with the form:
bool funcName(string);
Each time the regex evaluates the supplied expression (such as \w{2,6}) sends it to the callback function , and if it returns true, the regex conseders this string as match and continue evakuating the remaing pattern at this posission.
This will allow us to use what ever algorithms we need as a part of the pattern.
What about that?

@karelz

Closing per agreement above.

I suggested another solution that doesn't need to implement any thing in BCL.

I have an idea: can regex allow us to call a custom callback function to use as a part of the pattern?
say:
\b?<funcName(\w{2,6})>\b > where funcName can be any name of a callback function with the form:
bool funcName(string);
Each time the regex evaluates the supplied expression (such as \w{2,6}) sends it to the callback function , and if it returns true, the regex conseders this string as match and continue evakuating the remaing pattern at this posission.
This will allow us to use what ever algorithms we need as a part of the pattern.
What about that?

Please don't close issues too soon.

I closed the issue per your original proposal to bake in word distances into RegEx.
Your new proposal arrived after I closed it. Technically speaking your new proposal is entirely different API/feature proposal from the original one, they just both share same motivation (natural language search).

Your new proposal is general extension of RegEx to call functions (not available in other implementations AFAIK https://en.wikipedia.org/wiki/Regular_expression).
It needs to consider security implications and as well how it fits current model of RegEx class. Are there other scenarios which need it? (building a general-purpose concept might be overkill for addressing a narrow scenario)

I believe that current implementation of RegEx will be hard to extend in such general way. It may be much easier to create a new natural-language API (maybe similar to RegEx) and evolve it in separate package to figure out if it is the right solution even for the natural-language scenarios. Once proven useful and general enough to be applicable and useful in other scenarios, it would make sense to start thinking how to leverage its parts in BCL/CoreFX.

If you believe there are more scenarios and you want to start a discussion with other community members about them, I'd be ok to leave such issue open for a while to see if it sparks any interest from community. I would strongly recommend to create crisp API suggestion first, using natural language search as a minor motivation and application example, not as the primary focus of the issue -- the primary focus should be the API itself and its general capabilities, limitation (maybe even with outlook into implementation).

Thanks @karelz.
I changed the Title and the content of the issue to fit the new idea, so we can discuss it even if the first part is rejected. So you may reopen it.

The first part was already rejected. It might be best just to take the 2nd part into separate issue and provide more details about it there (the idea is still very raw).

@MohammadHamdyGhanem thank you for your various recent API suggestions. You may have found we are rather picky! Allow me to share some thoughts.

As you know the cost of an API is not just its implementation but also support and maintenance for a long time, cognitive load, opportunity cost, cost of porting, and other costs. So it has to be more than just a good idea. In the C# language they have said that every feature starts with -100 points - although I don't think the bar for API is quite so high, it is analogous in that it has to be more than just a good idea.

Amongst other considerations we look for is evidence it would be widely used. How many people would immediately consume it if we shipped it? It would not be good if we had to maintain something that only a few people used. That evidence might be for example, similar features in heavily used analogous platforms, or many reimplementations in Github projects based on .NET, or a heavily used complementary API. Also possible evidence is a lot of thumbs-ups and comments from other users on this repo. If an API does not necessarily have to be implemented in the runtime (unlike Span<T> for example) it would be interesting if there are libraries that already expose the feature successfully and it gets widely used in those libraries.

Another important consideration is whether it belongs in the platform, rather than a library. Sometimes the platform can make a "better" (faster, more complete) implementation because of its nature - or it really has to be part of a type that is already in the runtime.

Taking the case of using Regex to measure word distance - it is not clear there is such evidence of widespread use. Also, although we do have a Regex engine in the platform, it is not clear why it is important for measuring word distance to be part of that engine, rather than in a library on NuGet.

Hope that along with @karelz 's remarks this gives helpful perspective.

@danmosemsft
Thanks for your concern. I'm aware of your policy, and you are free to take whatever decisions about my suggestions. I only have some ideas and no knowledge about their popularity or how many others will need them, so I just share them here, and hope for the best. I gain great benefits from the discussions, and get more info about some NuGets or future features in progress, so I hope you don't close issues prematurely. Besides, I intend to apply my ideas anyway. For example, I finished improving the VerbalExpressions (the Regex Builder) and I will try to write my LinQ to Regex based on it.
But if I may say so, I am not a fan of Microsoft strategy in last years. It seems that MS feels OK to let others lead the market (especially in ASP.NET) then follows their steps. We got used to wait for other programmers or companies to innovate things like JQuery, Ajax, MVC, Xamarin, WebAssembly …. etc so MS finally adapt it! Even Docker started as a graduate project by some students who start a revolution in the OSs field! Seems MS got old and wise enough to let others take risk for her!
Thanks.

@MohammadHamdyGhanem Not sure that's fair. For example, have you read http://blog.stevensanderson.com/2018/02/06/blazor-intro/? The ASP.NET team is taking a risk, investing a lot of time into an experimental project just because the possibilities are so cool.

@jnm
I am aware of and excited about blazor. But it is built on webassembly . Years ago I asked for a framework for the client side so we can target it by VB or C#. I have been told about a project called ScriptSharp that uses C# to write client code then its translated to java script file, that I can add to my project. I still have a copy of this project on my PC but it's never completed! This was not the only missing opportunity, since SilverLight was a promising technology that was also aborted (once I heard of Blazor, I asked to make it use XAML instead HTML5 to be revive SilverLight again). Seems MS became too cautious after some bad choices in web and phone markets, and from my point of view, the worst decision at all is to left VB.NET behind and prefer even F# over it. Obeying the standards shouldn't mean get rid of what makes you special and popular in first place, and gaining new programmers on other systems, shouldn't come upon loosing what you already have! I'm happy that MS tries to fill the gap and conquer new worlds, but this doesn't mean giving up its leadership or neglect its people on the mainland!

@karelz @ViktorHofer @danmosemsft @jnm2 @radical @CyrusNajmabadi
This a "simple" regex to match IP
\b(?:1?\d{1,2}|2(?:[0-4]\d|5[0-5]))(?:\.(?:1?\d{1,2}|2(?:[0-4]\d|5[0-5]))){3}\b
It became that complex because of the need to check that each part is less than 255, which can be very simple if we mach three digits and pass them to a call back function to validat, like this:
\b(?Validate(\d{1,3}))(?:\.(?Validate(\d{1,3}))){3}\b
Where Validate is defined as:

bool Validate(string No)
{
   if (byte.TryParse(No, out byte b))
       return true;
   return false;
}

And that is all!
By the way, I generated the above regex by this Verex:

var validNo = ("1".Maybe() & Digit[1, 2]) |
                       ("2" & 
                              (
                                  (InRange('0', '4') & Digit) | ("5" & InRange('0', '5'))
                              )
                       );

var IP = WordEdge & validNo & ("." & validNo)[3] & WordEdge;

Regex can be improved more, and I'm not convinced with your argument about compatibility with other languages and books. We are .net programmers, and we can't run our code in other languages directly, so, why does we care about regex implementations?
Make .NET regex the easier and more powerful implementation ever, and let others follow you, or force these new features on the standards!
Think of making .Net developers happier and more productive, and forget about anything else!
Thanks.

Being able to call a function from inside a Regex pattern is a feature that exists in Perl today. Unfortunately we currently don't have any plans to add that feature to .NET.

@ViktorHofer
:(

@MohammadHamdyGhanem A core virtue here is that you can just write this and supply it as a nuget package. :) If you think this idea is virtuous, make it available for people to use. This is actually better as you can innovate much faster than .Net itself is able to do with all the restrictions in place that they have to consider.

--

Also, it's very easy to have named groups in regex and just write something that walks it and validates those named groups however you want. :) Needing core language support doesn't seem that necessary.

I thought about that but I expect regex code is complex and need some time to study before I mess with!
Also I still add new suggestions about regex, and it is better to wait until the whole picture completes. I asked an expert in natural languages about regexex in that erea and he said that there are tons of papers about. I also want to look at ML framework and AI tools microsoft aded to dot net before reinventing the wheel. So in fact my motion is much slower than MS team, and this is what I am tring to tell you from begining!. I don't prefer to work alone in scuch projects, and this shouldn't be encouraged in open source society. And as I told you before, the most important thing in programming is the idea, analysis and algorithm. Implementation is the easy part. I am a poet and a scifi writer besides being an engineer and a programmer , so I like working with natural languages the most. I will give this area some effort, including regex, search indexers, and machine learning, but alone, this will take years!

@ViktorHofer
Another solution for this issue, to add on your shelf for future improvements:
Instead of calling an external function, regex can define some validating expressins using Len( ), val () and so, similar to SQL.
\b(?n=\d{1,3}, val(n)<256)(?:\.(?d=\d{1,3}, val(d)<256){3}\b
It will be also helpful if we can define named sub expressions:

(?DEF validNo:(n=\d{1,3}, val(n)<256))\b\'validNo'(?:\.\'validNo'{3}\b

Compare above to:
'''\b(?:1?\d{1,2}|2(?:[0-4]\d|5[0-5]))(?:.(?:1?\d{1,2}|2(?:[0-4]\d|5[0-5]))){3}\b'''
which is readable, easier to write, and more iffecient?
Thanks.

@MohammadHamdyGhanem the features we take into BCL or language need to be baked and rock solid. Once we take them, we live with them forever.
From that reason, we can't act on just "potentially interesting ideas" (there are thousands such ideas), unless it is something we feel we should focus on (e.g. Span), or it is something lots of customers need/want (top voted issues). As you see RegEx is neither (at this moment).

Moving things into separate NuGet packages is the right way to bake ideas and show their usefulness (by growing download numbers). Your ideas need baking and validation how useful they are.
You don't need to do things alone, you just need to find the right group of people who also believe in usefulness and importance of RegEx. And that is indeed hard. They need to be passionate about RegEx and/or believe that current RegEx state is inefficient to use and having something better would help them and others significantly.
So far we haven't seen much interest in the space from community, but rest assured that if we see some, we will send them to this issue.

Thanks @karelz.
I hope I can put all my ideas about regex in a NuGet and add some features to make it more useful and interesting. But as I said this will take time. I will start by publishing Verex (my verbal Regex) today or tomorrow, and see where to go from there.

Here is Verex (my verbal regex builder):
https://github.com/MohammadHamdyGhanem/Verex
Thanks

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Timovzl picture Timovzl  ·  3Comments

GitAntoinee picture GitAntoinee  ·  3Comments

jzabroski picture jzabroski  ·  3Comments

nalywa picture nalywa  ·  3Comments

omajid picture omajid  ·  3Comments