Powershell: Introduce a -matchall operator that finds *all* regex matches, to complement -match

Created on 26 Sep 2018  路  12Comments  路  Source: PowerShell/PowerShell

-match is a handy regex-matching operator, but it is limited to finding (at most) _one_ match, as afterwards reflected in the automatic $Matches variable (with a scalar LHS) or directly returned (with an array-valued LHS).
Additionally, the ability to retrieve the matching part of the input and any capture-group values is lost with an array-valued LHS, because $Matches is then not populated.

In order to find _all_ matches of a given regex, you currently have two options:

  • Pipe to Select-String -AllMatches, but that is inefficient for matching (collections of) strings already in memory.

  • Use .NET directly, via the [regex]::Matches() method, but that makes for an awkward transition from the PowerShell-native -match operator.

Therefore, a -matchall (-imatchall, -cmatchall) operator could be introduced, as a PowerShell-friendly wrapper for the [regex]::Matches() method

# WISHFUL THINKING

# Scalar LHS; returns a collection of 2 matches
'foo' -matchall 'o'
# Array LHS; returns 2 collections of 2 matches each
'foo', 'baa' -matchall 'o|a'

could be the equivalent of:

# Scalar LHS
[regex]::matches('foo', 'o')
# Array LHS
[regex]::matches('foo', 'o|a'), [regex]::matches('baa', 'o|a')

That is, the output would be either a single [System.Text.RegularExpressions.MatchCollection] instance, or an array of them, each of which contains one [System.Text.RegularExpressions.Match] instance _per match_.
A [System.Text.RegularExpressions.Match] instance stringifies to the matching part of the input string, if any, and contains capture-group values as well as additional metadata about the match.

In essence, this is also what you get when you access the .Matches property of the [Microsoft.PowerShell.Commands.MatchInfo] instances returned by Select-String -AllMatches (though in the case of Select-String an [object[]] array of [System.Text.RegularExpressions.Match] is returned instead of a [System.Text.RegularExpressions.MatchCollection] instance).

Environment data

Written as of:

PowerShell Core 6.1.0
Issue-Enhancement Up-for-Grabs WG-Language

Most helpful comment

Is this issue still available? I would like to work on this issue.

All 12 comments

100% yes, this would be awesome to have!

I agree! That would be VERY Useful!!
:)

Just to complete the specification for this:

  1. Everyone is OK with returning the raw match collections and don't expect anything additional being done to the collections/objects correct? (I just want to make sure.)

  2. Question: in the case where there is an array on the LHS, what should be the result type of the collection of MatchCollections? Collection<MatchCollection> ?

  3. The Matches method takes up to 4 arguments, including Regex options and a timeout. We should support those as well so it would look like:

$strings -matchall 'pattern'[,<options>[,<timeout>]]
  1. The "default" operator -matchall would be case insensitive unless overridden through options, with -imatchall and -cmatchall variants.

  2. Not matching is not considered a failure so $? is always true afterwards.

  3. $Matches is not set by this operator.

  4. TBD - we need to figure out which language modes this operator is available in. Probably the same as -match but we should still think about it.

Have I missed anything?

  1. Unless there's something especially useful we could add on, I think the match collections has it pretty well covered.

  2. Yeah, I think that makes some sense. Not sure if you'd want to pack them all in the same collection, but I guess if you needed that you could make it work regardless, so it's safer to have each MatchCollection separated.

  3. Is it worth including an additional argument to just output an array of the matched strings?

4 & 5. Following established patterns. 馃憤

  1. Yep, sensible.

  2. I'd echo the modes available to -match; introducing any disparity is liable to create obscure and almost always poorly-documented and poorly-understood edge cases.

You missed 8 -- there's a bunch of documentation that'd need to be written up, at least covering probably a brief overview of the object types that it's returning and how to get your precious match strings out, etc. 馃槈

PS> 'abc' -matchall '(.)'
PS> $matches[0] # abc
PS> $matches[1] # a
PS> $matches[2] # b
PS> $matches[3] # c

Thanks for helping to flesh this out, @BrucePay.

Re 1, output type:

Outputting a MatchCollection instance (loosely speaking, an array of Match instances) is certainly easiest and fastest, but perhaps for consistency with the automatic $Matches variable that it should be an _array of [hashtable]s_ (whose entry 0 contains the full match, and the other entries the capture-group matches, as in $Matches).

On the flip side:

  • Not all metadata will be available - but perhaps that won't be a problem in practice.
  • [hashtable]s, unlike Match instances, do not stringify meaningfully, so the individual matches cannot be used as-is in expandable strings - though @vexx32's suggestion of having an option to return strings only could address that (see below).

Re 2, collection output type with array-valued LHS:

Again, for consistency, I suggest outputting a regular [object[]] array (which, combined with the suggestion above, would yield an array whose elements are hashtable arrays).

Re 3, optional operands:

My vote is _not_ to implement the options parameter (and, by extension, the timeout parameter), because it introduces complexity that may be confusing and not worth the effort:

  • For instance, the IgnoreCase option could cause confusion with the implied case-insensitivity of -matchall and even be at odds with using -cmatchall; ditto for CultureInvariant.

  • For those in the know, using _inline options_ as _part of the regex_ also provides access to the options (at least the most important ones), if desired (e.g., "a`nb" -match '(?s)^.*$' yielding $True due to inline option (?s) causing . to match newlines too).

  • The timeout option strikes as too obscure a feature to warrant inclusion in PowerShell.

However, @vexx32 's suggestion of introducing an optional operand to return strings only is worth considering:

@vexx32, can you elaborate on that? Have each match be an array of strings whose 1st element is the overall match, with subsequent elements containing the capture-group matches? Note that you'd lose access to capture-group matches by _name_ that way (if applicable).
So with a scalar LHS and multiple matches you'd get an array of string arrays?

Re 4, case-insensitivity by default:

馃憤

Re 5., setting $? to $True:

馃憤- it is consistent with -match.

Re 6, not setting $Matches:

馃憤

@p0W3RH311

Indexing into $Matches is already being used with -match to access the capture-group matches of a _single_ match:

PS> $null = 'abc' -match '(.)'; $matches

Name                           Value
----                           -----
1                              a
0                              a

@mklement0 I was thinking that for simplicity it would be lovely to simply have an optional second param or a secondary parameter (e.g., -rawmatches/-rmatchall perhaps?) that simply outputs an array of matched strings. So, as a very basic example:

$String = "testing, testing"
$String -rmatchall '.'
# outputs: @('t', 'e', 's', 't', 'i', 'n', 'g' ...) (etc)

This would allow almost the mirror operation to -split where instead of targeting pieces to remove in order to break apart a string, you target pieces you want to keep, and it gives you the pieces in a nice and simple array.

For array valued LHS (in both this raw-case and the above initial idea) I would defer to -split's default handling of LHS array for consistency. I believe that would generally result in a flat array.

You are correct in that having a "raw matches" sort of parameter would remove the ability to target matches by name, but if you're using it specifically there's unlikely to be a reason to name your matches. (You may of course choose to do so regardless for clarity in the regex, but that's another story.) 馃槃

@vexx32:

Generally, my vote would be for an _optional operand_ rather than a _new operator_ - aside from avoiding the proliferation of operators, it leaves the door open for implementing _multiple_ alternative output forms.

Yes, returning just strings would be handy, but it sounds like for predictability of processing we'd have to omit capture groups altogether in this case, right?
And leave it to users to use the richer default results if they do need capture-group access.

But note that we would _almost_ get the same if we passed the Match instances out as-is, given that they _stringify_ to what a given match captured in full:

# Would become: 'fo1o2' -matchall '.\d' | % tostring
PS> [regex]::matches('fo1o2', '.\d') | % tostring 
o1
o2

In other words: with a single syntax, you get objects that:

  • in a string context conveniently expand to the full match
  • still provide access to capture groups and position information, if needed

But, as stated, it would be a departure from how -match reports match information in $Matches.

That said, perhaps we can enhance the type of $Matches in a backward-compatible way that would allow use of the same type as -matchall return instances:

  • Make $Matches stringify to $Matches[0], i.e., the full match (I don't think anyone will miss the current stringification behavior, which is expansion to literal 'System.Collections.Hashtable'.)
  • Make $Matches an _ordered_ dictionary, while we're at it (just so that it _displays_ as expected).

Of course, this transformation from Match instances carries a performance cost.


As for producing a flat array with an array LHS: You're right, that is indeed what -split does:

PS> ('a,b', 'c,d' -split ',').count
4

Is this issue still available? I would like to work on this issue.

It's all yours! 馃挅

Following up from #11755: If -allmatches returns [System.Text.RegularExpressions.Match] instances (at least by default), you'll also be able to use it indirectly to get detailed information about a _single_ (the first) match, which -match with the $Matches variable won't give you:

PS> ('hello' -allmatches 'e')[0]  # same as: [regex]::Match('hello', 'e')

Groups   : {0}
Success  : True
Name     : 0
Captures : {0}
Index    : 1
Length   : 1
Value    : e
Was this page helpful?
0 / 5 - 0 ratings

Related issues

concentrateddon picture concentrateddon  路  3Comments

alx9r picture alx9r  路  3Comments

JohnLBevan picture JohnLBevan  路  3Comments

andschwa picture andschwa  路  3Comments

abock picture abock  路  3Comments