Powershell: string.Split doesn't work with multiple separator characters

Created on 1 Oct 2020 · 14Comments · Source: PowerShell/PowerShell

Steps to reproduce

"1x2y3".Split(@('x', 'y'))

Expected behavior

1
2
3

Actual behavior

1x2y3

Environment data

Name                           Value
----                           -----
PSVersion                      7.0.3
PSEdition                      Core
GitCommitId                    7.0.3
OS                             Microsoft Windows 10.0.19041
Platform                       Win32NT
PSCompatibleVersions           {1.0, 2.0, 3.0, 4.0…}
PSRemotingProtocolVersion      2.3
SerializationVersion           1.1.0.1
WSManStackVersion              3.0

Comments

Works correctly in Windows Powershell 5.1.19041.1 (the same command returns the result described in Expected behavior).
Respective method in .NET Core 3.1 also splits the string correctly:

foreach (var item in "1x2y3".Split(new[] { 'x', 'y' }))
    Console.WriteLine(item);

output:

1
2
3

A single separator works correctly in PS 7.0.3:

"1x2y3".Split(@('x'))

output:

1
2y3

Also in this syntax:

"1x2y3".Split('x')

output:

1
2y3

I'm not sure if this is also a bug (and in which PS version), but there is one more difference between PS 5 and PS 7: PS 5 seems to treat "xy" as an array of two separators while PS 7 seems to treat it as one 2-character separator (PS 7 behavior seems more correct here):

PS 5:

"1x2y3xy4".Split("xy")

output:

PS 7:

"1x2y3xy4".Split("xy")

output:

1x2y3
4

Issue-Question Resolution-Duplicate

Source

jjanuszkiewicz

Most helpful comment

I wouldn't say it's not _possible_ to select a string[] or char[] overload instead, but as @SeeminglyScience mentioned, it's quite hard to know if changing it to improve this one case might make others fall down where they were previously working just fine, or if it might create performance concerns in some scenarios.

But yes, currently the overload resolution / casting behaviour will prefer to transform object[] to string in that way rather than try to match it to an appropriate array type like char[] or string[].

The non-Core powershell doesn't actually have a string overload on .Split() from memory, so it's forced back to char[] or string[] because there isn't a "better" option in the older .NET runtimes.

The logic hasn't changed in PowerShell, it's just that .NET added more overloads for that method, and that causes PowerShell to select one of the new ones preferentially due to the existing logic.

vexx32 on 1 Oct 2020

👍2

All 14 comments

My gut feeling is that @('x', 'y') is converted to a string and an incorrect overload of String.Split from the runtime is called - i.e. one accepting a string separator instead of one accepting char[] or params char[]. But I might be totally wrong, haven't tried looking at the code.

jjanuszkiewicz on 1 Oct 2020

I think the issue here is just that .NET added new overloads to the Split() method since .NET Framework 4.8, so you're getting different methods selected because there are now new ones available. The available overloads in pwsh 7.1 are:

OverloadDefinitions
-------------------
string[] Split(char separator, System.StringSplitOptions options)
string[] Split(char separator, int count, System.StringSplitOptions options)
string[] Split(Params char[] separator)
string[] Split(char[] separator, int count)
string[] Split(char[] separator, System.StringSplitOptions options)
string[] Split(char[] separator, int count, System.StringSplitOptions options)
string[] Split(string separator, System.StringSplitOptions options)
string[] Split(string separator, int count, System.StringSplitOptions options)
string[] Split(string[] separator, System.StringSplitOptions options)
string[] Split(string[] separator, int count, System.StringSplitOptions options)

So you're most likely getting the string or string[] overload there, but I'm unsure which is more likely to be selected in this case. @SeeminglyScience probably remembers (or knows how to find out; I don't recall).

You'll most likely need to explicitly cast to [char[]] in order to get the correct overload selected for your use case:

"1x2y3".Split([char[]]("x","y"))

The .Split("xy") not working as expected in PS 7 is likely due to the new string overload being added where previously PS wouldn't have that option and would instead convert it to a char[] during the process of choosing which method to call.

vexx32 on 1 Oct 2020

So you're most likely getting the string or string[] overload there, but I'm unsure which is more likely to be selected in this case. @SeeminglyScience probably remembers (or knows how to find out; I don't recall).

Yeah it's the string one.

You'll most likely need to explicitly cast to [char[]] in order to get the correct overload selected for your use case:

👍

Also related/dupe: #11720

SeeminglyScience on 1 Oct 2020

👍1

Hmm, so the binder's preferring object[] -> string over object[] -> string[] / char[]. That seems a little odd to me tbh, but I'm unsure if changing that would have wide implications that might be undesirable (or if changing it is even really feasible).

Important to note as well that a conversion from the array to string isn't splitting the string because that conversion will result in a string with spaces between each character by default, so these two calls are equivalent:

"1x2y3".Split(@('x', 'y'))

"1x2y3".Split('x y')

Which as you saw with your .Split("xy") example, .NET (in current versions) looks for the complete string to split on rather than a set of characters.

An option that will work the same way in both PS versions is to use -split '[xy]' which uses PS's own (regex) operator rather than relying on the .NET API, which is subject to changes / additions that PowerShell can't control across different versions.

I'll mark this as a duplicate of #11720

vexx32 on 1 Oct 2020

👍2

Hmm, so the binder's preferring object[] -> string over object[] -> string[] / char[]. That seems a little odd to me tbh, but I'm unsure if changing that would have wide implications that might be undesirable (or if changing it is even really feasible).

Feasible for sure, but yeah... who knows what that would break.

Also just to demo/prove it a little more:

'1x y2z3'.Split(@('x', 'y'))
# 1
# 2z3

SeeminglyScience on 1 Oct 2020

Thanks for the explanations, I now better understand what's happening here.

I still think that @('x', 'y') getting converted/bound to string rather than to string[], resulting in a "wrong" overload being called, could be considered a bug. Is this because @('x', 'y') is an object array in PS, so the conversion to char or string array is not possible and the default "to string" conversion takes place? However, the non-Core Powershell is somehow able to call one of the Split overloads which accept char[] or string[] - and if a call like that is possible then shouldn't one of those methods be also chosen as a better-matching overload in PS Core? Or is this a known/documented breaking change? I'd appreciate a link to the docs if so.

I know that the built-in -split works fine, I've actually moved on to using it already, but it took me some time. I was adapting an existing script to PS Core and that change in behavior surprised me.

jjanuszkiewicz on 1 Oct 2020

The logic hasn't changed in PowerShell, it's just that .NET added more overloads for that method, and that causes PowerShell to select one of the new ones preferentially due to the existing logic.

vexx32 on 1 Oct 2020

👍2

@jjanuszkiewicz

Is this because @('x', 'y') is an object array in PS

A quick aside: Yes, PowerShell array literals are always [object[]]-typed, though @(...) isn't really an array literal, it is an - conceptually unnecessary - application of the array-subexpressions operator (even though it may get optimized away), whose purpose isn't to _create_ arrays, but to _guarantee_ that command output is an array, by wrapping a single output object in one. That said, it too always outputs [object[]]).

In principle, the advice from https://github.com/PowerShell/PowerShell/issues/11720#issuecomment-579866445 applies here as well; to recap:

A new overload added to a .NET method may cause the PowerShell engine to select it in situations where it previously selected a _different_ overload, due to the new overload now being a _better_ fit.
This is an _unavoidable_ consequence of PowerShell being a _late-bound_ language, and it is why you should generally prefer PowerShell-native solutions to .NET method calls (-split vs. .Split(), for instance).
The - cumbersome and possibly non-obvious - alternative is to match the method signature _precisely_, using _casts_ - only this guarantees longterm stability.

However, I do wonder if there are problematic overload resolution behaviors we need to address:

Generally, I don't think converting any array implicitly to [string] is useful in method-parameter binding - we don't even allow it in advanced functions (function foo { [CmdletBinding()]param([string] $foo) $foo }; foo -foo one, two fails)
- In particular, _how_ the conversion is performed is non-obvious (it's the same mechanism as in expandable strings: space-concatenated elements by default, separator configurable via $OFS).
- While changing that would certainly be a breaking change, I wonder how much existing code actually relies on this.
~~In the particular case of the String.Split() overloads there is an oddity that I can't explain~~: notably, a [string[]] cast only works in the _2+_ argument form; @SeeminglyScience, perhaps you can shed some light on this:

_Update_: The reason is that there's no string[]-_only_ overload, all such overloads have additional parameters - see below.

# !! MALFUNCTIONS, despite exact type match with [string[]], still binds to the [string] overload
PS> '0foo 1bar2'.Split([string[]] ('foo', '1')) # !! Same as: '0foo 1bar2'.Split('foo 1')
0
bar2

# OK: With exact type match *and an additional argument (`options`)*
PS> '0foo1bar2'.Split([string[]] ('bar', 'foo'), 'None')
0
1
2

mklement0 on 1 Oct 2020

In the particular case of the String.Split() overloads there is an oddity that I can't explain: notably, a [string[]] cast only works in the _2+_ argument form; @SeeminglyScience, perhaps you can shed some light on this:

So method invocation constraints will only help guide binding, it can't force an invalid overload. More specifically in this case, there is no Split(string[]) overload so it reverts to ~~Split(params char[])~~ Split(string, StringSplitOptions = default).

SeeminglyScience on 1 Oct 2020

👍1

No, I think you're correct, @SeeminglyScience: while _some_ overloads have optional parameters, the string[] overloads do not, and I mistakenly assumed they did.

The string[] overloads as of 3.1 / 5.0RC - _no_ optional parameters:

public string[] Split (string[]? separator, int count, StringSplitOptions options);
public string[] Split (string[]? separator, StringSplitOptions options);

mklement0 on 1 Oct 2020

Yeah sorry I meant wrong about the overload it falls back to, I corrected it.

SeeminglyScience on 1 Oct 2020

👍1

I think it boils down to this:

I don't think converting any array implicitly to [string] is useful in method-parameter binding

To my taste, conversion of array to string could be dropped at all and an exception about a missing method overload could be thrown. At least it would tell you immediately what's going on. But I guess this is just how PS is supposed to work. I'll be happy to see an improvement in this area, I know I'm not the only one finding this confusing.

Shall I close this issue, or does anyone want to add something or propose a change here?

jjanuszkiewicz on 2 Oct 2020

@jjanuszkiewicz, I agree.

But I guess this is just how PS is supposed to work.

Well, it works _inconsistently_ in that regard: cmdlets and advanced scripts/functions do _not_ perform this conversion, but method calls and non-advanced scripts/functions do, so I'd say that this _helps_ the argument that we _consistently shouldn't_ do it.

The crux, as it frequently is, is backward compatibility:

While I personally find it hard to imagine that someone actually _relies_ on the current behavior, the challenge is that analyzing existing code for such uses is a non-trivial undertaking.

Note that we do have a "bucket" system to categorize breaking changes, and notably among them is Bucket 3: Unlikely Grey Area, which permits breaking changes if the benefit of the change outweighs the risk of breaking a _small_ number of scripts.

Another, less desirable option is to make a change opt-in, such as via Set-StrictMode, though the lack of granularity there is problematic.

In short: I encourage you to close _this_ issue and instead create a new one, of type "Feature Request/Idea", in which you ask for the change you propose.

mklement0 on 2 Oct 2020

This issue has been marked as duplicate and has not had any activity for 1 day. It has been closed for housekeeping purposes.