Powershell: System.Xml.XmlElement instances aren't consistently treatable as single-element collections

Created on 25 Nov 2019 · 13Comments · Source: PowerShell/PowerShell

The PSv3+ unified handling of scalars and collections brought the ability to both call .Count on a scalar and index into a scalar (with only [0] and [-1] being meaningful indices); e.g.:
(42).Count and (42)[0])

With XmlNode instances this works with .Count, but not with indexing, which is an inconsistency that should be resolved.

_Update_: XmlElement instances have a type-native indexer that accepts a _string_, i.e. the _name of a child element_. While that indexer should remain available (even though PowerShell's convenient dot notation makes it virtually unnecessary), a _numeric_ index should be interpreted as with other scalars. Given that XML element names by definition must not start with _digits_, there is no risk of conflict, and no backward-compatibility concern.
It is worth generalizing this change for all types as follows: if a type has a native indexer with an argument type _other than_ object or index, conceive of PowerShell's positional indexer as an int-specific overload that is selected whenever the indexer is called with an int argument.

Steps to reproduce

# OK: Indexing works with 2+ sibling elements, because PowerShell returns
#     the 'bar' child elements as an *array*, where indexing works.
([xml] '<foo><bar id="1"/><bar/></foo>').foo.bar[0].id | Should -Be '1'

# Broken: A single element cannot be indexed into, because the single 'bar' child
#         is returned as itself, a System.Xm.XmlElement instance whose *type-native
#         indexer* preempts PowerShell's automatic indexer - even though the
#          type-native indexer only supports *strings* (child element names).
([xml] '<foo><bar id="1"/></foo>').foo.bar[0].id | Should -Be '1'

Expected behavior

Both tests should pass.

Actual behavior

The 2nd test fails, because indexing into the single XmlElement instance failed quietly:

Expected 1, but got $null.

As an aside: The type-native indexer works, but offers no benefit over PowerShell's dot notation, and is also limited to returning the _first_ child by the given name.

# Access child element <bar> via the type-native indexer, ['bar']
# Using `foo.bar.id` would be simpler.
# Also, the type-native indexer doesn't allow returning *multiple* children by that name.
([xml] '<foo><bar id="1"/></foo>').foo['bar'].id | Should -Be 1

Environment data

PowerShell Core 7.0.0-preview.6

Issue-Question Resolution-Answered WG-Engine

Source

mklement0

All 13 comments

XmlElement defines an indexer, so this would be expected behavior. Same reason hashtable can't work that way:

(@{ 0 = 'Test' }, @{ 0 = 'Test' })[0]
# Gives first hashtable
@{ 0 = 'Test' }[0]
# Gives 'Test'

SeeminglyScience on 25 Nov 2019

👍1

Good point, @SeeminglyScience.

With hashtables, the type-native indexer being given precedence is necessary and I've never run into a problem with that.

With [string], giving precedence to the type-native indexer is problematic, but unavoidable, because the type-native and PowerShell's indexers are both [int]-typed.
This gives us problematic behaviors such as:

(& { param($count) , 'foo' * $count } 2)[0]
foo # OK, array indexer

(& { param($count) , 'foo' * $count } 1)[0]
f   # !! string indexer took precedence

(Honestly, I wish that strings weren't directly indexable in PowerShell, but that would obviously be a breaking change.)

However, I think we can and should avoid this unfortunate collision - which interferes with PowerShell's unified handling of scalars and collections - in the case of XmlElement:

the native indexer is [string]-typed (child element names)
Xml element names must not start with digits anyway (PowerShell swallows the exception in the indexer).

So with an [int] (or generally numeric) index argument, we should apply PowerShell's positional indexer, as usual, to get the desired unified behavior.

No existing code would break, because code that currently passes numbers by definition cannot work, because no child element name can match (always yields $null).

Generally, given the convenience of PowerShell's dot notation, there is no good reason to use the type-native indexer anyway, and I have yet to see it used in the wild in PowerShell code.

mklement0 on 25 Nov 2019

I think that sort of inconsistency would make the code sort of difficult to reason about. Special casing the index binder for a specific type with a specific literal argument type (int matches string for overload selection, so you'd have to be very specific in the binder and ignore conversion) isn't ideal.

SeeminglyScience on 25 Nov 2019

I get the _implementation_ concern, @SeeminglyScience, but from the _user's perspective_:

it is the _current_ behavior that is hard to reason about: the native indexer is little-known (and provides no benefit over the widely known and used dot notation).
the current behavior prevents unified handling of XML elements with situationally varying child-element counts.

Note that hashtables (dictionaries), strings, and XML elements already are all special-cased in general, in that PowerShell (sensibly) doesn't enumerate them, despite their implementing IEnumerable.

Users likely don't expect a single XmlElement to be indexable (it certainly was news to me), except in the treat-a-scalar-as-a-collection-of-one sense.

mklement0 on 25 Nov 2019

I get the _implementation_ concern, @SeeminglyScience, but from the _user's perspective_:

I'm not just speaking from an implementation concern. The more special rules there are around very specific niche cases the harder it is to figure out what something does. It becomes another situation where someone asks "why does x do blank but y does blank" and the answer is because "it's hard coded to do that for that one thing".

it is the _current_ behavior that is hard to reason about: the native indexer is little-known (and provides no benefit over the widely known and used dot notation).

If it's anything like the dictionary key-value property binder, the index notation is likely a lot faster, and one could argue it describes intent better (the latter being admittedly subjective). Plenty of folks (including myself) insist on using the indexer directly for dictionaries in any projects they maintain.

the current behavior prevents unified handling of XML elements with situationally varying child-element counts.

Note that hashtables (dictionaries), strings, and XML elements already are all special-cased in general, in that PowerShell (sensibly) doesn't enumerate them, despite their implementing IEnumerable.

Those are mostly dictating the behavior of an archetype of object though. Dictionaries, strings, etc. Yeah XML and DataTable's are thrown in too, but if that was proposed today I'd be against that as well.

Users likely don't expect a single XmlElement to be indexable (it certainly was news to me), except in the treat-a-scalar-as-a-collection-of-one sense.

Yeah most people probably will expect [0] to work, that's not really my issue with it. The problem is when they go to try it on a different type, find out that it doesn't work, and then start analyzing the two types to figure out why one works and why the other doesn't.

And yeah, there's already plenty of examples of that all throughout the engine, but that's exactly why I don't want to see more. Any time a behavior question is answered with "because of a line in the source" it's incredibly alienating to folks trying to actually understand the rules of the language.

For what it's worth, I wouldn't be surprised if the PS team disagrees with me for this scenario specifically. XmlElement is already more special cased than almost any other type, it even has it's own internal Adapter. So realistically, more fuel on that fire probably isn't the worst thing.

SeeminglyScience on 25 Nov 2019

The more special rules there are around very specific niche cases the harder it is to figure out what something does.

Fully agreed, but that doesn't apply here, because there are only two conceivable scenarios:

You _knowingly_ use the _native_ indexer - in which case you'll now that something like [0] fails _by definition_, and you'll only ever use ['foo']
You index with a _number_ - whether you know about the native indexer or not - in which case you expect normal positional indexing.

It's conceptually obvious, and no one would think twice about the behavior (or even worry about special-casing having been needed behind the scenes).

There's definitely plenty of magic around XmlElement already, as you state, and I think this change fits in well with the adapted XML DOM that PowerShell presents on top of the type-native members.

As for other types:

XmlElement is clearly a very commonly used type; while there's certainly the potential for such collisions elsewhere, I don't know how often that will come up; arguably, it's worth _generalizing_ the proposed behavior: _always_ allow numerical indexing if the type-native indexer is neither of type object nor int - though there may not always be this neat separation of indexing with stringified numbers _categorically_ making no sense.

At the end of the day, the rule wouldn't be hard to explain (though knowing how to tests for native indexers is an advanced technique):

If the type-native indexer is object- or int-typed, it must of necessity _shadow_ (make unavailable) the PowerShell-provided positional indexer (as is the case with string and hashtable)
otherwise, int arguments select the PowerShell-provided indexer, whereas type-appropriate arguments would select the type-native indexer; it would be like having an implicit, int-typed indexer overload.

mklement0 on 25 Nov 2019

The more special rules there are around very specific niche cases the harder it is to figure out what something does.

Fully agreed, but that doesn't apply here, because there are only two conceivable scenarios:

You _knowingly_ use the _native_ indexer - in which case you'll now that something like [0] fails _by definition_, and you'll only ever use ['foo']

You index with a _number_ - whether you know about the native indexer or not - in which case you expect normal positional indexing.

It's conceptually obvious, and no one would think twice about the behavior (or even worry about special-casing having been needed behind the scenes).

Right, the problem is when they try to apply the same logic to a different type and it doesn't work.

XmlElement is clearly a very commonly used type;

I'd say it's use cases are uncommon on the rarity scale. I wouldn't say rare, but I don't think the typical user is very commonly parsing XML files.

while there's certainly the potential for such collisions elsewhere, I don't know how often that will come up; arguably, it's worth _generalizing_ the proposed behavior: _always_ allow numerical indexing if the type-native indexer is neither of type object nor int - though there may not always be this neat separation of indexing with stringified numbers _categorically_ making no sense.

At the end of the day, the rule wouldn't be hard to explain (though knowing how to tests for native indexers is an advanced technique):

If the type-native indexer is object- or int-typed, it must of necessity _shadow_ (make unavailable) the PowerShell-provided positional indexer (as is the case with string and hashtable)

otherwise, int arguments select the PowerShell-provided indexer, whereas type-appropriate arguments would select the type-native indexer; it would be like having an implicit, int-typed indexer overload.

The proposal for XmlElement only makes the sense that it does because the declared indexer will otherwise throw. For types where that isn't the case, that isn't an option.

SeeminglyScience on 25 Nov 2019

The proposal for XmlElement only makes the sense that it does because the declared indexer will otherwise throw. For types where that isn't the case, that isn't an option.

I'd say it's still an option at last _conceptually_ (can't speak to implementation difficulties), if the indexer argument is neither object nor int (as detailed in the previous comment), but I get that the case is then less clear-cut.

However, if you conceive of the PowerShell-provided indexer as simply an additional type-native indexer overload - which would need to be a deliberate design decision - the behavior would be consistent with method overload resolution.

mklement0 on 25 Nov 2019

P.S.: Just as a point of interest (I wouldn't make the fix dependent on how common XML parsing in PowerShell truly is):

Number of PowerShell-tagged Stack Overflow questions relating to CSV / XML / JSON, in descending order:

Note that I've deliberately not used tags for the data formats, because questions are frequently incompletely tagged.

mklement0 on 25 Nov 2019

However, if you conceive of the PowerShell-provided indexer as simply an additional type-native indexer overload - which would need to be a deliberate design decision - the behavior would be consistent with method overload resolution.

Personally I consider it to be a fallback option only. The implementation backs that up, but I don't have any insights into the design intention.

P.S.: Just as a point of interest (I wouldn't make the fix dependent on how common XML parsing in PowerShell truly is):

Number of PowerShell-tagged Stack Overflow questions relating to CSV / XML / JSON, in descending order:

I was referring to the usage of XmlElement in relation to other types in the BCL, not in relation to other data types.

That said, I wouldn't be surprised if the scales tipped towards json if you filtered by questions asked this year. I'd also say those numbers may not reflect usage, but difficulty.

SeeminglyScience on 25 Nov 2019

Good point re XML question count possibly reflecting difficulty (and about JSON catching up), but I'd say that the more interesting gauge is how likely it is that the average user has to deal with XML rather than the relative frequency of type use.

Re fallback: It's clearly _de facto_ the case, but irrespective of the original design intent, I'm suggesting that implementing the positional-indexer-as-implicit-overload approach (within the stated constraints) would give us generally useful behavior that then doesn't require special-casing XmlElement.

It would make the scalar-as-single-element-collection behavior generally more consistent, which I think is important.

mklement0 on 25 Nov 2019

@mklement0 You could look also new System.Text.Json. There are issues too and maybe related to this. (My PR with new ConvertTo-Json can help you)