Powershell: Inconsistent support for converting number strings with binary-multiplier / number-type suffixes to numbers

Created on 5 Sep 2018 · 43Comments · Source: PowerShell/PowerShell

Note:

Changing this would technically be a breaking change and/or may be undesirable for performance reasons, in which case the issue should be _documented_.

When converting from _strings_,

PowerShell currently DOES recognize binary-multiplier suffixes such as kb, mb, ...:
- in _expressions_ (e.g., '1gb' / 1)

but DOES NOT recognize them:
- during _implicit_ type conversion in then context of _parameter binding_ (e.g.,
  & { param([int] $foo) } '1gb')
- (perhaps less surprisingly) in _explicit_ type conversions with _casts_ (e.g., [int] '1gb')

_Update_: The same applies to PowerShell's _number-data-type suffixes_ such as d for [decimal] and l for [long].

Environment data

Written as of:

PowerShell Core 6.1.0-preview.4

Breaking-Change Committee-Reviewed Issue-Discussion Resolution-Fixed Up-for-Grabs WG-Language

Source

mklement0

Most helpful comment

If you look at just the constructor you can see it does very little.

That said, because C# lets you initialize properties outside the constructor, there could be expensive work hiding elsewhere, but by design creating a parser was meant to be inexpensive and if that is not the case, it's something that should be fixed.

lzybkr on 18 Jan 2019

👍2 ❤1

All 43 comments

It is not type conversion. The multipliers is realy "multiplier" in parse time. Type conversion is done in run-time. So $($var)Kb doesn't work too.

iSazonov on 6 Sep 2018

@iSazonov: Parse-time resolution applies to _unquoted number literals_ such as 1gb.

By contrast, my expression example uses a _quoted string_, where type conversion is clearly performed _at runtime_:

PS> $v="1gb"; $v / 1
1073741824

/cc @PetSerAl

mklement0 on 6 Sep 2018

I'm not really sure this should be a thing. How is PS supposed to distinguish cleanly between when you want something to cast as an integer (especially in implicit cases) and when you might want it to throw instead?

I can see that it probably ought to happen for strongly-typed parameters and would make sense there, but in most cases you'll call a parameter like that... just as a number: Do-Thing -Number 1gb rather than as a string Do-Thing -Number '1gb' -- and I think that distinction is probably pretty sensible.

The only real time I can see this being particularly useful is in creating meta-code, but in those cases you can just convert the code string to a script block and the PS tokenizer will handle the multiplication correctly in those cases.

vexx32 on 6 Sep 2018

@vexx32:

Well, it already _is_ a thing, but only _partially_ (only in expressions), and that inconsistency prompted me to raise this issue.

How is PS supposed to distinguish cleanly

Just like with any "number strings": if the context explicitly requires a _number_, try to convert the string to one.

Yes, passing a string to something that expects a number is not the typical case, but it works with all other "number strings" - namely those that _.NET_ recognizes.

Note that the inconsistency is not limited to the binary-multiplier suffix, it also applies to the data-type suffixes:

'1d' / 1  # OK -> 1, of type [decimal]

[int] '1d' # FAILS -> Cannot convert value "1d" to type "System.Int32". Error: "Input string was not in a correct format."

mklement0 on 6 Sep 2018

👍1

@mklement0 I mean that '1Gb' and '1d' is strings and PowerShell have to parse the strings to convert to numerics. In runtime 'Gb' and 'd' would have to be operators.

iSazonov on 6 Sep 2018

@iSazonov:

Yes, they are strings, and PowerShell converts them to numbers in _expressions_; by contrast, it doesn't do so when converting _implicitly_ during parameter binding - that is the inconsistency being discussed here.

I understand that _.NET_ (CoreCLR) doesn't know about _PowerShell's_ number-literal suffixes, but the question is whether _PowerShell_ should handle the conversion part consistently.

As I said, with explicit _casts_ the picture is not as clear, but there already is a precedent for PowerShell doing additional behind-the-scenes work, such as casts sometimes being translated in to ::Parse() calls with the invariant culture.

mklement0 on 6 Sep 2018

@mklement0 I see your point.
What about

$var = '1gb'
& { param([int] $foo) } $var

Seems we'll have to do extra steps to process this.
And then what about:

$var = '1gb'
$var + 1

is this string operation or numeric?

iSazonov on 6 Sep 2018

@iSazonov:

Your 1st example is already in the OP, as an example of where the conversion _doesn't_ work (_implicit_ conversion in the context of _parameter binding_).

As for the 2nd example:

+ - unlike / - does _not_ perform to-number conversion, so nothing would change there:

'1' + 0 # -> '10'
'1gb' + 0 # -> '1gb0'

To summarize: The only potential change being discussed here is:

Should _all contexts where PowerShell currently already performs to-number conversion_ recognize [string representations of] PowerShell-only number-literal formats _too_?

mklement0 on 6 Sep 2018

It is a question for PowerShell Committee.

iSazonov on 7 Sep 2018

@PowerShell/powershell-committee reviewed this. This is a breaking change where it used to fail and now works. We agree that the desired behavior is acceptable. However, this is likely a complex change that need to also work with recently introduced suffixes like l for LONG.

SteveL-MSFT on 6 Dec 2018

👍2

@SteveL-MSFT Would simply passing the string to one of the parser or tokenizer functions be an appropriate way to determine if it's a parseable number and the resultant type? If this occurs after some 'is it a number' check has already been made, one might simply be able to directly call TryGetNumberValue or one of the related tokenizer functions on it directly?

Side note: L is an old suffix, but there are several newer ones that do indeed need to be considered also :wink:

vexx32 on 6 Dec 2018

cc @daxian-dbw @BrucePay

SteveL-MSFT on 12 Jan 2019

Thanks, Steve! I checked in with @TylerLeonhardt who passed it along to @rjmholt, and after some brief suggestions I traced the cached conversion methods that get called from string->integral types to here:

https://github.com/PowerShell/PowerShell/blob/e399b06cd23a8d9a7df9be5ed329471ba5b8eb82/src/System.Management.Automation/engine/LanguagePrimitives.cs#L2888-L2907

My initial thought so far is to add some kind of check in there to determine if the last two characters in this (possibly trimmed?) value are a multiplier or the last character is a type suffix, something kind of like...

ScriptBlockAst result;
char lastChar = strToConvert[strToConvert.Length - 1];
if (!lastChar.IsTypeSuffix() ||
    lastChar == 'b' && strToConvert[strToConvert.Length - 2].IsMultiplierStart())
{
    result = Parser.ParseInput(strToConvert, out Token[] tokens, out ParseError[] errors);
}

And then I think I can grab tokens[0].Value to determine the numeric value and use that. _However_, I don't know that invoking ParseInput() here is the best available option. It also sort of bypasses the _need_ for the latter half of this conversion expression, because the token's .Value property seems to hold the appropriate data type coming out of it, at least where any type suffix may be concerned.

Would definitely appreciate some input from y'all on this, it's a fun little rabbit trail, but I don't want to get too far off track and in a direction that isn't useful to y'all. 😄

EDIT: Spoke a little with @SeeminglyScience this morning, and he and I think it might be better if we can somehow decouple portions of ScanNumberHelper and TryGetNumberValue from the tokenizer stepping logic and reuse that, so that we're guaranteed self-consistent results without having to create an entire parser and tokenizer during this conversion, which is almost certainly an expensive operation. How doable that is, I don't know yet. Interesting thought though. 😄

vexx32 on 12 Jan 2019

👍2

EDIT: Spoke a little with @SeeminglyScience this morning, and he and I think it might be better if we can somehow decouple portions of ScanNumberHelper and TryGetNumberValue from the tokenizer stepping logic and reuse that, so that we're guaranteed self-consistent results without having to create an entire parser and tokenizer during this conversion, which is almost certainly an expensive operation. How doable that is, I don't know yet. Interesting thought though.

My personal feeling is as you say; casting numbers from strings here should work in exactly the same way as they would in the tokenizer (the easiest way being to share the function). If that requires no state, or can be decoupled from the tokenizer state, then the method may as well be shared and used in both places.

I don't know how expensive it is to create a new tokenizer (or if it's possible to do outside of the parser with current type access) but to my mind, casting is often associated with some overhead, so it's not the end of the world if the first implementation is not as efficient as it could be.

I would favour doing it consistently over making it fast, since once it's released we can improve the latter but improving the former would be a breaking change.

rjmholt on 17 Jan 2019

👍1

@rjmholt I think I can agree with all your points there.

Currently you cannot instantiate the tokenizer without the parser (that I can see, anyway), and the main method I know to parse the input would be the static Parser.ParseInput() method mentioned above.

I've been looking at attempting to decouple the methods in the tokenizer, and while I believe it may be eventually worthwhile it may not be especially valuable at this point in time. It's not a small work item by any means; a _lot_ of the underlying logic is coupled tightly to the step-through methods. I can decouple portions of the methods, but I've yet to find any relatively clean ways to thoroughly decouple the methods without causing every possible number to be double-scanned anyway. It may well be simpler to just... invoke the parser on the suspected number and go from there.

If that sounds OK, I can tidy the code in that method and submit a PR?

vexx32 on 17 Jan 2019

Currently you cannot instantiate the tokenizer without the parser (that I can see, anyway), and the main method I know to parse the input would be the static Parser.ParseInput() method mentioned above.

Mmm, I thought that was the case. Bugger.

I've been looking at attempting to decouple the methods in the tokenizer, and while I believe it may be eventually worthwhile it may not be especially valuable at this point in time. It's not a small work item by any means; a lot of the underlying logic is coupled tightly to the step-through methods.

I've wanted parser and tokenizer as nice, standalone objects for a while as well, but I think the likelihood of breakage to a critical (and currently very fault-tolerant) part of PowerShell means we need a large and compelling reason for the work.

rjmholt on 17 Jan 2019

Completely agree there. I think it's certainly _possible_ to handle this, but as you say... not really worth it until there's a real need for it.

I'll corral the code here and get it working properly, then submit the PR. 😄

vexx32 on 17 Jan 2019

I think you're looking for this code. Note that it creates a parser which is not expensive.

lzybkr on 18 Jan 2019

@lzybkr Thank you! That saves me rewriting that code... _fantastic_. This'll be pretty straightforward. 😄

Creating a parser _isn't_ expensive? That's... actually very surprising, looking at the size of those files. Good to know, though!

vexx32 on 18 Jan 2019

If you look at just the constructor you can see it does very little.

lzybkr on 18 Jan 2019

👍2 ❤1

@lzybkr Got it halfway there with that; see the referenced WIP PR.

As long as the value in the string is valid when it's parsed, we get the right value and type back (e.g., if you do [int]"100y" you _do_ get back 100 as an int, because 100 is a valid byte value, can be parsed, and then it gets converted.

But it just completely crashes PS with a StackOverflowException when the value in the string is an invalid number (e.g., 200y, which would usually throw a parsing exception if you attempt to use it normally, just crashes PowerShell ).

vexx32 on 18 Jan 2019

Can we reuse the parser and don't allocate it in every ScanNumber() call?

iSazonov on 18 Jan 2019

🤔 Potentially? I'm not sure how much gain there is in doing so, though, or how complicated it would end up being. I'll look into it tomorrow. 😄

vexx32 on 18 Jan 2019

It is _engine_. If we can avoid an allocation we should do I believe.

iSazonov on 18 Jan 2019

👍2

@iSazonov I attempted to add a static Parser field to LanguagePrimitives but immediately this resulted in an AMSI uninitialization error as soon as the project was built and run; it could not even reach the console prompt.

The shell cannot be started. A failure occurred during initialization:
The type initializer for 'System.Management.Automation.Language.Compiler' threw an exception.
Assertion Failed
AMSI should have been uninitialized.

I'm not sure why this occurs, but I'm sure there's probably some reason for it.

vexx32 on 20 Jan 2019

It come from VerifyAmsiUninitializeCalled(). I guess it is side effect. You could remove the assert temporary to find root cause.

iSazonov on 21 Jan 2019

Not a bad idea... What would I be looking for after it's disabled? I'm not at all familiar with that code.

vexx32 on 21 Jan 2019

I expect that after you disable the assert you will see that process terminated and will need to research why.

iSazonov on 21 Jan 2019

👍1

Ah, my favourite kind! 😄

I'll see if I can find some time to take a close look at it tomorrow. 🙂

vexx32 on 21 Jan 2019

@iSazonov Gave this a look, but it's not actually crashing.

It does however completely disable the error stream. $Error is never populated, errors never display.

Very odd! Debugging time!

Update:

So I still have no idea. The exceptions are thrown and then travel up the stack (which is _really_ tricky to follow at the best of times) and then just seem to get caught somewhere that isn't really obvious and then... nothing. $error never gets populated, and no error output reaches the console.

I moved the static parser declaration to the Parser class itself, and this actually seems to resolve it. PowerShell builds and runs OK with only that change, AMSI doesn't fail anywhere that is immediately obvious, and the error stream is working. However, doing this is potentially a thread-safety concern as I note here: https://github.com/PowerShell/PowerShell/pull/8681#issuecomment-456441592

I think at present it's probably best not to keep a static Parser instance around; it may be worth pursuing in future, but thread-safety is something I do not think I have enough experience with to do a proper job of it here, and the overhead of ensuring thread-safety may well outweigh the costs of instantiating a Parser each time.

vexx32 on 22 Jan 2019

My 2¢ - it's probably not worth the trouble to reuse the parser, but if you did, it would look something like this.

lzybkr on 22 Jan 2019

👍2

I may well have a crack at it and see if I can follow that example, but that will probably be in a followup PR. 😄

Thank you! ❤️

vexx32 on 22 Jan 2019

@vexx32 the AMSI assert is https://github.com/PowerShell/PowerShell/issues/6862

SteveL-MSFT on 22 Jan 2019

@SteveL-MSFT That may be the same thing I ran into, or it may be something else. That one's under specific circumstances, but the probably naive implementation of a static parser I made caused it to happen _immediately_ on startup, rending pwsh completely unusable. 😄

Not totally sure why, but if I go after that route, I'll definitely take @lzybkr's suggestion... and hope it doesn't break the same way! 🙂

vexx32 on 22 Jan 2019

@vexx32 Please see GetStringBuilder() that is for reusing StringBuilder-s.

iSazonov on 23 Jan 2019

👍1

@SteveL-MSFT @lzybkr @iSazonov So I got curious and decided to see if I could put this together; this is what I came up with.

internal static class ParserCache
{
    [ThreadStatic]
    private static Parser t_cachedInstance;

    internal static Parser Acquire()
    {
        Parser p = t_cachedInstance;
        if (p != null)
        {
            t_cachedInstance = null;
            return p;
        }

        return new Parser();
    }

    internal static void Release(Parser p)
    {
        if (p.ErrorList.Count > 0)
        {
            p.ErrorList.Clear();
        }

        p._fileName = null;
        p._ungotToken = null;
        t_cachedInstance = p;
    }
}

I ran a build with it, and it seems to work just fine. There are a handful of callsites in Parser.cs where new Parser() is called that I replaced with the methods here, and it seems to work just fine. Whether there's any performance benefit I'm not clear on.

Think it's worth including in #8681?

vexx32 on 28 Jan 2019

@vexx32 - the Parser class should be responsible for resetting it's state for reuse. You can see the code that does that here.

I think there is significant risk in reusing parser instances. A specific instance may be left in a state not suitable for another parsing task, and tracking down how the parser got into that state might be very difficult. These bugs might not be hard to fix once understood, but I would expect issues to be hard to reproduce, possibly even non-deterministic.

Maybe with some care you could a test that uses reflection to verify a Parser.Reset method resets every field in Parser and for instances that are reused - check that recursively - so Tokenizer is reset appropriately as well. Without such a test, I'd be worried some new code gets added introducing a bug that could take a long time to surface.

lzybkr on 28 Jan 2019

👍1

All very valid concerns, indeed! I appreciate the tips, thank you! 🙂

I'm not familiar with much of reflection, but such a test doesn't seem overly complicated to do... just need to make sure to be thorough. 😄

My thought is that pulling a parser from the cache should get you a parser in the exact same state as a newly-created parser. Doing a comparison between the two should be the best indicator, as well as being fairly future-proof in terms of ensuring it still properly covers the case that someone adds new members to the Parser class down the road. I think. 😄

Would that test be best performed from Pester, do you think, or is xUnit preferable here?

vexx32 on 29 Jan 2019

Why we need new Parser in ScanNumber?

iSazonov on 29 Jan 2019

@iSazonov all of the static Scan*() methods in the parser currently instantiate a new Parser object in order to perform their function. As to _why_ specifically, I'm not familiar with everything about the code (I'm sure Jason could speak to that far better than I, of course) but the tokenizer and parser are tied together; you literally _can't_ instantiate one without the other.

The only available constructor for the tokenizer class requires a Parser object as input, and instantiating a Parser _also_ instantiates a tokenizer. Certainly, perhaps this could be changed, but I don't know whether or not that is... wise. 🙂

vexx32 on 29 Jan 2019

My thoughts is about possibility of using current Tokenizer and Parser. Perhaps they could be improved to support this and perhaps it is more simple.

iSazonov on 29 Jan 2019

The tokenizer uses the parser for error reporting. There may be other important ties that I forget, but that's the big one.

Unless you have concrete data showing a performance problem, I'll suggest the risk isn't worth it.

I've suggested ways to mitigate the risk (and testing that a Reset parser is in the same state as a new Parser is also a good idea - not sure how that will look in code though), but in the end - I'd like data to justify the extra complexity and maintenance risks.

lzybkr on 29 Jan 2019

👍1

@lzybkr Thanks!

Initial request was to reduce garbage pressing. The parser state resetting is not resolve this. So I agree that the change is not worth it.

iSazonov on 30 Jan 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Feature Request: Suppress sleep cmdlet

garegin16 · 3Comments

Splatting a property

JohnLBevan · 3Comments

Write-Output -NoEnumerate outputs PSObject[] rather than Object[] and generally doesn't respect the input collection type

mklement0 · 3Comments

PowerShell core on Linux - Get-Service should mimic linux "service" command.

MaximoTrinidad · 3Comments

New-PSSession error "subsystem request failed on channel 0" when using SSH key authentication

pcgeek86 · 3Comments