Powershell: Enhance some cmdlets with Culture and Comparison parameters

Created on 12 Apr 2019  路  22Comments  路  Source: PowerShell/PowerShell

Approved proposal https://github.com/PowerShell/PowerShell/issues/9348#issuecomment-483051343

Motivation

Currently some cmdlets process objects using only current culture. Sometime we have Culture parameter to use another culture and CaseSensitive parameter to switch from default case-INsensitive behavior to case-sensitive.

Using only current culture and case-insensitivity is not a justified restriction.
In management tasks, it is often justified to use culture invariant and ordinal/ordinal-ignorecase comparisons.

Using non-culture (especially ordinal) allows to significantly increase the performance of operations like processing large log files.

Summary of the new feature/enhancement

For follow cmdlets:

  • Compare-Object
  • Group-Object
  • Sort-Object
  • Select-Object
  • Select-String

implement:

  • add Culture parameter if absent
  • add Comparision (with values - Ordinal, OrdinalIgnoreCase, CurrentCulture, CurrentCultureIgnoreCase, InvariantCulture, InvariantCultureIgnoreCase and perhaps SimpleCaseFolding)
  • deprecate CaseSensitive parameter

Proposed technical implementation details

  • All this parameters is in ObjectCmdletBase class.
  • To deprecate CaseSensitive parameter add Parameter(DontShow = true).
  • Comparision parameter has priority over Culture parameter.
  • Defaults (stay like now) is CurrentCulture for Culture parameter and CurrentCultureIgnoreCase for Comparison parameter. Although we might consider OrdinalIgnoreCase.

Additional information

Come from #8180 discussion.

Committee-Reviewed Issue-Enhancement

All 22 comments

@SteveL-MSFT Perhaps PowerShell Committee should make a conclusion before I push a PR.

Great idea in principle, but I definitely advise against deprecating -CaseSensitive and generally undermining the default assumption of case-_insensitivity_ by having Ordinal and CurrentCulture mean case-_sensitive_ and suddenly having a case-insensitivity opt-in (OrdinalIgnoreCase, CurrentCultureIgnoreCase- that this is the meaning of these flags at the level of _CoreFx_ is irrelevant, because it is a different realm.

From what I can see, the only thing that's needed is to loosen the definition of the -Culture parameter slightly by accepting two pseudo-culture names, Invariant and Ordinal - no separate -Comparison parameter is then needed:

  • Invariant (currently, at the level of CoreFx, you must pass '' as the name to [cultureinfo]::GetCultureInfo() in order to get the the invariant culture, but that is too obscure; of course you can more sensibly use [cultureinfo]::InvariantCulture), which uses StringComparison.InvariantCultureIgnore case by default, unless -CaseSensitive is a also present (StringComparison.InvariantCulture)

  • Ordinal, which uses ordinal comparison only, StringComparison.OrdinalIgnoreCase, unless -CaseSensitive is also present (StringComparison.Ordinal)

  • Specifying any actual culture name such as ru-RU or omitting -Culture altogether would then imply StringComparison.CurrentCultureIgnoreCase for the specified / current culture, except if -CaseSensitive is also present (StringComparison.CurrentCulture).

Suggestion is not remove CaseSensitive (only hide from Intellisense) but expose new functionality to users. Users will be able to continue to use the old features if they want.
The enhancement of Culture type makes it difficult to discover new elements and new functionality. It will also unnecessarily complicates our code to transform the pseudo type. The C # model is more user- and developer-friendly.

Suggestion is not remove CaseSensitive

I understand that your intent was to "soft-deprecate" -CaseSensitive, not to remove it.
Yet, that very intent is what I advise against.

will also unnecessarily complicates our code

Such considerations should not guide design decisions.

The C # model is more user- and developer-friendly.

PowerShell is not C#, and there are fundamental differences.

The fundamentally case-INsensitive nature of PowerShell is one of them.

While aligning with C# _where it make sense_ is commendable, in this case it contravenes PowerShell's fundamental nature and will cause nothing but confusion.

Your proposal may make sense to you because you're immersed in C# - that cannot and shouldn't be assumed for all PowerShell users.

@mklement0 I mentioned C# only because this model is intuitive and easy to use. Especially since we still have no IntelliSense for cultures.

Especially since we still have no IntelliSense for cultures.

Why not? It's fairly easy to do in PowerShell:

function Get-Foo {
  param(
    [ArgumentCompleter({ 
      param($c, $p, $w) 
      'Invariant', 'Ordinal'  + 
        [cultureinfo]::GetCultures('SpecificCultures').Name -like "$w*"
    })]
    [string] $Culture
  )
  $Culture
}

this model is intuitive and easy to use.

Perhaps in C#, but not in PowerShell, for the reasons stated; plus, throwing -Culture into the mix complicates things, due to incompatible -Culture and -Comparison values.

Let's juxtapose the two proposals:

Task
(case-sensitivity, culture) | single -Culture param | -Culture + C#-style -Comparison param
---- | ------ | -----
insensitive, current | (default behavior) | (default behavior)
sensitive, current | -CaseSensitive | -Comparison CurrentCulture
insensitive, invariant | -Culture Invariant | -Comparison InvariantCultureIgnoreCase
sensitive, invariant | -Culture Invariant
-CaseSensitive | -Comparison InvariantCulture
insensitive, ordinal | -Culture Ordinal | -Comparison OrdinalIgnoreCase
sensitive, ordinal | -Culture Ordinal
-CaseSensitive | -Comparison Ordinal
insensitive, given culture | -Culture ru-RU | -Culture ru-RU
sensitive, given culture | -Culture ru-RU
-CaseSensitive | -Culture ru-RU
-Comparison CurrentCulture

  • in all cases, the C#-style comparison mode names contradict the default case-insensitivity expectation in PowerShell.
  • Note the double awkwardness of -Culture ru-RU -Comparison CurrentCulture:

    • The word _current_ can seem contradictory with specifying a specific culture.

    • From a PowerShell mindset, nothing indicates case-_sensitivity_.

  • The -Culture + -Comparison proposal is more verbose in the case-insensitive scenarios (which are arguably more common) and obscure, as well as more complex, given the need to prevent use of -Culture with -Comparison Ordinal | OrdinalIgnoreCase | InvariantCulture | InvariantCultureIgnoreCase.

@mklement0 Thanks for making things clear. I see your point and I agree that we could easily implement this by keeping current parameters. This "packaging" looks nice. Nevertheless, I see several problems after attempts to make a prototype that have led me to make the original proposal:

  • it is not a breaking change
  • it is keep Culture type being native. Overlapping native types always complicates code and it is a way to performance issues. It can confuse not only C# developers but script ones too because Ordinal is not culture - splitting ordinal and linguistic making things clear and simple.
  • the proposal made with the intention to add SimpleCaseFolding to Comparison.
  • as for "names contradict" we could replace "Ordinal" with "OrdinalSensitiveCase" (vs "OrdinalIgnoreCase"). Interesting, InvariantCulture is still culture. :-) However, if we transfer the center of gravity to Comparison parameter, then I would prefer to do all the workarounds on it.

it is not a breaking change

That is largely hypothetical, as the existing -Culture parameters are [string]-typed, and neither Ordinal nor Invariant refer to existing cultures - nor would SimpleCaseFolding.

it is a way to performance issues.

I don't think that's a concern here. We could stick with string, if we don't want to create a type that is a "superset" of [cultureinfo].

the proposal made with the intention to add SimpleCaseFolding to Comparison.

Then accept SimpleCaseFolding as a pseudo culture too.

As an aside, I'm still unclear on how simple case folding relates to InvariantCultureIgnoreCase in terms of _behavior_ - from what I understand the former is faster than the latter, but are they _functionally_ the same?

Interesting, InvariantCulture is still culture. :-)

Yes, and that's why it makes sense to treat _all_ these cases the same, as pseudo cultures, because it all comes under the heading of "The rules of what culture/non-culture should be applied?"

@PowerShell/powershell-committee reviewed this, we believe @mklement0's proposal with -Culture and -CaseSensitive will be easier for most users. Given the complexity of this and the opportunity for more feedback, we request that a RFC be authored for this work.

Probably want to make sure that includes a parameter transform attribute and argumentcompleter setup to make it as easily reusable as possible, too. :)

When I spoke about complications, I just meant that we would have to add code for transform attribute and argument completer.
We could ask CoreFX team to add Invariant name to Culture type, but Ordinal name will probably never be on the list of cultures.
Also we have to translate "culture" to Culture and Comparison types so transform attribute is problematic.

I looked at grep and ripgrep.
They can accept input as byte stream. grep uses for this LC_ALL="C" that is one byte encoding (.Net Core exposes this as"invariant" culture on Unix). ripgrep also has an option to switch to byte stream.
It makes me think that -AsByteStream parameter can be useful to us.

Interesting, ripgrep ignores system locale settings (!)
https://github.com/BurntSushi/ripgrep/issues/790#issuecomment-365024942
https://github.com/BurntSushi/ripgrep/blob/7b3fe6b3251a18d2b8d3efe7ee6a85c9e9e4e565/FAQ.md#can-ripgrep-replace-grep
Taking in account that ripgrep is fastest tool we might think how to use the ripgrep experience. Maybe we don't need Culture parameter either.

@iSazonov The right documentation to read with respect to how ripgrep handles encoding is this: https://github.com/BurntSushi/ripgrep/blob/master/GUIDE.md#file-encoding --- Notably, ripgrep transparently supports UTF-16. grep does not. That might be relevant here.

@BurntSushi Thanks for comment and link! We are very interested in improving the search in PowerShell and would be happy for your help.
Currently Select-String cmdlet has already "-Encoding" parameter and we can either use automatic encoding detection or specify a value. (I am thinking about reducing transcoding because in C# world we work with UTF-16 by default and maybe we should consider working directly with UTF-8).
Now our discussion is that the cmdlet does only linguistic search for current culture/locale and we want to add culture independent (invariant culture) and non-lingustic (ordinal in C# world - don't know the correct term from ripgrep) search. I see ripgrep ignores locale settings but it is not clear how it support (if does) lingustic, invariant and ordinal search.
If you could share your experience it would be great.

I'm afraid that I don't follow. Apologies, but I'm not a Windows user. I don't know what "culture independent," "invariant culture," "non-linguistic" or "ordinal search" mean in this context. Examples would help.

An easy statement for me to make, based on guessing at what you mean, is to simply state that ripgrep has no support for custom tailorings defined by the various Unicode technical reports. Everything is vanilla Unicode, and its support for case folding is limited to the "simple" case mapping.

Otherwise, what you see if what you get. You give ripgrep a query, and it looks for exact matches in files, just like grep does. That's it.

Yes, terminology is a problem :-(
https://docs.microsoft.com/en-us/dotnet/csharp/how-to/compare-strings
In the docs you find examples.
"Culture" is "locale" on Unix.
"Invariant" is more tricky. It follow OS. On Windows it is a general culture (with Latin char ordering). On Unix it is locale "C". "Invariant" is usually used with an application should has the same behavior regardless system/user locale.
"Ordinal" means non-linguistic comparison, byte-by-byte (char-by-char), optionally with ignore case. On Unix .Net Core get this with local "C" (if I understand correctly) that is "invariant culture" too. On Windows "invariant culture" and "ordinal" are different as you can see in the above docs.

@mklement0 Since https://github.com/dotnet/corefx/issues/41333 all things are simplified and we could implement your proposal without any concerns about future API changes.
I have only one concern about preudo-culture names - "Invariant" and, specially, "Ordinal". Our intention is to simplify user understanding but these names looks very specific - they is native only for .Net Core/C# users.

Glad to hear it, @iSazonov.

Yes, they are .NET terms, which I think is appropriate, given PowerShell's foundation, especially if we clearly document these values.

(If they were specific to a .NET _language_, such as C#, I'd be more concerned).

What do you have in mind?

What do you have in mind?

Perhaps there are suitable terms in Unicode standard?

There is nothing in the Unicode glossary that jumps out: http://unicode.org/glossary/ (but see next section).

Conversely, I do think using the established .NET terms is beneficial - even if they may initially be unfamiliar to non-.NET users: that's where the docs come in.

(I guess CodePoint and Agnostic / Neutral could be considered, but I don't think there's a strong enough case for deviating from the established .NET terms).


As for the relationship with the Unicode standard, from what I gather:

  • _Ordinal_ has no counterpart in the Unicode standard, because it seems that _all_ operations are expected to be linguistically correct, albeit not all of them culture-specifically (_ordinal_ recognizes the solely code-point-by-code-point case mapping - e.g., to ; but not multi-character mappings such as to SS, defined in file SpecialCasing.text, which Unicode expects to be _always_ observed)

  • What is referred to in the standard under the umbrella of _default_ case algorithms appears to correspond to .NET's _invariant_ matching.

  • Culture-_specific_ behavior is referred to as _tailoring_.

@mklement0 I did not found appropriate terms too. Thanks for confirmation!

Was this page helpful?
0 / 5 - 0 ratings