From http://unicode.org/faq/casemap_charprop.html :
Case folding is mostly used for caseless comparison of text, such as identifiers in a computer program, rather than actual text transformation.
I opened https://github.com/dotnet/corefx/issues/33047 to implement _simple case folding_.
We could use this for improving performance in:
I started. Alfa code https://github.com/iSazonov/PowerShell/tree/add-unicode3/src/System.Management.Automation/utils/unicode
The code doesn't depend on PowerShell so you can freely copy-past the files to your local folder and run with dotnet run -c release
.
Better result is 10x win in best case and -4x in worst case. It is worth exploring further!
@mklement0 Have you an experience with Unicode? I'll need help to create a test set and review API design.
(Somewhat ironically, the U
in Unicode
in your URL should be lowercase u
- else the link doesn't work: https://github.com/iSazonov/PowerShell/tree/add-unicode3/src/System.Management.Automation/utils/unicode)
Thanks for looking into this, looks promising.
I think I have a basic understanding of Unicode fundamentals, but no experience relating to _implementation_ aspects.
Can you elaborate on how I could help? Not sure what you mean by _test site_.
As for implementing the Unicode-version-dependent case-folding tables ourselves, please note the following comments from https://github.com/dotnet/corefx/issues/17233:
I think the open issue is how we support this on Windows (since I think that we can get Case Folding functionality from ICU).
In general, we've tried to move away from the framework itself shipping globalization data and used the OS provided APIs instead. Since Windows doesn't currently expose a way to do case folding, we'd have to figure out what to do.
So it sounds like we should take advantage of OS support where available (ICU on Linux; ?? on macOS).
As for Windows, and therefore overall: Perhaps it's worth waiting for CoreFx to support this?
Not sure what you mean by test site.
Test set :-) It is Friday evening. If we would develop the APIs we should create a full set of tests. (We could look and grab CoreFX/CoreCLR tests - possible start search point https://github.com/dotnet/corefx/pull/32968).
Can you elaborate on how I could help?
http://unicode.org/ contains standard papers. I could share more tomorrow.
So it sounds like we should take advantage of OS support where available (ICU on Linux; ?? on macOS).
As for Windows, and therefore overall: Perhaps it's worth waiting for CoreFx to support this?
Inital plan is:
Ideal plan is to use the APIs in PowerShell engine for identifers.
A perfect plan :-) is to get full featured API in CoreFX/CoreCLR.
Today I have done some performance tests and it seems it will not be easy to overtake Kernel32 p/invokes. Though I have no experience with low level optimizations at all.
From: http://unicode.org/faq/casemap_charprop.html
Q: What is the difference between case mapping and case folding?
A: Case mapping or case conversion is a process whereby strings are converted to a particular form—uppercase, lowercase, or titlecase—possibly for display to the user. Case folding is mostly used for caseless comparison of text, such as identifiers in a computer program, rather than actual text transformation. Case folding in Unicode is primarily based on the lowercase mapping, but includes additional changes to the source text to help make it language-insensitive and consistent. As a result, case-folded text should be used solely for internal processing and generally should not be stored or displayed to the end user.
From http://www.unicode.org/reports/tr31/tr31-29.html#R5
Equivalent Case-Insensitive Identifiers: To meet this requirement, an implementation shall specify either simple or full case folding, and adhere to the Unicode specification for that folding. Any two identifiers that have the same case-folded form shall be treated as equivalent by the implementation.
From http://www.unicode.org/reports/tr21/tr21-5.html
1.3 Caseless Matching
Caseless matching is implemented using case-folding. The latter is the process of mapping strings to a canonical form where case differences are erased. Case-folding allows for fast caseless matches in lookups, since only binary comparison is required. Case-folding is more than just conversion to lowercase. For example, it handles cases such as the Greek sigma, so that "Μάϊος" and "ΜΆΪΟΣ" will match correctly.
From http://unicode.org/reports/tr18/#Folded_Matching
3.10 Folded Matching (Retracted)
RL3.10 Folded Matching
Previous versions of RL3.10 described tailored folding. However, for most full-featured regular expression engines, it is quite difficult to match under folding equivalences that are not 1:1. For more discussion of this, see 1.5 Simple Loose Matches and 2.1 Canonical Equivalents. Thus RL3.10 has been retracted.
From http://unicode.org/reports/tr18/#Simple_Loose_Matches
1.5 Simple Loose Matches
Most regular expression engines offer caseless matching as the only loose matching. If the engine does offers this, then it needs to account for the large range of cased Unicode characters outside of ASCII.
RL1.5 Simple Loose Matches
To meet this requirement, if an implementation provides for case-insensitive matching, then it shall provide at least the simple, default Unicode case-insensitive matching, and specify which properties are closed and which are not.
To meet this requirement, if an implementation provides for case conversions, then it shall provide at least the simple, default Unicode case folding.
From http://www.unicode.org/versions/Unicode11.0.0/ch03.pdf
3.13 Default Case Algorithms
Default Case Folding
Default Case Folding is based on the full case conversion operations without the contextdependent
mappings sensitive to the casing context.
R4 toCasefold(X): Map each character C in X to Case_Folding(C).
• Case_Folding(C) uses the mappings with the status field value “C” or “F” in the
data file CaseFolding.txt in the Unicode Character Database.
A modified form of Default Case Folding is designed for best behavior when doing caseless
matching of strings interpreted as identifiers.
(See above http://www.unicode.org/reports/tr31/tr31-29.html#R5)
Default Caseless Matching
Default caseless matching is the process of comparing two strings for case-insensitive
equality. The definitions of Unicode Default Caseless Matching build on the definitions of
Unicode Default Case Folding.
Default Caseless Matching uses full case folding:
D144 A string X is a caseless match for a string Y if and only if:
toCasefold(X) = toCasefold(Y)
Not related but interesting http://www.unicode.org/notes/tn31/tn31-2.html
Fast Compression Algorithm for Unicode Text (compression ratio ~20-50%, speed ratio ~15 times)
Additianal link from Rust RegEx library repository https://github.com/rust-lang/regex/blob/77140e7c1628ac31026d2421c6a4c3b0eb19506c/UNICODE.md#rl15-simple-loose-matches
First alfa code performance results (yes, I got the code fast!):
/ * Summary *
BenchmarkDotNet=v0.11.1, OS=Windows 10.0.10240.17146 (1507/RTM/Threshold1)
Intel Core i3-4150 CPU 3.50GHz (Haswell), 1 CPU, 4 logical and 2 physical cores
Frequency=3410090 Hz, Resolution=293.2474 ns, Timer=TSC
.NET Core SDK=2.1.403
[Host] : .NET Core 2.1.5 (CoreCLR 4.6.26919.02, CoreFX 4.6.26919.02), 64bit RyuJIT
DefaultJob : .NET Core 2.1.5 (CoreCLR 4.6.26919.02, CoreFX 4.6.26919.02), 64bit RyuJIT
| Method | StrA | StrB | Mean | Error | StdDev | Median | Scaled | ScaledSD | Gen 0 | Allocated |
|----------------- |------------- |------------ |----------:|---------:|----------:|----------:|-------:|---------:|-------:|----------:|
| ToLowerRussian | CaseFolding1 | CaseFolding | 73.28 ns | 2.129 ns | 3.120 ns | 72.32 ns | 0.99 | 0.06 | 0.0660 | 104 B |
| ToLowerInvariant | CaseFolding1 | CaseFolding | 73.85 ns | 1.660 ns | 3.978 ns | 72.09 ns | 1.00 | 0.00 | 0.0660 | 104 B |
| StringFold | CaseFolding1 | CaseFolding | 65.15 ns | 1.372 ns | 3.068 ns | 63.81 ns | 0.88 | 0.06 | 0.0660 | 104 B |
| | | | | | | | | | | |
| ToLowerRussian | ЯЯЯЯЯЯЯЯЯЯЯ1 | ЯЯЯЯЯЯЯЯЯЯЯ | 188.23 ns | 5.924 ns | 17.185 ns | 177.66 ns | 1.09 | 0.11 | 0.0660 | 104 B |
| ToLowerInvariant | ЯЯЯЯЯЯЯЯЯЯЯ1 | ЯЯЯЯЯЯЯЯЯЯЯ | 173.58 ns | 3.699 ns | 8.646 ns | 169.48 ns | 1.00 | 0.00 | 0.0660 | 104 B |
| StringFold | ЯЯЯЯЯЯЯЯЯЯЯ1 | ЯЯЯЯЯЯЯЯЯЯЯ | 77.15 ns | 1.580 ns | 2.054 ns | 76.17 ns | 0.45 | 0.02 | 0.0660 | 104 B |
Given how frequently PS uses case-insensitive comparisons, that will likely make a massive difference if it can be utilised widely in the PS code base.
Looking good @iSazonov!!
Yes, it is encouraging. I updated link above to the alfa code. Now I started to work on fold comparison methods and perf tests.
This can not be trusted but the first results are impressive.
// * Summary *
BenchmarkDotNet=v0.11.1, OS=Windows 10.0.10240.17146 (1507/RTM/Threshold1)
Intel Core i3-4150 CPU 3.50GHz (Max: 0.80GHz) (Haswell), 1 CPU, 4 logical and 2 physical cores
Frequency=3410090 Hz, Resolution=293.2474 ns, Timer=TSC
.NET Core SDK=2.1.403
[Host] : .NET Core 2.1.5 (CoreCLR 4.6.26919.02, CoreFX 4.6.26919.02), 64bit RyuJIT
DefaultJob : .NET Core 2.1.5 (CoreCLR 4.6.26919.02, CoreFX 4.6.26919.02), 64bit RyuJIT
| Method | StrA | StrB | Mean | Error | StdDev | Median | Allocated |
|-------------------------------------- |------------- |------------ |----------:|-----------:|-----------:|----------:|----------:|
| CoreFXCompareOrdinal | CaseFolding1 | CaseFolding | 15.97 ns | 0.4665 ns | 0.4581 ns | 15.75 ns | 0 B |
| CoreFXCompareOrdinalIgnoreCase | CaseFolding1 | CaseFolding | 28.47 ns | 0.6793 ns | 1.1535 ns | 27.98 ns | 0 B |
| CoreFXCompareInvariantCulture | CaseFolding1 | CaseFolding | 78.80 ns | 1.5962 ns | 3.6354 ns | 77.11 ns | 0 B |
| CoreFXCompareRussianCulture | CaseFolding1 | CaseFolding | 205.74 ns | 4.0894 ns | 3.8252 ns | 203.90 ns | 0 B |
| CoreFXCompareRussianCultureIgnoreCase | CaseFolding1 | CaseFolding | 213.12 ns | 17.6287 ns | 17.3137 ns | 204.75 ns | 0 B |
| CompareFolded | CaseFolding1 | CaseFolding | 23.20 ns | 0.4771 ns | 0.5495 ns | 22.90 ns | 0 B |
| CoreFXCompareOrdinal | ЯЯЯЯЯЯЯЯЯЯЯ1 | ЯЯЯЯЯЯЯЯЯЯЯ | 16.27 ns | 0.3593 ns | 0.3690 ns | 16.21 ns | 0 B |
| CoreFXCompareOrdinalIgnoreCase | ЯЯЯЯЯЯЯЯЯЯЯ1 | ЯЯЯЯЯЯЯЯЯЯЯ | 35.92 ns | 1.1880 ns | 1.3205 ns | 35.35 ns | 0 B |
| CoreFXCompareInvariantCulture | ЯЯЯЯЯЯЯЯЯЯЯ1 | ЯЯЯЯЯЯЯЯЯЯЯ | 101.51 ns | 2.7983 ns | 7.8468 ns | 97.63 ns | 0 B |
| CoreFXCompareRussianCulture | ЯЯЯЯЯЯЯЯЯЯЯ1 | ЯЯЯЯЯЯЯЯЯЯЯ | 205.93 ns | 3.9585 ns | 4.2356 ns | 204.45 ns | 0 B |
| CoreFXCompareRussianCultureIgnoreCase | ЯЯЯЯЯЯЯЯЯЯЯ1 | ЯЯЯЯЯЯЯЯЯЯЯ | 210.18 ns | 15.8387 ns | 14.8155 ns | 204.99 ns | 0 B |
| CompareFolded | ЯЯЯЯЯЯЯЯЯЯЯ1 | ЯЯЯЯЯЯЯЯЯЯЯ | 22.83 ns | 0.1892 ns | 0.1477 ns | 22.79 ns | 0 B |
I updated my work branch. Now the code is placed in correct (I hope :-) ) folders and you can compile PowerShell and run xUnit and performance tests.
Next step is to create a comparer based on simple folded strings. Then we could try it in PowerShell engine.
Update: comparer was pushed.
Great stuff, @iSazonov; thanks also for the helpful collection of quotes from the Unicode standard.
I've never had to deal with Unicode normalization problems in the context of .NET (e.g., that accented letter ü
can be represented either as single code point U+00FC
or as two code points U+0075
+ U+0308
- u
followed by combining diacritic ̈
): do they come into play here?
There is case mapping and case folding.
Case mapping is a conversion like ToLowerCase() and ToUpperCase(). Case mapping is _culture aware_.
Case folding is still another conversion and it is _culture-independent_. The last feature is key for us here.
Case folding can be full and simple.
Full case folding can change string length (i.e. replace 2-byte to 4 byte code point).
Simple case folding always preserves the string length. This is a key for us here too.
Case folding is 1:1 char replacement - no chars added, no chars removed.
Unicode standard exposes CaseFolding.txt file containing a table for case folding (full and simple in one table). We parse the file (we created a PowerShell script for this) and one-time generate C# file with two static char arrays - first for 2-byte code points and second for 4-byte code points - this is important because chars in C# is Utf16 2-byte values, it also means that we have to process surrogates. These array is used for one-step very fast char conversions. Surrogates require additional calculations but surrogates is very rare and the calculations is very simple - this should not degrade performance.
Thanks, @iSazonov.
Minor addition: _default_ case folding - which is the one we're concerned with - is culture-invariant, but case folding _can_ be culture-sensitive (_tailored_), if needed (the T
status values in the case-folding table, and additional language-specific rules not covered there).
While the _normalization_ problem I've mentioned is unrelated to casing per se, I still wonder if it affects us here:
# Composed normalization form: 'ö' as a *single* code point.
$oUmlautNFC = [string] [char] 0xf6
# Decomposed normalization form: 'ö' as *two* code points:
# The base character, 'o', followed by a combining diacritic.
$oUmlautNFD = -join [char[]] (0x6f, 0x308)
# String comparison *does* recognize the equivalence, even though
# the strings have distinct code points:
$oUmlautNFC -eq $oUmlautNFD # -> $True!
Does your code recognize this kind of equivalence too?
I would expect that it doesn't need to, @mklement0. In that example, let's consider how case-folding would treat the characters:
I would expect that folding the single code-point variant would consider both 0xD6
and 0xF6
(upper and lowercase code points of Ö
) the same.
In the case of the regular character and the combining diacritic, it only needs to fold case for the regular character, and the combining diacritic would be unaffected.
@vexx32: That applies if _both_ input strings are in the _same_ normalization form (and only differ by case), but the question is what happens if they're not:
With -eq
(.Equals()
with option InvariantCultureIgnoreCase
- also works with CurrentCultureIgnoreCase
), the equivalence is recognized:
# NFC 'Ö' (LATIN CAPITAL LETTER O WITH DIAERESIS)
$oUmlautNFC_Upper = [string] [char] 0xd6
# Compare uppercase (single-code-point) NFC Ö to lowercase (2-code-point) NFD ö
$oUmlautNFC_Upper -eq $oUmlautNFD # -> !True!
By contrast - much to my surprise - Select-String
does _not_ recognize this equivalence:
$oUmlautNFC_Upper | Select-String $oUmlautNFD # !! NO OUTPUT
Is it by design that Select-String
compares strings differently? (Never looked at the internals).
Either way, while the answer may be that we cannot handle normalization differences for performance reasons, we need to have clarity on that decision and to document it accordingly.
@mklement0 Your question is open question. I think the question is "What is PowerShell identifier" in Unicode terms and our answer should be based on http://www.unicode.org/reports/tr31/
As I mentioned above there is a rule that we could use for PowerShell:
UAX31-R5. Equivalent Case-Insensitive Identifiers: To meet this requirement, an implementation shall specify either simple or full case folding, and adhere to the Unicode specification for that folding. Any two identifiers that have the same case-folded form shall be treated as equivalent by the implementation.
There's a chapter on normalization in the document you link to: http://www.unicode.org/reports/tr31/#normalization_and_case
Have only glanced at it at this point, but this caught my eye:
Implementations that take normalization and case into account have two choices: to treat variants as equivalent, or to disallow variants.
So it sounds like disallowing variants (making no attempt to recognize different normalization forms as equivalent) _is_ an option - again, a decision we should be clear about and document.
Yes, the standard describes all common possibilities. And it says about normalization of identifiers. This allows "human" identifiers. I believe it make sense for DSL or AI and make no sense for strong typed languages like C# and perhaps for PowerShell too - I'd very wonder if PowerShell consider "$Ёлка" and "$Елка" as equal although both are Christmas tree (the sample is only "display difference"! not Unicode specific).
I agree that for _identifiers_ disallowing normalization variants makes sense.
(As an aside: your Ёлка
example is not normalization-related and also not covered by the case-folding table - is it an example of language-specific case-folding? Or is it something more informal?)
Select-String
- and anywhere outside input is processed - is a different story, however, and we already have a split there, based on CoreFx behavior:
[string]
IS normalization-aware[regex]
is NOT normalization-awareThat explains why $oUmlautNFC_Upper | Select-String $oUmlautNFD
(regex) does _not_ match, whereas $oUmlautNFC_Upper | Select-String $oUmlautNFD -SimpleMatch
(string) _does_.
In Russian we can freely replace Ё - 0401 with Е - 0415. (For lower case too). But "Ёлка" -eq "Елка" returns false - I don't know why. Perhaps because of Unicode backword compatibility and stability only.
Update: Unicode equivalence is defined for the _same_ abstract character. Ё
and Е
is different chars.
I don't wonder about [string]
. By default it use a culture overload for Equals bool Equals(string value, System.StringComparison comparisonType)
which makes p/invoke of OS culture API.
I am happy that [rexeg]
is not normalization-aware - it makes things simple for us.
I've been going to do this for a long time #8180 for Select-String -SimpleMatch
.
Thanks for the background on Е
/ Ё
. There _are_ cases in the official case-folding table where distinct characters are treated the same, and there are additional, language-specific ones in SpecialCasing.txt, but I guess we needn't worry about cases that are defined in neither.
-eq
consistently uses InvariantCultureIgnoreCase
, from what I gather, which aligns with PowerShell's overall policy of using the invariant culture in explicit string operations.
It is unfortunate that -like
, -match
/ -replace
/ split
and Select-String
all use culture-_sensitive_ comparison (which is documented for Select-String
, but not the others).
The difference rarely matters, but there is the infamous Turkish "I" problem in the tr
and az
cultures.
[The following is at least partially incorrect.]
~Even more curiously, -match
/ -replace
/ split
apply the Turkish / Azerbaijani case mapping irrespective of what culture is in effect: "ı" -match "I"
yields $True
in _any_ culture (don't try this on macOS, where the Turkish rules are fundamentally unsupported).~
~Haven't looked at the implementation yet - any idea what's going on?
I'm also baffled that on macOS -match
/ -replace
/ split
do not support Turkish case rules, seemingly due to lack of OS support, whereas Select-String
(without -SimpleMatch
) does.~
Seems the Turkish case _mapping_ is not related with case _folding_. I believe simple case folding works with Turkish well because the folding is intended to be language-neutral. This is just the reason why we can get a win in regex and PowerShell identifier comparisons internally and can not in -match / -replace / split.
Seems the Turkish case mapping is not related with case folding.
It _is_ part of the case-folding table, but of the _language-specific_ variety: status T
in CaseFolding.txt, where T
, I presume, stands for _tailored_.
# The mappings with status T can be used or omitted depending on the desired case-folding
# behavior. (The default option is to exclude them.)
[...]
0049; T; 0131; # LATIN CAPITAL LETTER I
0130; T; 0069; # LATIN CAPITAL LETTER I WITH DOT ABOVE
To reiterate: case folding can be language-neutral _or_ language-specific (tailored), _as needed_.
But I agree that for identifiers it definitely makes sense to stay language-neutral.
get a win in regex and PowerShell identifier comparisons internally and can not in -match / -replace / split.
Understood re identifiers, but what _regex_ used do you mean in this case, given that regexes are also used for -match
/ -replace
, / -split
?
It is part of the case-folding table, but of the language-specific variety: status T in CaseFolding.txt, where T, I presume, stands for tailored.
-match / -replace, / -split
-imatch
.Today I measured Import-Csv
on large file (vs 6.1.0 RTM ) and discover up to 8% win with new SimpleFoldedStringComparer
.
I opened new issue in CoreFXlab repo to start a work to add the enhancement in .Net Core.
The main problem now is code performance. A large cache is now used in my alfa code and it is fast. The proposed version with a 3-level cache (size in 10x times less) is too slow (2-3 times). Any help is appreciated how make the code compact and fastest (comparable to ToLowerInvariant() for folding and OrdinalIgnoreCase for comparer).
@vexx32 @HumanEquivalentUnit @mklement0 @daxian-dbw
@SteveL-MSFT I am ready to create a nuget package in CoreFXlab https://github.com/dotnet/corefxlab/issues/2610
Could we use the package after it will created and tested?
@iSazonov I think we can use it with 6.3-preview.1
@SteveL-MSFT Thanks! Currently the code was merged in Corefxlab but nuget package is not still created. Targeting is to .Net Core 3.0 only. Now there need to review _public_ api and to make performance optimizations. This is a work for MSFT experts. It would be nice if you could sync internally with Core team.
As for PowerShell we will need approval using SCF because folding is not full equivalent of ToLowerCase() and correspondingly there is a littel difference with OrdinalIgnoreCase.
Moving out of 7.0 as it seems .NET 5 may support this
Yes, they plan to move ICU internally in Core and unify Core behavior (to fall back to OS with p/invoke sometimes). In the case OrdinalIgnoreCase will implement the SCF, I hope with great performance.
While waiting Core 5.0 we could resolve #8180 without risk to conflict with Core.
Most helpful comment
First alfa code performance results (yes, I got the code fast!):
/ * Summary *
BenchmarkDotNet=v0.11.1, OS=Windows 10.0.10240.17146 (1507/RTM/Threshold1)
Intel Core i3-4150 CPU 3.50GHz (Haswell), 1 CPU, 4 logical and 2 physical cores
Frequency=3410090 Hz, Resolution=293.2474 ns, Timer=TSC
.NET Core SDK=2.1.403
[Host] : .NET Core 2.1.5 (CoreCLR 4.6.26919.02, CoreFX 4.6.26919.02), 64bit RyuJIT
DefaultJob : .NET Core 2.1.5 (CoreCLR 4.6.26919.02, CoreFX 4.6.26919.02), 64bit RyuJIT
| Method | StrA | StrB | Mean | Error | StdDev | Median | Scaled | ScaledSD | Gen 0 | Allocated |
|----------------- |------------- |------------ |----------:|---------:|----------:|----------:|-------:|---------:|-------:|----------:|
| ToLowerRussian | CaseFolding1 | CaseFolding | 73.28 ns | 2.129 ns | 3.120 ns | 72.32 ns | 0.99 | 0.06 | 0.0660 | 104 B |
| ToLowerInvariant | CaseFolding1 | CaseFolding | 73.85 ns | 1.660 ns | 3.978 ns | 72.09 ns | 1.00 | 0.00 | 0.0660 | 104 B |
| StringFold | CaseFolding1 | CaseFolding | 65.15 ns | 1.372 ns | 3.068 ns | 63.81 ns | 0.88 | 0.06 | 0.0660 | 104 B |
| | | | | | | | | | | |
| ToLowerRussian | ЯЯЯЯЯЯЯЯЯЯЯ1 | ЯЯЯЯЯЯЯЯЯЯЯ | 188.23 ns | 5.924 ns | 17.185 ns | 177.66 ns | 1.09 | 0.11 | 0.0660 | 104 B |
| ToLowerInvariant | ЯЯЯЯЯЯЯЯЯЯЯ1 | ЯЯЯЯЯЯЯЯЯЯЯ | 173.58 ns | 3.699 ns | 8.646 ns | 169.48 ns | 1.00 | 0.00 | 0.0660 | 104 B |
| StringFold | ЯЯЯЯЯЯЯЯЯЯЯ1 | ЯЯЯЯЯЯЯЯЯЯЯ | 77.15 ns | 1.580 ns | 2.054 ns | 76.17 ns | 0.45 | 0.02 | 0.0660 | 104 B |