Runtime: String comparer for sorting numeric strings logically

Created on 11 Jan 2015 · 124Comments · Source: dotnet/runtime

Rationale

For sorting purposes it's common to need portions of strings containing numbers to be treated like numbers. Consider the list of strings "Windows 7", "Windows 10".

Using the Ordinal StringComparer to sort the list one would get

Windows 10
Windows 7

but the desired ascending logical sort would be

Windows 7
Windows 10

Proposed API

 namespace System {
     public class StringComparer {
+        public static StringComparer Create(CultureInfo culture, CompareOptions options);
     }
 }
 namespace System.Globalization {
     public enum CompareOptions {
+        NumericOrdering = 0x00000020
     }
 }

Usage

var list = new List<string> { "Windows 10", "Windows 7" };
list.Sort(StringComparer.Logical); // List is now "Windows 7", "Windows 10"

This would also be good for sorting strings containing IP addresses.

Details

Logical is a convenience property equivalent to the result of Create(CultureInfo.CurrentCulture, CompareOptions.Logical)
LogicalIgnoreCase is a convenience property equivalent to the result of Create(CultureInfo.CurrentCulture, CompareOptions.Logical | CompareOptions.IgnoreCase)
Non-numeric sequences will be evaluated with the culture provided.
Numeric sequences will be determined by the result of Char.IsDigit.
All UTF-16 digits will be supported and are manually parsed using Char.GetNumericValue.
Only positive integral values without digit separators will be supported directly.
Numbers will be treated as ulongs. Logic for overflows will have to be considered.
The string Windows 8.1 would be considered 4 sequences. The Windows would be a string sequence, the 8 would be a numeric sequence, the . would be another string sequence, and the 1 would be another numeric sequence.
This API could later be expanded to include support for allowing signs, decimals, and digit separators through the use of overloads accepting a NumberStyles parameter.
When a numeric and string sequence are considered at the same time the numeric sequence always comes before the string sequence so when sorting the following list, "a", "7" the number 7 will be sorted before the letter a.
Existing methods that take a CompareOptions parameter as input will need to be updated to support the new Logical member.

Open Questions

Should CompareOptions.Logical be implemented as the flag option SORT_DIGITSASNUMBERS to the dwCmpFlags parameter of CompareStringEx? Using it's implementation should be more efficient but later expanding support for NumberStyles will require a re-implementation with matching behavior.

Updates

Added Logical and LogicalIgnoreCase properties.
Added support for all UTF-16 digits.
Added more CreateLogical overloads to match the Create method.
Added retrieval of the NumberFormatInfo from the StringComparer parameter when not explicitly provided and is a CultureAwareComparer.
Removed CreateLogical overloads that matched the Create method.
Switched to only supporting positive integral values without digit separators.
Added consideration of comparing a numeric sequence with a string sequence.
Added the flag member CompareOptions.Logical and changed CreateLogical to be just an overload of Create.

api-approved area-System.Runtime up-for-grabs

Source

Peter-Juhasz

👍17

Most helpful comment

I think, logically, the value of 011 and 11 should be equivalent, but they would then be sorted among themselves (likely ordinally), to avoid mixing a011b, a11c, a011d, etc

tannergooding on 30 Aug 2019

👍3

All 124 comments

That's a good suggestion. Would you be willing to make an API proposal?

terrajobst on 13 Jan 2015

I think conceptually this is a good idea, but there are a lot of details that need to be worked out. It seems like such a comparer might want to take options which control how it behaves. For example, are commas in numbers ok? Spaces? Decimal points? How are different cultures handled?

MgSam on 22 Jan 2015

👍3

While it's easy to define the behavior on Windows systems as following StrCmpLogicalW, this could be problematic in the long run.

The precise behavior of StrCmpLogicalW is not only undocumented, but it explicitly states that it can change in future releases of Windows.
The behavior of this method is not case-sensitive, and there is no case-sensitive method which is otherwise equivalent. It would be strange, for example, to provide StringComparer.LogicalIgnoreCase without providing StringComparer.Logical.

I believe you could define a reasonable API proposal in terms of a lexicographical ordering of substrings divided at numeric boundaries. Sequences of non-digits would be compared using StringComparer.CurrentCulture or StringComparer.CurrentCultureIgnoreCase, and sequences of digits would be compared according to their numeric values.

sharwell on 24 Jan 2015

The comparer should require an IFormatProvider or NumberFormatInfo; which can be passed in the constructor or it should use the current one from the owning thread.

@terrajobst This is a problem I am facing in almost every business application: proper ordering even if the strings contain numbers. I don't have time to make a PR or document it well, but I would be very happy to see it in the new releases of the core framework. I think the use case is clear enough.

Peter-Juhasz on 26 Jan 2015

This is a need that I have often faced. If @Peter-Juhasz does not have the time to create the API proposal or implement it, I would be interested in taking it on.

eatdrinksleepcode on 26 Jan 2015

@Peter-Juhasz There is probably no need to consider the NumberFormatInfo. As stated in the documentation for NumberFormatInfo.NativeDigits:

The character set that is specified by the NativeDigits property has no effect on parsing or formatting operations. Only the Basic Latin digits 0 (U+0030) through 9 (U+0039) are used when formatting or parsing numeric values or date and time values.

If StringComparer.Logical and StringComparer.LogicalIgnoreCase were defined to use StringComparer.CurrentCulture and StringComparer.CurrentCultureIgnoreCase for comparing non-digits, the culture used for the comparison is clear.

On a side note, even if used there is no need to explicitly provide the NumberFormatInfo to the comparer since the current culture information already provides it.

sharwell on 26 Jan 2015

Looks in Windows, CompareStringEx can be used with the flag SORT_DIGITSASNUMBERS. I didn't look at ICU if it support such functionality too.

tarekgh on 16 Aug 2016

I believe there is few things which need to be worked on here:

check if there is linux implementation existing and if it behaves similarly to CompareStringEx with SORT_DIGITSASNUMBERS
should we provide all permutations for all comparers now or perhaps generate comparer based on some settings (perhaps we already have something like that which we might be able to inject this log in and I might not be aware of it)

I think this would be a nice addition but it needs some work.

krwq on 28 Nov 2016

should we provide all permutations for all comparers now or perhaps generate comparer based on some settings (perhaps we already have something like that which we might be able to inject this log in and I might not be aware of it)

I don't think we need to do that. it would be just option (or flag) passed with the CompareOptions and let the underlying OS handle it.

I think this would be a nice addition but it needs some work.

I agree with that

tarekgh on 28 Nov 2016

I'd see value in both an OS-dependent sort for cases where you want to match the OS (file listing, for example), and an OS-independent .NET sort that natively understands both numbers and dates. A real life example is needing month names to be correctly sorted, even when that's the only date-related part of the string.

The OS-dependent sort should be super simple to implement. After that's done I think it would still be worth putting design time in towards an independent managed sort as well, to figure out if and how people's needs differ.

jnm2 on 30 Nov 2016

Rationale

For sorting purposes it's common to need portions of strings containing numbers to be treated like numbers. Consider the list of strings "Windows 7", "Windows 10".

Using the Ordinal StringComparer to sort the list one would get

Windows 10
Windows 7

but the desired ascending logical sort would be

Windows 7
Windows 10

Proposed API

 namespace System {
     public class StringComparer {
+        public static StringComparer Create(CultureInfo culture, CompareOptions options);
+        public static StringComparer Logical { get; }
+        public static StringComparer LogicalIgnoreCase { get; }
     }
 }
 namespace System.Globalization {
     public enum CompareOptions {
+        Logical = 0x00000020
     }
 }

Usage

var list = new List<string> { "Windows 10", "Windows 7" };
list.Sort(StringComparer.Logical); // List is now "Windows 7", "Windows 10"

This would also be good for sorting strings containing IP addresses.

Details

Logical is a convenience property equivalent to the result of Create(CultureInfo.CurrentCulture, CompareOptions.Logical)
LogicalIgnoreCase is a convenience property equivalent to the result of Create(CultureInfo.CurrentCulture, CompareOptions.Logical | CompareOptions.IgnoreCase)
Non-numeric sequences will be evaluated with the culture provided.
Numeric sequences will be determined by the result of Char.IsDigit.
All UTF-16 digits will be supported and are manually parsed using Char.GetNumericValue.
Only positive integral values without digit separators will be supported directly.
Numbers will be treated as ulongs. Logic for overflows will have to be considered.
The string Windows 8.1 would be considered 4 sequences. The Windows would be a string sequence, the 8 would be a numeric sequence, the . would be another string sequence, and the 1 would be another numeric sequence.
This API could later be expanded to include support for allowing signs, decimals, and digit separators through the use of overloads accepting a NumberStyles parameter.
When a numeric and string sequence are considered at the same time the numeric sequence always comes before the string sequence so when sorting the following list, "a", "7" the number 7 will be sorted before the letter a.
Existing methods that take a CompareOptions parameter as input will need to be updated to support the new Logical member.

Open Questions

Should CompareOptions.Logical be implemented as the flag option SORT_DIGITSASNUMBERS to the dwCmpFlags parameter of CompareStringEx? Using it's implementation should be more efficient but later expanding support for NumberStyles will require a re-implementation with matching behavior.

Updates

Added Logical and LogicalIgnoreCase properties.
Added support for all UTF-16 digits.
Added more CreateLogical overloads to match the Create method.
Added retrieval of the NumberFormatInfo from the StringComparer parameter when not explicitly provided and is a CultureAwareComparer.
Removed CreateLogical overloads that matched the Create method.
Switched to only supporting positive integral values without digit separators.
Added consideration of comparing a numeric sequence with a string sequence.
Added the flag member CompareOptions.Logical and changed CreateLogical to be just an overload of Create.

TylerBrinkley on 23 Feb 2017

we are limited to what the underlying OS can give us as functionality support, this need to be investigated if both Windows and Linux can provide this functionality before we proceed here.

tarekgh on 23 Feb 2017

I don't think this should necessarily follow the Windows functionality but be entirely implemented in .NET.

TylerBrinkley on 23 Feb 2017

👍1

I don't think this should necessarily follow the Windows functionality but be entirely implemented in .NET.

This is not easy to do especially we don't carry the needed collation tables needed to do this.

if it is just ASCII characters that is possible. if you have more complicated scenario that will be challenging except if you go and parse the string split them. this will not be nice performance wise.

in short, Linguistic comparisons will be challenging.

tarekgh on 23 Feb 2017

You're probably right about the performance. I'm currently using an implementation that splits the string which works for me but I understand it may not be quick enough for the framework due to the string allocations.

TylerBrinkley on 23 Feb 2017

it is not regarding the string allocations. we can pin the strings and avoid the allocations. it is about parsing the string and then compare each part and then handle the digits comparisons. note that, such functionality has to support all digits for all languages too and not just 0~9.

tarekgh on 23 Feb 2017

Ugh, supporting digits for all languages would be a pain as you'd have to go by the NativeDigits property on NumberFormatInfo. I suppose a fast path could be added for 0-9 and a slow path for other languages.

After looking at my implementation I think char.IsDigit would work just fine.

TylerBrinkley on 23 Feb 2017

For some scenarios I'd prefer an OS-dependent logical sort, for others an invariant logical sort.
I'd really love something like this:

c# var comparer = StringComparer.TryGetOSLogical() ?? StringComparer.CreateLogical(StringComparer.CurrentCulture);

I want to match the feel of the native OS if possible.

jnm2 on 23 Feb 2017

👍1

Ugh, supporting digits for all languages would be a pain as you'd have to go by the NativeDigits property on NumberFormatInfo. I suppose a fast path could be added for 0-9 and a slow path for other languages. After looking at my implementation I think char.IsDigit would work just fine.

We don't have to use NumberFormatInfo.NativeDigits because this will make it work with only specific culture while digits in all languages should just work regardless of the chosen culture because digits doesn't have linguistic context (in most cases).

char.IsDigit is not helpful either because you need to know the value of the digit and not just check if it is digit to be able to support the needed functionality.

I am not trying to push back here but I am just explaining the complexity we'll have if we need to do it without OS help.

var comparer = StringComparer.TryGetOSLogical()
?? StringComparer.CreateLogical(StringComparer.CurrentCulture);

I hope StringComparer.TryGetOSLogical() exist and work all the time but the logic to fallback will create some other issues regarding the difference in the behavior if OS support it and if not. This may be OK but we have to understand that.

tarekgh on 23 Feb 2017

I understand and thank you for going through the complexities with me. In my implementation I TryParse the substring number passing in the NumberFormatInfo, why would that not work? Performance?

TylerBrinkley on 23 Feb 2017

I understand and thank you for going through the complexities with me. In my implementation I TryParse the substring number passing in the NumberFormatInfo, why would that not work? Performance?

Int32.TryParse with NumberFormatInfo will not parse native digits.
also if this work, this means if you have en-US NumberFormatInfo you'll not be able to compare strings containing say Urdu digits. and that is what I think we should have all native numbers work regardless of the cultures because numbers are well identified in the Unicode.

tarekgh on 23 Feb 2017

Int32.TryParse with NumberFormatInfo will not parse native digits.

If Int32.TryParse doesn't support native digits, why should this? As far as I can tell even the native StrCmpLogicalW doesn't support native digits.

TylerBrinkley on 24 Feb 2017

👍1

If Int32.TryParse doesn't support native digits, why should this? As far as I can tell even the native StrCmpLogicalW doesn't support native digits.

Because the functionality we are talking about is regarding the collation which support all native digits. Also there is some requests we need to support native digits in the formatting and parsing APIs but we didn't get into it yet.

In short, if we support only 0~9 digits, sooner or later we'll get request to support the native digits. so it will be good to have a good plan in our support before we jump doing it.

tarekgh on 24 Feb 2017

@tarekgh do you think it's valuable to mark this as Api ready for review and then if it is approved then try to sort out implementation details?

joperezr on 8 Mar 2017

do you think it's valuable to mark this as Api ready for review and then if it is approved then try to sort out implementation details?

No, we don't have a good way to implement this at least on Windows today. without figuring out the implementation story we cannot proceed with the API.

tarekgh on 8 Mar 2017

@tarekgh

...we don't have a good way to implement this at least on Windows today...

I'm not sure I agree with this completely. It should be possible to implement a comparer which uses StringComparer.CurrentCulture for substrings that do not contain the code points [0-9], and a numeric comparison based on int.Parse for sequences containing only those code points.

The API could be defined to simply be two new properties of the StringComparer class.

StringComparer.Logical: Uses CurrentCulture for sequences of non-digits
StringComparer.LogicalIgnoreCase: Uses CurrentCultureIgnoreCase for sequences of non-digits

Since neither of these properties is an invariant string comparer, it is allowable that future updates make changes to accommodate various cultures, including but not limited to changes in the way native digits are handled.

sharwell on 19 Apr 2017

👍3

for substrings that do not contain the code points [0-9], and a numeric comparison based on int.Parse for sequences containing only those code points.

This will not work with non Latin digits. if you use other numerals from other languages that will not work. please read the discussion in this issue.
other concern from the proposed implementation is the performance. splitting the string and then comparing parts will be expensive.
Also should we consider cases like (1,0000, 1.1, ...etc.)? did you think about that?

tarekgh on 19 Apr 2017

char.IsDigit is not helpful either because you need to know the value of the digit and not just check if it is digit to be able to support the needed functionality.

@tarekgh I've updated my proposal to specify the use of the Char.GetNumericValue method to retrieve its value. I've also added @sharwell 's Logical and LogicalIgnoreCase properties.

TylerBrinkley on 19 Apr 2017

I think for the API interface we can do the following:

Add a new value to CompareOptions
Then we can use the following in StringComparer API to return the comparer
https://github.com/dotnet/corefx/blob/master/src/System.Globalization.Extensions/src/System/Globalization/Extensions.cs#L13

The benefit here is this will enable string comparison APIs too and not just the string comparer.

tarekgh on 19 Apr 2017

This will not work with non Latin digits. if you use other numerals from other languages that will not work. please read the discussion in this issue.

Yes, this behavior is intentional.

other concern from the proposed implementation is the performance. splitting the string and then comparing parts will be expensive.

There is no need to split the string. You can identify the numeric portions easily, with will also reveal the sequences of non-digits. The sequences of non-digits can be compared using CultureInfo.CurrentCulture.CompareInfo.Compare, which can operate on explicit substrings without [external] allocation.

Also should we consider cases like (1,0000, 1.1, ...etc.)? did you think about that?

Yes, the , and . elements seen here do not fall in the character range [0-9]. The comparison would treat 1,0000 as three sequences: 1 (numeric), , (non-numeric), 0000 (numeric). The string 1.1 would likewise be treated as three sequences. In my proposal, the current culture has no impact on the treatment of any individual code point as a digit or non-digit. Note that the treatment of "fractional digits" in this manner (ignoring them) is consistent with the way Windows already performs logical sorting in Explorer (for example).

My goal is to provide a clean and immediately usable API without restricting its ability to meet needs in the future. I do not believe it needs to be complicated in order to achieve this outcome.

sharwell on 19 Apr 2017

👍2

If we only use the CompareOptions enum then we'll be limited to only supporting the base integer case without sign, decimal, or thousands support. Also, it would be far less visible and hardly anyone would use it that would otherwise.

With my proposal the user can specify whether to AllowLeadingSign, AllowTrailingSign, AllowDecimalPoint, and AllowThousands using the NumberStyles flag enum parameter and also what strings to use for the NegativeSign, PositiveSign, NumberDecimalSeparator, and NumberGroupSeparator using the NumberFormatInfo parameter. And being listed in the StringComparer class makes it very easy to discover.

TylerBrinkley on 19 Apr 2017

@sharwell

Yes, this behavior is intentional.

why you think this is good?

There is no need to split the string.

I didn't mean split the string. I meant you need to search the string for the numbers first and then compare the part before the number then compare the number and then repeat the process. that what I meant will be expensive.

My goal is to provide a clean and immediately usable API without restricting its ability to meet needs in the future. I do not believe it needs to be complicated in order to achieve this outcome.

I understand but in same time we need to get the right design to serve us in the future if we need to expand the functionality.

tarekgh on 19 Apr 2017

@TylerBrinkley please move the proposal to the top of the issue.

@joperezr we can mark this as ready for review according to the latest discussion.

tarekgh on 20 Apr 2017

One last note, for the

C# public static StringComparer CreateLogical(StringComparer stringComparer, NumberStyles numberStyle, NumberFormatInfo numberFormatInfo);

I think this will be a little bit confusing as we can can create StringComparer for specific culture that match the desired number format. I know that could have have a comparer for specific culture and in same time you want to treat the numbers from other culture but that is unlikely and it would be good just not include it till we find a real need for it. other than that the proposal looks reasonable to me

tarekgh on 20 Apr 2017

How do you recommend we get the NumberFormatInfo though? For a CultureAwareComparer we would need to add another field initialized from the constructor and for non-CultureAwareComparer's we would need to use a default such as CurrentInfo or else we would have to add a NumberFormatInfo or IFormatProvider property to StringComparer which seems wrong.

Also since there is already a Create(CultureInfo culture, bool ignoreCase) method, should we also add overloads for those parameters, e.g. CreateLogical(CultureInfo culture, bool ignoreCase) and CreateLogical(CultureInfo culture, bool ignoreCase, NumberStyles numberStyle)?

TylerBrinkley on 20 Apr 2017

I've updated the proposal to retrieve the NumberFormatInfo if not explicitly provided from CultureAwareComparer's and for other comparers a default of NumberFormatInfo.CurrentInfo will be used.

I've also added CreateLogical overloads for the parameters included in the Create(CultureInfo culture, bool ignoreCase) method.

@TylerBrinkley please move the proposal to the top of the issue.

I don't have access to do that.

TylerBrinkley on 20 Apr 2017

How do you recommend we get the NumberFormatInfo though

While creating the string comparer we'll store the corresponding NFI object. for example, if you are creating CurrentCulture comparer we'll get NFI from current culture and use it. this will apply on culture aware comparer. so we should be find here without passing NFI to the constructors of String comparers.

CreateLogical(CultureInfo culture, bool ignoreCase)

please don't add any overloads to this one as we have introduced a new signature that is taking CompareOptions instead of ignoreCase bool. we should advise people to use this one instead.

@TylerBrinkley please update the proposal accordingly and I'll move it to the top after you do that. thanks.

tarekgh on 20 Apr 2017

So do you want me to remove those overloads and add ones with CompareOptions instead of the ignoreCase bool or just remove those overloads?

TylerBrinkley on 20 Apr 2017

I suggest to have only

C# public static StringComparer CreateLogical(StringComparer stringComparer); public static StringComparer CreateLogical(StringComparer stringComparer, NumberStyles numberStyle); public static StringComparer Logical { get; } public static StringComparer LogicalIgnoreCase { get; }

The signature taking StringComparer (the first and second one) is covering the scenario of culture aware comparer which can be created using ignoreCase or CompareOptions. Also I removed the one which is taking NFI.

We still have the freedom in the future to add more signatures if proven we really need them.

What you think?

tarekgh on 20 Apr 2017

Also, do you think naming Logical is good here? I understand Windows API StrCmpLogicalW called logical but I am not sure if this is really easy to understand what it means. anyway, we may get more feedback from the design reviewer as needed if we don't come up with better naming now.

tarekgh on 20 Apr 2017

While that's acceptable I don't necessarily like that a user can't use an Ordinal StringComparer with an NFI that's not the current cultures.

Also, do you think naming Logical is good here?

I can't think of anything better but naming it something different is definitely worth considering.

TylerBrinkley on 20 Apr 2017

While that's acceptable I don't necessarily like that a user can't use an Ordinal StringComparer with a culture's NFI besides the current.

This still can be achieved by create culture aware comparer with CompareOptions.Ordinal and use the first or second signature.

tarekgh on 20 Apr 2017

What's the API to create a culture aware comparer with CompareOptions.Ordinal? Is that not implemented yet?

From a brief search Natural is the only other reasonable name I could find.

TylerBrinkley on 20 Apr 2017

What's the API to create a culture aware comparer with CompareOptions.Ordinal? Is that not implemented yet?

We have it here:
https://github.com/dotnet/corefx/blob/master/src/System.Globalization.Extensions/src/System/Globalization/Extensions.cs#L13

But I can see it will complicate our implementation to extract the NFI. So I am OK to return back the signatures that taking NFI.

Also we need to define the behavior when someone using
public static StringComparer CreateLogical(StringComparer stringComparer);
public static StringComparer CreateLogical(StringComparer stringComparer, NumberStyles numberStyle);

We may be able to construct the NFI from the comparer if it is one of framework comparer but we'll not be able to construct it if the caller used custom comparer. so we may need to add overloads taking NFI for the methods taking StringComparer.

C# public static StringComparer CreateLogical(StringComparer stringComparer); public static StringComparer CreateLogical(StringComparer stringComparer, NumberStyles numberStyle); public static StringComparer CreateLogical(StringComparer stringComparer, NumberFormatInfo nfi); public static StringComparer CreateLogical(StringComparer stringComparer, , NumberFormatInfo nfi, NumberStyles numberStyle); public static StringComparer Logical { get; } public static StringComparer LogicalIgnoreCase { get; }

tarekgh on 20 Apr 2017

I think StringComparer.Logical has the correct connotations. You could also try for StringComparer.Intelligent or StringComparer.Smart.

jnm2 on 20 Apr 2017

And we'll use Invariant.NumberFormatInfo when the caller not passing NFI to us

tarekgh on 20 Apr 2017

I think Logical has the correct connotations. You could also try for Intelligent.

let's stick with Logical for now

tarekgh on 20 Apr 2017

👍2

What, you don't like StringComparer.BestestBetterness? 😆

jnm2 on 20 Apr 2017

😄2

We may be able to construct the NFI from the comparer if it is one of framework comparer but we'll not be able to construct it if the caller used custom comparer. so we may need to add overloads taking NFI for the methods taking StringComparer.

NumberFormatInfo is only needed when using NumberStyles because the default value is specified as None, meaning we don't need to know what the signs, group separators, or decimal separators are. Thus we only need the already proposed method.

c# public static StringComparer CreateLogical(StringComparer stringComparer, NumberStyles numberStyle, NumberFormatInfo numberFormatInfo);

I've removed the CreateLogical overloads that match the Create method's parameters.

TylerBrinkley on 20 Apr 2017

NumberFormatInfo is only needed when using NumberStyles because the default value is specified as None,

NFI is needed to parse the numbers in the strings. right? how you know the decimal point, thousand separator...etc. when parsing the number?

tarekgh on 20 Apr 2017

With the default NumberStyles.None it only allows for positive integral values with no thousand separator, thus the NFI isn't needed.

TylerBrinkley on 20 Apr 2017

Ok. I missed this part. so I think we have the proposal now. I'll move it up

tarekgh on 20 Apr 2017

Note, I didn't update the proposal to use NumberFormatInfo.InvariantInfo vs NumberFormatInfo.CurrentInfo.

TylerBrinkley on 20 Apr 2017

please update it in your comment and I'll re-apply it at the top.

tarekgh on 20 Apr 2017

Done. Just curious, why Invariant vs Current?

TylerBrinkley on 20 Apr 2017

Actually current make more sense. you may update your comment. we should be good now.

thank you.

tarekgh on 20 Apr 2017

I don't like that the proposed CreateLogical method takes a StringComparer argument. This will be difficult to implement without additional allocations (or at least copying). The following overload should be enough:

public static StringComparer CreateLogical(CultureInfo culture, boolean ignoreCase);

I'm also not convinced it makes sense to add the overload which takes a NumberStyles argument.

sharwell on 21 Apr 2017

@sharwell extra allocation is good point.

if we have

```c#
public static StringComparer CreateLogical(CultureInfo culture, boolean ignoreCase);

This will not include the ordinal case. so we need to have a way to provide that. we may consider adding the following to your suggestion

```c#
public static StringComparer CreateLogical(CompareOptions);

And we don't have to provide any NFI parameters.

I'm also not convinced it makes sense to add the overload which takes a NumberStyles argument.

How you think we should support the number parsing? should we just allow the number sign, decimal points, thousand separators...etc. or you are suggesting to only support int numbers with no delimiters (or signs)?

tarekgh on 21 Apr 2017

This will not include the ordinal case

Do you have an example of where the ordinal case would provide some feature that the invariant culture wouldn't work equally well for?

or you are suggesting to only support int numbers with no delimiters (or signs)?

Correct. Prior implementations I've seen, including many of the ones that led people to want this comparer out-of-the-box, are based on integers and non-integers. The behavior isn't absolutely precise in all cases, but overall it makes sense more times than it doesn't.

sharwell on 21 Apr 2017

👍3

I'd be okay with just the simple case being supported initially and later if it's deemed that support for NumberStyles should be added then it can be added to the API pretty easily.

TylerBrinkley on 21 Apr 2017

I also like the uniformity with the existing Create method.

TylerBrinkley on 21 Apr 2017

Do you have an example of where the ordinal case would provide some feature that the invariant culture wouldn't work equally well for?

```C#
var s1 = "Strasse";
var s2 = "Straße";

        Console.WriteLine(s1.Equals(s2, StringComparison.Ordinal));           //false
        Console.WriteLine(s1.Equals(s2, StringComparison.InvariantCulture));  //true

So I think all what we need is just the following API.

```C#
        public static StringComparer CreateLogical(CultureInfo culture, CompareOptions);

Note that we don't even need the following APIs. we can have them only for convenience which might be not initially necessary.

C# public static StringComparer Logical { get; } public static StringComparer LogicalIgnoreCase { get; }

We should clarify the expectation of the numbers would work for integers only (with no signs) and later we can think supporting more complicated cases as needed.

if you all agree, @TylerBrinkley please update the proposal one more time.

tarekgh on 21 Apr 2017

I just did a quick performance comparison between Ordinal vs Invariant and found that on my machine running under .NET 4.5 Invariant takes around 360% longer to compare than Ordinal. So that's definitely a reason Ordinal ought to be supported.

I'll update the proposal to include just the following APIs

c# public static StringComparer CreateLogical(CultureInfo culture, CompareOptions options); public static StringComparer Logical { get; } public static StringComparer LogicalIgnoreCase { get; }

TylerBrinkley on 21 Apr 2017

👍2

@tarekgh I've updated the proposal as requested.

TylerBrinkley on 21 Apr 2017

Thanks @TylerBrinkley

tarekgh on 21 Apr 2017

Sorry, to bring this up again but I just thought of another option. Instead of introducing CreateLogical we could add a CompareOptions.Logical flag member and simply add a Create overload. We could still later add an overload with a NumberStyles parameter if needed, however that overload will only make sense with CompareOptions.Logical. Any thoughts?

TylerBrinkley on 24 Apr 2017

@TylerBrinkley, if we added to CompareOptions we'll have to support this option in all APIs accepting CompareOptions. in the future if we need to support NumberStyles/NFI with this option, that will complicate the APIs taking CompareOptions. Or we can decide to have it in CompareOptions and then we can later add whatever extra params (NumberStyles/NFI) in StringComparer only. I like the later as it matches what I have originally proposed and you pushed back on it :-)

tarekgh on 24 Apr 2017

@tarekgh Agreed, if extra params are needed later then they should only be added in StringComparer. I like your idea of using CompareOptions better now as the proposed API using it is now defined in StringComparer as opposed to the GlobalizationExtensions class and is therefore more visible.

Something to consider though is that CompareOptions maps pretty directly to the dwCmpFlags parameter of CompareStringEx. If we'd ever want to support the native SORT_DIGITSASNUMBERS which is available in Windows 7+ it might get a bit confusing having both Logical and SortDigitsAsNumbers members.

TylerBrinkley on 24 Apr 2017

We may consider using SORT_DIGITSASNUMBERS if it gives us the desired behavior. this will be better for performance too. so CompareOptions.Logical can map to SORT_DIGITSASNUMBERS. This is implementation details which we can investigate more.

if all agree, we can update the proposal to only

C# namespace System { public class StringComparer { public static StringComparer Logical { get; } public static StringComparer LogicalIgnoreCase { get; } } }
with adding Logical option to CompareOptions.

or we can have only adding Logical option to CompareOptions and no change in stringComparer

tarekgh on 24 Apr 2017

If you didn't add StringComparer.LogicalIgnoreCase, I'd create a StringComparerEx.LogicalIgnoreCase for sanity. 😄

jnm2 on 24 Apr 2017

😄2

Will mapping Logical to SORT_DIGITSASNUMBERS present a problem as it's only available on Windows 7+? Also, for Unix environments is it already supported in the Libraries.GlobalizationInterop dll and where is that source code, see https://github.com/dotnet/coreclr/blob/f355c5789d5d7cd6ea10bee326c66f04641cfc90/src/mscorlib/src/Interop/Unix/System.Globalization.Native/Interop.Collation.cs#L22?

TylerBrinkley on 24 Apr 2017

Will mapping Logical to SORT_DIGITSASNUMBERS present a problem as it's only available on Windows 7+?

coreclr in general doesn't support pre-Windows 7. so we should be ok here.

tarekgh on 24 Apr 2017

It looks like the Unix source is located here. It doesn't look like it's currently supported.

Also, does SORT_DIGITSASNUMBERS support any UTF-16 digits or just 0-9?

TylerBrinkley on 24 Apr 2017

It looks like the Unix source is located here. It doesn't look like it's currently supported.

we can support it as I know ICU is supporting such functionality.

Also, does SORT_DIGITSASNUMBERS support any UTF-16 digits or just 0-9?

I'll try to find out.

@TylerBrinkley do you want update the proposal?

tarekgh on 24 Apr 2017

If we'd ever want to support NumberStyles in the future this would have to be entirely re-implemented in .NET with the exact same behavior as CompareStringEx. That seems unlikely to ever happen if it's implemented as SORT_DIGITSASNUMBERS.

@TylerBrinkley do you want update the proposal?

Sure.

TylerBrinkley on 24 Apr 2017

I double checked Windows behavior and it is supporting the following digits:

    0x0030, 0x0039,     // ASCII
    0x0660, 0x0669,     // Arabic-Indic
    0x06f0, 0x06f9,     // Eastern Arabic-Indic
    0x0966, 0x096f,     // Devanagari
    0x09e6, 0x09ef,     // Bengali
    0x0a66, 0x0a6f,     // Gurmukhi
    0x0ae6, 0x0aef,     // Gujarati
    0x0b66, 0x0b6f,     // Oriya
    0x0c66, 0x0c6f,     // Telugu
    0x0ce6, 0x0cef,     // Kannada
    0x0d66, 0x0d6f,     // Malayalam
    0x0e50, 0x0e59,     // Thai
    0x0ed0, 0x0ed9,     // Lao
    0x0f20, 0x0f29,     // Tibetan
    0x1040, 0x1049,     // Myanmar
    0x17e0, 0x17e9,     // Khmer
    0x1810, 0x1819,     // Mongolian
    0x1946, 0x194f,     // Limbu
    0xff10, 0xff19      // Fullwidth
    0x1d7ce-0x1d7d7 Math Bold
    0x1d7d8-0x1d7e1 Math Double
    0x1d7e2-0x1d7eb Math SansSerif
    0x1d7ec-0x1d7f5 Math SS Bold
    0x1d7f6-0x1d7ff Math Monosp

tarekgh on 24 Apr 2017

@tarekgh Great, thanks. I've updated the proposal to include the following API.

c# namespace System { public class StringComparer { public static StringComparer Create(CultureInfo culture, CompareOptions options); public static StringComparer Logical { get; } public static StringComparer LogicalIgnoreCase { get; } } } namespace System.Globalization { public enum CompareOptions { Logical = 0x00000020 } }

TylerBrinkley on 24 Apr 2017

I don't think we need to have

C# public static StringComparer Create(CultureInfo culture, CompareOptions options);

after last proposed change. right?

tarekgh on 24 Apr 2017

I'll keep what you proposed but we need to be clear all StringComparer changes are only there for convenience and discovery.

tarekgh on 24 Apr 2017

Perhaps not, but it could serve as a replacement for

c# public static StringComparer Create(CultureInfo culture, bool ignoreCase)

and is more visible to consumers.

TylerBrinkley on 24 Apr 2017

Another potential issue with implementing this by mapping Logical to SORT_DIGITSASNUMBERS came to me. CompareOptions.Ordinal and CompareOptions.OrdinalIgnoreCase are specially handled by the framework and aren't compatible with the other values of CompareOptions. As such we would not be able to handle Ordinal and Logical together.

TylerBrinkley on 25 Apr 2017

As such we would not be able to handle Ordinal and Logical together.

We'll need to allow having ordinal with Logical. we'll just need to provide some code handle such case. The good thing here is we'll not lose a perf as we don't have search for numbers before performing the comparison. We can just do both operations together.

tarekgh on 25 Apr 2017

I'm not sure how that would work but I'll trust you do.

TylerBrinkley on 25 Apr 2017

When we get to the implementation we can have more details. what I meant here is we'll handle the cases like (CompareOptions.Ordinal | CompareOptions.Logical) as special case that we will not call the OS for it but instead we'll provide the code that go through the string and compare it ordinarily and take care if we find numbers. if there is anything specific not clear I can try to clarify more.

tarekgh on 25 Apr 2017

Okay, that makes sense. It would be a re-implementation of the logic provided by the OS.

TylerBrinkley on 25 Apr 2017

I'm wondering why we need to use the OS implementation here. It means the various platforms may differ, have different performance, etc...

The ANSI implementation itself isn't difficult (compare as ordinal until you hit a number in both strings, then parse out all sequential numerical digits and compare, repeat until you hit the end of the string if the sequence of digits is the same).

The Unicode implementation would likely be similar with a decision only needing to be made w.r.t how sequences of numeric digits from different cultures are compared. For example: if you have 1 (West Arabic) and ١ (East Arabic), do they compare as equal and are sorted ordinarily (١ would come after 1 due to it being the same value but it coming after 1 ordinarily and it would come before 2 due to having a lesser value) or are they just compared ordinarily due to being different number systems (١ would come after all strings of West Arabic digits, as it would in normal ordinal compare).

tannergooding on 25 Apr 2017

@tannergooding are you talking generally or are you talking about the ordinal case only?

tarekgh on 25 Apr 2017

I was talking about implementing this entirely in .NET vs implementing it using the existing underlying OS functionality.

The algorithm to do this is fairly trivial (as listed above) with the only real decision being how identical numbers from different cultures are compared. I am also not sure I buy the argument that it will be less performant. We will have spans shortly and fixed buffers in the meantime, both of which would allow us to compare substrings without needing to worry about expensive allocations or copying. Even a naive implementation shouldn't require any recursing, lookback, etc (comparing two strings should require parsing no more characters than the length of both strings, assuming they both end in a string of digits and are otherwise identical).

The hardest part of the implementation would be ensuring we have the ability to parse strings of digits other than [0..9] (but we should probably have this support somewhere in the framework anyways).

tannergooding on 25 Apr 2017

API review:
We should split it into 2 independent features:

namespace System {
     public class StringComparer {
+        public static StringComparer Create(CultureInfo culture, CompareOptions options);
     }
}

This one is for discoverability. Approved.

 namespace System {
     public class StringComparer {
//REJECTED:  public static StringComparer Logical { get; }
//REJECTED: public static StringComparer LogicalIgnoreCase { get; }
     }
 }
 namespace System.Globalization {
     public enum CompareOptions {
+        NumericOrdering = 0x00000020
     }
 }

We rejected the 2 Local StringComparer.Logical* statics, because we would have to have more combinations (also with Ordinal). We do not believe this option needs such promotion. It is doable via flags.

Discussion about naming -- options: Logical (proposal & Windows API) vs. DigitsAsNumbers (Windows API) vs. NumericOrdering (ICU) vs. NaturalOrdering (Java) vs. NumericCollation (ICU under the hood).

Rejected: Logical seems very general and doesn't quite capture what it does (although Windows has API called strCmpLogicalW).
Rejected: DigitsAsNumbers explains slightly better what it does (used by Windows in SORT_DIGITSASNUMBERS) - Digits and numbers are the same thing, so it is weird.
Rejected: NaturalOrdering is too close to Logical - unclear meaning
Rejected: NumericCollation - just weird, no one likes it
Approved: NumericOrdering

We should make sure the implementation is flushed out, if it doesn't call into OS (maybe experiment in another library, e.g. CoreFXLab). We should not take half-working API into CoreFX.

karelz on 25 Apr 2017

👍2

Top-post updated with approved API shape

karelz on 25 Apr 2017

@tannergooding let's take an example for what I mean by the performance issue if we implement it without calling the OS:

Imagine you are comparing the following 2 strings (using CurrentCulture for instance)
var s1 = "Strasse 10";
var s2 = "Straße 2"

"Strasse" is equals to "Straße"

To do what you have suggested, you'll need first to search the string for numeric (which is 10 in the first string and 2 in the second string) then call the OS for the string comparisons of "Strasse" and "Straße". We have to call the OS because the framework cannot perform such Linguistic comparisons. then we need to parse the numbers and then compare them. we have to repeat this process till we find a difference or reach the end of one of the strings.

If we call the OS, we'll not have to do that and the OS will just go through the string as one pass. so I expect the performance would be much better.

Let me know if this answer your question.

tarekgh on 25 Apr 2017

For the record, adding the ICU info we can use:

http://icu-project.org/apiref/icu4c/ucol_8h.html

When turned on, this attribute makes substrings of digits sort according to their numeric values.

This is a way to get '100' to sort AFTER '2'. Note that the longest digit substring that can be treated as a single unit is 254 digits (not counting leading zeros). If a digit substring is longer than that, the digits beyond the limit will be treated as a separate digit substring.

A "digit" in this sense is a code point with General_Category=Nd, which does not include circled numbers, roman numerals, etc. Only a contiguous digit substring is considered, that is, non-negative integers without separators. There is no support for plus/minus signs, decimals, exponents, etc.

tarekgh on 25 Apr 2017

Is this okay to work on? I'm interested, and I think I have figured out mostly.

Gnbrkm41 on 22 Aug 2019

Also, could there be potential issues with the length of the numbers inside the string? ICU spec says:

Note that the longest digit substring that can be treated as a single unit is 254 digits (not counting leading zeros) If a digit substring is longer than that, the digits beyond the limit will be treated as a separate digit substring.

I do not see any digit limits for the Windows CompareStringEx function; I know 254 digits is an extreme case... but it just makes me wonder.

Gnbrkm41 on 22 Aug 2019

@Gnbrkm41 thanks for willing to help with this issue. Yes, you can start working on it. Please make sure you follow the final approved proposal https://github.com/dotnet/corefx/issues/395#issuecomment-297115553

Note that, the first part of the approved proposal

```C#
public class StringComparer {
public static StringComparer Create(CultureInfo culture, CompareOptions options);
}

is already implemented and exposed so you don't have to worry about this part. you need to focus on supporting the part 

```C#
 namespace System.Globalization {
     public enum CompareOptions {
        NumericOrdering = 0x00000020
     }
 }

For digits length limit 254, I would say we already expect some different collation behavior between ICU and Windows. So, let's stick with whatever the underlying OS behavior is even if it is different. As you mentioned this is extreme case anyway and we can change the behavior as needed if we get any complains about it.

tarekgh on 22 Aug 2019

👍1

@tarekgh, where should the tests for this particular StringComparer go? src/System.Runtime/tests/StringComparer.cs? src/System.Runtime/tests/StringComparer.netcoreapp.cs? I'm guessing it should go to netcoreapp since this only exists in .NET Core... but just asking to make sure.

Gnbrkm41 on 23 Aug 2019

When we get to the implementation we can have more details. what I meant here is we'll handle the cases like (CompareOptions.Ordinal | CompareOptions.Logical) as special case that we will not call the OS for it but instead we'll provide the code that go through the string and compare it ordinarily and take care if we find numbers. if there is anything specific not clear I can try to clarify more.

Almost missed this; should this go in as well? At the moment I merely added more code for calling the OS API.

Gnbrkm41 on 23 Aug 2019

Should comparers created with the NumericSorting flag work with GetHashCode? it doesn't work with StringSort at the moment it seems, so wondering if same should happen with NumericSorting.

Gnbrkm41 on 23 Aug 2019

if the Ordinal | Numeric is allowed as a special case, should other flags be allowed as well, such as IgnoreCase? Ordinal | Numeric | IgnoreCase sounds something that might people want to me...

Gnbrkm41 on 23 Aug 2019

where should the tests for this particular StringComparer go?

corefx\src\System.Runtime.Extensions\tests\SystemStringComparer.netcoreapp.cs. The other string comparisons tests should go to corefx\src\System.Globalization\tests

Almost missed this; should this go in as well? At the moment I merely added more code for calling the OS API.

Yes, we need to have the full implementation of the feature. Let me know if you need any help in this part.

Should comparers created with the NumericSorting flag work with GetHashCode? it doesn't work with StringSort at the moment it seems, so wondering if same should happen with NumericSorting.

Yes, NumericSorting should be included and supported in getting the hashcode. It should be working with StringSort too.
https://github.com/dotnet/coreclr/blob/eab2a618963fc4e193501fb4e1c3fe0809c5de6d/src/System.Private.CoreLib/shared/System/StringComparer.cs#L132

if the Ordinal | Numeric is allowed as a special case, should other flags be allowed as well, such as IgnoreCase? Ordinal | Numeric | IgnoreCase sounds something that might people want to me...

CompareOptions already has OrdinalIgnoreCase. so we should allow only

OrdinalIgnoreCase | NumericSorting and Ordinal | NumericSorting

and we can document the behavior.

tarekgh on 23 Aug 2019

Yes, NumericSorting should be included and supported in getting the hashcode. It should be working with StringSort too.
https://github.com/dotnet/coreclr/blob/eab2a618963fc4e193501fb4e1c3fe0809c5de6d/src/System.Private.CoreLib/shared/System/StringComparer.cs#L132

I thought that changing this would do, but it seems that there's additional checks when it calls the CompareInfo's ones, which makes StringComparer.Create(InvariantCulture, StringSort).GetHashCode() fail:
https://github.com/dotnet/coreclr/blob/master/src/System.Private.CoreLib/shared/System/Globalization/CompareInfo.cs#L23-L40

This makes it throw when the internal GetHashCodeOfString is called.

Gnbrkm41 on 24 Aug 2019

What would be the desired behaviour for the ordinal, case sensitive, numeric sorting comparer in case of comparer.Compare("Windows 7", "windows 10")? should it consider the case difference first or the number difference at the end?

Gnbrkm41 on 24 Aug 2019

@Gnbrkm41 I believe you compare letters first and then whenever you see a number you start comparing by numbers.

Some test cases can look like versions, i.e.:

A.1.2.3
A.12.3.4
A.1.23.4
...

krwq on 24 Aug 2019

I have long written my own StrCmpLogicalW wrapper for use in WPF, so I support this direction.

scalablecory on 24 Aug 2019

@krwq, yes, but what I meant was whether the numbers have more significance. For example, consider these strings: file 1 and File 2. If we abide by that rule (compare letters first), ordinal sorting would put File 2 first then file 1 since it considers the case difference first.

It seems that CompareStringEx function with InvariantCulture and SORT_DIGITSASNUMBERS flag sort file 1 first then File 2, which suggests that it prioritises the numeric difference rather than the casing, even when it's case sensitive. This doesn't apply to StrCmpLogicalW since it is case insensitive by default.

I was wondering, for the ordinal implementation (which we'll take care of ourselves, rather than calling the OS; anything else is handled by the OS.), whether we should do the same as CompareStringEx, which is to consider numbers first if there's only case difference, or to consider the casings first. Personally casings first makes more sense to me since it's ordinal.

Gnbrkm41 on 24 Aug 2019

👍1

Yes, NumericSorting should be included and supported in getting the hashcode. It should be working with StringSort too.
https://github.com/dotnet/coreclr/blob/eab2a618963fc4e193501fb4e1c3fe0809c5de6d/src/System.Private.CoreLib/shared/System/StringComparer.cs#L132

I thought that changing this would do, but it seems that there's additional checks when it calls the CompareInfo's ones, which makes StringComparer.Create(InvariantCulture, StringSort).GetHashCode() fail:
https://github.com/dotnet/coreclr/blob/master/src/System.Private.CoreLib/shared/System/Globalization/CompareInfo.cs#L23-L40

This makes it throw when the internal GetHashCodeOfString is called.

How about NumericSorting | Ordinal or NumericSorting | OrdinalIgnoreCase? should we allow GetHashCode to be called for these as well?

Gnbrkm41 on 24 Aug 2019

@Gnbrkm41 intuitively "File 2" should be first not sure why would we compare in the middle first or how would that be different from sorting "B1" and "A2".

For GetHashCode question: as long as HashCode(stringA) == HashCode(stringB) for every Equals(stringA, stringB) for a given setting then implementation is correct (although hash should also be reasonable since always returning 0 as hash code would also meet the criteria)

krwq on 27 Aug 2019

Let's think more how we can handle the ordinal with the NumericSorting from design and implementation point of view. Ordinal and OrdinalIgnoreCase always treated as exclusive flags which cannot be combined with any other flags. changing that would require a lot changes in all places handle these flags. We may think if there is a better way we can handle this.

tarekgh on 29 Aug 2019

So I did some testing on both Windows and Linux, and it seems that there are some interesting differences:

Windows treats non-canonical numbers (i.e. numbers with leading zeros) non-equal; Linux does
- 00000011 and 11 compares different on Windows (longer one is considered to come earlier)
- However, this compares equal on Linux
- I think this also makes a slight difference when it comes to comparing strings with numbers + α: for example, AA_11_00123A < AA_11_123 on Windows, AA_11_00123A > AA_11_123 on Linux

Should we try to consolidate this behaviour so that they're platform independent, or would it be okay to just leave it as-is?

Gnbrkm41 on 30 Aug 2019

It is expected we can find differences in the sort behavior among Windows and ICU. This is happen in other cases and not with numbers only. we can report the issue to Windows if we think ICU behavior is better.

tarekgh on 30 Aug 2019

I personally think that the current Windows behaviour is better, since it would result in consistent sorting output regardless of the unsorted state of the collection or algorithm... but on the other hand different representations equaling others do make sense as well.

Gnbrkm41 on 30 Aug 2019

considering this is numeric sorting I think 011 should equal 11 (from sorting point of view consistent sorting makes it more convenient but you can say the same with comparing case insensitive "aA" with "Aa"). Perhaps there are some switches somewhere to unify this behavior.

krwq on 30 Aug 2019

I think, logically, the value of 011 and 11 should be equivalent, but they would then be sorted among themselves (likely ordinally), to avoid mixing a011b, a11c, a011d, etc

tannergooding on 30 Aug 2019

👍3

I agree with @tannergooding the ICU behavior looks more correct to me.

CC @ShawnSteele if he has any comment on Windows behavior.

tarekgh on 31 Aug 2019

Still working on this; Just wondering, can anyone point me to any reliable string hashing algorithms? making it to compare different representations of strings equal would mean that I'd have to implement another hashing algorithm for it instead of using existing ones.

Gnbrkm41 on 8 Oct 2019

@Gnbrkm41 why you need a new hashing algorithm? why can't using whatever used today in the string comparers?

tarekgh on 8 Oct 2019

For example, if I were to treat "11" and "0011" equal with the comparer, it should produce the same hashcode, no? None of the existing comparers do that.

If I were to do that myself with the ordinal numeric comparer that I'll have to implement myself, it doesn't look like I can reuse current implementations of existing comparers. the Ordinal/OrdinalIgnoreCase comparers seem to use Marvin implemented in managed code, and all the other ones seem to call the underlying OS (for windows it seem to use LCMapStringEx which does not compare such strings equal).

Gnbrkm41 on 8 Oct 2019

When you enable the NumericOrder option through the underlying native OS APIs (ICU or NLS), then the hashing should automatically work without need to do anything. I agree this will need to do more work for ordinal cases as you pointed. That is why I previously mentioned https://github.com/dotnet/corefx/issues/395#issuecomment-525974010 we need to rethink about the current proposed design as it can have some issues when using ordinal.

tarekgh on 8 Oct 2019

Summary about the behaviour

NumericOrdering will be allowed to be specified with IgnoreCase, IgnoreKanaType, IgnoreNonSpace, IgnoreSymbols, IgnoreWidth and StringSort, just like any other options
This calls into the the globalisation service available from the OS to perform the comparison
- The NumericOrdering flag corresponds to UCOL_NUMERIC_COLLATION attribute of ICU or SORT_DIGITSASNUMBERS flag of NLS (Windows)
- Platform difference 1: ICU has limit of 254 digits (Note that the longest digit substring that can be treated as a single unit is 254 digits (not counting leading zeros) If a digit substring is longer than that, the digits beyond the limit will be treated as a separate digit substring.), whereas NLS does not specify one.
- Platform difference 2: Windows considers leading zeroes when comparing two otherwise equal numbers. For example, "042".Compare("42") will return -1.
- Platform difference 3: Windows considers the number difference to be more significant than case difference even if case sensitivity is specified. For example, "a1".Compare("A2") will return -1.
- Not sure if there's any difference in what the globalisation lib considers as valid numbers; looks like ICU will consider any characters of General category = Nd, and Windows supports the following: https://github.com/dotnet/corefx/issues/395#issuecomment-296773739 Turns out what Windows supports are all of category Nd.

Proposed design for Ordinal comparison

In addition to the flag combination above, allow Ordinal or OrdinalIgnoreCase to be specified with NumericOrdering (NumericOrdering | Ordinal, NumericOrdering | OrdinalIgnoreCase)
- Implemented in managed language, without globalisation support for speed and potentially security reasons
- Do we take alternative route about allowing this? Allowing it to be specified with Ordinal(IgnoreCase) which has always been specified exclusively is surely different than what it used to be, but I'm not sure if there's any other places that validate the flags.
Only the characters '0-9' (\x30 - \x39) will be considered as valid digits
Only whole numbers in decimal without separators are supported. Scientific notation, decimal points, hexadecimal, group separator etc will be treated just like any other non-numeric characters.
Any leading zeroes (e.g. "00000042") will be ignored when compared with other numbers. This means "042".Compare("42") will return 0
- This will mean it will have to implement custom logic for GetHashCode because now numbers will compare equal with leading zeroes
Any non-digit characters will be compared by its underlying value, until it reaches a number from both string; from this point it will compare the digits of both strings to see which come first, or they compare equal

Is there any other suggestions, or potentially points that are missed out?

Gnbrkm41 on 14 Dec 2019

@Gnbrkm41 thanks for your follow up here.

Your suggested plan looks reasonable but I have 2 concerns with it, which is mainly related to the ordinal comparisons:

currently we have a lot of code which assume Ordinal and OrdinalIgnoreCase are exclusives options. touching all places handle such assumptions will be risky for causing functional or performance regression. That is why I was trying to think in avoiding that.
handling ordinal with only 0-9 ASCII numbers is not good thing to start with. it is easy to run into cases that require more than that.

I didn't have time to investigate more, but my thoughts to solve these 2 concerns was:

maybe it is worth to have OrdinalNumericOrdering and OrdinalIgnoreCaseNumericOrdering options. having these will avoid us touching a lot of current code.
We need to look at Windows and ICU how they support the ordinal with the numeric ordering. if they offer any help with these. otherwise, I agree with your thoughts to manually handle the ordinal cases which this will be the most likely approach we'll use.

tarekgh on 15 Dec 2019

currently we have a lot of code which assume Ordinal and OrdinalIgnoreCase are exclusives options. touching all places handle such assumptions will be risky for causing functional or performance regression. That is why I was trying to think in avoiding that.

maybe it is worth to have OrdinalNumericOrdering and OrdinalIgnoreCaseNumericOrdering options. having these will avoid us touching a lot of current code.

Fair point. I could perhaps take a look at the list of methods that take CompareOptions and see if it'd be a huge effort to implement that; Not entirely sure about the performance implications though.

handling ordinal with only 0-9 ASCII numbers is not good thing to start with. it is easy to run into cases that require more than that.

Also fair. If that's the case, what set of characters should we consider as valid digits? Windows and ICU seem to consider different sets of characters as digits. Perhaps we could take from both?

Also repeating @tannergooding's opinion:

The Unicode implementation would likely be similar with a decision only needing to be made w.r.t how sequences of numeric digits from different cultures are compared. For example: if you have 1 (West Arabic) and ١ (East Arabic), do they compare as equal and are sorted ordinarily (١ would come after 1 due to it being the same value but it coming after 1 ordinarily and it would come before 2 due to having a lesser value) or are they just compared ordinarily due to being different number systems (١ would come after all strings of West Arabic digits, as it would in normal ordinal compare).

I agree with Tanner that if we compare different number groups I think we should just compare ordinarily rather than by the value.

We need to look at Windows and ICU how they support the ordinal with the numeric ordering. if they offer any help with these.

While this would be nice, I'm worried that their behaviour might differ across OSes (just like how ICU and NLS support different set of numeric characters), which might lead to cases where the sorted order differ. Even if this is by design this would effectively mean we would not have a way to get consistent results across OSes.

Gnbrkm41 on 16 Dec 2019

While this would be nice, I'm worried that their behaviour might differ across OSes

This is the general case we already have today so we don't have to worry much about that especially we are going to support using ICU on Windows too so we already have a solution for the users regarding this issue.

tarekgh on 17 Dec 2019

👍1

CompareInfo.Compare(string?, string?, CompareOptions)
CompareInfo.Compare(string?, int, string?, int, CompareOptions)
CompareInfo.Compare(string?, int, int, string?, int, int, CompareOptions)
CompareInfo.IsPrefix(string, string, CompareOptions)
CompareInfo.IsSuffix(string, string, CompareOptions)
CompareInfo.IndexOf(string, char, CompareOptions)
CompareInfo.IndexOf(string, string, CompareOptions)
CompareInfo.IndexOf(string, char, int, CompareOptions)
CompareInfo.IndexOf(string, string, int, CompareOptions)
CompareInfo.IndexOf(string, char, int, int, CompareOptions)
CompareInfo.IndexOf(string, string, int, int, CompareOptions)
CompareInfo.LastIndexOf(string, char, CompareOptions)
CompareInfo.LastIndexOf(string, string, CompareOptions)
CompareInfo.LastIndexOf(string, char, int, CompareOptions)
CompareInfo.LastIndexOf(string, string, int, CompareOptions)
CompareInfo.LastIndexOf(string, char, int, int, CompareOptions)
CompareInfo.LastIndexOf(string, string, int, int, CompareOptions)
CompareInfo.GetSortKey(string, CompareOptions)
CompareInfo.GetHashCode(string, CompareOptions)
CompareInfo.GetHashCode(ReadOnlySpan<char>, CompareOptions)
GlobalizationExtensions.GetStringComparer(CompareInfo, CompareOptions)
string.Compare(string?, string?, CultureInfo?, CompareOptions)
string.Compare(string?, int, string?, int, int, CultureInfo?, CompareOptions)
StringComparer.Create(CultureInfo, CompareOptions)

While the amount of public APIs that accept CompareOptions seem like a lot, most of them end up at internal CultureAwareComparer and shared implementation in CompareInfo where the enum is validated; It feels like the amount of work needed to be done will be about the same for introducing a dedicated member for it, when compared to allowing a different combination with existing members.

Gnbrkm41 on 17 Dec 2019

Is this useful?: https://natsort.readthedocs.io/en/master/index.html