For sorting purposes it's common to need portions of strings containing numbers to be treated like numbers. Consider the list of strings "Windows 7", "Windows 10"
.
Using the Ordinal
StringComparer
to sort the list one would get
Windows 10
Windows 7
but the desired ascending logical sort would be
Windows 7
Windows 10
namespace System {
public class StringComparer {
+ public static StringComparer Create(CultureInfo culture, CompareOptions options);
}
}
namespace System.Globalization {
public enum CompareOptions {
+ NumericOrdering = 0x00000020
}
}
var list = new List<string> { "Windows 10", "Windows 7" };
list.Sort(StringComparer.Logical); // List is now "Windows 7", "Windows 10"
This would also be good for sorting strings containing IP addresses.
Logical
is a convenience property equivalent to the result of Create(CultureInfo.CurrentCulture, CompareOptions.Logical)
LogicalIgnoreCase
is a convenience property equivalent to the result of Create(CultureInfo.CurrentCulture, CompareOptions.Logical | CompareOptions.IgnoreCase)
Char.IsDigit
.Char.GetNumericValue
.ulong
s. Logic for overflows will have to be considered.Windows 8.1
would be considered 4 sequences. The Windows
would be a string sequence, the 8
would be a numeric sequence, the .
would be another string sequence, and the 1
would be another numeric sequence.NumberStyles
parameter."a", "7"
the number 7
will be sorted before the letter a
.CompareOptions
parameter as input will need to be updated to support the new Logical
member.CompareOptions.Logical
be implemented as the flag option SORT_DIGITSASNUMBERS
to the dwCmpFlags
parameter of CompareStringEx
? Using it's implementation should be more efficient but later expanding support for NumberStyles
will require a re-implementation with matching behavior.Logical
and LogicalIgnoreCase
properties.CreateLogical
overloads to match the Create
method.NumberFormatInfo
from the StringComparer
parameter when not explicitly provided and is a CultureAwareComparer
.CreateLogical
overloads that matched the Create
method.CompareOptions.Logical
and changed CreateLogical
to be just an overload of Create
.That's a good suggestion. Would you be willing to make an API proposal?
I think conceptually this is a good idea, but there are a lot of details that need to be worked out. It seems like such a comparer might want to take options which control how it behaves. For example, are commas in numbers ok? Spaces? Decimal points? How are different cultures handled?
While it's easy to define the behavior on Windows systems as following StrCmpLogicalW
, this could be problematic in the long run.
StrCmpLogicalW
is not only undocumented, but it explicitly states that it can change in future releases of Windows.StringComparer.LogicalIgnoreCase
without providing StringComparer.Logical
.I believe you could define a reasonable API proposal in terms of a lexicographical ordering of substrings divided at numeric boundaries. Sequences of non-digits would be compared using StringComparer.CurrentCulture
or StringComparer.CurrentCultureIgnoreCase
, and sequences of digits would be compared according to their numeric values.
The comparer should require an IFormatProvider or NumberFormatInfo; which can be passed in the constructor or it should use the current one from the owning thread.
@terrajobst This is a problem I am facing in almost every business application: proper ordering even if the strings contain numbers. I don't have time to make a PR or document it well, but I would be very happy to see it in the new releases of the core framework. I think the use case is clear enough.
This is a need that I have often faced. If @Peter-Juhasz does not have the time to create the API proposal or implement it, I would be interested in taking it on.
@Peter-Juhasz There is probably no need to consider the NumberFormatInfo
. As stated in the documentation for NumberFormatInfo.NativeDigits
:
The character set that is specified by the NativeDigits property has no effect on parsing or formatting operations. Only the Basic Latin digits 0 (U+0030) through 9 (U+0039) are used when formatting or parsing numeric values or date and time values.
If StringComparer.Logical
and StringComparer.LogicalIgnoreCase
were defined to use StringComparer.CurrentCulture
and StringComparer.CurrentCultureIgnoreCase
for comparing non-digits, the culture used for the comparison is clear.
On a side note, even if used there is no need to explicitly provide the NumberFormatInfo
to the comparer since the current culture information already provides it.
Looks in Windows, CompareStringEx can be used with the flag SORT_DIGITSASNUMBERS. I didn't look at ICU if it support such functionality too.
I believe there is few things which need to be worked on here:
I think this would be a nice addition but it needs some work.
should we provide all permutations for all comparers now or perhaps generate comparer based on some settings (perhaps we already have something like that which we might be able to inject this log in and I might not be aware of it)
I don't think we need to do that. it would be just option (or flag) passed with the CompareOptions and let the underlying OS handle it.
I think this would be a nice addition but it needs some work.
I agree with that
I'd see value in both an OS-dependent sort for cases where you want to match the OS (file listing, for example), and an OS-independent .NET sort that natively understands both numbers and dates. A real life example is needing month names to be correctly sorted, even when that's the only date-related part of the string.
The OS-dependent sort should be super simple to implement. After that's done I think it would still be worth putting design time in towards an independent managed sort as well, to figure out if and how people's needs differ.
For sorting purposes it's common to need portions of strings containing numbers to be treated like numbers. Consider the list of strings "Windows 7", "Windows 10"
.
Using the Ordinal
StringComparer
to sort the list one would get
Windows 10
Windows 7
but the desired ascending logical sort would be
Windows 7
Windows 10
namespace System {
public class StringComparer {
+ public static StringComparer Create(CultureInfo culture, CompareOptions options);
+ public static StringComparer Logical { get; }
+ public static StringComparer LogicalIgnoreCase { get; }
}
}
namespace System.Globalization {
public enum CompareOptions {
+ Logical = 0x00000020
}
}
var list = new List<string> { "Windows 10", "Windows 7" };
list.Sort(StringComparer.Logical); // List is now "Windows 7", "Windows 10"
This would also be good for sorting strings containing IP addresses.
Logical
is a convenience property equivalent to the result of Create(CultureInfo.CurrentCulture, CompareOptions.Logical)
LogicalIgnoreCase
is a convenience property equivalent to the result of Create(CultureInfo.CurrentCulture, CompareOptions.Logical | CompareOptions.IgnoreCase)
Char.IsDigit
.Char.GetNumericValue
.ulong
s. Logic for overflows will have to be considered.Windows 8.1
would be considered 4 sequences. The Windows
would be a string sequence, the 8
would be a numeric sequence, the .
would be another string sequence, and the 1
would be another numeric sequence.NumberStyles
parameter."a", "7"
the number 7
will be sorted before the letter a
.CompareOptions
parameter as input will need to be updated to support the new Logical
member.CompareOptions.Logical
be implemented as the flag option SORT_DIGITSASNUMBERS
to the dwCmpFlags
parameter of CompareStringEx
? Using it's implementation should be more efficient but later expanding support for NumberStyles
will require a re-implementation with matching behavior.Logical
and LogicalIgnoreCase
properties.CreateLogical
overloads to match the Create
method.NumberFormatInfo
from the StringComparer
parameter when not explicitly provided and is a CultureAwareComparer
.CreateLogical
overloads that matched the Create
method.CompareOptions.Logical
and changed CreateLogical
to be just an overload of Create
.we are limited to what the underlying OS can give us as functionality support, this need to be investigated if both Windows and Linux can provide this functionality before we proceed here.
I don't think this should necessarily follow the Windows functionality but be entirely implemented in .NET.
I don't think this should necessarily follow the Windows functionality but be entirely implemented in .NET.
This is not easy to do especially we don't carry the needed collation tables needed to do this.
if it is just ASCII characters that is possible. if you have more complicated scenario that will be challenging except if you go and parse the string split them. this will not be nice performance wise.
in short, Linguistic comparisons will be challenging.
You're probably right about the performance. I'm currently using an implementation that splits the string which works for me but I understand it may not be quick enough for the framework due to the string allocations.
it is not regarding the string allocations. we can pin the strings and avoid the allocations. it is about parsing the string and then compare each part and then handle the digits comparisons. note that, such functionality has to support all digits for all languages too and not just 0~9.
Ugh, supporting digits for all languages would be a pain as you'd have to go by the NativeDigits
property on NumberFormatInfo
. I suppose a fast path could be added for 0-9 and a slow path for other languages.
After looking at my implementation I think char.IsDigit
would work just fine.
For some scenarios I'd prefer an OS-dependent logical sort, for others an invariant logical sort.
I'd really love something like this:
c#
var comparer = StringComparer.TryGetOSLogical()
?? StringComparer.CreateLogical(StringComparer.CurrentCulture);
I want to match the feel of the native OS if possible.
Ugh, supporting digits for all languages would be a pain as you'd have to go by the NativeDigits property on NumberFormatInfo. I suppose a fast path could be added for 0-9 and a slow path for other languages. After looking at my implementation I think char.IsDigit would work just fine.
We don't have to use NumberFormatInfo.NativeDigits because this will make it work with only specific culture while digits in all languages should just work regardless of the chosen culture because digits doesn't have linguistic context (in most cases).
char.IsDigit is not helpful either because you need to know the value of the digit and not just check if it is digit to be able to support the needed functionality.
I am not trying to push back here but I am just explaining the complexity we'll have if we need to do it without OS help.
var comparer = StringComparer.TryGetOSLogical()
?? StringComparer.CreateLogical(StringComparer.CurrentCulture);
I hope StringComparer.TryGetOSLogical() exist and work all the time but the logic to fallback will create some other issues regarding the difference in the behavior if OS support it and if not. This may be OK but we have to understand that.
I understand and thank you for going through the complexities with me. In my implementation I TryParse
the substring number passing in the NumberFormatInfo
, why would that not work? Performance?
I understand and thank you for going through the complexities with me. In my implementation I TryParse the substring number passing in the NumberFormatInfo, why would that not work? Performance?
Int32.TryParse with NumberFormatInfo will not parse native digits.
If Int32.TryParse
doesn't support native digits, why should this? As far as I can tell even the native StrCmpLogicalW
doesn't support native digits.
If Int32.TryParse doesn't support native digits, why should this? As far as I can tell even the native StrCmpLogicalW doesn't support native digits.
Because the functionality we are talking about is regarding the collation which support all native digits. Also there is some requests we need to support native digits in the formatting and parsing APIs but we didn't get into it yet.
In short, if we support only 0~9 digits, sooner or later we'll get request to support the native digits. so it will be good to have a good plan in our support before we jump doing it.
@tarekgh do you think it's valuable to mark this as Api ready for review and then if it is approved then try to sort out implementation details?
do you think it's valuable to mark this as Api ready for review and then if it is approved then try to sort out implementation details?
No, we don't have a good way to implement this at least on Windows today. without figuring out the implementation story we cannot proceed with the API.
@tarekgh
...we don't have a good way to implement this at least on Windows today...
I'm not sure I agree with this completely. It should be possible to implement a comparer which uses StringComparer.CurrentCulture
for substrings that do not contain the code points [0-9]
, and a numeric comparison based on int.Parse
for sequences containing only those code points.
The API could be defined to simply be two new properties of the StringComparer
class.
StringComparer.Logical
: Uses CurrentCulture
for sequences of non-digitsStringComparer.LogicalIgnoreCase
: Uses CurrentCultureIgnoreCase
for sequences of non-digitsSince neither of these properties is an invariant string comparer, it is allowable that future updates make changes to accommodate various cultures, including but not limited to changes in the way native digits are handled.
for substrings that do not contain the code points [0-9], and a numeric comparison based on int.Parse for sequences containing only those code points.
This will not work with non Latin digits. if you use other numerals from other languages that will not work. please read the discussion in this issue.
other concern from the proposed implementation is the performance. splitting the string and then comparing parts will be expensive.
Also should we consider cases like (1,0000, 1.1, ...etc.)? did you think about that?
char.IsDigit is not helpful either because you need to know the value of the digit and not just check if it is digit to be able to support the needed functionality.
@tarekgh I've updated my proposal to specify the use of the Char.GetNumericValue
method to retrieve its value. I've also added @sharwell 's Logical
and LogicalIgnoreCase
properties.
I think for the API interface we can do the following:
The benefit here is this will enable string comparison APIs too and not just the string comparer.
This will not work with non Latin digits. if you use other numerals from other languages that will not work. please read the discussion in this issue.
Yes, this behavior is intentional.
other concern from the proposed implementation is the performance. splitting the string and then comparing parts will be expensive.
There is no need to split the string. You can identify the numeric portions easily, with will also reveal the sequences of non-digits. The sequences of non-digits can be compared using CultureInfo.CurrentCulture.CompareInfo.Compare
, which can operate on explicit substrings without [external] allocation.
Also should we consider cases like (1,0000, 1.1, ...etc.)? did you think about that?
Yes, the ,
and .
elements seen here do not fall in the character range [0-9]
. The comparison would treat 1,0000
as three sequences: 1
(numeric), ,
(non-numeric), 0000
(numeric). The string 1.1
would likewise be treated as three sequences. In my proposal, the current culture has no impact on the treatment of any individual code point as a digit or non-digit. Note that the treatment of "fractional digits" in this manner (ignoring them) is consistent with the way Windows already performs logical sorting in Explorer (for example).
My goal is to provide a clean and immediately usable API without restricting its ability to meet needs in the future. I do not believe it needs to be complicated in order to achieve this outcome.
If we only use the CompareOptions
enum then we'll be limited to only supporting the base integer case without sign, decimal, or thousands support. Also, it would be far less visible and hardly anyone would use it that would otherwise.
With my proposal the user can specify whether to AllowLeadingSign
, AllowTrailingSign
, AllowDecimalPoint
, and AllowThousands
using the NumberStyles
flag enum parameter and also what strings to use for the NegativeSign
, PositiveSign
, NumberDecimalSeparator
, and NumberGroupSeparator
using the NumberFormatInfo
parameter. And being listed in the StringComparer
class makes it very easy to discover.
@sharwell
Yes, this behavior is intentional.
why you think this is good?
There is no need to split the string.
I didn't mean split the string. I meant you need to search the string for the numbers first and then compare the part before the number then compare the number and then repeat the process. that what I meant will be expensive.
My goal is to provide a clean and immediately usable API without restricting its ability to meet needs in the future. I do not believe it needs to be complicated in order to achieve this outcome.
I understand but in same time we need to get the right design to serve us in the future if we need to expand the functionality.
@TylerBrinkley please move the proposal to the top of the issue.
@joperezr we can mark this as ready for review according to the latest discussion.
One last note, for the
C#
public static StringComparer CreateLogical(StringComparer stringComparer, NumberStyles numberStyle, NumberFormatInfo numberFormatInfo);
I think this will be a little bit confusing as we can can create StringComparer for specific culture that match the desired number format. I know that could have have a comparer for specific culture and in same time you want to treat the numbers from other culture but that is unlikely and it would be good just not include it till we find a real need for it. other than that the proposal looks reasonable to me
How do you recommend we get the NumberFormatInfo
though? For a CultureAwareComparer
we would need to add another field initialized from the constructor and for non-CultureAwareComparer
's we would need to use a default such as CurrentInfo
or else we would have to add a NumberFormatInfo
or IFormatProvider
property to StringComparer
which seems wrong.
Also since there is already a Create(CultureInfo culture, bool ignoreCase)
method, should we also add overloads for those parameters, e.g. CreateLogical(CultureInfo culture, bool ignoreCase)
and CreateLogical(CultureInfo culture, bool ignoreCase, NumberStyles numberStyle)
?
I've updated the proposal to retrieve the NumberFormatInfo
if not explicitly provided from CultureAwareComparer
's and for other comparers a default of NumberFormatInfo.CurrentInfo
will be used.
I've also added CreateLogical
overloads for the parameters included in the Create(CultureInfo culture, bool ignoreCase)
method.
@TylerBrinkley please move the proposal to the top of the issue.
I don't have access to do that.
How do you recommend we get the NumberFormatInfo though
While creating the string comparer we'll store the corresponding NFI object. for example, if you are creating CurrentCulture comparer we'll get NFI from current culture and use it. this will apply on culture aware comparer. so we should be find here without passing NFI to the constructors of String comparers.
CreateLogical(CultureInfo culture, bool ignoreCase)
please don't add any overloads to this one as we have introduced a new signature that is taking CompareOptions instead of ignoreCase bool. we should advise people to use this one instead.
@TylerBrinkley please update the proposal accordingly and I'll move it to the top after you do that. thanks.
So do you want me to remove those overloads and add ones with CompareOptions
instead of the ignoreCase bool or just remove those overloads?
I suggest to have only
C#
public static StringComparer CreateLogical(StringComparer stringComparer);
public static StringComparer CreateLogical(StringComparer stringComparer, NumberStyles numberStyle);
public static StringComparer Logical { get; }
public static StringComparer LogicalIgnoreCase { get; }
The signature taking StringComparer (the first and second one) is covering the scenario of culture aware comparer which can be created using ignoreCase or CompareOptions. Also I removed the one which is taking NFI.
We still have the freedom in the future to add more signatures if proven we really need them.
What you think?
Also, do you think naming Logical is good here? I understand Windows API StrCmpLogicalW called logical but I am not sure if this is really easy to understand what it means. anyway, we may get more feedback from the design reviewer as needed if we don't come up with better naming now.
While that's acceptable I don't necessarily like that a user can't use an Ordinal
StringComparer
with an NFI that's not the current cultures.
Also, do you think naming Logical is good here?
I can't think of anything better but naming it something different is definitely worth considering.
While that's acceptable I don't necessarily like that a user can't use an Ordinal StringComparer with a culture's NFI besides the current.
This still can be achieved by create culture aware comparer with CompareOptions.Ordinal and use the first or second signature.
What's the API to create a culture aware comparer with CompareOptions.Ordinal
? Is that not implemented yet?
From a brief search Natural
is the only other reasonable name I could find.
What's the API to create a culture aware comparer with CompareOptions.Ordinal? Is that not implemented yet?
We have it here:
https://github.com/dotnet/corefx/blob/master/src/System.Globalization.Extensions/src/System/Globalization/Extensions.cs#L13
But I can see it will complicate our implementation to extract the NFI. So I am OK to return back the signatures that taking NFI.
Also we need to define the behavior when someone using
public static StringComparer CreateLogical(StringComparer stringComparer);
public static StringComparer CreateLogical(StringComparer stringComparer, NumberStyles numberStyle);
We may be able to construct the NFI from the comparer if it is one of framework comparer but we'll not be able to construct it if the caller used custom comparer. so we may need to add overloads taking NFI for the methods taking StringComparer.
C#
public static StringComparer CreateLogical(StringComparer stringComparer);
public static StringComparer CreateLogical(StringComparer stringComparer, NumberStyles numberStyle);
public static StringComparer CreateLogical(StringComparer stringComparer, NumberFormatInfo nfi);
public static StringComparer CreateLogical(StringComparer stringComparer, , NumberFormatInfo nfi, NumberStyles numberStyle);
public static StringComparer Logical { get; }
public static StringComparer LogicalIgnoreCase { get; }
I think StringComparer.Logical
has the correct connotations. You could also try for StringComparer.Intelligent
or StringComparer.Smart
.
And we'll use Invariant.NumberFormatInfo when the caller not passing NFI to us
I think Logical has the correct connotations. You could also try for Intelligent.
let's stick with Logical for now
What, you don't like StringComparer.BestestBetterness
? 馃槅
We may be able to construct the NFI from the comparer if it is one of framework comparer but we'll not be able to construct it if the caller used custom comparer. so we may need to add overloads taking NFI for the methods taking StringComparer.
NumberFormatInfo
is only needed when using NumberStyles
because the default value is specified as None
, meaning we don't need to know what the signs, group separators, or decimal separators are. Thus we only need the already proposed method.
c#
public static StringComparer CreateLogical(StringComparer stringComparer, NumberStyles numberStyle, NumberFormatInfo numberFormatInfo);
I've removed the CreateLogical
overloads that match the Create
method's parameters.
NumberFormatInfo is only needed when using NumberStyles because the default value is specified as None,
NFI is needed to parse the numbers in the strings. right? how you know the decimal point, thousand separator...etc. when parsing the number?
With the default NumberStyles.None
it only allows for positive integral values with no thousand separator, thus the NFI isn't needed.
Ok. I missed this part. so I think we have the proposal now. I'll move it up
Note, I didn't update the proposal to use NumberFormatInfo.InvariantInfo
vs NumberFormatInfo.CurrentInfo
.
please update it in your comment and I'll re-apply it at the top.
Done. Just curious, why Invariant vs Current?
Actually current make more sense. you may update your comment. we should be good now.
thank you.
I don't like that the proposed CreateLogical
method takes a StringComparer
argument. This will be difficult to implement without additional allocations (or at least copying). The following overload should be enough:
public static StringComparer CreateLogical(CultureInfo culture, boolean ignoreCase);
I'm also not convinced it makes sense to add the overload which takes a NumberStyles
argument.
@sharwell extra allocation is good point.
if we have
```c#
public static StringComparer CreateLogical(CultureInfo culture, boolean ignoreCase);
This will not include the ordinal case. so we need to have a way to provide that. we may consider adding the following to your suggestion
```c#
public static StringComparer CreateLogical(CompareOptions);
And we don't have to provide any NFI parameters.
I'm also not convinced it makes sense to add the overload which takes a NumberStyles argument.
How you think we should support the number parsing? should we just allow the number sign, decimal points, thousand separators...etc. or you are suggesting to only support int numbers with no delimiters (or signs)?
This will not include the ordinal case
Do you have an example of where the ordinal case would provide some feature that the invariant culture wouldn't work equally well for?
or you are suggesting to only support int numbers with no delimiters (or signs)?
Correct. Prior implementations I've seen, including many of the ones that led people to want this comparer out-of-the-box, are based on integers and non-integers. The behavior isn't absolutely precise in all cases, but overall it makes sense more times than it doesn't.
I'd be okay with just the simple case being supported initially and later if it's deemed that support for NumberStyles
should be added then it can be added to the API pretty easily.
I also like the uniformity with the existing Create
method.
Do you have an example of where the ordinal case would provide some feature that the invariant culture wouldn't work equally well for?
```C#
var s1 = "Strasse";
var s2 = "Stra脽e";
Console.WriteLine(s1.Equals(s2, StringComparison.Ordinal)); //false
Console.WriteLine(s1.Equals(s2, StringComparison.InvariantCulture)); //true
So I think all what we need is just the following API.
```C#
public static StringComparer CreateLogical(CultureInfo culture, CompareOptions);
Note that we don't even need the following APIs. we can have them only for convenience which might be not initially necessary.
C#
public static StringComparer Logical { get; }
public static StringComparer LogicalIgnoreCase { get; }
We should clarify the expectation of the numbers would work for integers only (with no signs) and later we can think supporting more complicated cases as needed.
if you all agree, @TylerBrinkley please update the proposal one more time.
I just did a quick performance comparison between Ordinal
vs Invariant
and found that on my machine running under .NET 4.5 Invariant
takes around 360% longer to compare than Ordinal
. So that's definitely a reason Ordinal
ought to be supported.
I'll update the proposal to include just the following APIs
c#
public static StringComparer CreateLogical(CultureInfo culture, CompareOptions options);
public static StringComparer Logical { get; }
public static StringComparer LogicalIgnoreCase { get; }
@tarekgh I've updated the proposal as requested.
Thanks @TylerBrinkley
Sorry, to bring this up again but I just thought of another option. Instead of introducing CreateLogical
we could add a CompareOptions.Logical
flag member and simply add a Create
overload. We could still later add an overload with a NumberStyles
parameter if needed, however that overload will only make sense with CompareOptions.Logical
. Any thoughts?
@TylerBrinkley, if we added to CompareOptions we'll have to support this option in all APIs accepting CompareOptions. in the future if we need to support NumberStyles/NFI with this option, that will complicate the APIs taking CompareOptions. Or we can decide to have it in CompareOptions and then we can later add whatever extra params (NumberStyles/NFI) in StringComparer only. I like the later as it matches what I have originally proposed and you pushed back on it :-)
@tarekgh Agreed, if extra params are needed later then they should only be added in StringComparer
. I like your idea of using CompareOptions
better now as the proposed API using it is now defined in StringComparer
as opposed to the GlobalizationExtensions
class and is therefore more visible.
Something to consider though is that CompareOptions
maps pretty directly to the dwCmpFlags
parameter of CompareStringEx
. If we'd ever want to support the native SORT_DIGITSASNUMBERS
which is available in Windows 7+ it might get a bit confusing having both Logical
and SortDigitsAsNumbers
members.
We may consider using SORT_DIGITSASNUMBERS if it gives us the desired behavior. this will be better for performance too. so CompareOptions.Logical can map to SORT_DIGITSASNUMBERS. This is implementation details which we can investigate more.
if all agree, we can update the proposal to only
C#
namespace System {
public class StringComparer {
public static StringComparer Logical { get; }
public static StringComparer LogicalIgnoreCase { get; }
}
}
with adding Logical option to CompareOptions.
or we can have only adding Logical option to CompareOptions and no change in stringComparer
If you didn't add StringComparer.LogicalIgnoreCase
, I'd create a StringComparerEx.LogicalIgnoreCase
for sanity. 馃槃
Will mapping Logical
to SORT_DIGITSASNUMBERS
present a problem as it's only available on Windows 7+? Also, for Unix environments is it already supported in the Libraries.GlobalizationInterop dll and where is that source code, see https://github.com/dotnet/coreclr/blob/f355c5789d5d7cd6ea10bee326c66f04641cfc90/src/mscorlib/src/Interop/Unix/System.Globalization.Native/Interop.Collation.cs#L22?
Will mapping Logical to SORT_DIGITSASNUMBERS present a problem as it's only available on Windows 7+?
coreclr in general doesn't support pre-Windows 7. so we should be ok here.
It looks like the Unix source is located here. It doesn't look like it's currently supported.
Also, does SORT_DIGITSASNUMBERS
support any UTF-16 digits or just 0-9
?
It looks like the Unix source is located here. It doesn't look like it's currently supported.
we can support it as I know ICU is supporting such functionality.
Also, does SORT_DIGITSASNUMBERS support any UTF-16 digits or just 0-9?
I'll try to find out.
@TylerBrinkley do you want update the proposal?
If we'd ever want to support NumberStyles
in the future this would have to be entirely re-implemented in .NET
with the exact same behavior as CompareStringEx
. That seems unlikely to ever happen if it's implemented as SORT_DIGITSASNUMBERS
.
@TylerBrinkley do you want update the proposal?
Sure.
I double checked Windows behavior and it is supporting the following digits:
0x0030, 0x0039, // ASCII
0x0660, 0x0669, // Arabic-Indic
0x06f0, 0x06f9, // Eastern Arabic-Indic
0x0966, 0x096f, // Devanagari
0x09e6, 0x09ef, // Bengali
0x0a66, 0x0a6f, // Gurmukhi
0x0ae6, 0x0aef, // Gujarati
0x0b66, 0x0b6f, // Oriya
0x0c66, 0x0c6f, // Telugu
0x0ce6, 0x0cef, // Kannada
0x0d66, 0x0d6f, // Malayalam
0x0e50, 0x0e59, // Thai
0x0ed0, 0x0ed9, // Lao
0x0f20, 0x0f29, // Tibetan
0x1040, 0x1049, // Myanmar
0x17e0, 0x17e9, // Khmer
0x1810, 0x1819, // Mongolian
0x1946, 0x194f, // Limbu
0xff10, 0xff19 // Fullwidth
0x1d7ce-0x1d7d7 Math Bold
0x1d7d8-0x1d7e1 Math Double
0x1d7e2-0x1d7eb Math SansSerif
0x1d7ec-0x1d7f5 Math SS Bold
0x1d7f6-0x1d7ff Math Monosp
@tarekgh Great, thanks. I've updated the proposal to include the following API.
c#
namespace System {
public class StringComparer {
public static StringComparer Create(CultureInfo culture, CompareOptions options);
public static StringComparer Logical { get; }
public static StringComparer LogicalIgnoreCase { get; }
}
}
namespace System.Globalization {
public enum CompareOptions {
Logical = 0x00000020
}
}
I don't think we need to have
C#
public static StringComparer Create(CultureInfo culture, CompareOptions options);
after last proposed change. right?
I'll keep what you proposed but we need to be clear all StringComparer changes are only there for convenience and discovery.
Perhaps not, but it could serve as a replacement for
c#
public static StringComparer Create(CultureInfo culture, bool ignoreCase)
and is more visible to consumers.
Another potential issue with implementing this by mapping Logical
to SORT_DIGITSASNUMBERS
came to me. CompareOptions.Ordinal
and CompareOptions.OrdinalIgnoreCase
are specially handled by the framework and aren't compatible with the other values of CompareOptions
. As such we would not be able to handle Ordinal
and Logical
together.
As such we would not be able to handle Ordinal and Logical together.
We'll need to allow having ordinal with Logical. we'll just need to provide some code handle such case. The good thing here is we'll not lose a perf as we don't have search for numbers before performing the comparison. We can just do both operations together.
I'm not sure how that would work but I'll trust you do.
When we get to the implementation we can have more details. what I meant here is we'll handle the cases like (CompareOptions.Ordinal | CompareOptions.Logical) as special case that we will not call the OS for it but instead we'll provide the code that go through the string and compare it ordinarily and take care if we find numbers. if there is anything specific not clear I can try to clarify more.
Okay, that makes sense. It would be a re-implementation of the logic provided by the OS.
I'm wondering why we need to use the OS implementation here. It means the various platforms may differ, have different performance, etc...
The ANSI implementation itself isn't difficult (compare as ordinal until you hit a number in both strings, then parse out all sequential numerical digits and compare, repeat until you hit the end of the string if the sequence of digits is the same).
The Unicode implementation would likely be similar with a decision only needing to be made w.r.t how sequences of numeric digits from different cultures are compared. For example: if you have 1
(West Arabic) and 佟
(East Arabic), do they compare as equal and are sorted ordinarily (佟
would come after 1
due to it being the same value but it coming after 1
ordinarily and it would come before 2
due to having a lesser value) or are they just compared ordinarily due to being different number systems (佟
would come after all strings of West Arabic digits, as it would in normal ordinal compare).
@tannergooding are you talking generally or are you talking about the ordinal case only?
I was talking about implementing this entirely in .NET vs implementing it using the existing underlying OS functionality.
The algorithm to do this is fairly trivial (as listed above) with the only real decision being how identical numbers from different cultures are compared. I am also not sure I buy the argument that it will be less performant. We will have spans shortly and fixed buffers in the meantime, both of which would allow us to compare substrings without needing to worry about expensive allocations or copying. Even a naive implementation shouldn't require any recursing, lookback, etc (comparing two strings should require parsing no more characters than the length of both strings, assuming they both end in a string of digits and are otherwise identical).
The hardest part of the implementation would be ensuring we have the ability to parse strings of digits other than [0..9] (but we should probably have this support somewhere in the framework anyways).
API review:
We should split it into 2 independent features:
namespace System {
public class StringComparer {
+ public static StringComparer Create(CultureInfo culture, CompareOptions options);
}
}
This one is for discoverability. Approved.
Next:
namespace System {
public class StringComparer {
//REJECTED: public static StringComparer Logical { get; }
//REJECTED: public static StringComparer LogicalIgnoreCase { get; }
}
}
namespace System.Globalization {
public enum CompareOptions {
+ NumericOrdering = 0x00000020
}
}
We rejected the 2 Local StringComparer.Logical*
statics, because we would have to have more combinations (also with Ordinal
). We do not believe this option needs such promotion. It is doable via flags.
Discussion about naming -- options: Logical
(proposal & Windows API) vs. DigitsAsNumbers
(Windows API) vs. NumericOrdering
(ICU) vs. NaturalOrdering
(Java) vs. NumericCollation
(ICU under the hood).
Logical
seems very general and doesn't quite capture what it does (although Windows has API called strCmpLogicalW
).DigitsAsNumbers
explains slightly better what it does (used by Windows in SORT_DIGITSASNUMBERS
) - Digits and numbers are the same thing, so it is weird.NaturalOrdering
is too close to Logical
- unclear meaningNumericCollation
- just weird, no one likes itNumericOrdering
We should make sure the implementation is flushed out, if it doesn't call into OS (maybe experiment in another library, e.g. CoreFXLab). We should not take half-working API into CoreFX.
Top-post updated with approved API shape
@tannergooding let's take an example for what I mean by the performance issue if we implement it without calling the OS:
Imagine you are comparing the following 2 strings (using CurrentCulture for instance)
var s1 = "Strasse 10";
var s2 = "Stra脽e 2"
"Strasse" is equals to "Stra脽e"
To do what you have suggested, you'll need first to search the string for numeric (which is 10 in the first string and 2 in the second string) then call the OS for the string comparisons of "Strasse" and "Stra脽e". We have to call the OS because the framework cannot perform such Linguistic comparisons. then we need to parse the numbers and then compare them. we have to repeat this process till we find a difference or reach the end of one of the strings.
If we call the OS, we'll not have to do that and the OS will just go through the string as one pass. so I expect the performance would be much better.
Let me know if this answer your question.
For the record, adding the ICU info we can use:
http://icu-project.org/apiref/icu4c/ucol_8h.html
When turned on, this attribute makes substrings of digits sort according to their numeric values.
This is a way to get '100' to sort AFTER '2'. Note that the longest digit substring that can be treated as a single unit is 254 digits (not counting leading zeros). If a digit substring is longer than that, the digits beyond the limit will be treated as a separate digit substring.
A "digit" in this sense is a code point with General_Category=Nd, which does not include circled numbers, roman numerals, etc. Only a contiguous digit substring is considered, that is, non-negative integers without separators. There is no support for plus/minus signs, decimals, exponents, etc.
Is this okay to work on? I'm interested, and I think I have figured out mostly.
Also, could there be potential issues with the length of the numbers inside the string? ICU spec says:
Note that the longest digit substring that can be treated as a single unit is 254 digits (not counting leading zeros) If a digit substring is longer than that, the digits beyond the limit will be treated as a separate digit substring.
I do not see any digit limits for the Windows CompareStringEx
function; I know 254 digits is an extreme case... but it just makes me wonder.
@Gnbrkm41 thanks for willing to help with this issue. Yes, you can start working on it. Please make sure you follow the final approved proposal https://github.com/dotnet/corefx/issues/395#issuecomment-297115553
Note that, the first part of the approved proposal
```C#
public class StringComparer {
public static StringComparer Create(CultureInfo culture, CompareOptions options);
}
is already implemented and exposed so you don't have to worry about this part. you need to focus on supporting the part
```C#
namespace System.Globalization {
public enum CompareOptions {
NumericOrdering = 0x00000020
}
}
For digits length limit 254, I would say we already expect some different collation behavior between ICU and Windows. So, let's stick with whatever the underlying OS behavior is even if it is different. As you mentioned this is extreme case anyway and we can change the behavior as needed if we get any complains about it.
@tarekgh, where should the tests for this particular StringComparer
go? src/System.Runtime/tests/StringComparer.cs
? src/System.Runtime/tests/StringComparer.netcoreapp.cs
? I'm guessing it should go to netcoreapp
since this only exists in .NET Core... but just asking to make sure.
When we get to the implementation we can have more details. what I meant here is we'll handle the cases like (CompareOptions.Ordinal | CompareOptions.Logical) as special case that we will not call the OS for it but instead we'll provide the code that go through the string and compare it ordinarily and take care if we find numbers. if there is anything specific not clear I can try to clarify more.
Almost missed this; should this go in as well? At the moment I merely added more code for calling the OS API.
Should comparers created with the NumericSorting flag work with GetHashCode
? it doesn't work with StringSort
at the moment it seems, so wondering if same should happen with NumericSorting.
if the Ordinal | Numeric is allowed as a special case, should other flags be allowed as well, such as IgnoreCase
? Ordinal | Numeric | IgnoreCase sounds something that might people want to me...
where should the tests for this particular StringComparer go?
corefx\src\System.Runtime.Extensions\tests\SystemStringComparer.netcoreapp.cs. The other string comparisons tests should go to corefx\src\System.Globalization\tests
Almost missed this; should this go in as well? At the moment I merely added more code for calling the OS API.
Yes, we need to have the full implementation of the feature. Let me know if you need any help in this part.
Should comparers created with the NumericSorting flag work with GetHashCode? it doesn't work with StringSort at the moment it seems, so wondering if same should happen with NumericSorting.
Yes, NumericSorting should be included and supported in getting the hashcode. It should be working with StringSort too.
https://github.com/dotnet/coreclr/blob/eab2a618963fc4e193501fb4e1c3fe0809c5de6d/src/System.Private.CoreLib/shared/System/StringComparer.cs#L132
if the Ordinal | Numeric is allowed as a special case, should other flags be allowed as well, such as IgnoreCase? Ordinal | Numeric | IgnoreCase sounds something that might people want to me...
CompareOptions already has OrdinalIgnoreCase. so we should allow only
OrdinalIgnoreCase | NumericSorting
and Ordinal | NumericSorting
and we can document the behavior.
Yes, NumericSorting should be included and supported in getting the hashcode. It should be working with StringSort too.
https://github.com/dotnet/coreclr/blob/eab2a618963fc4e193501fb4e1c3fe0809c5de6d/src/System.Private.CoreLib/shared/System/StringComparer.cs#L132
I thought that changing this would do, but it seems that there's additional checks when it calls the CompareInfo's ones, which makes StringComparer.Create(InvariantCulture, StringSort).GetHashCode()
fail:
https://github.com/dotnet/coreclr/blob/master/src/System.Private.CoreLib/shared/System/Globalization/CompareInfo.cs#L23-L40
This makes it throw when the internal GetHashCodeOfString
is called.
What would be the desired behaviour for the ordinal, case sensitive, numeric sorting comparer in case of comparer.Compare("Windows 7", "windows 10")
? should it consider the case difference first or the number difference at the end?
@Gnbrkm41 I believe you compare letters first and then whenever you see a number you start comparing by numbers.
Some test cases can look like versions, i.e.:
A.1.2.3
A.12.3.4
A.1.23.4
...
I have long written my own StrCmpLogicalW wrapper for use in WPF, so I support this direction.
@krwq, yes, but what I meant was whether the numbers have more significance. For example, consider these strings: file 1
and File 2
. If we abide by that rule (compare letters first), ordinal sorting would put File 2
first then file 1
since it considers the case difference first.
It seems that CompareStringEx function with InvariantCulture and SORT_DIGITSASNUMBERS
flag sort file 1
first then File 2
, which suggests that it prioritises the numeric difference rather than the casing, even when it's case sensitive. This doesn't apply to StrCmpLogicalW since it is case insensitive by default.
I was wondering, for the ordinal implementation (which we'll take care of ourselves, rather than calling the OS; anything else is handled by the OS.), whether we should do the same as CompareStringEx, which is to consider numbers first if there's only case difference, or to consider the casings first. Personally casings first makes more sense to me since it's ordinal.
Yes, NumericSorting should be included and supported in getting the hashcode. It should be working with StringSort too.
https://github.com/dotnet/coreclr/blob/eab2a618963fc4e193501fb4e1c3fe0809c5de6d/src/System.Private.CoreLib/shared/System/StringComparer.cs#L132I thought that changing this would do, but it seems that there's additional checks when it calls the CompareInfo's ones, which makes
StringComparer.Create(InvariantCulture, StringSort).GetHashCode()
fail:
https://github.com/dotnet/coreclr/blob/master/src/System.Private.CoreLib/shared/System/Globalization/CompareInfo.cs#L23-L40This makes it throw when the internal
GetHashCodeOfString
is called.
How about NumericSorting | Ordinal
or NumericSorting | OrdinalIgnoreCase
? should we allow GetHashCode to be called for these as well?
@Gnbrkm41 intuitively "File 2" should be first not sure why would we compare in the middle first or how would that be different from sorting "B1" and "A2".
For GetHashCode question: as long as HashCode(stringA) == HashCode(stringB) for every Equals(stringA, stringB) for a given setting then implementation is correct (although hash should also be reasonable since always returning 0 as hash code would also meet the criteria)
Let's think more how we can handle the ordinal with the NumericSorting from design and implementation point of view. Ordinal and OrdinalIgnoreCase always treated as exclusive flags which cannot be combined with any other flags. changing that would require a lot changes in all places handle these flags. We may think if there is a better way we can handle this.
So I did some testing on both Windows and Linux, and it seems that there are some interesting differences:
00000011
and 11
compares different on Windows (longer one is considered to come earlier)AA_11_00123A
< AA_11_123
on Windows, AA_11_00123A
> AA_11_123
on LinuxShould we try to consolidate this behaviour so that they're platform independent, or would it be okay to just leave it as-is?
It is expected we can find differences in the sort behavior among Windows and ICU. This is happen in other cases and not with numbers only. we can report the issue to Windows if we think ICU behavior is better.
I personally think that the current Windows behaviour is better, since it would result in consistent sorting output regardless of the unsorted state of the collection or algorithm... but on the other hand different representations equaling others do make sense as well.
considering this is numeric sorting I think 011 should equal 11 (from sorting point of view consistent sorting makes it more convenient but you can say the same with comparing case insensitive "aA" with "Aa"). Perhaps there are some switches somewhere to unify this behavior.
I think, logically, the value of 011 and 11 should be equivalent, but they would then be sorted among themselves (likely ordinally), to avoid mixing a011b
, a11c
, a011d
, etc
I agree with @tannergooding the ICU behavior looks more correct to me.
CC @ShawnSteele if he has any comment on Windows behavior.
Still working on this; Just wondering, can anyone point me to any reliable string hashing algorithms? making it to compare different representations of strings equal would mean that I'd have to implement another hashing algorithm for it instead of using existing ones.
@Gnbrkm41 why you need a new hashing algorithm? why can't using whatever used today in the string comparers?
For example, if I were to treat "11" and "0011" equal with the comparer, it should produce the same hashcode, no? None of the existing comparers do that.
If I were to do that myself with the ordinal numeric comparer that I'll have to implement myself, it doesn't look like I can reuse current implementations of existing comparers. the Ordinal/OrdinalIgnoreCase
comparers seem to use Marvin implemented in managed code, and all the other ones seem to call the underlying OS (for windows it seem to use LCMapStringEx which does not compare such strings equal).
When you enable the NumericOrder option through the underlying native OS APIs (ICU or NLS), then the hashing should automatically work without need to do anything. I agree this will need to do more work for ordinal cases as you pointed. That is why I previously mentioned https://github.com/dotnet/corefx/issues/395#issuecomment-525974010 we need to rethink about the current proposed design as it can have some issues when using ordinal.
UCOL_NUMERIC_COLLATION
attribute of ICU or SORT_DIGITSASNUMBERS
flag of NLS (Windows)Note that the longest digit substring that can be treated as a single unit is 254 digits (not counting leading zeros) If a digit substring is longer than that, the digits beyond the limit will be treated as a separate digit substring.
), whereas NLS does not specify one.Ordinal
or OrdinalIgnoreCase
to be specified with NumericOrdering
(NumericOrdering | Ordinal
, NumericOrdering | OrdinalIgnoreCase
)\x30
- \x39
) will be considered as valid digits"042".Compare("42")
will return 0GetHashCode
because now numbers will compare equal with leading zeroesIs there any other suggestions, or potentially points that are missed out?
@Gnbrkm41 thanks for your follow up here.
Your suggested plan looks reasonable but I have 2 concerns with it, which is mainly related to the ordinal comparisons:
I didn't have time to investigate more, but my thoughts to solve these 2 concerns was:
currently we have a lot of code which assume Ordinal and OrdinalIgnoreCase are exclusives options. touching all places handle such assumptions will be risky for causing functional or performance regression. That is why I was trying to think in avoiding that.
maybe it is worth to have OrdinalNumericOrdering and OrdinalIgnoreCaseNumericOrdering options. having these will avoid us touching a lot of current code.
Fair point. I could perhaps take a look at the list of methods that take CompareOptions
and see if it'd be a huge effort to implement that; Not entirely sure about the performance implications though.
handling ordinal with only 0-9 ASCII numbers is not good thing to start with. it is easy to run into cases that require more than that.
Also fair. If that's the case, what set of characters should we consider as valid digits? Windows and ICU seem to consider different sets of characters as digits. Perhaps we could take from both?
Also repeating @tannergooding's opinion:
The Unicode implementation would likely be similar with a decision only needing to be made w.r.t how sequences of numeric digits from different cultures are compared. For example: if you have
1
(West Arabic) and佟
(East Arabic), do they compare as equal and are sorted ordinarily (佟
would come after1
due to it being the same value but it coming after1
ordinarily and it would come before2
due to having a lesser value) or are they just compared ordinarily due to being different number systems (佟
would come after all strings of West Arabic digits, as it would in normal ordinal compare).
I agree with Tanner that if we compare different number groups I think we should just compare ordinarily rather than by the value.
We need to look at Windows and ICU how they support the ordinal with the numeric ordering. if they offer any help with these.
While this would be nice, I'm worried that their behaviour might differ across OSes (just like how ICU and NLS support different set of numeric characters), which might lead to cases where the sorted order differ. Even if this is by design this would effectively mean we would not have a way to get consistent results across OSes.
While this would be nice, I'm worried that their behaviour might differ across OSes
This is the general case we already have today so we don't have to worry much about that especially we are going to support using ICU on Windows too so we already have a solution for the users regarding this issue.
CompareInfo.Compare(string?, string?, CompareOptions)
CompareInfo.Compare(string?, int, string?, int, CompareOptions)
CompareInfo.Compare(string?, int, int, string?, int, int, CompareOptions)
CompareInfo.IsPrefix(string, string, CompareOptions)
CompareInfo.IsSuffix(string, string, CompareOptions)
CompareInfo.IndexOf(string, char, CompareOptions)
CompareInfo.IndexOf(string, string, CompareOptions)
CompareInfo.IndexOf(string, char, int, CompareOptions)
CompareInfo.IndexOf(string, string, int, CompareOptions)
CompareInfo.IndexOf(string, char, int, int, CompareOptions)
CompareInfo.IndexOf(string, string, int, int, CompareOptions)
CompareInfo.LastIndexOf(string, char, CompareOptions)
CompareInfo.LastIndexOf(string, string, CompareOptions)
CompareInfo.LastIndexOf(string, char, int, CompareOptions)
CompareInfo.LastIndexOf(string, string, int, CompareOptions)
CompareInfo.LastIndexOf(string, char, int, int, CompareOptions)
CompareInfo.LastIndexOf(string, string, int, int, CompareOptions)
CompareInfo.GetSortKey(string, CompareOptions)
CompareInfo.GetHashCode(string, CompareOptions)
CompareInfo.GetHashCode(ReadOnlySpan<char>, CompareOptions)
GlobalizationExtensions.GetStringComparer(CompareInfo, CompareOptions)
string.Compare(string?, string?, CultureInfo?, CompareOptions)
string.Compare(string?, int, string?, int, int, CultureInfo?, CompareOptions)
StringComparer.Create(CultureInfo, CompareOptions)
While the amount of public APIs that accept CompareOptions seem like a lot, most of them end up at internal CultureAwareComparer
and shared implementation in CompareInfo
where the enum is validated; It feels like the amount of work needed to be done will be about the same for introducing a dedicated member for it, when compared to allowing a different combination with existing members.
Is this useful?: https://natsort.readthedocs.io/en/master/index.html
Most helpful comment
I think, logically, the value of 011 and 11 should be equivalent, but they would then be sorted among themselves (likely ordinally), to avoid mixing
a011b
,a11c
,a011d
, etc