Runtime: Breaking change with string.IndexOf(string) from .NET Core 3.0 -> .NET 5.0

Created on 22 Oct 2020  Â·  76Comments  Â·  Source: dotnet/runtime

Description

I'm extending a package to support .NET 5.0 and ran into a breaking change. Given the console application:

using System;

namespace ConsoleApp
{
    class Program
    {
        static void Main(string[] args)
        {
            var actual = "Detail of supported commands\n============\n## Documentation produced for DelegateDecompiler, version 0.28.0 on Thursday, 22 October 2020 16:03\n\r\nThis file documents what linq commands **DelegateDecompiler** supports when\r\nworking with [Entity Framework Core](https://docs.microsoft.com/en-us/ef/core/) (EF).\r\nEF has one of the best implementations for converting Linq `IQueryable<>` commands into database\r\naccess commands, in EF's case T-SQL. Therefore it is a good candidate for using in our tests.\r\n\r\nThis documentation was produced by compaired direct EF Linq queries against the same query implemented\r\nas a DelegateDecompiler's `Computed` properties. This produces a Supported/Not Supported flag\r\non each command type tested. Tests are groups and ordered to try and make finding things\r\neasier.\r\n\r\nSo, if you want to use DelegateDecompiler and are not sure whether the linq command\r\nyou want to use will work then clone this project and write your own tests.\r\n(See [How to add a test](HowToAddMoreTests.md) documentation on how to do this). \r\nIf there is a problem then please fork the repository and add your own tests. \r\nThat will make it much easier to diagnose your issue.\r\n\r\n*Note: The test suite has only recently been set up and has only a handful of tests at the moment.\r\nMore will appear as we move forward.*\r\n\r\n\r\n### Group: Unit Test Group\n#### [My Unit Test1](../TestGroup01UnitTestGroup/Test01MyUnitTest1):\n- Supported\n  * Good1 (line 1)\n  * Good2 (line 2)\n\r\n#### [My Unit Test2](../TestGroup01UnitTestGroup/Test01MyUnitTest2):\n- Supported\n  * Good1 (line 1)\n  * Good2 (line 2)\n\r\n\r\n\nThe End\n";

            var expected = "\n#### [My Unit Test2](";

            Console.WriteLine($"actual.Contains(expected): {actual.Contains(expected)}");
            Console.WriteLine($"actual.IndexOf(expected): {actual.IndexOf(expected)}");
        }
    }
}

I get different results based on the runtime from .NET Core 3.0 -> .NET 5.0:

.NET Core 3.0:

actual.Contains(expected): True
actual.IndexOf(expected): 1475

.NET 5.0:

actual.Contains(expected): True
actual.IndexOf(expected): -1

Configuration

Windows 10 Pro Build 19041 x64
.NET Core 3.1.9
.NET 5.0.0-rc.2.20475.5

Regression?

Yes, it worked through .NET Core 3.1.9

area-System.Globalization question

Most helpful comment

@tarekgh, I agree that the different results between Contains and IndexOf aren't the problem per se.

The problem is clearly IndexOf which is unable to find one ASCII-only string inside of another ASCII-only string (I'm not sure there ever was any locale-dependent behavior imposed on ASCII-only strings!).

This isn't something I would expect from any locale/NLS/ICU-related changes; in fact, I couldn't think of any other programming language/runtime behaving like that.

Here's a simplified test case, broken (I mean, giving me totally unexpected result) on .NET 5 RC 2:

var actual = "\n\r\nTest";
var expected = "\nTest";

Console.WriteLine($"actual.IndexOf(expected): {actual.IndexOf(expected)}"); // => -1

Should it really work like that? Also, why? What does it trying to do actually?

Was it really a planned change?

Yes switching to ICU is intentional change for different reasons.

I'm sorry, but I don't believe this was a planned change, so I'd like to emphasize: I couldn't imagine anyone _planning_ such a change. Like, folks from .NET team sat together and discussed:

Does string "\n\r\nTest" contain "\nTest" with ICU enabled? No, clearly doesn't!

And nobody complained? Not a chance!

This doesn't look like a planned or expected change, and instead looks like a very serious bug, a big compatibility blocker. Because of that, new and ported .NET applications won't properly work on the new runtime, because they won't be able to find substrings inside of string!

Why does ICU care about the line endings anyway? Do some locales have their own locale-specific line endings?

P.S. Yes, you could argue that one should really always call some variant of culture-independent IndexOf, like with the ordinal flag. But, if you've decided to break it _that_ hard in .NET 5, couldn't you just make it to use the sane, ordinal default then? I think it would break less applications than the current change we're seeing in .NET 5 RC 2.

Also, I think we all understand that, despite IndexOf always behaving in a culture-specific manner, there're _tons_ of code in the wild that use IndexOf without the ordinal flags, and that code _used to work_ (in some/most cases, at least). And it will stop working after .NET 5 update.

All 76 comments

@tarekgh

Tagging subscribers to this area: @tarekgh, @safern, @krwq
See info in area-owners.md if you want to be subscribed.

This is by design as in .NET 5.0 we have switched using ICU instead of NLS. You can look at https://docs.microsoft.com/en-us/dotnet/standard/globalization-localization/globalization-icu for more details.

You have the option to use the config switch System.Globalization.UseNls to go back to old behavior but we don't recommend doing that as ICU is more correct and moving forward using ICU will give consistency across OS's.

forgot to say, if you want IndexOf behave as Contains, you should use Ordinal comparisons at that time.

C# actual.IndexOf(expected, StringComparison.Ordinal)

You have the option to use the config switch System.Globalization.UseNls to go back to old behavior but we don't recommend doing that as ICU is more correct and moving forward using ICU will give consistency across OS's.

Yeah, if you run this code in Unix targeting netcoreapp3.1 you will see this same behavior:

 santifdezm  DESKTOP-1J7TFMI  ~  experimental  indexof  $   /home/santifdezm/experimental/indexof/bin/Debug/netcoreapp3.1/linux-x64/publish/indexof
actual.Contains(expected): True
actual.IndexOf(expected): -1

and as @tarekgh with Ordinal comparison it returns the expected result.

 santifdezm  DESKTOP-1J7TFMI  ~  experimental  indexof  $  /home/santifdezm/experimental/indexof/bin/Debug/net5.0/linux-x64/publish/indexof
actual.Contains(expected): True
actual.IndexOf(expected): 1475

I think this is failing because the mix of \r\n and \n in the source string. If I replace all instances of \r\n with \n it works. Same is true if I make everything \r\n. It's just the mix that is resulting in different results from ICU's comparison.

Issue as reported on Twitter was that within the 5.0 runtime, repeated calls to string.IndexOf with the same inputs was giving different results on each call.

https://twitter.com/jbogard/status/1319381273585061890?s=21

Edit: Above was a misunderstanding.

@GrabYourPitchforks can we update the issue title then? As this is technical a breaking change, but this happens on Windows and Unix... right?

I pinged Jimmy offline for clarification. It's possible I misunderstood his original issue report. 280-char forums aren't always efficient at communicating bugs clearly. ;)

Just to clarify, Contains API is performing ordinal operation. IndexOf without any string comparison flags is linguistic operation and not ordinal. If want to compare Contains behavior with IndexOf, need to use IndexOf(expected, StringComparison.Ordinal).
If need to learn more about the difference, https://docs.microsoft.com/en-us/dotnet/csharp/how-to/compare-strings is useful link.

I received clarification on Twitter. The app isn't calling IndexOf in a loop. This is just a standard 3.0 vs. 5.0 behavioral difference report.

@GrabYourPitchforks could you share the link https://docs.microsoft.com/en-us/dotnet/standard/globalization-localization/globalization-icu on your twitter replies and mention we have a config switch to go back to old behavior?

I received clarification on Twitter. The app isn't calling IndexOf in a loop. This is just a standard 3.0 vs. 5.0 behavioral difference report.

Thanks, @GrabYourPitchforks... based on this, closing it as by design.

To add more here, if you want to get the old behavior without switching back to NLS, you can do

```C#
CultureInfo.CurrentCulture.CompareInfo.IndexOf(actual, expected, CompareOptions.IgnoreSymbols)


or 

```C#
    actual.IndexOf(expected, StringComparison.Ordinal)

instead of

C# actual.IndexOf(expected)

and you should get the desired behavior.

I can't see anything about \r\n vs \n in the linked ICU-related documentation (https://docs.microsoft.com/en-us/dotnet/standard/globalization-localization/globalization-icu).

Was it really a planned change?

@ForNeVeR it will be hard to list every single difference between ICU and NLS. the document is talking about the main change of switching to ICU. As I pointed before earlier, it is not right to compare the results of Contains with IndexOf without the StringComparison parameters. I have listed above a couple of ways you can get the previous behavior if desired to. From the report of this issue, it looks to me the usage of IndexOf should use Ordinal option and it is incorrect to use the linguistic comparisons in such case. Using linguistic comparison in such case can depend on the current culture which possible to give different results on different environments.

Was it really a planned change?

Yes switching to ICU is intentional change for different reasons. Windows currently promoting using ICU over NLS. ICU is the future anyway. Also, ICU will give the opportunity to have consistent behaviors across Windows/Linux/OSX or any supported platforms. Using ICU will give the opportunity to the apps to customize the globalization behavior if they desired to.

As the document indicated, you still have the option to switch back to old behavior if you want to.

Ouch, the referenced doc says that ICU/NLS behavior on Windows might silently switch based on the icu.dll availability in the actual environment. This might come as a big surprise for published self-contained apps. I'd expect .NET to ship ICU to work around this issue if the switch were decided on, and as ICU is not available on all target environments. This optional runtime dependency makes things even funnier.

I'd expect .NET to ship ICU to work around this issue if the switch were decided on, and as ICU is not available on all target environments. This optional runtime dependency makes things even funnier.

The ICU now is published as NuGet package. The apps can use such packages to have the self contained app ensure having ICU. look at the app-local section in the doc. In short, the app has full control on the behavior want to get.

@tarekgh, I agree that the different results between Contains and IndexOf aren't the problem per se.

The problem is clearly IndexOf which is unable to find one ASCII-only string inside of another ASCII-only string (I'm not sure there ever was any locale-dependent behavior imposed on ASCII-only strings!).

This isn't something I would expect from any locale/NLS/ICU-related changes; in fact, I couldn't think of any other programming language/runtime behaving like that.

Here's a simplified test case, broken (I mean, giving me totally unexpected result) on .NET 5 RC 2:

var actual = "\n\r\nTest";
var expected = "\nTest";

Console.WriteLine($"actual.IndexOf(expected): {actual.IndexOf(expected)}"); // => -1

Should it really work like that? Also, why? What does it trying to do actually?

Was it really a planned change?

Yes switching to ICU is intentional change for different reasons.

I'm sorry, but I don't believe this was a planned change, so I'd like to emphasize: I couldn't imagine anyone _planning_ such a change. Like, folks from .NET team sat together and discussed:

Does string "\n\r\nTest" contain "\nTest" with ICU enabled? No, clearly doesn't!

And nobody complained? Not a chance!

This doesn't look like a planned or expected change, and instead looks like a very serious bug, a big compatibility blocker. Because of that, new and ported .NET applications won't properly work on the new runtime, because they won't be able to find substrings inside of string!

Why does ICU care about the line endings anyway? Do some locales have their own locale-specific line endings?

P.S. Yes, you could argue that one should really always call some variant of culture-independent IndexOf, like with the ordinal flag. But, if you've decided to break it _that_ hard in .NET 5, couldn't you just make it to use the sane, ordinal default then? I think it would break less applications than the current change we're seeing in .NET 5 RC 2.

Also, I think we all understand that, despite IndexOf always behaving in a culture-specific manner, there're _tons_ of code in the wild that use IndexOf without the ordinal flags, and that code _used to work_ (in some/most cases, at least). And it will stop working after .NET 5 update.

The problem is clearly IndexOf which is unable to find one ASCII-only string inside of another ASCII-only string (I'm not sure there ever was any locale-dependent behavior imposed on ASCII-only strings!).

It is not true ASCII is locale-independent. look at the link http://userguide.icu-project.org/collation/concepts as an example how the behavior of ASCII characters can differs for different cultures.

For example, in the traditional Spanish sorting order, "ch" is considered a single letter. All words that begin with "ch" sort after all other words beginning with "c", but before words starting with "d".
Other examples of contractions are "ch" in Czech, which sorts after "h", and "lj" and "nj" in Croatian and Latin Serbian, which sort after "l" and "n" respectively.

Also, I want to clarify that ICU picks its data and behavior from Unicode Standard which is well thought by many experts. @GrabYourPitchforks is going to post more details about \r\n\ case we are talking about here. but meanwhile, you may familiarize yourself with the doc https://unicode.org/reports/tr29/ especially in the sections mentioning the following:

Do not break between a CR and LF. Otherwise, break before and after controls.
--
GB3 | CR | Ă— | LF
GB4 | (Control \| CR \| LF) | ÷ |  
GB5 |   | ÷ | (Control \| CR \| LF)

Why does ICU care about the line endings anyway? Do some locales have their own locale-specific line endings?

This is addressed in last paragraph.

I'm sorry, but I don't believe this was a planned change, so I'd like to emphasize: I couldn't imagine anyone planning such a change. Like, folks from .NET team sat together and discussed:
This doesn't look like a planned or expected change, and instead looks like a very serious bug, a big compatibility blocker. Because of that, new and ported .NET applications won't properly work on the new runtime, because they won't be able to find substrings inside of string!

This is well planned worked and thought deeply in it. you may look at the issue https://github.com/dotnet/runtime/issues/826 which we published long ago and shared it publicly.
I want to emphasize too that globalization behavior can change at anytime not only for the .NET but for the OS's and other platforms. This is also why we supported ICU app-local feature to allow apps use specific ICU version to ensure the behavior they use is not going to change. Another thing, Windows itself is in the process promoting using ICU and one day ICU behavior will be what the majority of the users is going to use.

like with the ordinal flag. But, if you've decided to break it that hard in .NET 5, couldn't you just make it to use the sane, ordinal default then? I think it would break less applications than the current change we're seeing in .NET 5 RC 2.

Actually we have tried before to make the Ordinal behavior as the default before during Silverlight releases and that caused much more problems than what reported here. We are looking at more ways to help the developers to be conscious when calling something like IndexOf and be intentionally provide StringComparison flags to express the intention. We welcome any ideas you may come up with too.

Also, I think we all understand that, despite IndexOf always behaving in a culture-specific manner, there're tons of code in the wild that use IndexOf without the ordinal flags, and that code used to work (in some/most cases, at least). And it will stop working after .NET 5 update.

That is why we are providing a config switch to switch back to old behavior if you desire too. look at System.Globalization.UseNls

@tarekgh, thanks for thorough explanation!

For now, I think it's better for me to wait for details on this particular \r\n behavior. It isn't clear to me how IndexOf uses "Grapheme Cluster Boundary Rules" (and whether it should do that) when performing this particular search).

Also, even if the Unicode spec is relevant here (which may very well be!), from reading https://unicode.org/reports/tr29/, I'm not sure it forbids to match that last \n. As I read the spec, it says CR | Ă— | LF, where Ă— means "No boundary (do not allow break here)". So, when breaking a sequence \r\n\n, it only forbids to place the "break" between the first and the second characters, but it should be okay to place the "break" before the third one, no? So, I read \r\n\n as two separate "grapheme clusters", and, even if IndexOf has to only match full grapheme clusters and never touch parts of clusters, it should still find substring \nTest inside of string \n\r\nTest.

Also, are you telling that other runtimes/programming languages relying on ICU and/or Unicode spec should behave the same w.r.t. this particular example?

We are looking at more ways to help the developers to be conscious when calling something like IndexOf and be intentionally provide StringComparison flags to express the intention. We welcome any ideas you may come up with too.

_(A necessary disclaimer: I work for JetBrains on several projects, including ReSharper.)_

I didn't initially wanted to bring this point here to not sound like an advertisement, but I guess this is very relevant, so I'll have to do it. ReSharper by default will show the following warning for the user code calling IndexOf:
image

_(please note that I wasn't aware of this particular ReSharper diagnostic before being involved in this thread, so I'm not participating here just to bring this argument)_

So, I think it would be a very good idea to show such notification by default in every other tool, too, or maybe even totally deprecate this bogus method with such notice.

@ForNeVeR

Also, even if the Unicode spec is relevant here (which may very well be!), from reading unicode.org/reports/tr29, I'm not sure it forbids to match that last \n. As I read the spec, it says CR | Ă— | LF, where Ă— means "No boundary (do not allow break here)". So, when breaking a sequence \r\n\n, it only forbids to place the "break" between the first and the second characters, but it should be okay to place the "break" before the third one, no? So, I read \r\n\n as two separate "grapheme clusters"

That is correct. \r\n\n will be 2 clusters as \r\n and \n.

and, even if IndexOf has to only match full grapheme clusters and never touch parts of clusters, it should still find substring \nTest inside of string \n\r\nTest.

That is incorrect. \n\r\nTest will be split into the parts. \n, \r\n, and Test. it is obvious \nTest cannot part of this string. think about it of replacing the cluster \r\n with some symbol X. now the source string will be \nXTest which doesn't contains \nTest.

Also, are you telling that other runtimes/programming languages relying on ICU and/or Unicode spec should behave the same w.r.t. this particular example?

if using default collation strength level, then the answer is yes. ICU can allow changing the strength level which can affect the result. For example, as I mentioned earlier, doing something like:

CultureInfo.CurrentCulture.CompareInfo.IndexOf(actual, expected, CompareOptions.IgnoreSymbols)

will change the strength level and will make the operation ignore the symbols (which will change the behavior of \n and \r as it will be ignored at that time)

Also, I have wrote pure ICU native C app and ran the same case:

void SearchString(const char *target, int32_t targetLength, const char *source, int32_t sourceLength)
{
    static UChar usource[100];
    static UChar utarget[100];

    u_charsToUChars(source, usource, sourceLength);
    u_charsToUChars(target, utarget, targetLength);

    UErrorCode status = U_ZERO_ERROR;
    UStringSearch* pSearcher = usearch_open(utarget, targetLength, usource, sourceLength, "en_US", nullptr, &status);
    if (!U_SUCCESS(status))
    {
        printf("usearch_open failed with %d\n", status);
        return;
    }

    int32_t index = usearch_next(pSearcher, &status);
    if (!U_SUCCESS(status))
    {
        printf("usearch_next failed with %d\n", status);
        return;
    }

    printf("search result = %d\n", index);
    usearch_close(pSearcher);
}


int main()
{
    SearchString("\nT", 2, "\r\nT", 3);
    SearchString("\nT", 2, "\n\nT", 3);
}

this app will output the results:

search result = -1
search result = 1

which is identical to the behavior you are seeing with .NET.

For now, I think it's better for me to wait for details on this particular \r\n behavior. It isn't clear to me how IndexOf uses "Grapheme Cluster Boundary Rules" (and whether it should do that) when performing this particular search).

For sure clustering is affecting the collation operation. If you look at http://unicode.org/reports/tr29/tr29-7.html, it is clearly states the following:

Grapheme clusters include, but are not limited to, combining character sequences such as (g + °), digraphs such as Slovak “ch”, and sequences with letter modifiers such as kw. Grapheme cluster boundaries are important for collation, regular-expressions, and counting “character” positions within text. Word boundaries, line boundaries and sentence boundaries do not occur within a grapheme cluster. In this section, the Unicode Standard provides a determination of where the default grapheme boundaries fall in a string of characters. This algorithm can be tailored for specific locales or other customizations, which is what is done in providing contracting characters in collation tailoring tables.

I am not sure if there will be more details but I'll let @GrabYourPitchforks comment if he has more to add here.

So, I think it would be a very good idea to show such notification by default in every other tool, too, or maybe even totally deprecate this bogus method with such notice.

Thanks!
This is same direction we are thinking in too.

For my own sake, I compared the various overloads between versions:

| Method | netcoreapp3.1 | net5.0 |
|----------------------------------------------------------------|-----------------|----------|
| actual.Contains(expected) | True | True |
| actual.IndexOf(expected) | 1475 | -1 |
| actual.Contains(expected, StringComparison.CurrentCulture) | True | False |
| actual.IndexOf(expected, StringComparison.CurrentCulture) | 1475 | -1 |
| actual.Contains(expected, StringComparison.Ordinal) | True | True |
| actual.IndexOf(expected, StringComparison.Ordinal) | 1475 | 1475 |
| actual.Contains(expected, StringComparison.InvariantCulture) | True | False |
| actual.IndexOf(expected, StringComparison.InvariantCulture) | 1475 | -1 |

Please do include an analyzer for this.

This seems like one of those changes that, while good in the long run, while create a massive amount of churn once .NET 5 launches. So if the behavior of these methods differs between .NET 5 and .NET Core 3.1, what's going to happen when a .NET 5 calls an object defined inside a .NET Standard 2.0 library that manipulates a string passed into it from the .NET callsite? Does the old behavior get used or the new behavior?

@Aaronontheweb new behavior. I saw this initially from an assertion in NUnit3, which targets netstandard2.0. After upgrading, my test started failing when I only changed the target framework.

That's not great - can't control what older libraries I'm referencing do if I want to upgrade.

How many apps will not detect that in unit tests and put that into production?
Did .NET team consider the pain and problems and cost this could cause?

That's not great - can't control what older libraries I'm referencing do if I want to upgrade.

nice Trap!
happy time wasting by hunting weird bugs 🎉
but why InvariantGlobalization doesn't help out with this problem ?

Please do include an analyzer for this.

Yeah. And don't forget F# support.

How many apps will not detect that in unit tests

Zero - since unit tests aren't meant to test external libs/frameworks like the .NET BCL itself.

My view is that there should be an Assembly-level Attribute which can control the mode of this behaviour change. At least then, you can opt-in/out of it at a per-assembly level. This then means that the .NET Standard problem goes away too.

That's not great - can't control what older libraries I'm referencing do if I want to upgrade.

I don't see why you can't control it. All string comparisons are either using ICU or NLS. You can opt-out of ICU using the compat switch if you would like, and all of your libraries will revert to the old behavior.

You cannot rely on globalization data staying stable over time. Take it from the Windows team that they are not afraid to break people who depend on stable globalization data. Globalization functions should be a black box; I do not think it makes sense for libraries (particularly ones that target .NET Standard) to say that they are depending on implementation details like this.

I'm sure that just as many people have complained about .NET globalization functions returning different results on Windows vs. Linux (and perhaps even more people haven't even noticed yet). It's better to unify the behavior between Windows and other platforms, and any correct code should not be relying on globalization data being immutable, regardless.

Would you consider to make a breaking change to also make StringComparison.Ordinal the default comparison strategy? Seeing as globalization is so unstable, it makes sense that at least the default implementation uses a stable algorithm instead. I'm willing to bet that 99.9% of people who use string.Equals(...) or string.Contains(...) etc. without passing StringComparison are not doing it with the explicit intention of handling weird quirks related to locales.

Edit: I guess my question has already been answered:

Actually we have tried before to make the Ordinal behavior as the default before during Silverlight releases and that caused much more problems than what reported here. We are looking at more ways to help the developers to be conscious when calling something like IndexOf and be intentionally provide StringComparison flags to express the intention. We welcome any ideas you may come up with too.

You can opt-out of ICU using the compat switch if you would like, and all of your libraries will revert to the old behavior.

I make libraries for a living, not applications. I'd prefer to have a compile-time way of handling this.

Most of the work we do is InvariantCulture, which per my previous understanding is supposed to be immutable by design. Looks like IndexOf's behavior is different between .NET 5.0 and .NET Core 3.1 under those circumstances too.

How can any analyzer help for existing projects?

@petarrepac that's also a C#-only thing.

@isaacabraham and VB too ;)

I'd prefer to have a compile-time way of handling this.

@Aaronontheweb you do have a compile-time way of handling this (when compiling apps). You can add this to your project:

<ItemGroup>
  <RuntimeHostConfigurationOption Include="System.Globalization.UseNls" Value="true" />
</ItemGroup>

EDIT: this is only for apps consuming a library, unfortunately you can't control this when writing a library.

In the long run the Windows team is promoting to move to ICU, so at some point ICU will be the globalization story, and we're just a thin wrapper around OS libraries.

what's going to happen when a .NET 5 calls an object defined inside a .NET Standard 2.0 library that manipulates a string passed into it from the .NET callsite? Does the old behavior get used or the new behavior?

The new behavior will get used, because it depends on the runtime and .NET Standard is just a standard for runtimes to implement. However, notice that this is bringing platform consistency in between Unix and Windows, if you run this same libraries which people have concerns about in Unix, you will get the ICU result as ICU is the backing library in Unix.

@reflectronic is right in his all points https://github.com/dotnet/runtime/issues/43736#issuecomment-716681586.

to comment on @jbogard results mentioned here https://github.com/dotnet/runtime/issues/43736#issuecomment-716527590, you can just summarize the results by comparing between the Linguistic behavior between Windows and ICU. By the way, ICU now used by many applications on Windows and it is expected the usage to increase. Also, these results is excluding Linux with .NET Core 3.1 and below. which will show consistency between .NET 5.0 and previous versions on Linux.

The point about the library that used to work will be broken, this is not fully true because such libraries were already broken on Linux.

Would you consider to make a breaking change to also make StringComparison.Ordinal the default comparison strategy?

I mentioned earlier we already tried that before but the size of complains was very big that we couldn't apply it. We are looking at the ways to help developers to be conscious specifying the string comparison flags when calling the collation APIs.

I make libraries for a living, not applications. I'd prefer to have a compile-time way of handling this.

Yes, some analyzer can help in such case. usually look at the collation APIs calls and look which one was not showing the intent of using ordinal or linguistic operation.

How can any analyzer help for existing projects?

Analyzers scan the code and detect the calls of collations APIs that don't pass flags would help to look at such calls and fix if detected have a problem.

that's also a C#-only thing.

The change is inside the .NET runtime which should be global and not restricted to C#.

@tarekgh how does a C# analyser manifest itself in a VB or F# codebase?

Most of the work we do is InvariantCulture, which per my previous understanding is supposed to be immutable by design.

Critically, before this change, it was impossible to actually depend on this for cross-platform code; what NLS and ICU behave like, even when using invariant culture, are not always the same (as evidenced by this issue). Like @tarekgh said, this code has been acting differently on Linux this whole time. Now that this change has been made, for any up-to-date Windows install, what "invariant culture" means should _actually_ be consistent on all platforms.

This may come as a surprise, but we did eventually find and fix dozens of bugs related to platform differences over the years as a result of user reporting, as I'm sure many other library authors have done for years and years.

I'm just not excited at the prospect of .NET 5 introducing a new crop of unknown unknowns and revisiting those fixes for not just my libraries, but our downstream dependencies too. That's significant economic cost to us that doesn't create new productivity improvements for our users. Someone in MSFT ought to give that some consideration in their costs / benefits discussion on making this change.

Edit: like, I get the technical merits already - yes, thank you. Please help sell the non-obvious economic benefits of making this change as opposed to the costs. Why should users take the leap and upgrade to .NET 5 anyway given what a rat's nest this issue appears to be?

@tarekgh how does a C# analyser manifest itself in a VB or F# codebase?

We can cover C# and VB with a single analyzer (we strive to make our analyzers language-agnostic wherever possible), but we can't get analyzer coverage for F# at this time.

@Aaronontheweb anyone using cultural functionality and assume it is not going to change is already broken even if we didn't do this ICU changes. Look at the blog https://docs.microsoft.com/en-us/archive/blogs/shawnste/locale-culture-data-churn from Windows team which telling even NLS behavior is changing for the sake of improvements too. So, the issue here is not about moving to ICU more than just catching any wrong assumptions from the apps/libraries perspective. upgrading to 5.0 is same as upgrading to other previous versions. apps will get a lot of new cool features and apps need to be tested as could be some breaking changes between releases. I am not considering globalization behavior change is really breaking as we always telling globalization can change at anytime between OS's and OS versions. As indicated before, we have the config switch to choose to upgrade to 5.0 with keep using NLS. that will make ICU is not really a factor in the upgrade decision. now with ICU, the apps will have the opportunity to get more consistency across OS's and even can have more control on the globalization behavior if decided to use the app-local ICU. We are providing much more control to the apps than before.

Related: https://github.com/dotnet/runtime/issues/43802

(That issue doesn't track IndexOf _per se_. Rather, it discusses the unintended consequences of the the Compare routine defaulting to a culture-aware comparer.)

How can any analyzer help for existing projects?

Analyzers scan the code and detect the calls of collations APIs that don't pass flags would help to look at such calls and fix if detected have a problem.

I guess analyzers must be added to the csproj.
This will not happen automatically. So, a lot of existing projects will be moved to .NET 5 without those analyzers.
Plus, as already mentioned, it will not help for F# projects.

@jeffhandley thank you, that confirms what I already understood. So, it's important to acknowledge that the possible "workaround" will not help F# users (a small market but one that is nonetheless fully supported by MS as a first-class citizen of .NET). I have no idea what this means:

The change is inside the .NET runtime which should be global and not restricted to C#.

I have no idea what this means:
The change is inside the .NET runtime which should be global and not restricted to C#.

I meant any language using .NET runtime will be affected and not only C#.

I am going to strongly voice my opinion again that out of the box there should be an analyzer to uncover these gotchas in the new world, especially if the plan is to change current behavior.

I completely understand the technical merits of doing so and am not suggesting that you not make a change (long term it seems like this is the correct move). Nor am I saying that it was not already documented as best practice. What I am saying is that we really need this to be a big, red, flashing error to developers attempting to lift to .NET 5. Out of the box Developers are going to assume this "just works" otherwise.

Right now today you can use this Roslyn Analyzer Library from @meziantou to find areas that are affected: https://github.com/meziantou/Meziantou.Analyzer/tree/master/docs.

In this particular case this will throw a MA0074 - Avoid implicit culture-sensitive methods

image

That being said this really needs to be out of the box, I have opened up this Roslyn issue here: https://github.com/dotnet/roslyn-analyzers/issues/4367

@tarekgh Thanks for clarifying. My original point was that analysers are not the answer here if you are looking for a solution that works for all .NET users.

We are looking at more ways to help the developers to be conscious when calling something like IndexOf and be intentionally provide StringComparison flags to express the intention. We welcome any ideas you may come up with too.

How about deprecating the old methods (using an [Obsolete] attribute)?

What happens to .NET Standard compatibility when the API surface is the same but behaviour changes?

A change to honor grapheme clusters doesn't bother me, as long as the impact is well documented. A change that causes closely related members of the family of string methods to behave inconsistently from one another does.

Someone who is doing informal string munging, indifferent to the obscurities of characters, grapheme clusters, or locale, would take it as given that if str.Contains(whatever) succeeds, there is no need to inspect the result from str.IndexOf(whatever) because we were just told it is in there and therefore can be found. It doesn't matter which second parameter they don't care to know about is the default because defaulting is sure to behave the same across methods, freeing them from needing to study all the subtleties to use them at all.

Inconsistencies like this produce a language that can only be used successfully by experts and alienate the folks coming out of code camp. Don't raise the bar to entry this way.

I agree with @lupestro. The inconsistence in method behavior is very concerning. If you are going to have long standing methods act both differently and inconsistently there will be much sadness. People will run into this, they will feel betrayed by the API and then they will wonder what other timebombs are waiting to explode. Many would be .NET adoptees will write C# off over this sort of issue. It seems you should either, remove overloads that don't accept a locale, or normalize the default locale for methods. There is already a Compare and CompareOrdinal, perhaps a (Last)IndexOf(Any) and (Last)IndexOf(Any)Ordinal is needed. I don't like that solution but at least it would be consistent with what currently exists. Perhaps the idea of ordinal usage in string should be depreciated. If I have to choose between fast or right I'll choose 'right' every time. Inconsistent, non-intuitive behaviors are extremely frustrating.

I see this issue is already closed so I suppose it's already decided that this will move forward in .NET 5.0. I know this stuff is hard and Culture (like Time) info can change for all sorts of non-technical reasons. Developers need to be aware of this, but they also need to depend on their API's to be self consistent. There should at least be a warning (in 5.0) as pointed out by @aolszowka that indicates a problem ... and more importantly why it's a problem. Moving forward is important and sometimes that means you have to "break" old behaviors/assumptions. That does not imply that new inconsistencies need to be introduced. This change breaks expectations as well as code. If it is not possible to make the methods consistent then I'd much rather the CultureInfo be forced to be explicit (which I could then address via an extension method) than just have the possibility of a non-intuitive error blowing up my code during a stressful development cycle or worst at a customer installation.

TLDR: Change things that need changing but don't make the API's inconsistent to do so. If you are going to break it replace it with something better.

I’ve been using .NET for years and always default to InvariantCulture when using these functions. With regard to languages other than English, I’ve always been aware of character pairs that work as aliases for other language-specific letters, and the extra work that goes into checking for these pairs when doing comparisons with CurrentCulture as the default. This has bitten me, for example, when writing ASP.NET code that sets the CurrentCulture of the thread based on the user’s preferred language sent by the browser, and the comparisons for a user not using English makes the code break in subtle ways, especially when strings contain a mix of user-entered (linguistic) and ordinal text.

This Unicode grapheme cluster behavior is a new one on me, as I would have expected ordinal behavior for line feeds and carriage returns, even if I had used Invariant overloads as I typically do. It seems to me that the behavior that both does more work and requires expert knowledge should be opt-in behavior, regardless of technical righteousness. Perhaps that ship sailed long ago, but _this breaking change_ should be made as transparently as, for example, language changes, without having to read an obscure blog. My colleagues are barely aware of new C# 9.0 features, let alone some arcane ICU rule that may impact the code they write today which may someday be ported to and compiled in .NET 5 (or, more likely, .NET 6 or 7).

Obsoleting the old methods is one thing we're considering. I finished a draft of the proposal last night and am shopping it around for internal review, and I'll post it as a new issue here in a few hours.

The draft is posted at https://github.com/dotnet/runtime/issues/43956. It lists several alternatives for possible paths forward (including doing nothing) and weighs the pros and cons of each approach. Please feel free to leave any feedback on proposed courses of action there.

If you're reporting a bug, please file a new issue and use that new issue to describe the bug.

If you have comments on this particular issue ("\r\n" vs. "\n"), please keep responding in this thread. Thanks!

Reopening while we consider approaches described in @GrabYourPitchforks 's document.

We do hear you -- there's a lot of clear feedback here -- we're working to figure out the right way forward, and we'll keep this issue updated.

A change to honor grapheme clusters doesn't bother me, as long as the impact is well documented. A change that causes closely related members of the family of string methods to behave inconsistently from one another does.

Someone who is doing informal string munging, indifferent to the obscurities of characters, grapheme clusters, or locale, would take it as given that if str.Contains(whatever) succeeds, there is no need to inspect the result from str.IndexOf(whatever) because we were just told it is in there and therefore can be found. It doesn't matter which second parameter they don't care to know about is the default because defaulting is sure to behave the _same across methods_, freeing them from needing to study all the subtleties to use them at all.

Inconsistencies like this produce a language that can only be used successfully by experts and alienate the folks coming out of code camp. Don't raise the bar to entry this way.

Yes, this totally expressed my concerns. As a typical Chinese developer, we rarely put StringComparison or CultureInfo explicitly when calling string related methods in our application codes, and it just works. We defintely don't expect a different behaviour between IndexOf and Contains!
.net 5.0
image
.net core 3.1
image
.net framework
image

I agree with @lupestro. The inconsistence in method behavior is very concerning. If you are going to have long standing methods act both differently and inconsistently there will be much sadness.

Perhaps a key point here is that the two methods have always been inconsistent. They did not suddenly become inconsistent in .NET 5.0. If I'm following things correctly, IndexOf has always used current culture comparison, Contains has always used ordinal comparison. Granted .NET 5.0 adds more inconsistency. But the mistake here was in the original API design that allowed this inconsistency.

If I'm following things correctly, IndexOf has always used ordinal comparison, Contains has always used current culture comparison. Granted .NET 5.0 adds more inconsistency. But the mistake here was in the original API design that allowed this inconsistency.

That is correct but it is the other way around, IndexOf(string) uses current culture, IndexOf(char) uses Ordinal and Contains uses ordinal.

I'll elaborate briefly on the IndexOf vs. Contains differences that others alluded to recently.

IndexOf(string) has always assumed _CurrentCulture_ comparison, and Contains(string) has always assumed _Ordinal_ comparison. This discrepancy existed way back in .NET Framework. It is not a new discrepancy introduced in .NET 5. For example, on .NET Framework (which uses Windows's NLS facility), the ligature "æ" and the two-char string "ae" compare as equal under a linguistic comparer. This results in the following discrepancy:

// Sample on .NET Framework, showing Contains & IndexOf returning inconsistent results
Console.WriteLine("encyclopædia".Contains("ae")); // prints 'False'
Console.WriteLine("encyclopædie".IndexOf("ae")); // prints '8' (my machine is set to en-US)

This discrepancy has existed for over a decade. It's documented and there is guidance regarding it. Is it a misdesign? Perhaps. We're discussing over at https://github.com/dotnet/runtime/issues/43956 ways to make the ecosystem healthier moving forward.

For this specific issue, what we're really asking ourselves is: "Given that the discrepancy has existed forever, how far apart are these two methods allowed to deviate _in practice_ before it harms the larger ecosystem?" I think we're still trying to define where that threshold should be. Reports like this are _extremely_ helpful because they give us insight into people use these APIs in practice and what expectations customers have regarding their behavior. As stewards of the framework, we need to consider not just the technical documentation, but also the manner in which real-world apps consume these APIs.

This comment isn't really intended to defend any particular point of view. My intent is to clarify some misconceptions that I've seen and to help explain how we've framed the problem.

Even if InvariantCulture specified, is it correct behavior that \n doesn't match in the ICU version?

image
image

Maybe, the following code displays 5 on Linux and -1 on Windows if we use git and its default settings (autocrlf=true)?

using System;

var s = @"hello
world";
Console.WriteLine(s.IndexOf("\n", StringComparison.InvariantCulture));

@ufcpp Yes, per ICU that is the expected behavior, as discussed earlier in this thread. The two chars <CR><LF> when they occur adjacent to one another are considered an unbreakable unit for linguistic purposes. Searching for <LF> using a linguistic comparer (such as _InvariantCulture_) will produce no match since it would split this unbreakable unit.

I'd like to share an update with everyone: We've decided to keep ICU as the default for 5.0 GA. We'll look at improving the pitfall of accidentally using linguistic comparison when ordinal was intended, that's tracked in https://github.com/dotnet/runtime/issues/43956. We'll also reach out to some of the libraries we've identified as impacted by this issue. In the days to come we'll share more docs to accompany the upcoming 5.0 release that will help people better identify problematic code and fix it to avoid this issue.

This was a very difficult decision to make and we had to weigh the compatibility impact to the ecosystem against drive for standardization across platforms.

We will leave this issue open to consider further mitigation to the \r\n issue in servicing if it's determined that the current mitigations are insufficient.

I'd also like to encourage folks to continue to report issues that are encountered as a result of the ICU switch, even though the behavior might be of a known cause, that doesn't mean it is "by design". We will continue to investigate differences and understand the root cause and determine if we need to drive changes into ICU or .NET to address them.

There is an analyzer rule for always specifying StringComparison.

https://docs.microsoft.com/en-us/dotnet/fundamentals/code-analysis/quality-rules/ca1307

I'd like to share an update with everyone: We've decided to keep ICU as the default for 5.0 GA. We'll look at improving the pitfall of accidentally using linguistic comparison when ordinal was intended, that's tracked in #43956. We'll also reach out to some of the libraries we've identified as impacted by this issue. In the days to come we'll share more docs to accompany the upcoming 5.0 release that will help people better identify problematic code and fix it to avoid this issue.

This was a very difficult decision to make and we had to weigh the compatibility impact to the ecosystem against drive for standardization across platforms.

We will leave this issue open to consider further mitigation to the \r\n issue in servicing if it's determined that the current mitigations are insufficient.

I'd also like to encourage folks to continue to report issues that are encountered as a result of the ICU switch, even though the behavior might be of a known cause, that doesn't mean it is "by design". We will continue to investigate differences and understand the root cause and determine if we need to drive changes into ICU or .NET to address them.

I’d like to know if I can still use .NET Standard dependencies I have no control over.

I suspect I won’t be able to. How about making sure people transition to, I don’t know, .NET Standard 2.2? Change the API?

So that I can know for a fact that I get the expected behaviour.

I’m not sure how else we can avoid breaking the current ecosystem, which has its fair share of issues already.

I’d be glad to be proved wrong, please do so with prejudice - it would ease my mind :)

@rcollina

I’d like to know if I can still use .NET Standard dependencies I have no control over.

.Net Standard libraries are generally supposed to be cross-platform. And if some library works correctly on .Net Core 3 on Unix today (which uses ICU), then it will almost certainly also work on .Net 5 on Windows (which also uses ICU).

.NET Standard 2.2

.Net Standard vNext effectively exists, though it's called ".Net 5.0". (I.e. if you're writing a library and don't care about supporting older frameworks, today, you would target .Net Standard 2.1. In a month, you would target .Net 5.0 instead.)

@svick I get it. I believe I already understand how .NET Standard works. I understand .NET 5 is the new .NET Standard, as it were.

I apologize but I’m still unsure about what happens when I reference a .NET Standard 2.1 library that relied on a pre-existing behaviour inconsistency between IndexOf and Contains.

What’s changing here is something out-of-band, ICU versus NLS. This change widens the discrepancy we already had, and breaks expectations.
There’s nothing encoding this info into libs.

I don’t presume to understand all the implications, but I can’t shake off the feeling we’re being “technically correct” without a seamless onboarding path to the new normal. Which is sorely needed.

As someone else mentioned, most people aren’t even aware of new language features, let alone seismic changes such as this one discovered randomly by our best community members. What are the chances we can withstand this?

I’d like to know if I can still use .NET Standard dependencies I have no control over.

What’s changing here is something out-of-band, ICU versus NLS. This change widens the discrepancy we already had, and breaks expectations.

No, it doesn't. It can't be stressed enough here that ICU has _always_ been used on Unix. .NET Standard libraries are supposed to be portable by design and anything that previously worked on Linux .NET Core 3.x will work in .NET 5.

Most of the work we do is InvariantCulture, which per my previous understanding is supposed to be immutable by design.

Not true. InvariantCulture is only supposed to ignore the user's language setting. It can still change with updates to the unicode spec or to globalization libraries, etc.

Did .NET team consider the pain and problems and cost this could cause?

Remarks like this irk me to no end. Any code that suddenly breaks with this change was incorrect to begin with. How is the platform supposed to move forward if the .NET team have to retain the behaviour of user code that is incorrect or relies on implementation details? It's not like they didn't provide a compatibility switch. A big part of the reason .NET Core spun off from .NET Framework was to solve this problem through features like side-by-side installations and app-local runtime deployments. If you can't move to .NET 5 because of this, then don't move to .NET 5.

I'm just not excited at the prospect of .NET 5 introducing a new crop of unknown unknowns and revisiting those fixes for not just my libraries, but our downstream dependencies too.

If you've squashed all your platform-difference bugs as you claim, then you should have nothing to worry about.

Did .NET team consider the pain and problems and cost this could cause?

Remarks like this irk me to no end.

Millions of projects are running on .NET and string manipulation is a very frequent operation.
The effort that it takes for .NET team to fix/change this is minuscule comparing to effort needed to verify all existing code that will be migrated to .NET 5/6.

So, it's quite fair to ask about the "plan" to address this.
Is there any plan?
Was the effect of this change estimated?
Is it 0.001% of all projects? Is it 75%?
What are other similar changes that we do not know about?

Maybe this affects just a small number of projects.
But, was it estimated?

BTW, I'm all for breaking changes given good enough reason. But, we also need a migration path that is not too risky.

@petarrepac Don't get me wrong, I understand that. But as has been pointed out multiple times in this thread:

  1. There is a plan, and it is to provide a runtime configuration switch.
  2. The behaviour claimed to be breaking is the existing behaviour of .NET on all non-windows platforms.
  3. This should only effect code that performs culture-sensitive operations where ordinal was intended.

Given the last two points it is probably reasonable to assume this affects a fairly small percentage of projects.

100% it is fair to ask about this, but the people who write comments like the one I quoted are often just assuming that no consideration was put in and written before trying to understand the bigger picture behind the change.

Hello, all. We wanted to give a brief summary of the actions that we took when this issue was open and at the end why we decided to keep the default on Windows 10 May 2019 Update or later to be ICU for .NET 5.0.

When the issue was opened, we started some internal discussions about the potential impact and pain that this could've had with our customers given the inconsistency in between Contains(string) which is Ordinal and IndexOf(string) being culture-aware, plus having other APIs being culture-aware by default when operating over a string, but being Ordinal when operating over Span<char> or ReadOnlySpan<char>. So after discussions over this issue, we started doing analysis on NuGet packages that might be affected and gathered data to get a clear picture. This were our findings:

  • "\n" is #​30 on the "most-used string literals" passed to the IndexOf, LastIndexOf, EndsWith, StartsWith, Compare, and CompareTo.
  • 1% of callsites to String.EndsWith are with a search string that starts with \n. Any of these callsites where the string being tested contains "windows-style line endings" would be broken.
  • There are 2040 package ids hosted on NuGet.org where a version contains an assembly with an at-risk callsite.
  • The vast majority of these callsites are to EndsWith and IndexOf, which is consistent with the hypothesis that this pattern is used often for naive line-breaking algorithms.

From these 2040 packages that had an at-risk callsite, only 539 support some version of .NET Standard or .NET Core, so that means that only 0.54% packages listed in NuGet.org are likely to being exposed to the break.

We looked at packages in the list of 539 potentially-affected package ids to get an idea of the actual impact on those. We looked at the top 70 (by download counts), 20 didn't expose the pattern in the latest version, from the ones that exposed the pattern we could only look at 32 that had a permisive license:

  • 14 were not subject to the bug
  • 13 were potentially broken
  • 5 were uncertain (there was no clear indication of breaks due to defensive coding for diverse line break patterns or other mitigations).

So this means, that from the top 70 packages by download only 18% were potentially impacted.

These percentages are stacked and not against the total number of packages on NuGet.org which is 229,536. So if we used the total number of packages and total number of downloads in NuGet.org, we would be looking at 539 potentially affected packages out of 229,536 which is 0.24%.

And while it's great for us to analyze libraries, nuget represents only a small fraction of the C# code out there. And even if someone owns the code, a) bugs may not be easy to track down, and b) they may not actually have the source anymore.

However this was a good source of data to conclude that, while this could be a very notable change in behavior, it is a break that already happened on Unix when reading inputs that might contain Windows Line Endings (which might not be that common).

In .NET Core and .NET 5+, we're striving towards consistency across OSs and given the impact of this change, it felt like the right thing to do. We do care about compatibility and hence are providing a compat runtime switch for people to be able to go back to legacy behavior.

Also, a conclusion from the packages that we could inspect, given that the behavior is different on Unix already, we did see a lot of defensive programming against this issue, to mitigate potential breaks across OSs.

To add to this, globalization can change any time as we're a thin wrapper across the OS, so it felt like the right thing at the moment to just be the same wrapper on all OSs that we support.

As part of this we improved our documentation with practical examples, roslyn analyzer rules and how affected code can mitigate this.

Thank you for all your valuable feedback since it always takes us to a better place and we will try to keep improving this experience for .NET 6, as discussed on: https://github.com/dotnet/runtime/issues/43956

Since we understand the pain that this may cause because of the differences in between line endings across Unix and Windows, we're keeping this issue open and we will investigate a possible way to mitigate the \r\n case for .NET 5.0.x which should be part of a servicing release.

There is also a difference with char and string:

Console.WriteLine("com/test/test/test/a/b/Ęą$Ęą".IndexOf("$"));
Console.WriteLine("com/test/test/test/a/b/Ęą$Ęą".IndexOf('$'));
-1
24

@mattleibow when using character search, it perform ordinal search. The doc https://docs.microsoft.com/en-us/dotnet/api/system.string.indexof?view=net-5.0#System_String_IndexOf_System_Char_ which is stating This method performs an ordinal (culture-insensitive) search. If you do the string search with the ordinal option, you'll get the same result as the character.

C# Console.WriteLine("com/test/test/test/a/b/Ęą$Ęą".IndexOf("$", StringComparison.Ordinal));

~It seems that CA1307 only triggers on indexof(char) but not indexof(string). Is there a rule for the string version? This seems more important as the default of char automatically uses ordinal and the breaking change doesn't really affect this, while the behavior of string has changed and you need to specify ordinal to restore behavior in some cases.~

~How do you detect indexof(string)?~

Found it, it's rule CA1310. Our docs are wrong for https://docs.microsoft.com/en-us/dotnet/standard/globalization-localization/globalization-icu#use-nls-instead-of-icu and don't mention this specific variation. I'll update those docs.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

aggieben picture aggieben  Â·  3Comments

EgorBo picture EgorBo  Â·  3Comments

omariom picture omariom  Â·  3Comments

chunseoklee picture chunseoklee  Â·  3Comments

jzabroski picture jzabroski  Â·  3Comments