Runtime: Char.GetUnicodeCategory returns wrong category for certain Latin-1 characters

Created on 28 Aug 2018  路  10Comments  路  Source: dotnet/runtime

In a nutshell, there are certain characters where CharUnicodeInfo.GetUnicodeCategory returns the correct value, but Char.GetUnicodeCategory returns the wrong value. One such character is U+00B6 PILCROW SIGN, where CharUnicodeInfo returns _OtherPunctuation_ (which is correct) and where Char returns _OtherSymbol_ (which is incorrect). This also affects the behavior of dependent methods like Char.IsPunctuation and Char.IsLower.

MSDN says this behavior is intentional to preserve back-compat, but it is extraordinarily confusing to have two methods with the same name have different behavior.

One solution would be to update Char.GetUnicodeCategory to stay in sync with CharUnicodeInfo.GetUnicodeCategory. This is a breaking change, but it's the type of breaking change that is normally allowed in side-by-side major version updates.

An alternative is to mark Char.GetUnicodeCategory, Char.IsPunctuation, Char.IsLower, etc. as obsolete and to direct users to call into CharUnicodeInfo instead. This preserves existing behavior and provides a migration story to get developers on to the APIs which provide correct results.

area-System.Globalization breaking-change enhancement

Most helpful comment

I've run into this difference recently as well. It is quite confusing that the two methods return different results.

We're working on a major release, .NET 5. It is side-by-side. We are not and do not want to be bug-for-bug compatible. Why not just fix the methods to return the same results? Who are we helping by continuing to ship the discrepancy?

All 10 comments

update Char.GetUnicodeCategory to stay in sync with CharUnicodeInfo.GetUnicodeCategory

+1

Note that other things are also frozen to an old (pre-4.0) Unicode version for compatibility, like Regex:

Regex.IsMatch("\u00b6", @"\p{So}") // pilcrow
Regex.IsMatch("\u00ad", @"\p{Pd}") // soft-hyphen

We were avoiding updating the first 127 characters Unicode categories for compatibility reason because this can break things. And we always recommend using CharUnicodeInfo in general for that.

CharUnicodeInfo is not enough for the scenarios want to get the correct category?

And we always recommend using CharUnicodeInfo in general for that.

@tarekgh How would our developer audience know that? Yes, it says so on MSDN, but how would they even know that they'd need to consult MSDN on this particular issue? Hell, not even I knew this, and _I'm a partial owner of this feature area!_

If this API is only here for compat and all new code should be using the new API, this is the exact scenario that [Obsolete] was meant for. Otherwise we're just encouraging our developer audience to continue along writing new code against old, incorrect APIs.

@GrabYourPitchforks the point is we have the same thing in the full framework and the compatibility bar is very high there which I believe we cannot change that there. changing the behavior in .Net Core can be potentially introducing different behavior when the same code runs on the full framework. At the same time in my experience, we didn't get complains about that. I am not sure we are benefiting much of changing that now. anyone runs into issues because of that, can easily be corrected. I am not seeing a strong reason we need to change that now.

The Unicode categories evolve over time, like all other globalization data. I do not think we want to be freezing the globalization data in .NET Core. The globalization data in .NET Core should be always current. Yes, it will make the behavior different from .NET Framework and that is by design, I think.

The compatibility constrains of .NET Core are different from compatibility constrains of .NET Framework.

At the same time in my experience, we didn't get complains about that. I am not sure we are benefiting much of changing that now. anyone runs into issues because of that, can easily be corrected.

/me raises hand.

I ran into issues. I am complaining. I am making noise. I am proposing two solutions, one of which perfectly preserves compatibility while making sure that other customers don't run into the same issues I did.

The Unicode categories evolve over time, like all other globalization data. I do not think we want to be freezing the globalization data in .NET Core. The globalization data in .NET Core should be always current. Yes, it will make the behavior different from .NET Framework and that is by design, I think. The compatibility constrains of .NET Core are different from compatibility constrains of .NET Framework.

Just to clarify, we are talking here about a few characters in the ASCII range here which will never evolve again. we are not talking about the whole Unicode characters. We already update the categories for the rest of the Unicode characters. in another word CharUnicodeInfo.GetUnicodeCategory and Char properties are already identical.

This discrepancy in the ASCII range was introduced for a good reason too in the full framework.

I ran into issues. I am complaining. I am making noise. I am proposing two solutions, one of which perfectly preserves compatibility while making sure that other customers don't run into the same issues I did.

good and you know how to fix it now :-)

anyway, I'll not push back more than that. I am still not fully convinced we need to do that but if everyone else see we need to be aggressive changing that, I'll not resist more.

I've run into this difference recently as well. It is quite confusing that the two methods return different results.

We're working on a major release, .NET 5. It is side-by-side. We are not and do not want to be bug-for-bug compatible. Why not just fix the methods to return the same results? Who are we helping by continuing to ship the discrepancy?

Moving it to 5.0 release as this will be the chance to have such breaking change.

Was this page helpful?
0 / 5 - 0 ratings