Runtime: unable to parse '-1' string in Norwegian

Created on 18 Jan 2019  Â·  20Comments  Â·  Source: dotnet/runtime

using System;
namespace wtfdot
{
    class Program
    {
        static void Main(string[] args)
        {
            System.Threading.Thread.CurrentThread.CurrentCulture = new System.Globalization.CultureInfo("nb-NO");
            int.Parse("-1");
        }
    }
}

produces

Unhandled Exception: System.FormatException: Input string was not in a correct format.
at System.Number.StringToNumber(ReadOnlySpan1 str, NumberStyles options, NumberBuffer& number, NumberFormatInfo info, Boolean parseDecimal) at System.Number.ParseInt32(ReadOnlySpan1 s, NumberStyles style, NumberFormatInfo info)
at System.Int32.Parse(String s)

dotnet --version

2.2.103

lsb_release -a

No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.04.1 LTS
Release: 18.04
Codename: bionic

Most helpful comment

@veonua to summarize this thread:

  • If nb-NO culture has a different negative sign than the hyphen (0x002D), this should be done for a good reason. If you think this is wrong, you can raise your concern to CLDR to fix that.
  • If the culture didn't choose hyphen (0x002D) as negative sign, then we either cannot assume it is. hyphen (0x002D) can be used for other meaning for the culture which we cannot know. Imagine the culture decide to use the hyphen (0x002D) as thousand separator for instance.
  • The best practice to guarantee parsing, is to format the number with invariant culture and then parse it with Invariant. if you don't have control over the source of the string you are parsing, then Invariant will still be the best guess to use. You cannot just magically make any number formatted with some culture can be parsed with other culture. Imagine you formatted a number with thousand separator in Arabic culture and then you are using Spanish culture to parse it. this is expected to fail. The parser cannot assume every possible culture data especially any sign character can mean something else in different cultures.

by that, I am closing this issue but feel free to reply with any more questions if you have any. Thanks.

All 20 comments

The negative sign is \u2212 for some reason, and they probably get this info from the current OS.

is it possible to parse both - and \u2212 as minus sign? I'm pretty sure Norwegians use "-" from keyboard

The negative sign is \u2212 for some reason, and they probably get this info from the current OS.

Yes, we check for NumberFormatInfo.NegativeSign.

I'm pretty sure Norwegians use "-" from keyboard

At least from looking at our docs, it does look like the Norwegian keyboard layout uses U+002D (Hyphen-Minus): https://docs.microsoft.com/en-us/globalization/windows-keyboard-layouts#N

It seems we are operating by design - we expect to be running int.Parse in the same culture that the input was formatted in.

We don't in general know all the values that NumberFormatInfo.NegativeSign may take today and in the future on various cultures so it seems like we could not reasonably be more tolerant.

@veonua is it possible to set the thread culture to match the culture the number was formatted in? If it might have either nb-No or (say) en-US negative signs, you might have to use "TryParse" and fall back from one to the other if the first fails.

@danmosemsft If what 99 % of users of a culture type can't be parsed correctly, then I think it makes int.Parse nigh useless for that culture and I would consider that a bug.

The right fix might be to change CLDR data for the Norwegian culture, but I think the .Net team is in a better position to attempt to make that change, than a random person.

Yes, I think we should confirm that the actual behavior here matches what the docs are saying (i.e. that the Norwegian keyboard uses U+002D (Hyphen-Minus), but that the culture info uses U+2212 (Minus Sign)). If that is the case, I think it is a usability bug.

so basically ALL of client apps _must_ have two (or maybe more) TryParse
calls, do you think application developers has more knowledge and skills to
handle situations like this?

This either should be fixed in the framework or must be written in every
book and documentation.

My expectation that Framework would isolate me from platform & cluture
specific issues,

On Tue, Jan 22, 2019, 20:59 Dan Moseley <[email protected] wrote:

It seems we are operating by design - we expect to be running int.Parse
in the same culture that the input was formatted in.

We don't in general know all the values that NumberFormatInfo.NegativeSign
may take today and in the future on various cultures so it seems like we
could not reasonably be more tolerant.

@veonua https://github.com/veonua is it possible to set the thread
culture to match the culture the number was formatted in? If it might have
either nb-No or (say) en-US negative signs, you might have to use
"TryParse" and fall back from one to the other if the first fails.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/dotnet/corefx/issues/34672#issuecomment-456541136,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABeALj-DIy7Dn23cLZ4L3rcWmeYnj6U_ks5vF22LgaJpZM4aH1zR
.

My expectation that Framework would isolate me from platform & cluture specific issues

That's what CultureInfo.InvariantCulture is for. Pass that to your Parse call as the provider, or set it as the current culture, and regardless of locale you'll get the same parsing behavior.

The right fix might be to change CLDR data for the Norwegian culture

Yes. I don't believe we should be second-guessing the data from ICU / the OS. If there's an issue with that data, we should work with the provider of it to fix it.

cc: @tarekgh, @krwq

That's what CultureInfo.InvariantCulture is for. Pass that to your Parse call as the provider, or set it as the current culture, and regardless of locale you'll get the same parsing behavior.

So I have to keep in mind, what culture I should use for every parse? Even all my application uses nb-No, int parse must be in CultureInfo.InvariantCulture ?

and minimal code I have to provide is

var str = Console.ReadLine();
            int i=0;
            if (int.TryParse(str, out i)) {
                Console.WriteLine("it is "+i);
            } else {
                if (int.TryParse(str, NumberStyles.Any, CultureInfo.InvariantCulture, out i)) {
                    Console.WriteLine("You are Norwegian!");
                }
            } 

There are many differences (not just NegativeSign) between Windows and Linux, see other issues here and in related .NET Core repos.

This prints nothing on Windows but prints a lot of text on Linux:

C# foreach (var c in CultureInfo.GetCultures(CultureTypes.AllCultures)) { if (c.NumberFormat.NegativeSign != "-") Console.WriteLine(c); }

@0xd4d

I had a look at the strings used for NegativeSign on my Ubuntu 18.04. What I found:

| string | example culture |
|---|---|
| U+002D HYPHEN-MINUS | en English |
| U+061C ARABIC LETTER MARK, U+002D HYPHEN-MINUS | ar Arabic |
| U+200E LEFT-TO-RIGHT MARK, U+002D HYPHEN-MINUS | he Hebrew |
| U+200E LEFT-TO-RIGHT MARK, U+002D HYPHEN-MINUS, U+200E LEFT-TO-RIGHT MARK | ps Pashto |
| U+200F RIGHT-TO-LEFT MARK, U+002D HYPHEN-MINUS | ckb Central Kurdish |
| U+2212 MINUS SIGN | nb Norwegian Bokmål |
| U+200E LEFT-TO-RIGHT MARK, U+2212 MINUS SIGN | fa Persian |

So, all cultures use either U+002D HYPHEN-MINUS or U+2212 MINUS SIGN, though some surround them with additional marks, to make them display properly in that language. I haven't tested what effect those have on int.Parse.

@veonua to summarize this thread:

  • If nb-NO culture has a different negative sign than the hyphen (0x002D), this should be done for a good reason. If you think this is wrong, you can raise your concern to CLDR to fix that.
  • If the culture didn't choose hyphen (0x002D) as negative sign, then we either cannot assume it is. hyphen (0x002D) can be used for other meaning for the culture which we cannot know. Imagine the culture decide to use the hyphen (0x002D) as thousand separator for instance.
  • The best practice to guarantee parsing, is to format the number with invariant culture and then parse it with Invariant. if you don't have control over the source of the string you are parsing, then Invariant will still be the best guess to use. You cannot just magically make any number formatted with some culture can be parsed with other culture. Imagine you formatted a number with thousand separator in Arabic culture and then you are using Spanish culture to parse it. this is expected to fail. The parser cannot assume every possible culture data especially any sign character can mean something else in different cultures.

by that, I am closing this issue but feel free to reply with any more questions if you have any. Thanks.

@tarekgh

If you think this is wrong, you can raise your concern to CLDR to fix that.

Like I said above, I think this is a bug in .Net. And I think fixing those should be a responsibility of the .Net team, even if the ultimate source of the bug is in a third-party dependency.

if you don't have control over the source of the string you are parsing, then Invariant will still be the best guess to use. You cannot just magically make any number formatted with some culture can be parsed with other culture.

The invariant culture is not a good option for parsing strings from users. And it seems like you're saying that it's fine if using the actual culture also doesn't work correctly for that. This is not about having a string that uses one culture and parsing it with another culture. It's about parsing strings from Norwegian users using the Norwegian culture not working.

In my opinion, int.Parse should correctly parse what a regular user of a given culture is likely to type. It's not good enough if it's only supposed to parse the result of int.ToString().

@svick, I believe @tarekgh's point was that this is not something that .NET should fix or can reasonably workaround.

The issue (if there is one) is likely in the CLDR metadata published by the Unicode Consortium and a bug needs to be filed there: http://cldr.unicode.org/index/bug-reports.

@svick do you agree, that we cannot generally parse numbers correctly without knowing the culture? For example, 100,123 is a much smaller number in fr-FR than in en-US?

In which case I think it comes down to a possible bug in the culture data, which is coming from CLDR. We do not want to get in the business of defining our own culture data as it is complex and ever changing.

@danmosemsft

do you agree, that we cannot generally parse numbers correctly without knowing the culture? For example, 100,123 is a much bigger number in fr-FR than in en-US?

Of course.

In which case I think it comes down to a possible bug in the culture data, which is coming from CLDR. We do not want to get in the business of defining our own culture data as it is complex and ever changing.

I understand that. My problem is that what I see here is:

This is not a bug in our code, so we're going to close this issue and we will leave it to someone else to fix CLDR.

The attitude I would like to see is:

This problem is affecting our customers, so we will keep this issue open and we will work with the maintainers of CLDR on resolving it.

@svick - ah, I see. We ask folks to report directly to CLDR because we do not have expertise in the culture in which the issue is being reported, and therefore only make things less efficient trying to be in the "middle" of any discussion.
I believe CLDR is the closest to a standard across Windows and Unix - it is not something niche that .NET chose to depend on. If the issue was something specific to .NET's use of ICU/CLDR then we probably would want to be involved - that's not the case for the choice of negative sign in nb-NO.
@tarekgh is that a reasonable summary?

It also looks like this might be an issue with the CLDR metadata included with Ubuntu 18.04.

On Windows 10 (Build 17763), the program given in the OP succeeds.
On Ubuntu 18.04, the program fails with a FormatException.

@danmosemsft yes this is a good summary.

To be more helpful, here is the link can report any issue to CLDR http://cldr.unicode.org/index/bug-reports

Also I want to mention, we have fixed different parsing issues for nb-NO culture before. So, we really care about the customers when we have more control over the issue. @svick sorry if I really left any bad impression when I closed the issue, but what @danmosemsft mentioned explain why I closed the issue.

@tannergooding Windows trying to get closer to CLDR as much as can but still there is some difference which can be expected to see (or intentionally decided). I still think would be the best this problem be fixed in CLDR (if it is considered really a problem for such culture). anyway, there is a lot collaboration between CLDR and Windows and it is really going is very good direction.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

aggieben picture aggieben  Â·  3Comments

v0l picture v0l  Â·  3Comments

yahorsi picture yahorsi  Â·  3Comments

matty-hall picture matty-hall  Â·  3Comments

GitAntoinee picture GitAntoinee  Â·  3Comments