Roslyn: VirtualChar system cannot handle 32bit wide characters

Created on 9 Aug 2018 · 10Comments · Source: dotnet/roslyn

Original PR: https://github.com/dotnet/roslyn/pull/28927

The VirtualChar system can properly handle almost all C# escapes (including \n, \u0001, \xff, etc.). All these escapes map back down to a single 'char'.

one thing it can't handle is \UXXXXXXXX where that would map to an escape character larger than 16 bits. It's unclear if we should bother making it support that case. But this issue tracks the limitation anyways.

Area-IDE Bug Tenet-Localization help wanted

Source

CyrusNajmabadi

Most helpful comment

You could also argue that uint is not a good representation because it can represent numbers that are larger than 0x10ffff, which is the max value of a code point.

Yes. But it's the best type we have, and also makes the math a ton simpler.

CyrusNajmabadi on 22 Mar 2020

👍2

All 10 comments

Is there a BCL datatype which Roslyn references which should be used, or is int/uint the path to take?

jnm2 on 21 Mar 2020

I couldn't find a BCL type (pity). I'm using uint myself.

CyrusNajmabadi on 21 Mar 2020

@CyrusNajmabadi BCL uses int internally - the valid range is U+0000..U+10FFFF.
https://github.com/dotnet/runtime/blob/16e325d2f8f0aa0f7ab390275525b569edac7d1f/src/libraries/Common/tests/CoreFx.Private.TestUtilities.Unicode/System/Text/Unicode/CodePoint.cs#L15

tmat on 21 Mar 2020

Lol... sigh.

CyrusNajmabadi on 21 Mar 2020

THey use a signed value... to represent a character... Thus also making it a super pita to go between chars and CodePoints... sigh...

CyrusNajmabadi on 21 Mar 2020

I'm not sure what you mean. char is a different concept then code point. You can't just cast char to int to get a code point.

tmat on 21 Mar 2020

See char.ConvertToUtf32

tmat on 21 Mar 2020

I'm not sure what you mean. char is a different concept then code point. You can't just cast char to int to get a code point.

I mean that there is no sense where these are signed. There's no negative codepoint, just like there's no negative char. My sign is about using an innapropriate type (likely for cls reasons) to represent this instead of matching what char does and having a sane 0-N representation.

CyrusNajmabadi on 21 Mar 2020

👍1

The .NET commonly uses int to store non-negative numbers that fit to 31 bits.
You could also argue that uint is not a good representation because it can represent numbers that are larger than 0x10ffff, which is the max value of a code point.

tmat on 22 Mar 2020

You could also argue that uint is not a good representation because it can represent numbers that are larger than 0x10ffff, which is the max value of a code point.

Yes. But it's the best type we have, and also makes the math a ton simpler.

CyrusNajmabadi on 22 Mar 2020

👍2

Was this page helpful?

0 / 5 - 0 ratings