Roslyn: Surrogate pairs not recognized in identifiers.

Created on 31 Aug 2016  ·  6Comments  ·  Source: dotnet/roslyn

I am working on GB18030 certification prep for NuGet Visual Studio UI.

It seems that symbols/characters from CJK Unified Ideographs Extension B are not accepted as/in valid C# namespace in Visual Studio.

But according to the specs here, they should be since CJK Extension B falls into the Lo class of unicode.

Repro -

  1. Create new console app in VS.
  2. Change the namespace or class name to a combo of CJK ex B characters - 𠀀𠀁𠀂𠀃 (using the first 4 here)
  3. Observe that VS throws an error.

More ref - SO Question

//cc : @rrelyea

Area-Compilers Bug Language-C# Language-VB Tenet-Localization help wanted

Most helpful comment

Actually, the older versions of the spec appear to say we should support surrogates. For example, see https://msdn.microsoft.com/en-us/library/aa664669(v=vs.71).aspx which implies that the following program should compile without error. Roslyn rejects it.

using System;
using System.Collections.Generic;
using System.Linq;

class Program
{
    public static void Main(string[] args)
    {
        int \U00020000 = 10; http://www.fileformat.info/info/unicode/char/20000/index.htm
        Console.WriteLine(\U00020000);
    }
}

All 6 comments

@mishra14 The roslyn compiler doesn't accept surrogate pairs. As long as the compiler uses string - UTF-16 based string implementation, cost of accepting them is too high.

There is an easy but dangerous way to accept them: adding the following line in the UnicodeCharacterUtilities.IsLetterChar method.

                case UnicodeCategory.Surrogate:

after this line: https://github.com/dotnet/roslyn/blob/master/src/Compilers/Core/Portable/UnicodeCharacterUtilities.cs#L125

image

The Swift language adopts such a way. However, this introduces another problem. You can use any characters including symbols. The following link shows a valid source code in the Swift. It uses Mathematical Alphanumeric Symbols as identifier.

https://swiftlang.ng.bluemix.net/#/repl/57c4393a7adc02c56275043d

Thus, more correct way is using UnicodeCategory GetUnicodeCategory(string, int) instead of UnicodeCategory GetUnicodeCategory(char) but its performance hit is not trivial.

Treating surrogates properly in identifiers is nontrivial, but it doesn't have to be much of a performance hit. Only when surrogates are actually used would any surrogate-related code path be taken.

@mishra14 I am all for supporting surrogate pairs and would have been myself pointing the specification out, except that it explicitly refers to Unicode 3.0 where there were no such surrogates as far as I know...

@miloush More recent versions of the language specifications will refer to more recent versions of the Unicode specs. Sorry the C# 6 spec isn't out yet.

Actually, the older versions of the spec appear to say we should support surrogates. For example, see https://msdn.microsoft.com/en-us/library/aa664669(v=vs.71).aspx which implies that the following program should compile without error. Roslyn rejects it.

using System;
using System.Collections.Generic;
using System.Linq;

class Program
{
    public static void Main(string[] args)
    {
        int \U00020000 = 10; http://www.fileformat.info/info/unicode/char/20000/index.htm
        Console.WriteLine(\U00020000);
    }
}

This is also reported as #9371

Was this page helpful?
0 / 5 - 0 ratings

Related issues

binki picture binki  ·  3Comments

marler8997 picture marler8997  ·  3Comments

OndrejPetrzilka picture OndrejPetrzilka  ·  3Comments

joshua-mng picture joshua-mng  ·  3Comments

NikChao picture NikChao  ·  3Comments