Runtime: Add UTF8 to CharSet

Created on 17 Apr 2016 · 50Comments · Source: dotnet/runtime

The CharSet enumeration is used to specify how strings should be marshaled:
https://github.com/dotnet/corefx/blob/master/src/System.Runtime/ref/System.Runtime.cs#L3019-L3023

``` C#
public enum CharSet
{
Ansi = 2,
Unicode = 3,
}

`Unicode` specifies that UTF16 should be used, regardless of platform, but `Ansi` is interpreted differently based on platform: on Windows it's interpreted to mean the ANSI format, whereas on Unix it's interpreted to mean UTF8.  This means that on Windows we lack the ability to specify UTF8 as the marshaling, and more generally we lack the ability to specify UTF8 marshaling regardless of platform, making writing cross-platform managed components more difficult.

We should add a new UTF8 enum value:

``` C#
public enum CharSet
{
    Ansi = 2,
    Unicode = 3,
    UTF8 = 5,
}

that when used will cause the runtime's marshaling to be done with UTF8, which is the standard for modern services.

[Added-by-Yi]

We should also add a new corresponding UnmanagedType enum in UnmanagedType for UTF8 as well, for finer control and parity (between UnmanagedType and CharSet):
https://github.com/dotnet/corefx/blob/master/src/System.Runtime.InteropServices.PInvoke/ref/System.Runtime.InteropServices.PInvoke.cs

``` C#
public enum UnmanagedType
{
LPUTF8Str = 0x30
}

And new Marshal helpers while we are at it:

``` C#
public class PInvokeMarshal
{
    public static string PtrToStringUTF8(System.IntPtr ptr);
    public static string PtrToStringUTF8(System.IntPtr ptr, int len);
    public static System.IntPtr StringToAllocatedMemoryUTF8(string s);
    public static System.IntPtr ZeroFreeMemoryUTF8(System.IntPtr s);
}

api-approved area-System.Runtime.InteropServices

Source

stephentoub

👍13 ❤3

Most helpful comment

@yizhang82

In the CharSet case, the potential concern is deprecating would introduce warnings in existing code and will break them if they have warningaserror, for something common as CharSet.Unicode.

But isn't that the whole point of warningaserror, declaring that they _want_ their code to break if even something minor enough to be considered an error ends up being off about their codebase, including future changes?

masonwheeler on 11 May 2016

👍5

All 50 comments

Adding some extra information.

When these values are embedded in P/Invoke definitions on the metadata, in the "Flags for ImplMap" [PInvokeAttributes]. There is one unused bit there, that we could take (0x08) and we could take one value of those to mean Utf8, leaving some room for other things in the future as well.

migueldeicaza on 17 Apr 2016

I think we should also add UnmanagedType=LpUtf8Str (or UTF8String, name TBD). All the capabilities in CharSet should be available in UnmanagedType.

yizhang82 on 19 Apr 2016

The work is the same since we have to create a new ilmarshaler anyway.

yizhang82 on 19 Apr 2016

Added UnmanagedType into @stephentoub original proposal.

yizhang82 on 19 Apr 2016

Added new PInvokeMarshal helpers into the proposal as well.

yizhang82 on 19 Apr 2016

Yes, this has been needed for quite a while now. It definitely looks like a worthwhile proposal.

masonwheeler on 19 Apr 2016

Updated LpUTF8Str to LPUTF8Str according to @weshaggard 's feedback.

yizhang82 on 19 Apr 2016

@yizhang82
cc @weshaggard @stephentoub
Does the below API names sound reasonable? I know you named them StringToAllocatedMemoryUTF8 , but StringToCoTaskMemUTF8 sounds consistent with what we have now for ANSI and UniCode(aka UTF16).

unsafe public static String PtrToStringUTF8(IntPtr ptr)
unsafe public static String PtrToStringUTF8(IntPtr ptr,int len)
unsafe public static IntPtr StringToCoTaskMemUTF8(String s)
unsafe public static void ZeroFreeCoTaskMemUTF8(IntPtr s)

tijoytom on 10 May 2016

@tijoytom Please note that these are the new names we defined for PInvokeMarshal class where the naming is consistent and use AllocatedMemory (instead of CoTaskMem, to avoid windows-ness).

yizhang82 on 11 May 2016

Any chance we can call a duck a "duck" and change Unicode to Utf16? I'd even settle for Ucs2 at this point.

whoisj on 11 May 2016

@whoisj Unfortunately changing existing API contract is a breaking change. We can change the Unicode names in the new PInvokeMarshal class APIs, but the CharSet enum is going to be problematic. I think we can slowly introduce new API and enum with better names and slowly deprecate the old ones (LPWstr is another such example).

yizhang82 on 11 May 2016

@yizhang82 What about leaving Unicode and adding Utf16, with both having the same value of 3?

bendono on 11 May 2016

@bendono That was my thought too. Or possibly even marking Unicode as [Obsolete]. (Can you do that with enum values?)

masonwheeler on 11 May 2016

@whoisj

Unfortunately changing existing API contract is a breaking change. We can change the Unicode names in the new PInvokeMarshal class APIs, but the CharSet enum is going to be problematic. I think we can slowly introduce new API and enum with better names and slowly deprecate the old ones (LPWstr is another such example).

I guess I do not understand why the enumeration value cannot be overloaded for "correctness" and "back compat". Something along the lines of:

C# public enum CharSet { Ansi = 2, [Obsolete] Unicode = 3, UTF16 = 3, UTF8 = 5, }

whoisj on 11 May 2016

@whoisj @masonwheeler Thanks for your suggestions. Yes, this is exactly what I was thinking as well (introduce new API and deprecate old ones). In the CharSet case, the potential concern is deprecating would introduce warnings in existing code and will break them if they have warningaserror, for something common as CharSet.Unicode. It is something a simple search/replace could fix, but still potentially a big impact regardless.

yizhang82 on 11 May 2016

@yizhang82 as a customer who would be impacted by such a change in the way you describe, I appreciate your concern.

I propose the following resolution:

Add UTF16 = 3 to the enumeration, and pubically state that Unicode will be deprecated in the future.
In the follow up release to NetFx 4.7 / 5 / 7, whatever it'll be named, add the [Obsolete] decorator. to the Unicode value.
Sometime in the future, when appropriate, remove the Unicode value from the enumeration.

This seems to be a decent compromise.

whoisj on 11 May 2016

@yizhang82

In the CharSet case, the potential concern is deprecating would introduce warnings in existing code and will break them if they have warningaserror, for something common as CharSet.Unicode.

masonwheeler on 11 May 2016

👍5

@yizhang82
unsafe public static String PtrToStringUTF8(IntPtr ptr,int len)

Realized that the semantics of this method (ie for ANSI and UTF16) usually 'len' parameter is the length in number of characters , now for UTF8 the user won't know the number of character pointed to by ptr. Now even if the user somehow find the number of chars , we still need to get the nubmer of bytes to do the actual conversion.
Given this , i think it might be better to drop this overload of PtrToStringUTF8.

tijoytom on 12 May 2016

@tijoytom not every uft8 block is null terminated. Many use-cases will need the len value to be the number of utf8 bytes, but absolutely not the count of characters or code points in the block.

whoisj on 12 May 2016

@tijoytom The len here is not count of characters. It is byte len. Our Documentation on PtrToStringAnsi is incorrect. Let's change the name of PtrToStringUTF8 to be PtrToStringUTF8ByteLen

yizhang82 on 12 May 2016

Let's change the name of PtrToStringUTF8 to be PtrToStringUTF8ByteLen

Wouldn't it be better to instead change the name of the parameter from len to byteLength or something like that?

stephentoub on 12 May 2016

👍2

Yes, changed it to byteLen

tijoytom on 13 May 2016

@yizhang82 @tijoytom is this something you are currently working on?

AlexGhiondea on 29 Nov 2016

@masonwheeler

@yizhang82

In the CharSet case, the potential concern is deprecating would introduce warnings in existing code and will break them if they have warningaserror, for something common as CharSet.Unicode.

But isn't that the whole point of warningaserror, declaring that they want their code to break if even something minor enough to be considered an error ends up being off about their codebase, including future changes?

This topic has come up several times here and in .NET design meetings. Being respectful of warnings-as-errors is a good thing, just as much for the sake of not introducing 1000 new build warnings even _without_ warnings-as-errors. But at the end of the day, if people opt into warnings as errors, they should expect to have their builds broken now and then. They want micromanagement, and they will get it.

I think obsoleting public API immediately is okay. If you don't do that, at least clearly document the value as deprecated in IntelliSense.

Personally I leave it off until the implementation is past primary development. I turn it on during the stage where I don't expect to be making more changes myself besides potential dependency updating.

jnm2 on 30 Nov 2016

@jnm2 the Visual Studio experience is the main concern. I can't speak for the compiler team but I understand they wish to avoid making new warnings appear in the VS error list on upgrade as over the years it has proven to deter people from upgrading their Visual Studio. Historically we have tried to encourage folks to disable specific warnings to get past this but it hasn't been sufficient. I completely agree with the sentiment because there are types we _really_ want to steer folks away from but that's the current policy.

danmosemsft on 3 Jan 2017

@danmosemsft but CoreFx and VS are decoupled now, no?

whoisj on 3 Jan 2017

@whoisj They are decoupled - but VS needs to have a policy decision on which version of CoreFX meta package as the default in VS projects, and this has wide impact to developers using VS. At the end of the day, I think there is a trade off to be made case-by-case. In this case I don't see a strong reason. There will be some confusion, yes, but given that rest of the .NET escosystem (char, string, etc) is pretty much UTF-16 based, it is probably not worth impacting anyone that has CharSet.Unicode in their code.

We could simply add a new value UTF16 that has the same value without deprecating the Unicode value.

yizhang82 on 3 Jan 2017

@whoisj Visual Studio does not run on CoreFX, but it does offer a tooling experience for CoreFX which is what matters here. Possibly I misunderstand.

danmosemsft on 3 Jan 2017

@yizhang82 I think this:

We could simply add a new value UTF16 that has the same value without deprecating the Unicode value.

Is the right answer. For those coming from non-Windows platforms, seeing the options Ascii, Unicode, and UTF8 as options is kind of mind boggling. For many, Unicode is UTF8. 😏

whoisj on 4 Jan 2017

👍2

Additionally I think it's valuable to make sure it's clear in the IntelliSense documentation for Unicode that UTF16 should be used instead.

jnm2 on 4 Jan 2017

👍1

@jnm2 We can also add EditorBrowsable(Never) on the Unicode enum value to hide it, if I remember correctly. That's probably for the better :)

yizhang82 on 4 Jan 2017

👍2

@yizhang82 Please do, and in addition, make it clear in the IntelliSense documentation for the ReSharper users who don't benefit from EditorBrowsable visibility settings and for those browsing MSDN, and for those who have just pasted or typed Unicode.

jnm2 on 4 Jan 2017

@yizhang82 are you planning to do this? just trying to clarify who is.

danmosemsft on 10 Jan 2017

@danmosemsft We'll get to it when we add CharSet.UTF8.
cc @tijoytom

yizhang82 on 12 Jan 2017

@yizhang82 is this something you or @tijoytom are working on for 2.0?
How long do you think this is going to take?

AlexGhiondea on 17 Mar 2017

Not for 2.0 because it is not there in desktop CLR. It'll be post 2.0 work.

yizhang82 on 17 Mar 2017

😕3

How would I implement a "UTF8" only on non-Windows platforms?

fubar-coder on 29 Aug 2017

@fubar-coder, the existing "Ansi" is implemented as UTF8 on Unix.

stephentoub on 9 Oct 2017

😕1

...but not on other platforms. Yay consistency!

masonwheeler on 9 Oct 2017

@jeffschwMSFT is Charset.UTF8 in plan somewhere? It is wonky to have to use Charset.ANSI for Unix - not guessable.

danmosemsft on 9 Feb 2018

I will follow-up and see where the broader plan is - particularly since we have a loose dependency on ensuring .NET Fx is inalignment.

jeffschwMSFT on 9 Feb 2018

assign this to @jeffschwMSFT and @luqunl

luqunl on 9 Feb 2018

I love that this is being looked at!

Wouldn't the correct casing be Utf8 in order to match Ansi? Keeping Ansi but adding UTF8 seems like an inconsistency.

brian-armstrong-discord on 6 Jun 2018

👍2

Re casing, The design guidelines require Utf8. Looking in CoreFXLabs, they consistently use Utf8.

However there is plenty of use of upper casings eg if you look in CoreFX public API and in https://apisof.net.

@stephentoub did you choose upper case to be consistent with UTF8Encoding?

danmosemsft on 6 Jun 2018

👍1

@stephentoub did you choose upper case to be consistent with UTF8Encoding?

With Encoding.UTF8

stephentoub on 6 Jun 2018

@brian-armstrong-discord , We are discussing whether it is necessary to add Charset.UTF8. Most of case(except array), User can just add MarshalAs(UnmanagedType.LPUTF8Str) to each appropriate arguments/fields to use UTF8.

luqunl on 6 Jun 2018

@luqun Besides char, char[], CharSet also affects default String/StringBuilder marshaling behavior.

yizhang82 on 6 Jun 2018

👍1

@jeffschwMSFT what's the current status of this approved API? do you need anything from the corefx team to continue making progress here?

/cc @layomia and @GrabYourPitchforks

Thanks