Runtime: Add UTF8 to CharSet

Created on 17 Apr 2016  路  50Comments  路  Source: dotnet/runtime

The CharSet enumeration is used to specify how strings should be marshaled:
https://github.com/dotnet/corefx/blob/master/src/System.Runtime/ref/System.Runtime.cs#L3019-L3023

``` C#
public enum CharSet
{
Ansi = 2,
Unicode = 3,
}

`Unicode` specifies that UTF16 should be used, regardless of platform, but `Ansi` is interpreted differently based on platform: on Windows it's interpreted to mean the ANSI format, whereas on Unix it's interpreted to mean UTF8.  This means that on Windows we lack the ability to specify UTF8 as the marshaling, and more generally we lack the ability to specify UTF8 marshaling regardless of platform, making writing cross-platform managed components more difficult.

We should add a new UTF8 enum value:

``` C#
public enum CharSet
{
    Ansi = 2,
    Unicode = 3,
    UTF8 = 5,
}

that when used will cause the runtime's marshaling to be done with UTF8, which is the standard for modern services.

[Added-by-Yi]

We should also add a new corresponding UnmanagedType enum in UnmanagedType for UTF8 as well, for finer control and parity (between UnmanagedType and CharSet):
https://github.com/dotnet/corefx/blob/master/src/System.Runtime.InteropServices.PInvoke/ref/System.Runtime.InteropServices.PInvoke.cs

``` C#
public enum UnmanagedType
{
LPUTF8Str = 0x30
}

And new Marshal helpers while we are at it:

``` C#
public class PInvokeMarshal
{
    public static string PtrToStringUTF8(System.IntPtr ptr);
    public static string PtrToStringUTF8(System.IntPtr ptr, int len);
    public static System.IntPtr StringToAllocatedMemoryUTF8(string s);
    public static System.IntPtr ZeroFreeMemoryUTF8(System.IntPtr s);
}
api-approved area-System.Runtime.InteropServices

Most helpful comment

@yizhang82

In the CharSet case, the potential concern is deprecating would introduce warnings in existing code and will break them if they have warningaserror, for something common as CharSet.Unicode.

But isn't that the whole point of warningaserror, declaring that they _want_ their code to break if even something minor enough to be considered an error ends up being off about their codebase, including future changes?

All 50 comments

Adding some extra information.

When these values are embedded in P/Invoke definitions on the metadata, in the "Flags for ImplMap" [PInvokeAttributes]. There is one unused bit there, that we could take (0x08) and we could take one value of those to mean Utf8, leaving some room for other things in the future as well.

I think we should also add UnmanagedType=LpUtf8Str (or UTF8String, name TBD). All the capabilities in CharSet should be available in UnmanagedType.

The work is the same since we have to create a new ilmarshaler anyway.

Added UnmanagedType into @stephentoub original proposal.

Added new PInvokeMarshal helpers into the proposal as well.

Yes, this has been needed for quite a while now. It definitely looks like a worthwhile proposal.

Updated LpUTF8Str to LPUTF8Str according to @weshaggard 's feedback.

@yizhang82
cc @weshaggard @stephentoub
Does the below API names sound reasonable? I know you named them StringToAllocatedMemoryUTF8 , but StringToCoTaskMemUTF8 sounds consistent with what we have now for ANSI and UniCode(aka UTF16).

unsafe public static String PtrToStringUTF8(IntPtr ptr)
unsafe public static String PtrToStringUTF8(IntPtr ptr,int len)
unsafe public static IntPtr StringToCoTaskMemUTF8(String s)
unsafe public static void ZeroFreeCoTaskMemUTF8(IntPtr s)

@tijoytom Please note that these are the new names we defined for PInvokeMarshal class where the naming is consistent and use AllocatedMemory (instead of CoTaskMem, to avoid windows-ness).

Any chance we can call a duck a "duck" and change Unicode to Utf16? I'd even settle for Ucs2 at this point.

@whoisj Unfortunately changing existing API contract is a breaking change. We can change the Unicode names in the new PInvokeMarshal class APIs, but the CharSet enum is going to be problematic. I think we can slowly introduce new API and enum with better names and slowly deprecate the old ones (LPWstr is another such example).

@yizhang82 What about leaving Unicode and adding Utf16, with both having the same value of 3?

@bendono That was my thought too. Or possibly even marking Unicode as [Obsolete]. (Can you do that with enum values?)

@whoisj

Unfortunately changing existing API contract is a breaking change. We can change the Unicode names in the new PInvokeMarshal class APIs, but the CharSet enum is going to be problematic. I think we can slowly introduce new API and enum with better names and slowly deprecate the old ones (LPWstr is another such example).

I guess I do not understand why the enumeration value cannot be overloaded for "correctness" and "back compat". Something along the lines of:

C# public enum CharSet { Ansi = 2, [Obsolete] Unicode = 3, UTF16 = 3, UTF8 = 5, }

@whoisj @masonwheeler Thanks for your suggestions. Yes, this is exactly what I was thinking as well (introduce new API and deprecate old ones). In the CharSet case, the potential concern is deprecating would introduce warnings in existing code and will break them if they have warningaserror, for something common as CharSet.Unicode. It is something a simple search/replace could fix, but still potentially a big impact regardless.

@yizhang82 as a customer who would be impacted by such a change in the way you describe, I appreciate your concern.

I propose the following resolution:

  1. Add UTF16 = 3 to the enumeration, and pubically state that Unicode will be deprecated in the future.
  2. In the follow up release to NetFx 4.7 / 5 / 7, whatever it'll be named, add the [Obsolete] decorator. to the Unicode value.
  3. Sometime in the future, when appropriate, remove the Unicode value from the enumeration.

This seems to be a decent compromise.

@yizhang82

In the CharSet case, the potential concern is deprecating would introduce warnings in existing code and will break them if they have warningaserror, for something common as CharSet.Unicode.

But isn't that the whole point of warningaserror, declaring that they _want_ their code to break if even something minor enough to be considered an error ends up being off about their codebase, including future changes?

@yizhang82
unsafe public static String PtrToStringUTF8(IntPtr ptr,int len)

Realized that the semantics of this method (ie for ANSI and UTF16) usually 'len' parameter is the length in number of characters , now for UTF8 the user won't know the number of character pointed to by ptr. Now even if the user somehow find the number of chars , we still need to get the nubmer of bytes to do the actual conversion.
Given this , i think it might be better to drop this overload of PtrToStringUTF8.

@tijoytom not every uft8 block is null terminated. Many use-cases will need the len value to be the number of utf8 bytes, but absolutely not the count of characters or code points in the block.

@tijoytom The len here is not count of characters. It is byte len. Our Documentation on PtrToStringAnsi is incorrect. Let's change the name of PtrToStringUTF8 to be PtrToStringUTF8ByteLen

Let's change the name of PtrToStringUTF8 to be PtrToStringUTF8ByteLen

Wouldn't it be better to instead change the name of the parameter from len to byteLength or something like that?

Yes, changed it to byteLen

@yizhang82 @tijoytom is this something you are currently working on?

@masonwheeler

@yizhang82

In the CharSet case, the potential concern is deprecating would introduce warnings in existing code and will break them if they have warningaserror, for something common as CharSet.Unicode.

But isn't that the whole point of warningaserror, declaring that they want their code to break if even something minor enough to be considered an error ends up being off about their codebase, including future changes?

This topic has come up several times here and in .NET design meetings. Being respectful of warnings-as-errors is a good thing, just as much for the sake of not introducing 1000 new build warnings even _without_ warnings-as-errors. But at the end of the day, if people opt into warnings as errors, they should expect to have their builds broken now and then. They want micromanagement, and they will get it.

I think obsoleting public API immediately is okay. If you don't do that, at least clearly document the value as deprecated in IntelliSense.

Personally I leave it off until the implementation is past primary development. I turn it on during the stage where I don't expect to be making more changes myself besides potential dependency updating.

@jnm2 the Visual Studio experience is the main concern. I can't speak for the compiler team but I understand they wish to avoid making new warnings appear in the VS error list on upgrade as over the years it has proven to deter people from upgrading their Visual Studio. Historically we have tried to encourage folks to disable specific warnings to get past this but it hasn't been sufficient. I completely agree with the sentiment because there are types we _really_ want to steer folks away from but that's the current policy.

@danmosemsft but CoreFx and VS are decoupled now, no?

@whoisj They are decoupled - but VS needs to have a policy decision on which version of CoreFX meta package as the default in VS projects, and this has wide impact to developers using VS. At the end of the day, I think there is a trade off to be made case-by-case. In this case I don't see a strong reason. There will be some confusion, yes, but given that rest of the .NET escosystem (char, string, etc) is pretty much UTF-16 based, it is probably not worth impacting anyone that has CharSet.Unicode in their code.

We could simply add a new value UTF16 that has the same value without deprecating the Unicode value.

@whoisj Visual Studio does not run on CoreFX, but it does offer a tooling experience for CoreFX which is what matters here. Possibly I misunderstand.

@yizhang82 I think this:

We could simply add a new value UTF16 that has the same value without deprecating the Unicode value.

Is the right answer. For those coming from non-Windows platforms, seeing the options Ascii, Unicode, and UTF8 as options is kind of mind boggling. For many, Unicode is UTF8. 馃槒

Additionally I think it's valuable to make sure it's clear in the IntelliSense documentation for Unicode that UTF16 should be used instead.

@jnm2 We can also add EditorBrowsable(Never) on the Unicode enum value to hide it, if I remember correctly. That's probably for the better :)

@yizhang82 Please do, and in addition, make it clear in the IntelliSense documentation for the ReSharper users who don't benefit from EditorBrowsable visibility settings and for those browsing MSDN, and for those who have just pasted or typed Unicode.

@yizhang82 are you planning to do this? just trying to clarify who is.

@danmosemsft We'll get to it when we add CharSet.UTF8.
cc @tijoytom

@yizhang82 is this something you or @tijoytom are working on for 2.0?
How long do you think this is going to take?

Not for 2.0 because it is not there in desktop CLR. It'll be post 2.0 work.

How would I implement a "UTF8" only on non-Windows platforms?

@fubar-coder, the existing "Ansi" is implemented as UTF8 on Unix.

...but not on other platforms. Yay consistency!

@jeffschwMSFT is Charset.UTF8 in plan somewhere? It is wonky to have to use Charset.ANSI for Unix - not guessable.

I will follow-up and see where the broader plan is - particularly since we have a loose dependency on ensuring .NET Fx is inalignment.

assign this to @jeffschwMSFT and @luqunl

I love that this is being looked at!

Wouldn't the correct casing be Utf8 in order to match Ansi? Keeping Ansi but adding UTF8 seems like an inconsistency.

Re casing, The design guidelines require Utf8. Looking in CoreFXLabs, they consistently use Utf8.

However there is plenty of use of upper casings eg if you look in CoreFX public API and in https://apisof.net.

@stephentoub did you choose upper case to be consistent with UTF8Encoding?

@stephentoub did you choose upper case to be consistent with UTF8Encoding?

With Encoding.UTF8

@brian-armstrong-discord , We are discussing whether it is necessary to add Charset.UTF8. Most of case(except array), User can just add MarshalAs(UnmanagedType.LPUTF8Str) to each appropriate arguments/fields to use UTF8.

@luqun Besides char, char[], CharSet also affects default String/StringBuilder marshaling behavior.

@jeffschwMSFT what's the current status of this approved API? do you need anything from the corefx team to continue making progress here?

/cc @layomia and @GrabYourPitchforks

Thanks

Any progress on this? The current behavior is confusing.

This is not currently on our .NET Core 3.0 list of final items. We can take a look and see if this would align with 3.1.

cc @jkoritzinsky

Was this page helpful?
0 / 5 - 0 ratings