Runtime: Add a GetEncodings method to System.Text.EncodingProvider to support enumerating available character encodings

Created on 9 Apr 2018  路  25Comments  路  Source: dotnet/runtime

This API proposal arose out of dotnet/runtime#25804.

This proposal was approved before but we discovered the original proposal was returning EncodingInfo object while we didn't expose any public constructor for this type.
The delta change in this proposal is exposing a public constructor for EncodingInfo to allow creating such objects

Rationale and Use Cases

As of this writing:

However, it is desirable to have the ability to enumerate _all_ available encodings:

  • Even in the abstract it seems like an awkward omission not to be able to reflect on the available encodings; more concretely, consider the following use cases, both related to PowerShell Core:

  • PowerShell Core commands such as Get-Content for reading text files support an -Encoding parameter that directly accepts System.Text.Encoding instances, so users expect to be able to _discover_ all available encodings and/or be assisted in selecting a specific encoding with _tab completion_:

Get-Content file.txt -Encoding <tab-keypress>  # should show all available encodings
  • Similarly, a green-lighted, but not yet implemented Convert-TextFile command will allow conversion between all available character encodings, so their discovery / ease of selection is important there too.

Proposed API

Add a GetEncodings() method to abstract class System.Text.EncodingProvider:

namespace System.Text
{
    public abstract class EncodingProvider
    {
      // ...
      // New virtual method, analogous to System.Text.Encoding.GetEncodings()
      public virtual System.Text.EncodingInfo[] GetEncodings ();
      // ...
    }

    public sealed class EncodingInfo // already exposed class 
    {
       // This is internal constructor and we are making it public
        public EncodingInfo(int codePage, string name, string displayName); 
    }
}

And System.Text.CodePagesEncodingProvider would then implement that method to return the encodings it provides.

Specifically, a call to System.Text.CodePagesEncodingProvider.Instance.GetEncodings() would then return the array of System.Text.EncodingInfo instances representing that provider's encodings.

As with System.Text.Encodings.GetEncodings(), no particular enumeration order is guaranteed.


Integration into System.Text.Encodings.GetEncodings()

Once System.Text.Encoding.RegisterProvider(System.Text.CodePagesEncodingProvider.Instance) has been called, System.Text.Encodings.GetEncodings() will enumerate the additional encodings too, using the newly introduced method on the provider _if_ registered.

In other words: System.Text.Encodings.GetEncodings() will return whatever encodings are _currently available_ - whether just the default set by default or the _union_ of the default set and the additional encodings after provider registration.

As with the existing enumeration, the encodings (EncodingInfo instances) are returned in no guaranteed order.

If there is more than one registered provider that supports a given encoding, the returned list will contain only one entry for it.

Updates (most recent ones first)

  • Clarification re returning only distinct encodings.
  • Return behavior clarified again - no particular enumeration order is guaranteed
  • Made new GetEncodings method virtual rather than abstract.
  • Open question removed in favor of unconditionally outputting all then-available encodings.
  • Return behavior clarified (ordering).
api-approved area-System.Text.Encoding

Most helpful comment

Video

  • EncodingProvider.GetEncodings() should be IEnumerable<EncodingInfo>
  • EncodingProvider.GetEncodings() should be virtual and return an empty list (instead of throwing). While this isn't correct it means that one provider that isn't updated doesn't spoil the enumeration for everyone else.
  • Encoding.GetEncodings() should return all registered encodings across all providers and de-dupe them if necessary.

All 25 comments

public abstract System.Text.EncodingInfo[] GetEncodings()

This cannot be abstract method. Adding abstract method to existing type is a breaking change.

Either way, a new overload is needed for GetEncodings()

I do not think we need a new overload for GetEncodings(). GetEncodings() should just return all encodings available.

@mklement0 thanks for the proposal.

For the open question, System.Text.Encodings.GetEncodings() should report all encodings including the provider encodings. I am not sure why this will be a compatibility issue? I cannot think in any breaking scenario here as we are going to return a superset from what we used to return. could you please elaborate more on that?

@tarekgh, @jkotas:

To be clear, my preference too is for System.Text.Encodings.GetEncodings() to return all _available_ encodings, at least by default.

If there's no concern about backward compatibility (see below) and no need to support discovering _just_ the default encodings _while additional ones happen to be registered_, then I'm happy to

  • make do without a new System.Text.Encodings.GetEncodings() overload.

  • let System.Text.Encodings.GetEncodings() unconditionally return whatever encodings are _currently_ available.

I'm personally not worried about backward compatibility in this case, but _conceivably_ there is code out there that relies on System.Text.Encodings.GetEncodings() only ever returning the _default_ encodings; in other words: someone may rely on System.Text.Encodings.GetEncodings() as a reliable way to discover the _default encodings_.

So: Do wee need to worry about retaining a predictable, invariant way to discover the set of default encodings?

@jkotas:

This cannot be abstract method. Adding abstract method to existing type is a breaking change.

Thanks for pointing that out; I need guidance as to what to propose instead.

please remove the open question then.

Thanks for pointing that out; I need guidance as to what to propose instead.

make the method virtual instead of abstract.

@tarekgh: Thanks - done. I've also added details re return behavior (enumeration order); let me know if that makes sense.

@mklement0

System.Text.CodePagesEncodingProvider would then have to implement that method to return the encodings it provides, in ascending order of the CodePage property values of the EncodingInfo instances, analogous to System.Text.Encodings.GetEncodings().

System.Text.Encodings.GetEncodings documentation doesn't promise for any sort order. From where you got that?
https://msdn.microsoft.com/en-us/library/system.text.encoding.getencodings(v=vs.110).aspx#Remarks

@tarekgh:

You're right - I didn't check the docs, I just inferred it from the de facto behavior.

Do you want me to say that the order is unspecified? What about when forming the union between default and registered encodings?

I prefer not mentioning anything about the order. we just promise to return the whole set supported by the core and the providers.

@tarekgh: Done (I've added a note that no particular enumeration order is guaranteed).

Thanks @mklement0

I have added the following statement too to the proposal integration section. let me know if you have any concern with that.

If there is more than one registered provider support same specific encoding, the returned list will contain only one entry for that specific encoding.

CC @krwq

Thanks, @tarekgh - good addition (I've streamlined the wording a bit).

There may be another scenario if the user wants to get only the default list:
```c#
public virtual System.Text.EncodingInfo[] GetEncodings (bool IncludeDefaultsOnly = false);

@iSazonov: This was discussed above, but both @jkotas and @tarekgh think it's not necessary.

(On a side note: My guess is that _separate method overloads_ are preferred to _optional parameters_; may be worth adding a clarification to the coding style document.)

Video

  • EncodingProvider.GetEncodings() should be IEnumerable<EncodingInfo>
  • EncodingProvider.GetEncodings() should be virtual and return an empty list (instead of throwing). While this isn't correct it means that one provider that isn't updated doesn't spoil the enumeration for everyone else.
  • Encoding.GetEncodings() should return all registered encodings across all providers and de-dupe them if necessary.

@mklement0 do you want to help in the implementation?

@tarekgh: Sounds good; I'll get started soon (unlike last time, where Anipik thankfully eventually jumped in).

@mklement0 Ping me too if you want.

I gave it a shot, but hit roadblocks (see below - some may require renewed discussion and sign-off).

Given my inexperience, I suggest someone more experienced take this on - @iSazonov, please feel free to take over.


Implementing this API requires CoreCLR (CoreLib) changes too, and I don't know how to coordinate that (even just adding a new virtual System.Text.EncodingProvider.GetEncodings() method breaks CoreFX-only builds):

  • Encoding.cs, whose Encoding.GetEncodings() method needs modifying, is part of CoreLib.

  • Encoding.GetEncodings() would need to iterate over all registered encoding providers in order to ask each for the encodings it provides, but EncodingProvider currently exposes no method for enumerating them.

On the CoreFX side:

@mklement0 thanks for trying. we'll try to get into this at some point.

t's unclear how to instantiate EncodingInfo: There's only an internal constructor

I think you have raised a good point too. we may need to revisit discussing EncodingInfo class to enable constructing it outside coreclr and provide the needed functionality. we'll try to think more about this one.

C# public sealed partial class EncodingInfo { internal EncodingInfo() { } public int CodePage { get { throw null; } } public string DisplayName { get { throw null; } } public string Name { get { throw null; } } public override bool Equals(object value) { throw null; } public System.Text.Encoding GetEncoding() { throw null; } public override int GetHashCode() { throw null; } }

@mklement0 I haven't an expirience to contribute in CoreCLR/CoreFX :confused:

breaks CoreFX-only builds

I believe it will be two PRs - in CoreCLR repo and in CoreFX repo. After the first one will be merged it will be automatically replicated in Core FX repo. After that the second PR in CoreFX can be continue.

@tarekgh Have you any news about EncodingInfo?

Have you any news about EncodingInfo?

No update yet because it is kind of low priority for now. But I suggest adding the following public constructor to EncodingInfo

C# public EncodingInfo(int codePage, string name, string displayName) { }

Video

  • EncodingProvider.GetEncodings() should be IEnumerable<EncodingInfo>
  • EncodingProvider.GetEncodings() should be virtual and return an Enumerable.Empty<EncodingInfo>() (instead of throwing). While this isn't correct it means that one provider that isn't updated doesn't spoil the enumeration for everyone else.
  • Encoding.GetEncodings() should return all registered encodings across all providers and de-dupe them if necessary.
  • EncodingInfo.GetEncoding() calls the static Encoding.GetEncoding() method, which means getting encoding infos and using EncodingInfo.GetEncoding() might create an encoding that isn't tied to that encoding provider.

    • EncodingInfo should take an EncodingProvider that we'll use in EncodingInfo.GetEncoding() and call EncodingProvider.GetEncoding(codePage).

```C#
namespace System.Text
{
public partial class EncodingProvider
{
public virtual IEnumerable GetEncodings();
}

public partial class EncodingInfo
{
    public EncodingInfo(EncodingProvider provider, int codePage, string name, string displayName);
}

}
```

Was this page helpful?
0 / 5 - 0 ratings

Related issues

EgorBo picture EgorBo  路  3Comments

bencz picture bencz  路  3Comments

matty-hall picture matty-hall  路  3Comments

iCodeWebApps picture iCodeWebApps  路  3Comments

nalywa picture nalywa  路  3Comments