Standard: What is default encoding in .Net?

Created on 27 Mar 2017  路  28Comments  路  Source: dotnet/standard

Currently Windows default encoding is UTF16LE, Unix - UTF-8 No BOM.

What is default encoding in .Net Standart 1.0 and 2.0? (.Net Core 1.0, 1.1, 2.0)

Most helpful comment

Here is some reasons:

  • We (and Windows team) have been awhile advocating for moving to Unicode encoding and avoid using the non-Unicode encoding. This is the best practice for the apps in general to have a good globalization support
  • After we became open-source and usually Linux/OSX already using utf8, then having net core default to be UTF8 would be better for consistency across OS's and in same time to apply the best practice.
  • When we were working on .net native we have been looking at usage of the Encoding in general in the store apps. we found less than 5% of such apps really cares about the encoding and having UTF8 as default encoding was reasonable even for such apps.
  • we still supporting the other encoding through the provider registration and we have a way to get the same result of the default encoding as the desktop if the app want to do that.
  • Windows team working hard to support UTF8 in the console which would give a better support for net core console apps in general.

All 28 comments

What do you mean by "default encoding"? For example, if you don't specify an encoding, StreamWriter will use UTF-8 without BOM on all OSes. As another example, Encoding.Default is not exposed on .Net Core 1.x and, as far as I can tell, it will return UTF-8 on .Net Core 2.0 on all OSes.

Depending on Encoding.Default is generally a bad idea, especially if you are trying to use it to encode/decode some payload you exchange across machines.

As @svick points out we generally use UTF-8 without a BOM when no encoding is specified but you generally need to specify the encoding based on the data you want to decode/encode.

@svick @weshaggard Many thanks! Clear.
Given that the use of UTF-8 "everywhere" is a "breaking change" in relation to Windows is there any public discussion or PG conclusion (blog post) about this? I need this for reference in PowerShell repo discussion about default encoding.

@tarekgh do we have any docs on this?

The remarks section in the following link talking generally about the difference in the encoding support between the desktop and net core

https://msdn.microsoft.com/en-us/library/system.text.encodingprovider(v=vs.110).aspx

we don't have specific doc for Encoding.Default though.

One note, if you want to get the default Encoding that is same as the one returned from the desktop you can do that by registering the provider and then call Encoding.GetEncoding(0)

@tarekgh Thanks! Now in PowerShell repo we are discussing the appropriateness of a move to UTF-8. I can already see that .Net Core has made this step. It would be helpful to see your discussion and justification because you likely have analyzed a very deep impact on other applications.

Here is some reasons:

  • We (and Windows team) have been awhile advocating for moving to Unicode encoding and avoid using the non-Unicode encoding. This is the best practice for the apps in general to have a good globalization support
  • After we became open-source and usually Linux/OSX already using utf8, then having net core default to be UTF8 would be better for consistency across OS's and in same time to apply the best practice.
  • When we were working on .net native we have been looking at usage of the Encoding in general in the store apps. we found less than 5% of such apps really cares about the encoding and having UTF8 as default encoding was reasonable even for such apps.
  • we still supporting the other encoding through the provider registration and we have a way to get the same result of the default encoding as the desktop if the app want to do that.
  • Windows team working hard to support UTF8 in the console which would give a better support for net core console apps in general.

@tarekgh Many thanks for great comments!

@svick: Please note that the upcoming Encoding.Default - which you can already access in the current release, and also already officially via Encoding.GetDefault(0) - is the UTF-8 encoding _with BOM_, which is problematic - shouldn't that be the _BOM-less_ UTF-8 encoding, in the interest of cross-platform support?

Another thing worth pointing out - though I strongly suspect it was a conscious decision:

  • While you can get support for _Windows_ legacy code pages via the System.Text.CodePagesEncodingProvider NuGet package, even on Unix,

  • there seems to be no corresponding support for _Unix_ legacy encodings (perhaps because the - justifiable - assumption is that the Unix world has moved to UTF-8 a long time ago); while I'm sure there's significant overlap between these legacy Unix encodings and the Windows legacy encodings, the Unix set is larger: Encoding.GetEncodings().Length yields 140 encodings on Windows 10, while locale -m on Ubuntu 16.04 returns 235.

@mklement0 Do you actually need, for example, all 16 variants of EBCDIC that are in that list? I seriously doubt that is a common need.

@svick: As I said: it's justifiable (and makes sense to me personally) not to support these, but it's worth having clarity on that decision and documenting it.

By contrast, I do think that Encoding.Default returning a UTF-8 encoding _with BOM_ is a real problem.

@weshaggard Please clarify about BOM: based on your "we generally use UTF-8 without a BOM when no encoding is specified" why is UTF8Encoding.Default left UTF8withBOM?
(I see multiple new UTF8Encoding(encoderShouldEmitUTF8Identifier: false, throwOnInvalidBytes: true) and new UTF8Encoding(false) in the repo.)

I was basing that on https://github.com/dotnet/corefx/blob/151917b//src/System.Runtime.Extensions/src/System/IO/StreamWriter.cs#L77. As for the default in other places I will leave that up to @tarekgh to explain as he is much better understanding of this space then I do.

We just using the default UTF8 instance which has the BOM. in Console we ensure turning off the BOM. and looks we are doing the same for IO.

@iSazonov what exactly the problem you are facing when you have the BOM enabled in the default encoding?

@tarekgh I found only some (3-4) places in the repo where UTF8withBOM used. In all cases UTF8NoBOM is used. That's why I'm surprised that the BOM was set aside for defaults.

We are discussing about encodings in PowerShell repo:
https://github.com/PowerShell/PowerShell/issues/3248
https://github.com/PowerShell/PowerShell/issues/3248 (the RFC https://github.com/PowerShell/PowerShell-RFC/blob/master/1-Draft/RFC0020-DefaultFileEncoding.md)

In short there are two problems:

  1. PG want for PowerShell Core 6.0 a backward compatibility with previous versions. (Ideally, the scripts should work without modification). With links above you can see that historically there is not a single default, different subsystems use their values.
  2. PowerShell Core 6.0 is ported on Unix. Unix defaults is UTF8NoBOM, Windows defaults is UTF16LEwithBOM.

So my question here was just to understand what can we make everything easier and use UTF8NoBOM as default "everywhere" (but maybe having breaking change).

Is it possible you can control the default encoding in PS? I mean if it is UTF8 just ensure you use the version which has no BOM? this will ensure comparability. we used the same idea with the Console.

if you think we need to change the behavior in netcore, we need to know the impact in general for the apps. but we can discuss that if PS cannot handle the issue.

I believe we can control the default encoding.

@joeyaiello @SteveL-MSFT could you please take part in the discussion?

@tarekgh PM on PowerShell here, we can absolutely control the default BOM behavior. I think more than anything, @iSazonov and @mklement0 were trying to understand .NET's train of thought on setting certain defaults in .NET Core/Standard, the impact to back- and cross-compat, and whether the overall Windows ecosystem has enough momentum in their built-in tools (conhost, Notepad, etc.) to support UTF-8 (BOM or BOM-less) by default.

What I'm hearing is that:

  • Yes, there is a bunch of momentum there, and while UTF-8 BOM is probably better for some things (e.g. Notepad will read it as UTF-8 rather than reading BOM-less as ANSI), there's a strong case for either UTF-8 default.
  • Don't be naive and fall out to Encoding.Default everywhere. We're already not doing that, we just offer Default and Oem as options on the -Encoding parameter for many cmdlets (though Default is not actually the default for any of them).

Unfortunately, on the PowerShell side, we also have to consider:

  • People use tons of different languages/alphabets in their interactive shell, and that could impact our codepage story. There's a conversation going on in PowerShell/PowerShell#3248 about Serbian using cyrillic and latin alphabets that's been particularly educational for me. In any case, we can ship System.Text.CodePagesEncodingProvider with PowerShell if we need to. (Personally, I'm not concerned about the "legacy encoding" side on Linux as PowerShell doesn't have a legacy on Linux, we only support modern Linux versions that .NET Core supports, and I'm not seeing a ton of those legacy encodings out in the wild.)
  • We want to push PowerShell Core 6 as the de-facto vNext of PowerShell even on downlevel platforms (like Win7/Server2008R2). That means we can't necessarily depend on the Win10 conhost. Even then, I think a move to UTF-8 would probably be fine, as we'd be handling the actual content parsing before render-time in the PowerShell cmdlet layer.

Thanks for your explanations. I certainly appreciate it, and I expect our community does as well. :+1:

Thanks, @joeyaiello.

While I think that PowerShell has all the support it needs from .NET Core at this point to implement its own behavior, there are some fundamental points worth making, including problematic aspects of .NET Core:

  • The _true default behavior_ of the methods in System.IO.File - in the sense of _what happens if no encoding is specified at all_, is _BOM-less_ UTF-8 - which is great, has always been that way, since the inception of the .NET Framework, and is in line with the rest of the world.

  • There is the longstanding problematic discrepancy between Encoding.UTF8 (_with_ BOM) and an instance of UTF8Encoding with the default constructor (_no_ BOM), and that probably won't go away for reasons of backward compatibility, but at least it only comes into play when someone _explicitly_ requests either encoding (and knowing the difference can result in an informed choice).

  • Encoding.Default on Windows in the .NET Framework reflects the _legacy_ "ANSI" encoding, which is based on the system locale and therefore culture-dependent - and, as all legacy code pages are, incompatible with other code pages. (Calling it Default without qualification - especially given that it's at odds with the framework's own true default (BOM-less UTF-8) - is unfortunate, but that ship has sailed a long time ago).

    • .NET Core has (justifiably) chosen not to support the legacy Windows code pages when running on Windows, which raises the question what Encoding.Default should represent in _Core_:

    • If it should - as a generalization of the .NET Framework meaning - be _the respective platform's default_, _if we take legacy encodings out of the picture, both on Windows and Unix_:

      • Unix: The choice is clear: UTF-8 _without BOM_
      • Windows: _Hypothetically_, UTF-16 LE, the officially recommended successor to "ANSI" code pages (which in the Windows world is regrettably conflated with _Unicode_ per se, failing to make the distinction between the abstract standard and a specific _encoding_ of that standard). In practice, we know that few Windows applications, especially console programs, are equipped to handle UTF-16 LE.
    • Another option is to take the opportunity to have Encoding.Default reflect the _framework's_ actual default: UTF-8 _without BOM_

    • Either way, having Encoding.Default return a UTF-8 encoding _with BOM_ is pointless and confusing, because it neither represents any platform's nor the framework's default.

  • The same goes for the equivalent Encoding.GetEncoding(0) call, which is problematic in another respect:

    • Changing the return value of Encoding.GetEncoding(0) from the .NET Core default to the active "ANSI" legacy code page after having called Encoding.RegisterProvider(CodePagesEncodingProvide.Instance) from the System.Text.Encoding.CodePages package is confusing.

    • Why would the _platform_ / _framework_'s default change by registering additional character encodings? This side effect is unexpected.

@joeyaiello:

To address the _Windows_ side of things (to other readers: paragraphs specific to PowerShell are prefixed with "[PS]").


while UTF-8 BOM is probably better for some things (e.g. Notepad will read it as UTF-8 rather than reading BOM-less as ANSI), there's a strong case for either UTF-8 default.

I don't think there's a strong case or UTF-8 _with_ BOM:

  • On Unix, utilities _do not expect a BOM_ (more properly called a _Unicode signature_) and treat it as _data_, leading to unexpected results.

  • On Windows, _some_ applications, such as Notepad, are equipped to handle a BOM, but it's by no means consistent or pervasive. Also, if you have to write a _BOM_, it's arguably _not a default_.
    Default, at least in the Unix sense means: given _a raw stream of bytes without any metadata_ (and a BOM would qualify as metadata) to be interpreted as _text_, what encoding should _blindly_ be assumed?


[PS]

we just offer Default and Oem as options on the -Encoding parameter for many cmdlets (though Default is not actually the default for any of them).

Despite what the PS docs state, Default _is_ the actual default on some cmdlets, crucially Set-Content and Get-Content, which is what PowerShell/PowerShell#3248 is about.

Not only that, but Default is also the encoding PowerShell _itself_ uses to read _source code_.


@iSazonov:

When you say that "Windows default encoding is UTF16LE", that only applies to the _Windows APIs_ and _in-memory strings_, not to _file_ encodings in any meaningful way; certainly not in the same, standardized way that:

  • Unix utilities consult the LC_CTYPE locale category to know what character encoding to apply.
  • Unix utilities _blindly_ assume that _any_ file uses that character encoding, treating all bytes as _data_ (no BOM detection).

The reality of file encoding on Windows is (spoiler alert: heavily legacy-leaning confusion):

  • Legacy Windows console programs generally know about OEM encoding only.

  • Standard Windows GUI applications such as Notepad and WordPad _do_ know about _BOMs_, but, in their absence, default to "ANSI" encoding - both on reading and writing. You still have to go out of your way to save "Unicode" (UTF-16 LE) or UTF-8 (_with_ BOM) files.

    • On opening a BOM-less UTF-8-encode file, Notepad tries to guess the encoding - but that's all it is: a _guess_, given that in the absence of definitive encoding information (or assumptions), you cannot distinguish between an "ANSI" (single-byte, 8-bit encoding) and a UTF-8 file - the latter is a subset of the former - see this excellent blog post.
  • Applications based on the .NET Framework that chose no explicit character encoding (or explicitly chose UTF-8 without a BOM), assume _BOM-less_ UTF-8 encoding on input, and create the same on output.

Why would the platform / framework's default change by registering additional character encodings? This side effect is unexpected.

it is expected the encoding provider change the default behavior of the encoding. the point of the provider is not only to provide extra encoding but to override the encoding behaviors if needed to.

Either way, having Encoding.Default return a UTF-8 encoding with BOM is pointless and confusing, because it neither represents any platform's nor the framework's default.

I am inclining to agree with you here. we have been using the default UTF8 as the one returned from Encoding.UTF8 which has the BOM. so this looks old decision since Encoding.UTF8 is introduced and carried this legacy. if you feel strongly about it you may open a new issue and we'll try to look at it and decide if we can change this behavior

@tarekgh Please clarify the new Issue we should create in the repo or in CoreFX repo?

@tarekgh:

it is expected the encoding provider change the default behavior of the encoding.

Thanks for clarifying.
That behavior isn't obvious to me from the current documentation, so I've opened doc issue dotnet/docs/issues/1837.

@tarekgh

if you feel strongly about it you may open a new issue and we'll try to look at it and decide if we can change this behavior

Thanks. I've opened #10643 in the CoreCLR repo

(@iSazonov: I found the source for Encoding.Default in the CoreCLR repo, so I've opened the issue there.)

Thanks guys for opening the tracking issue.

@iSazonov HAHA, sorry I didn't see this. I asked the same question a while ago. It's inconsistent. Glad you guys got them to do something about it!

Was this page helpful?
0 / 5 - 0 ratings