Runtime: Console encoding defaults to UTF-8

Created on 14 Jul 2016  Â·  35Comments  Â·  Source: dotnet/runtime

With .NET Core, the default encoding on my W10 box is now UTF-8.
In the full .NET Framework, this was IBM437.

Running the code fragment below gives the following results:

.NET Core:
Speed 24 116,00 KB/s
Encoding: Unicode (UTF-8) | utf-8

.NET Framework:
Speed 24 116,82 KB/s
Encoding: OEM United States | IBM437

I could of course change the code page that the command prompt uses, but I'm not sure everyone is willing to do this.

using System;
using System.Globalization;

namespace ConsoleApp2
{
    public class Program2
    {
        public static void Main2(string[] args)
        {
            var culture = new CultureInfo("nl-BE");
            var speed = 24116m;
            var formatted = string.Format(culture, "Speed {0:N2} KB/s", speed);

            Console.WriteLine(formatted);
            Console.WriteLine("Encoding: {0} | {1}", Console.OutputEncoding.EncodingName, Console.OutputEncoding.ToString());
        }
    }
}
area-System.Console

Most helpful comment

@couven92 this is expected as discussed in this thread. if you want to get the desktop behavior, you can through adding the line

Encoding.RegisterProvider(CodePagesEncodingProvider.Instance)

in your initialization code

All 35 comments

This change was purposeful but I don't know much about the history of it. It was made in this commit which says To make Console use the intended encoding in this scenario, the user will have to register CodePagesEncodingProvider or support codepage 0 in their custom provider.

This issue is mostly a duplicate of https://github.com/dotnet/corefx/issues/5704.

@stephentoub may have better context.

@stephentoub may have better context.

I don't unfortunately. @pallavit, could you shed some more light on this?

@pallavit isn't here anymore, so we'll need to guess at it ourselves.

Changing this behavior to allow 437 as the encoding would require we add 437 support to the Encoding class. When we determine the Console encoding, we have a helper that calls Encoding.GetEncoding(437), which throws a NotSupportedException which we catch and instead return a UTF8Encoding.

Considering the prevalence of CP437, this seems like bad behavior to me. IMO we should at least we should add some special casing for 437 so we don't have to try/catch every time.

Adding actual CP437 support to Encoding would be a different issue we'd need to resolve in /dotnet/coreclr/. I've got very little context there, unfortunately.

Another use case for CP437 are zip files: zip files use the CP437 code page as default (see appendix D of the specifications)

(The ZipArchive in .NET Core uses Encoding.Default, which seems incorrect)

(The ZipArchive in .NET Core uses Encoding.Default, which seems incorrect)

This isn't true anymore for writing. In https://github.com/dotnet/corefx/pull/9004 I added some checks to handle the unicode/CP437 file names issue. If a file name doesn't fit within the range of characters shared between CP437 and ASCII then we write it in unicode and set the zip unicode bit. Otherwise we write it in ASCII and leave the unicode bit unset.

However, there is still the potential for discrepancies when reading a zip. If the unicode bit is unset and the file name contains values outside of the shared ASCII/CP437 range, then those values will be interpreted as their numerical unicode equivalent. So CP437{3} (regularly ♥) would be incorrectly read as UTF8{3} (control character).

As to Encoding support, I followed up with pallavit and confirmed the removal of CP437 was intentional. This doc page describes the reasoning, but the gist of it is that .NET Core removed most of the encodings to save space and CP437 is one of the ones removed because of its similarities to ASCII. That page also has instructions on what to do to workaround the change.

@ianhays Thanks for the pointer, this block of code indeed gets you the CP437 encoding on .NET Core: CodePagesEncodingProvider.Instance.GetEncoding(437);, provided you reference the System.Text.Encoding.Codepages NuGet package.

For writing zip files, using UTF-8 if a non-ASCII character is present is indeed an elegant solution; for reading zip files, things are a bit messier in European languages: they tend to use a lot of the characters which are present in CP437 but not in ASCII, so you may get inconsistent behavior there.

Add to that, that Windows still defaults to CP437 where possible: you can try this by creating a file called Über.txt and zipping it using File Explorer. The filename will be encoded using CP437; but the ZipArchive class decodes it as �ber.txt.

@ianhays Would it help if I create a seperate issue for this?

@ianhays (hope this doesn't sound to negative) How come plently of API's are re-introduced for the sake of compatibility, while "runtime" compatibility here is ignored.

The doc page that you linked to does not describe the rationale to me. I don't see any motivation as to why support for some encodings was removed.

things are a bit messier in European languages:

That's a good point. Language names with tilde'd characters and those sideways colons and whatnot will be marred.

@ianhays Would it help if I create a seperate issue for this?

Yeah, we should consider doing something for this in ZipArchive. Maybe when reading we could try to use CP437 if it's registered via EncodingProvider when the unicode bit is unset? That's a fairly specific case that I doubt will be that common, but at least you would have the _option_ of making CP437 zip entry names unzip correctly. Ping me in the new issue and we can continue the discussion.

@ianhays (hope this doesn't sound to negative) How come plently of API's are re-introduced for the sake of compatibility, while "runtime" compatibility here is ignored.

The doc page that you linked to does not describe the rationale to me. I don't see any motivation as to why support for some encodings was removed.

@tarekgh @danmosemsft

@drieseng

The doc page that you linked to does not describe the rationale to me. I don't see any motivation as to why support for some encoding was removed.

we are really didn't remove the support, instead we made it opt-in through the Encoding Provider as the doc says. by default apps should be using Unicode and this is the case for the majority of apps. apps which cares about the non-Unicode code pages they still can have the support by adding one line in their code

 Encoding.RegisterProvider(CodePagesEncodingProvider.Instance)

having the apps using Unicode by default is good things in general and not having people mistakenly use non-Unicode while they are not aware of (we got cases like these before). whoever cares about non-Unicode codepages still have easy way to do that. The other secondary reason for doing that is codepages support have some cost too (including the data and code). so if we add it by default means all apps will having this cost while most of the apps don't really cares about encoding.

The question now is, do you think the opt-in solution is not good enough for the scenarios like yours? and why?

Thanks,
Tarek

@tarekgh I don't have strong feelings - or a strong motivation - to change the current behavior. I just don't see why you'd want API compatibility, and at the same time expect applications to introduce specific code for .NET Core.

In the example I provided, I don't explicitly decide to do anything encoding related (and yes, I know it's definitely affected by the default encoding). To me, it's just the simplest application possible. And yet it doesn't yield the correct result.

How many Windows systems default to one of the (by default) supported encodings ?

@tarekgh Well, I guess you could look at it this way, too:

  1. Shouldn't the code sample above "just work"?
  2. On most systems, Windows console doesn't use the UTF-8 encoding by default, yet .NET Core insists on sending it UTF-8 encoded data instead.

@ianhays (hope this doesn't sound to negative) How come plently of API's are re-introduced for the sake of compatibility, while "runtime" compatibility here is ignored.

@drieseng We care about both, but at this moment there is a big push to grow our API surface area up again.

@drieseng

In the example I provided, I don't explicitly decide to do anything encoding related (and yes, I know it's definitely affected by the default encoding). To me, it's just the simplest application possible. And yet it doesn't yield the correct result.

Actually what you have said is good example, if the default encoding affect your app behavior then you have to be conscious about the encoding. imagine we support the default encoding 437 and you think your app is just working fine and then someone run on other configuration and the app will break because you didn't pay much attention to the encoding.

How many Windows systems default to one of the (by default) supported encoding ?

The problem is not the default supported encoding. the problem is, does the default encoding support the characters you are using in the output string and does the console display these characters correctly. even supporting this default encoding doesn't guarantee display all characters from your output.

@qmfrederik

Shouldn't the code sample above "just work"?

well, this is exactly my point. The code will work only with the specific configuration. if we have some other configuration on the system (I mean the ACP is different) the problem can show up again even if we enable such encodings by default.

just to reiterate what really the problem is:

The problem here is not the UTF8 encoding itself but how the console display characters from this encoding. for the sample code mentioned in this issue, the problem is the string has the character "\u00A0" which is non-space character. when using UTF8, this character get converted correctly to 0xC2 0xA0 but the problem is the console doesn't display it right. So this is kind of Console rendering issue. I'll talk to the Windows Console owners to know if they can fix such issues.

I'll try to look at, if we support the non-Unicode encodings by default will help much but from your side if you want to guarantee your app behavior, you'll need to set the desired encoding to the console output and make sure these will get displayed as expected in the console. this will make your app work seamlessly regardless of the machine configuration.

thanks for your feedback.

@drieseng @qmfrederik Jumping in here as @tarekgh kindly alerted me to the conversation underway.

I own the Windows Console and Bash on Windows and wanted to provide a little perspective, assurance and background re. the Windows Console and text encoding:

The Windows Console has one of the longest-serving code-bases in all of Windows, but hasn't had consistent ownership for many years. That changed ~2 years ago when a new team was formed to own and improve the Windows Console.

Over the last couple of years, the team gradually started making improvements to the console as our understanding of the Console's internals increased. Initial changes included line-selection, transparent backgrounds, fixes for many rendering and layout issues, significant performance improvements, etc.

In Windows 10 Anniversary Update, the team made HUGE improvements to the Console in order to support Unix and Linux applications running on the new Ubuntu for Windows atop the Windows Subsystem for Linux (WSL). These changes included adding extensive virtual-terminal (VT) sequence support, improved color handling, big improvements to better handling international characters & globalized text, improved font selection, etc.

By the time the team delivered our final bits for Win10AU, we had a comprehensive understanding of how the Console works, including its edge-cases.

The team have now begin the biggest overhaul to-date of the Windows Console. We've many, MANY improvements planned including adding better support for how we render and handle various text encodings, character sets, fonts, layout, rendering performance, input methods, etc.

The changes we have planned will dramatically improve how the Console handles UTF-8 text in particular. You'll see far fewer issues with rendering of UTF-x encoded text and should enjoy a far richer, more capable Console experience over-all.

You'll start to see these improvements arrive in the coming weeks and months as we drive towards our next major Windows 10 release. If you're interested in being among the first to try-out our changes, be sure to join the Windows Insider program and fire-up your fast-ring updates. We appreciate any and all feedback you can share to make sure we're delivering the Console you need and deserve.

The issue also occurs for Nordic languages.
for

Console.WriteLine("Ã…lesund");
Console.WriteLine(Console.OutputEncoding.WebName);
Console.WriteLine($"Code Page: {Console.OutputEncoding.CodePage}");

Output in .NET Framework 4.6.1:

ibm850
Code Page: 850

In .NET Core:

Ålesund
utf-8
Code Page: 65001

@couven92 this is expected as discussed in this thread. if you want to get the desktop behavior, you can through adding the line

Encoding.RegisterProvider(CodePagesEncodingProvider.Instance)

in your initialization code

Closing issue as expected behavior.

I'm seeing a related problem, where although it is set to UTF-8, it doesn't actually work as UTF-8. The issue can be demonstrated as follows:

  1. Start app
  2. Read Console.OutputEncoding. It says its UTF-8
  3. Display output. It does not come out in UTF-8, even though that was reported as the current encoding.
  4. Set the encoding to UTF-8. (I.e. set it to its existing value).
  5. Display output again. Now it does come out in UTF-8.

This definitely seems like a bug to me, since its a case where setting a property to its existing value changes behavior. In this the case, the bug is present before the "set", since at that point in the process the getting is returning a value which clearly is not actually being used.

From https://github.com/dotnet/corefx/issues/15986#issuecomment-313146731

@JohnRusk which version of net core you are using? and which version and build of the OS you are using?

@tarekgh I've repro'd it on both Windows and Ubuntu 16.04. On Windows here's my version info:

.NET Command Line Tools (1.0.4)

Product Information:
Version: 1.0.4
Commit SHA-1 hash: af1e6684fd

Runtime Environment:
OS Name: Windows
OS Version: 10.0.14393
OS Platform: Windows
RID: win10-x64
Base Path: C:\Program Files\dotnet\sdk\1.0.4

And on Ubuntu:

.NET Command Line Tools (1.0.4)

Product Information:
Version: 1.0.4
Commit SHA-1 hash: af1e6684fd

Runtime Environment:
OS Name: ubuntu
OS Version: 16.04
OS Platform: Linux
RID: ubuntu.16.04-x64
Base Path: /usr/share/dotnet/sdk/1.0.4

Microsoft .NET Core Shared Framework Host

Version : 2.0.0-preview2-25407-01
Build : 40c565230930ead58a50719c0ec799df77bddee9

@JohnRusk just to confirm, does the value of the element "TargetFramework" in your csproj is 1.1? if so could you please try 2.0 version.

https://github.com/dotnet/corefx/blob/master/Documentation/project-docs/dogfooding.md

Couple of notes here:

  • On Windows build 14393, the console Window have some problems with the UTF8 encoding. so the displayed test can be displayed wrong. also you can run chcp command to know what is the default codepage in this console.
  • On Linux, please confirm your default encoding in the system is UTF8 and not set to something else.
  • Before you try netcore as I suggested 2.0, could you please try to add the following line in your app initialization code
    C# Encoding.RegisterProvider(CodePagesEncodingProvider.Instance)
    and let me know if you notice a difference when reading Console.OutputEncoding first time.

I appreciate your help here.

Thanks @tarekgh. Yes, my target framework is 1.1.

What do you mean by "On Linux, please confirm your default encoding in the system is UTF8 and not set to something else." If you mean, "check that the initial value of Console.OutputEncoding, before I set it, really is UTF-8", than I'm fairly sure I've already done that test.

I'm afraid I don't think I'll have time to run the other tests soon, I'm sorry. This month is a bit busy for me...

What do you mean by "On Linux, please confirm your default encoding in the system is UTF8 and not set to something else."

What I mean, try in your Linux terminal run the command "locale" and then look what is the value of the property LC_CTYPE which can tell which default encoding is used in your system.

I'm aware the branch seems closed, but the issue still exists. Yesterday I tried fresh downloaded VS2017 with ubiquitous HelloWorld and encountered this behavior in .NET Core while .NET Framework works fine. I should stress what @JohnRusk already sayd:
C# string Message = "Привет, .NET Core!"; // a localized string containing cyrillic Console.WriteLine(Message); // prints garbage instead of nonASCII letters Console.WriteLine(Console.OutputEncoding); // outputs System.Text.UTF8Encoding Console.OutputEncoding = System.Text.Encoding.UTF8; // seems unnecessary but it does correct all things! Console.WriteLine(Console.OutputEncoding); // again outputs System.Text.UTF8Encoding as if nothing changed Console.WriteLine(Message); // surprisingly prints correctly localized string
Again: the same HelloWorld with localized string compiled int .NET Framework works fine and does not require any playarounds with encoding.
It seems that initial default encoding settings set by runtime works incorrectly in .NET Core.

@aalub This issue is closed with milestone set to 2.0.0. That means that .Net Core 1.x still has the issue (and requires a workaround). It will be fixed in .Net Core 2.0, which means you can either wait until that's released later this quarter or you can try a preview version.

@aalub I assume you are running on Windows. please confirm. Did you use the following line in your app initialization?

C# Encoding.RegisterProvider(CodePagesEncodingProvider.Instance)

Also, it will be helpful to print Console.OutputEncoding before you print the first message.

@aalub I assume you are running on Windows. please confirm.

Yes, confirm. Windows 8.1 Enterprise x64, localized.

Did you use the following line in your app initialization?

No. The code quoted above is the entire contents of the Main() method. It was just a "HelloWorld", you know. And my .NET Core instance (neither 1.0 nor 1.1) won't find the name CodePagesEncodingProvider, sorry.

Also, it will be helpful to print Console.OutputEncoding before you print the first message.

Tested. The same. It outputs System.Text.UTF8Encoding even if it's the only statement in Main().

@aalub This issue is closed with milestone set to 2.0.0. That means that .Net Core 1.x still has the issue (and requires a workaround). It will be fixed in .Net Core 2.0

@svick, thank you very much! You are helpful.

It will be fixed in .Net Core 2.0

@svick , yes, dotnet/cli shows the issue is gone in .NET Core 2.0 preview 2: a console program prints correct text without any workarounds. Still I can't see 2.0 in Visual Studio 2017: I employ version 15.2 which they say is hard coded for 1.x, and 2.0 can install only on VS2017 version 15.3 which is a preview. So I'll wait. Till then Console.OutputEncoding = System.Text.Encoding.UTF8 is simple and corrects things.

Thanks again.

And my .NET Core instance (neither 1.0 nor 1.1) won't find the name CodePagesEncodingProvider, sorry.

Could you please add a reference to the package System.Text.Encoding.CodePages and then add the line

C# Encoding.RegisterProvider(CodePagesEncodingProvider.Instance)

in your app initialization and let's know if you see any difference.

Could you please add a reference to the package System.Text.Encoding.CodePages and then add the line

Yes, I see now. After adding System.Text.Encoding.CodePages NuGet package the name got resolved and there it is:
C# static void Main(string[] args) { Encoding.RegisterProvider(CodePagesEncodingProvider.Instance); Console.WriteLine(Console.OutputEncoding); // prints System.Text.SBCSCodePageEncoding string Message = "Привет, .NET Core!"; Console.WriteLine(Message); // prints correctly localized text Console.WriteLine(Console.OutputEncoding); // prings System.Text.SBCSCodePageEncoding again Console.OutputEncoding = System.Text.Encoding.UTF8; Console.WriteLine(Console.OutputEncoding); // prints System.Text.UTF8Encoding Console.WriteLine(Message); // again correct cyrillic text }
It seems more consistent than the mentioned superstitious Console.OutputEncoding = System.Text.Encoding.UTF8 but takes one or two more little steps and many much more knowledge and experience.

Thank you very much!

I use .Net Core 2.0 but it doesn't have definition for CodePagesEncodingProvider to use workaround suggested before:
Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);

I tried to use using System.Text.Encoding.CodePages; but it also is absent in Encoding.

@algot you also need to add a package reference to the System.Text.Encoding.CodePages package.

Ok, I'm lost. I'm using dotnet version 2.1.2 on Windows 10 with VS2017 v15.5.2
I was trying to print the smiley face, using '\u263A' but was getting a question mark instead. After adding the following line right after my Main method, the smiley face printed correctly.
Console.OutputEncoding = System.Text.Encoding.UTF8;

Would I have to add that line in every console application that uses special characters? Please explain what is going like I'm 5 years old, thank you.

@navarrorc Okay, first you need to add the System.Text.Encoding.CodePages NuGet package to your netstandard or netcoreapp project.
Then add the following line to the first line of your Main method:

System.Text.Encoding.RegisterProvider(System.Text.Encoding.CodePages.CodePagesEncodingProvider.Instance);

Normally, you should not set the OutputEncoding of your Console (unless you redirect the output of your app to a file that requires a specific encoding, which in most cases will be very unusual, and typically you'd do some other tricks instead in such cases).

But with the the line above, your app will be able to correctly .NET native UTF-16 strings to whatever Encoding your OS would like to use.

On Linux and Mac OSX a console will mostly use UTF-8 encoding, or whatever some magic environment variables like LANG dictate. This will also be different depending on what shell (i.e. bash, sh, dash or zsh) you are using.
On Windows, because of backwards-compatability, the terminal will use some very weird old so-called CodePage encoding that is specific to the Operating System language. For example on Windows with European language settings (like German, French or Scandinavian) it will use the IBM-850 CodePage.

On newer Windows (i.e. from Windows 10) the terminal comes with TrueFont-support, which means that it now is capable of displaying UTF-8 encoded text in emulation mode, so yes, you can change the outputencoding to UTF-8 on Windows 10, but on all older Windows version this will not work and just cause the terminal to mangle the output.

Note that when you're using some other console, like Cygwin, Msys Shell or Git Bash, you might be subject to a weird combination of some or all these effects, since these shells try to merge Windows and Linux behaviour.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

nalywa picture nalywa  Â·  3Comments

jamesqo picture jamesqo  Â·  3Comments

matty-hall picture matty-hall  Â·  3Comments

bencz picture bencz  Â·  3Comments

btecu picture btecu  Â·  3Comments