Powershell: Make console windows fully UTF-8 by default on Windows, in line with the behavior on Unix-like platforms

Created on 5 Jul 2018  Â·  12Comments  Â·  Source: PowerShell/PowerShell

PowerShell Core now commendably defaults to UTF-8 encoding, including when sending strings _to_ _external_ programs, as reflected in $OutputEncoding's default value.

However, because the console-window shortcut file / taskbar entry still defaults to the OEM code page implied by the legacy system locale (e.g. 437 on US-English systems), it misinterprets strings _from_ external programs; e.g., with Node.js installed:

PSCoreOnWin> $captured = '€' | node -pe "require('fs').readFileSync(0).toString().trim()"; $captured
Γé¼    # !! node's UTF-8 output was misinterpreted.

This currently requires the following _workaround_ (in addition to requiring the console window to use a TrueType font (true by default on Windows 10)):

[console]::InputEncoding = [console]::OutputEncoding = New-Object System.Text.UTF8Encoding

Prepend $OutputEncoding = to make a _Windows PowerShell_ console fully UTF-8-aware.

The above implicitly switches to the UTF-8 code page (65001), as then reflected in chcp.

This obscure workaround shouldn't be necessary, and I think it would make sense for PowerShell to automatically set [console]::InputEncoding and [console]::OutputEncoding to (BOM-less) UTF-8 on startup.

_Update_: When this issue was originally created, there was no mechanism for presetting code page 65001 (UTF-8) _system-wide_, which necessitated the awkward workaround. In recent versions of Windows 10 it _is_ now possible to switch to code page 65001 _as the system locale_ and therefore system-wide, although as of Windows 10 version 1909 that feature is still in _beta_ - see this SO answer.

  • _Caveat_: In addition to defaulting the _OEM_ code page to 65001 in all console windows (including cmd.exe windows), this invariably also makes _Windows PowerShell_'s _ANSI_-encoding-default cmdlets default to UTF-8, notably Get-Content and Set-Content, which can be problematic from a backward-compatibility perspective.
    Additionally, there is a _bug_ - see below.

The change, which can also be made programmatically (see below), requires administrative privileges and a reboot.

Environment data

PowerShell Core 7.1.0-preview.3 on Windows 10
WG-Interactive-Console

Most helpful comment

Since Windows 7 EOL and community are migrating to Windows 10 it seems a time to switch a console default to UTF8 on WIndows.

/cc @SteveL-MSFT

All 12 comments

It is a platform default:
https://source.dot.net/#System.Console/System/Console.cs,a570cd79bd33ceab
https://source.dot.net/#System.Console/System/ConsolePal.Windows.cs,c997db0e94f0d1cc
https://source.dot.net/#System.Console/Common/Interop/Windows/Interop.GetConsoleOutputCP.cs,f028312cfc964730

So we need do [console]::InputEncoding = [console]::OutputEncoding = New-Object System.Text.UTF8Encoding at PowerShell Core startup. @mklement0 Right fix for all platforms and Windows versions (Windows 7?) ?

Thanks for the sleuthing, @iSazonov.

Yes, I think the fix is also appropriate for Windows 7:

While you're more likely to run into problems with standard console programs there that can even _break_ with UTF-8 input, I think it's more important for PowerShell Core to exhibit consistent encoding behavior and to support modern, cross-platform utilities that natively speak UTF-8 by default.

@iSazonov: Forgot to clarify: It is only the right fix for _Windows_ - on Unix-like platforms the CoreFx default should be used, as discussed in #7634 (even though there's a CoreFx fix pending).

I hope @JamesWTruher could comment. I think he considered this in time writing and implementing Encoding RFC.

Since Windows 7 EOL and community are migrating to Windows 10 it seems a time to switch a console default to UTF8 on WIndows.

/cc @SteveL-MSFT

@nu8, are you using Windows PowerShell? In PowerShell Core, the default encoding for Get-Content on files has been UTF8NoBOM since https://github.com/PowerShell/PowerShell/pull/5080.

@nu8, your problem is unrelated to the active OEM code page, and it is what @KalleOlaviNiemitalo states: if you have a _BOM-less_ UTF-8 file, _Windows PowerShell_ interprets it as ANSI-encoded.

The OEM code page only matters with respect to _external programs_, and only if their output is captured or redirected.

Furthermore, to make the [console]::InputEncoding = [console]::OutputEncoding = ... solution complete in _Windows PowerShell_ (not needed in Core), you must also set $OutputEncoding to [System.Text.UTF8Encoding]::new(), so as to make PowerShell also use UTF-8 when piping data _to_ an external program:

$OutputEncoding = [console]::InputEncoding =
                  [console]::OutputEncoding =
                  [System.Text.UTF8Encoding]::new()

While your registry-based way of setting the OEM code page to UTF-8 (65001) obviates the need for setting [console]::InputEncoding and [console]::OutputEncoding, in _Windows PowerShell_ the need for setting $OutputEncoding remains.

In _PowerShell Core_, the registry-based approach is sufficient (except for possibly switching to a font with more complete Unicode support).

Note that the registry-based approach is the (potentially programmatic) equivalent of the aforementioned GUI method (via Control Panel, intl.cpl, tab Administrative, Change system locale...); as previously noted, this is still labeled as Beta: as of Windows 10 release 1909, though I suspect it will work fine as long as you use only modern command-line utilities.

I'm talking about external programs, because that is what this issue is about, and it is what the [console]::InputEncoding = [console]::OutputEncoding = New-Object System.Text.UTF8Encoding workaround is for.

Get-Content is indeed not an external program, which is why I pointed out that your problem is indeed unrelated to this issue (in the remainder of my response I've merely provided more context to your attempt to solve the original, external-program-related issue).

I've also explained the reason for your unrelated Get-Content problem.
To spell out the solution: you must create your UTF-8 files _with a BOM_ in order for _Windows PowerShell_ to recognize them as UTF-8-encoded; for more information, see this Stack Overflow answer.

If you have further questions, I suggest you use the following community resources:

@nu8

Yes, of course you can always use -Encoding explicitly to specify the source encoding (which is ignored if a BOM indicates a different encoding, as an aside).

My intent was to show you how to create UTF-8-encoded files that Windows PowerShell recognizes as such _by default_.
The previously linked SO answer also demonstrates how you can use the $PSDefaultParameterValues hash table to _preset_ defaults ($PSDefaultParameterValues['*:Encoding'] = 'utf8'), though that needs careful managing to ensure that it only applies where you want it to.

Changing the legacy system locale (language for non-Unicode programs) to 65001 does _not_ help with that, because it doesn't change the _ANSI_ code page (only the _OEM_ code page): Even after making that change, with chcp then reporting 65001, (Get-Culture).TextInfo.ANSICodePage continues to report the culture-appropriate code page, e.g., 1252 on US-English systems, so Get-Content a.txt will continue to misinterpret your file in Windows PowerShell.

While there _is_ an ACP registry value that controls the ANSI code page - just like OEMCP controls the OEM code page - using the GUI method changes only the OEMCP value - for good reasons, I suspect.
I personally wouldn't even attempt to set ACP to 65001 - I don't think that is supported, and I suspect it'll either wreak havoc or will be ignored.

To summarize:

  • This issue is about _external (console) programs_ in _console windows_, as clearly mentioned in the body of the OP, and as explained later; it is about making PowerShell Core's UTF-8 support _complete_ by also using UTF-8 consistently when communicating with external console programs. PowerShell-native commands already do consistently default to UTF-8.

  • Your Get-Content problem is unrelated; it only surfaces in Windows PowerShell, and it is only related to the system's active _ANSI_ code page and how PowerShell - which is internally UTF-16-based (.NET strings) - interprets content read from _files_; the behavior is independent of the PowerShell host application (whether it is a console or not).

  • Please generally note that this repo is for reporting PowerShell _Core_ issues only - for _Windows PowerShell_ issues, use the UserVoice forum.

You should probably stop suggesting people use BOM

I've suggested a BOM to _work around_ an issue with the _legacy_ shell _Windows PowerShell_.
Yes, avoiding a BOM is the better approach, especially in cross-platform code, and, fortunately, _PowerShell Core_ now defaults to UTF-8.

(For _cross-edition_ PowerShell source code, UTF-8 _with BOM_ continues to be your only option if the source code contains (runtime-relevant) non-ASCII characters.)

Locale has to do with Language, not code page.

In Windows, locales _imply_ code pages. In the GUI, you pick a language (+ region/country), which makes the code pages associated with the chosen locale the active ones.

So I don't see what the point would be in changing an ANSI code page, as we are
dealing with Unicode here.

My point was that just as you're changing the OEM code page to 65001 (UTF-8), you may be tempted to set the ANSI code page to this value in order to change Get-Content's default (see below) - but that isn't supported.

Get-Content a.txt works just fine after making the Registry change I
suggested.

Much to my surprise, this is indeed the case - even though the ANSI code page _isn't_ changed by your registry approach (or via intl.cpl).

I therefore recommend against setting the OEM code page to 65001 _system-wide_ if you're using _Windows PowerShell_, because you'll break any existing code that was written based on the assumption that Get-Content defaults to the active ANSI code page.

In other words: all (by definition BOM-less) ANSI-encoded files (which is historically common on Windows) will then be misinterpreted as UTF-8 by Get-Content (in the absence of -Encoding).
Similarly, Set-Content would then create BOM-less UTF-8 files rather than ANSI files by default.

If you use the workaround suggested in the OP, you won't have that problem (but you'll need to use something like $PSDefaultParameterValues['*:Encoding'] = 'utf8' for _your_ code to default to UTF-8.)

Of course, if you're willing to assume that you'll only ever come across BOM-less files that are UTF-8-encoded (rather than ANSI-encoded) and that no (third-party) code ever runs that relies on ANSI being the default, your approach would work.

(It is a separate issue that it is currently needlessly difficult to read ANSI-encoded files in _PowerShell Core_, because the -Encoding parameter supports no ANSI value - see #6562).

Then the title should reflect that.

hoo boy.

I mean really, how common is that going to be?

The first commandment of _Windows PowerShell_ has always been: _backward compatibility_.
(That commandment is fraying around the edges in _PowerShell [Core]_.)

Your solution breaks backward compatibility, and I pointed out how.


I notice you didnt give a single example
perhaps you can point out some high profile ANSI files that are included with Windows, or some that I am likely to come across online?
Yes, I am willing to do that

hoo boy.

Let me remind you again: This repo is for questions about _PowerShell [Core]_, not _Windows PowerShell_.

Your concerns are a non-issue in PowerShell [Core], which strikes me as a good reason to switch to it. If there's something holding you back - such as the startup-performance issue you linked to (#6443) - I suggest you focus on solving that.

Let me try to summarize, now that we (hopefully) have the full picture:

I've hidden my previous comments in favor of this one, @nu8 - I encourage you to do the same, as appropriate. This comment also corrects my incorrect earlier claim that you cannot set the ANSI code page to 65001.

This issue is about making UTF-8 support in PowerShell on Windows _complete_, by making sure that PowerShell also uses UTF-8 when communicating with _external programs_ (the built-in cmdlets already default to UTF-8, invariably so), which requires setting [console]::InputEncoding and [console]::OutputEncoding to (BOM-less) UTF-8 (possibly indirectly).


Currently, in the absence of PowerShell doing that itself, there are two workarounds:

Option 1: Put the following statement in your $PROFILE:

# In *Windows PowerShell*, prepend `$OutputEncoding = `
[console]::InputEncoding = [console]::OutputEncoding = [System.Text.UTF8Encoding]::new()

Pros and cons:

  • Doesn't require administrative privileges and takes effect in new windows without the need for a reboot.
  • Requires modifying $PROFILE
  • Is bypassed if the CLI is used as pwsh -noprofile ...

Note: In Windows PowerShell, you must prepend $OutputEncoding = to the above command, in order to also make Windows PowerShell send UTF-8 _to_ external programs. (In PowerShell [Core], this preference variably commendably defaults to (BOM-less) UTF-8.)


Option 2: Change the active code pages to 65001 _system-wide_ (W10+):

  • GUI method: via intl.cpl (Control Panel), tab Administrative, Change system locale...); as previously noted, this is still labeled as Beta: as of Windows 10 release 1909, though I suspect it will work fine as long as you use only modern command-line utilities.

  • Equivalent programmatic method, based on @nu8's approach:

# Requires ELEVATION and a REBOOT
'ACP', 'OEMCP', 'MACCP' | Set-ItemProperty HKLM:\SYSTEM\CurrentControlSet\Control\Nls\CodePage -Name { $_ } 65001
# Restart-Computer

Pros and cons:

  • Due to a .NET bug still present in the .NET version underlying PowerShell Core 7.1.0-preview.3, [console]::InputEncoding and [console]::OutputEncoding are mistakenly set to UTF-8 encoding _with BOM_, which causes follow-on bugs; notably, it breaks Start-Job in PowerShell. See https://github.com/dotnet/runtime/issues/28929. Option 1 above doesn't have this problem.

    • Curiously, by contrast, the [System.Text.Encoding]::Default encoding that reflects the active ANSI code page contains a _BOM-less_ UTF-8 encoding after the system-wide change (see below).

    • Note that the bug can also manifest without the system-wide change, namely if you manually run chcp 65001 from cmd.exe, for instance, _before_ invoking PowerShell (running chcp from _inside_ PowerShell isn't supported and requires Option 1 instead).

  • Requires administrative privileges and a reboot.

  • Takes effect _system-wide_: it applies to all console / Windows Terminal windows, notably including those running cmd.exe

  • Invariably also uses UTF-8 as the _ANSI_ code page (not just the _OEM_ code page), as reflected in [System.Text.Encoding]::Default (note that this also applies if you set _only_ the OEMCP registry value to 65001; (Get-Culture).TextInfo.ANSICodePage, by contrast, continues to report the locale-appropriate code page, e.g. 1252).

    • If you're (also) running _Windows PowerShell_, this means that the setting invariably makes _Windows PowerShell_'s _ANSI_-encoding-default cmdlets default to UTF-8, notably Get-Content and Set-Content, which, depending on your backward-compatibility needs:

      • may be desirable for consistent UTF-8 use across both PowerShell editions.

      • Note: short of placing $PSDefaultParameterValues['*:Encoding'] = 'utf8' in your $PROFILE, this is the only way to get Windows PowerShell to consistently default to UTF-8.

        Curiously, the system-wide change causes Windows PowerShell to then create _BOM-less_ UTF-8 files by default with Set-Content, something that cannot otherwise achieved, except with direct use of .NET.

      • may be undesired, if you have existing code that uses Get-Content and Set-Content without -Encoding and you need to process BOM-less files that are ANSI- rather than UTF-8-encoded.

    • Also, in _Windows PowerShell_ only, you must additionally still run
      $OutputEncoding = [System.Text.Utf8Encoding]::new() (via $PROFILE) in order to also make Windows PowerShell send UTF-8 _to_ external programs.

A note on file encoding:

If making _Windows PowerShell_ too default to UTF-8 via the system-wide change is not an option, BOM-less UTF-8 files will only be read correctly under one of the following conditions:

  • you use -Encoding Utf8 with file-handling cmdlets.
  • you convert your BOM-less UTF-8 files to have a BOM

    • such files can be problematic in cross-platform use; on Unix-like platforms, a UTF-8 BOM can be misinterpreted as data
    • conversely, if you write PowerShell code that contains (runtime-relevant) non-ASCII characters and needs to run in both editions, saving your source code files as UTF-8 _with BOM_ is a must (though you could also use UTF-16).
  • you preset the default encoding via $PSDefaultParameterValues['*:Encoding'] = 'utf8', but you'll have to _scope_ this setting if you don't want _all_ code to use these defaults.

Note that Windows PowerShell - curiously, _except_ if the system-wide change is made - only ever creates UTF-8 files _with BOM_ (whereas PowerShell [Core] defaults to _BOM-less_ UTF-8 and has an -Encoding utf8BOM opt-in); direct use of .NET is required to work around that - see this SO answer.

Was this page helpful?
0 / 5 - 0 ratings