PowerShell Core now commendably defaults to UTF-8 encoding, including when sending strings _to_ _external_ programs, as reflected in $OutputEncoding
's default value.
However, because the console-window shortcut file / taskbar entry still defaults to the OEM code page implied by the legacy system locale (e.g. 437
on US-English systems), it misinterprets strings _from_ external programs; e.g., with Node.js installed:
PSCoreOnWin> $captured = '€' | node -pe "require('fs').readFileSync(0).toString().trim()"; $captured
Γé¼ # !! node's UTF-8 output was misinterpreted.
This currently requires the following _workaround_ (in addition to requiring the console window to use a TrueType font (true by default on Windows 10)):
[console]::InputEncoding = [console]::OutputEncoding = New-Object System.Text.UTF8Encoding
Prepend $OutputEncoding =
to make a _Windows PowerShell_ console fully UTF-8-aware.
The above implicitly switches to the UTF-8 code page (65001
), as then reflected in chcp
.
This obscure workaround shouldn't be necessary, and I think it would make sense for PowerShell to automatically set [console]::InputEncoding
and [console]::OutputEncoding
to (BOM-less) UTF-8 on startup.
_Update_: When this issue was originally created, there was no mechanism for presetting code page 65001
(UTF-8) _system-wide_, which necessitated the awkward workaround. In recent versions of Windows 10 it _is_ now possible to switch to code page 65001
_as the system locale_ and therefore system-wide, although as of Windows 10 version 1909 that feature is still in _beta_ - see this SO answer.
65001
in all console windows (including cmd.exe
windows), this invariably also makes _Windows PowerShell_'s _ANSI_-encoding-default cmdlets default to UTF-8, notably Get-Content
and Set-Content
, which can be problematic from a backward-compatibility perspective.The change, which can also be made programmatically (see below), requires administrative privileges and a reboot.
PowerShell Core 7.1.0-preview.3 on Windows 10
It is a platform default:
https://source.dot.net/#System.Console/System/Console.cs,a570cd79bd33ceab
https://source.dot.net/#System.Console/System/ConsolePal.Windows.cs,c997db0e94f0d1cc
https://source.dot.net/#System.Console/Common/Interop/Windows/Interop.GetConsoleOutputCP.cs,f028312cfc964730
So we need do [console]::InputEncoding = [console]::OutputEncoding = New-Object System.Text.UTF8Encoding
at PowerShell Core startup. @mklement0 Right fix for all platforms and Windows versions (Windows 7?) ?
Thanks for the sleuthing, @iSazonov.
Yes, I think the fix is also appropriate for Windows 7:
While you're more likely to run into problems with standard console programs there that can even _break_ with UTF-8 input, I think it's more important for PowerShell Core to exhibit consistent encoding behavior and to support modern, cross-platform utilities that natively speak UTF-8 by default.
@iSazonov: Forgot to clarify: It is only the right fix for _Windows_ - on Unix-like platforms the CoreFx default should be used, as discussed in #7634 (even though there's a CoreFx fix pending).
I hope @JamesWTruher could comment. I think he considered this in time writing and implementing Encoding RFC.
Since Windows 7 EOL and community are migrating to Windows 10 it seems a time to switch a console default to UTF8 on WIndows.
/cc @SteveL-MSFT
@nu8, are you using Windows PowerShell? In PowerShell Core, the default encoding for Get-Content
on files has been UTF8NoBOM
since https://github.com/PowerShell/PowerShell/pull/5080.
@nu8, your problem is unrelated to the active OEM code page, and it is what @KalleOlaviNiemitalo states: if you have a _BOM-less_ UTF-8 file, _Windows PowerShell_ interprets it as ANSI-encoded.
The OEM code page only matters with respect to _external programs_, and only if their output is captured or redirected.
Furthermore, to make the [console]::InputEncoding = [console]::OutputEncoding = ...
solution complete in _Windows PowerShell_ (not needed in Core), you must also set $OutputEncoding
to [System.Text.UTF8Encoding]::new()
, so as to make PowerShell also use UTF-8 when piping data _to_ an external program:
$OutputEncoding = [console]::InputEncoding =
[console]::OutputEncoding =
[System.Text.UTF8Encoding]::new()
While your registry-based way of setting the OEM code page to UTF-8 (65001
) obviates the need for setting [console]::InputEncoding
and [console]::OutputEncoding
, in _Windows PowerShell_ the need for setting $OutputEncoding
remains.
In _PowerShell Core_, the registry-based approach is sufficient (except for possibly switching to a font with more complete Unicode support).
Note that the registry-based approach is the (potentially programmatic) equivalent of the aforementioned GUI method (via Control Panel, intl.cpl
, tab Administrative
, Change system locale...
); as previously noted, this is still labeled as Beta:
as of Windows 10 release 1909, though I suspect it will work fine as long as you use only modern command-line utilities.
I'm talking about external programs, because that is what this issue is about, and it is what the [console]::InputEncoding = [console]::OutputEncoding = New-Object System.Text.UTF8Encoding
workaround is for.
Get-Content
is indeed not an external program, which is why I pointed out that your problem is indeed unrelated to this issue (in the remainder of my response I've merely provided more context to your attempt to solve the original, external-program-related issue).
I've also explained the reason for your unrelated Get-Content
problem.
To spell out the solution: you must create your UTF-8 files _with a BOM_ in order for _Windows PowerShell_ to recognize them as UTF-8-encoded; for more information, see this Stack Overflow answer.
If you have further questions, I suggest you use the following community resources:
@nu8
Yes, of course you can always use -Encoding
explicitly to specify the source encoding (which is ignored if a BOM indicates a different encoding, as an aside).
My intent was to show you how to create UTF-8-encoded files that Windows PowerShell recognizes as such _by default_.
The previously linked SO answer also demonstrates how you can use the $PSDefaultParameterValues
hash table to _preset_ defaults ($PSDefaultParameterValues['*:Encoding'] = 'utf8'
), though that needs careful managing to ensure that it only applies where you want it to.
Changing the legacy system locale (language for non-Unicode programs) to 65001
does _not_ help with that, because it doesn't change the _ANSI_ code page (only the _OEM_ code page): Even after making that change, with chcp
then reporting 65001
, (Get-Culture).TextInfo.ANSICodePage
continues to report the culture-appropriate code page, e.g., 1252
on US-English systems, so Get-Content a.txt
will continue to misinterpret your file in Windows PowerShell.
While there _is_ an ACP
registry value that controls the ANSI code page - just like OEMCP
controls the OEM code page - using the GUI method changes only the OEMCP
value - for good reasons, I suspect.
I personally wouldn't even attempt to set ACP
to 65001
- I don't think that is supported, and I suspect it'll either wreak havoc or will be ignored.
To summarize:
This issue is about _external (console) programs_ in _console windows_, as clearly mentioned in the body of the OP, and as explained later; it is about making PowerShell Core's UTF-8 support _complete_ by also using UTF-8 consistently when communicating with external console programs. PowerShell-native commands already do consistently default to UTF-8.
Your Get-Content
problem is unrelated; it only surfaces in Windows PowerShell, and it is only related to the system's active _ANSI_ code page and how PowerShell - which is internally UTF-16-based (.NET strings) - interprets content read from _files_; the behavior is independent of the PowerShell host application (whether it is a console or not).
Please generally note that this repo is for reporting PowerShell _Core_ issues only - for _Windows PowerShell_ issues, use the UserVoice forum.
You should probably stop suggesting people use BOM
I've suggested a BOM to _work around_ an issue with the _legacy_ shell _Windows PowerShell_.
Yes, avoiding a BOM is the better approach, especially in cross-platform code, and, fortunately, _PowerShell Core_ now defaults to UTF-8.
(For _cross-edition_ PowerShell source code, UTF-8 _with BOM_ continues to be your only option if the source code contains (runtime-relevant) non-ASCII characters.)
Locale has to do with Language, not code page.
In Windows, locales _imply_ code pages. In the GUI, you pick a language (+ region/country), which makes the code pages associated with the chosen locale the active ones.
So I don't see what the point would be in changing an ANSI code page, as we are
dealing with Unicode here.
My point was that just as you're changing the OEM code page to 65001
(UTF-8), you may be tempted to set the ANSI code page to this value in order to change Get-Content
's default (see below) - but that isn't supported.
Get-Content a.txt works just fine after making the Registry change I
suggested.
Much to my surprise, this is indeed the case - even though the ANSI code page _isn't_ changed by your registry approach (or via intl.cpl
).
I therefore recommend against setting the OEM code page to 65001
_system-wide_ if you're using _Windows PowerShell_, because you'll break any existing code that was written based on the assumption that Get-Content
defaults to the active ANSI code page.
In other words: all (by definition BOM-less) ANSI-encoded files (which is historically common on Windows) will then be misinterpreted as UTF-8 by Get-Content
(in the absence of -Encoding
).
Similarly, Set-Content
would then create BOM-less UTF-8 files rather than ANSI files by default.
If you use the workaround suggested in the OP, you won't have that problem (but you'll need to use something like $PSDefaultParameterValues['*:Encoding'] = 'utf8'
for _your_ code to default to UTF-8.)
Of course, if you're willing to assume that you'll only ever come across BOM-less files that are UTF-8-encoded (rather than ANSI-encoded) and that no (third-party) code ever runs that relies on ANSI being the default, your approach would work.
(It is a separate issue that it is currently needlessly difficult to read ANSI-encoded files in _PowerShell Core_, because the -Encoding
parameter supports no ANSI
value - see #6562).
Then the title should reflect that.
hoo boy.
I mean really, how common is that going to be?
The first commandment of _Windows PowerShell_ has always been: _backward compatibility_.
(That commandment is fraying around the edges in _PowerShell [Core]_.)
Your solution breaks backward compatibility, and I pointed out how.
I notice you didnt give a single example
perhaps you can point out some high profile ANSI files that are included with Windows, or some that I am likely to come across online?
Yes, I am willing to do that
hoo boy.
Let me remind you again: This repo is for questions about _PowerShell [Core]_, not _Windows PowerShell_.
Your concerns are a non-issue in PowerShell [Core], which strikes me as a good reason to switch to it. If there's something holding you back - such as the startup-performance issue you linked to (#6443) - I suggest you focus on solving that.
Let me try to summarize, now that we (hopefully) have the full picture:
I've hidden my previous comments in favor of this one, @nu8 - I encourage you to do the same, as appropriate. This comment also corrects my incorrect earlier claim that you cannot set the ANSI code page to 65001
.
This issue is about making UTF-8 support in PowerShell on Windows _complete_, by making sure that PowerShell also uses UTF-8 when communicating with _external programs_ (the built-in cmdlets already default to UTF-8, invariably so), which requires setting [console]::InputEncoding
and [console]::OutputEncoding
to (BOM-less) UTF-8 (possibly indirectly).
Currently, in the absence of PowerShell doing that itself, there are two workarounds:
$PROFILE
:# In *Windows PowerShell*, prepend `$OutputEncoding = `
[console]::InputEncoding = [console]::OutputEncoding = [System.Text.UTF8Encoding]::new()
Pros and cons:
$PROFILE
pwsh -noprofile ...
Note: In Windows PowerShell, you must prepend $OutputEncoding =
to the above command, in order to also make Windows PowerShell send UTF-8 _to_ external programs. (In PowerShell [Core], this preference variably commendably defaults to (BOM-less) UTF-8.)
65001
_system-wide_ (W10+):GUI method: via intl.cpl
(Control Panel), tab Administrative
, Change system locale...
); as previously noted, this is still labeled as Beta:
as of Windows 10 release 1909, though I suspect it will work fine as long as you use only modern command-line utilities.
Equivalent programmatic method, based on @nu8's approach:
# Requires ELEVATION and a REBOOT
'ACP', 'OEMCP', 'MACCP' | Set-ItemProperty HKLM:\SYSTEM\CurrentControlSet\Control\Nls\CodePage -Name { $_ } 65001
# Restart-Computer
Pros and cons:
Due to a .NET bug still present in the .NET version underlying PowerShell Core 7.1.0-preview.3, [console]::InputEncoding
and [console]::OutputEncoding
are mistakenly set to UTF-8 encoding _with BOM_, which causes follow-on bugs; notably, it breaks Start-Job
in PowerShell. See https://github.com/dotnet/runtime/issues/28929. Option 1 above doesn't have this problem.
Curiously, by contrast, the [System.Text.Encoding]::Default
encoding that reflects the active ANSI code page contains a _BOM-less_ UTF-8 encoding after the system-wide change (see below).
Note that the bug can also manifest without the system-wide change, namely if you manually run chcp 65001
from cmd.exe
, for instance, _before_ invoking PowerShell (running chcp
from _inside_ PowerShell isn't supported and requires Option 1 instead).
Requires administrative privileges and a reboot.
Takes effect _system-wide_: it applies to all console / Windows Terminal windows, notably including those running cmd.exe
Invariably also uses UTF-8 as the _ANSI_ code page (not just the _OEM_ code page), as reflected in [System.Text.Encoding]::Default
(note that this also applies if you set _only_ the OEMCP
registry value to 65001
; (Get-Culture).TextInfo.ANSICodePage
, by contrast, continues to report the locale-appropriate code page, e.g. 1252
).
Get-Content
and Set-Content
, which, depending on your backward-compatibility needs:$PSDefaultParameterValues['*:Encoding'] = 'utf8'
in your $PROFILE
, this is the only way to get Windows PowerShell to consistently default to UTF-8.Set-Content
, something that cannot otherwise achieved, except with direct use of .NET.Get-Content
and Set-Content
without -Encoding
and you need to process BOM-less files that are ANSI- rather than UTF-8-encoded.$OutputEncoding = [System.Text.Utf8Encoding]::new()
(via $PROFILE
) in order to also make Windows PowerShell send UTF-8 _to_ external programs.A note on file encoding:
If making _Windows PowerShell_ too default to UTF-8 via the system-wide change is not an option, BOM-less UTF-8 files will only be read correctly under one of the following conditions:
-Encoding Utf8
with file-handling cmdlets.you convert your BOM-less UTF-8 files to have a BOM
you preset the default encoding via $PSDefaultParameterValues['*:Encoding'] = 'utf8'
, but you'll have to _scope_ this setting if you don't want _all_ code to use these defaults.
Note that Windows PowerShell - curiously, _except_ if the system-wide change is made - only ever creates UTF-8 files _with BOM_ (whereas PowerShell [Core] defaults to _BOM-less_ UTF-8 and has an -Encoding utf8BOM
opt-in); direct use of .NET is required to work around that - see this SO answer.
Most helpful comment
Since Windows 7 EOL and community are migrating to Windows 10 it seems a time to switch a console default to UTF8 on WIndows.
/cc @SteveL-MSFT