Powershell: Default and OEM character encodings in the Core edition should be Windows-1252, not ISO-8859-1

Created on 4 Mar 2017  ·  7Comments  ·  Source: PowerShell/PowerShell

ISO-8859-1 is currently (alpha16) the default character encoding, as well as when explicit encoding specifiers Default and OEM are used - see here.

This choice is problematic, because ISO-8859-1 is a _subset_ of the commonly used Windows-1252 encoding.
(The two encodings are often conflated, but they are _not_ the same.)

Specifically, using ISO-8859-1 makes the following characters - the printable characters in the codepoint range 0x80 - 0x9F - _unavailable_:

€ ‚ ƒ „ … † ‡ ˆ ‰ Š ‹ Œ Ž ‘ ’ “ ” • – — ˜ ™ š › œ ž Ÿ

Note that the character is part of that list.

You can verify the problematic behavior as follows:

> '€' | Set-Content tmp.txt; Get-Content tmp.txt
?

Because cannot be represented in ISO-8859-1, it was quietly converted to a _literal_ ?.

Contrast this with use of Windows-1252:

> $enc1252 = [System.Text.CodePagesEncodingProvider]::Instance.GetEncoding(1252); [IO.File]::WriteAllText('tmp.txt', '€', $enc1252); [IO.File]::ReadAllText('tmp.txt', $enc1252)
€

The char. - codepoint 0x80 in Windows-1252 (but not ISO-8859-1) - was correctly preserved.


Also, please note that in order to fully emulate _Windows_ PowerShell behavior, using a _fixed_ encoding in Core is _not_ sufficient.

Instead, the encoding would have to be locale-dependent, as on Windows:
Unix locales would have to be mapped to the Windows legacy codepages - see here.

Resolution-Duplicate WG-Engine

All 7 comments

Looks like the Content cmdlets used ISO-8859-1 as default to align with HTTP1.1, but HTML5 now uses Windows-1252 as equivalent due to mislabeling of sites using ISO-8859-1. Seems like using Windows-1252 would be ideal.

@SteveL-MSFT

Thanks for that. To make it even more explicit:

Mistaking what was actually Windows-1252 for ISO-8559-1 became so commonplace that the HTML5 specification, which links to this page about encoding, decided to treat label "iso-8859-1" as an _alias for_ windows-1252 (the link lists all chars. with the high bit set, i.e., starting at 0x80), which is also reflected in this living WHATWG document.

@SteveL-MSFT: I suspect that the reason it breaks is that the Windows-1252 code page is not available _by default_ in Core, but you can load it via the System.Text.Encoding.CodePages NuGet package, as demonstrated in this Stack Overflow answer.

Curiously, from PowerShell Core itself, that package (assembly) _is_ available by default, which is what the PowerShell snippet in the initial post takes advantage of ([System.Text.CodePagesEncodingProvider]::Instance.GetEncoding(1252)).

@mklement0 thanks!

@SteveL-MSFT

Looks like the Content cmdlets used ISO-8859-1 as default to align with HTTP1.1, but HTML5 now uses Windows-1252 as equivalent due to mislabeling of sites using ISO-8859-1. Seems like using Windows-1252 would be ideal.

Defaults for Content cmdlets and Web cmdlets is different things. I believe the issue is for Content cmdlets and it is duplicate #3248. Default for Web Cmdlet I suggest better discuss in #3267.

@SteveL-MSFT: @iSazonov is correct, so I'm closing this issue.

Was this page helpful?
0 / 5 - 0 ratings