ISO-8859-1 is currently (alpha16) the default character encoding, as well as when explicit encoding specifiers Default and OEM are used - see here.
This choice is problematic, because ISO-8859-1 is a _subset_ of the commonly used Windows-1252 encoding.
(The two encodings are often conflated, but they are _not_ the same.)
Specifically, using ISO-8859-1 makes the following characters - the printable characters in the codepoint range 0x80 - 0x9F - _unavailable_:
€ ‚ ƒ „ … † ‡ ˆ ‰ Š ‹ Œ Ž ‘ ’ “ ” • – — ˜ ™ š › œ ž Ÿ
Note that the € character is part of that list.
You can verify the problematic behavior as follows:
> '€' | Set-Content tmp.txt; Get-Content tmp.txt
?
Because € cannot be represented in ISO-8859-1, it was quietly converted to a _literal_ ?.
Contrast this with use of Windows-1252:
> $enc1252 = [System.Text.CodePagesEncodingProvider]::Instance.GetEncoding(1252); [IO.File]::WriteAllText('tmp.txt', '€', $enc1252); [IO.File]::ReadAllText('tmp.txt', $enc1252)
€
The € char. - codepoint 0x80 in Windows-1252 (but not ISO-8859-1) - was correctly preserved.
Also, please note that in order to fully emulate _Windows_ PowerShell behavior, using a _fixed_ encoding in Core is _not_ sufficient.
Instead, the encoding would have to be locale-dependent, as on Windows:
Unix locales would have to be mapped to the Windows legacy codepages - see here.
Looks like the Content cmdlets used ISO-8859-1 as default to align with HTTP1.1, but HTML5 now uses Windows-1252 as equivalent due to mislabeling of sites using ISO-8859-1. Seems like using Windows-1252 would be ideal.
Unfortunately just changing https://github.com/PowerShell/PowerShell/blob/fc30ae1d8713d930b7301bd6d9a85c77256f8669/src/System.Management.Automation/utils/ClrFacade.cs#L385 causes PowerShell to stop working
@SteveL-MSFT
Thanks for that. To make it even more explicit:
Mistaking what was actually Windows-1252 for ISO-8559-1 became so commonplace that the HTML5 specification, which links to this page about encoding, decided to treat label "iso-8859-1" as an _alias for_ windows-1252 (the link lists all chars. with the high bit set, i.e., starting at 0x80), which is also reflected in this living WHATWG document.
@SteveL-MSFT: I suspect that the reason it breaks is that the Windows-1252 code page is not available _by default_ in Core, but you can load it via the System.Text.Encoding.CodePages NuGet package, as demonstrated in this Stack Overflow answer.
Curiously, from PowerShell Core itself, that package (assembly) _is_ available by default, which is what the PowerShell snippet in the initial post takes advantage of ([System.Text.CodePagesEncodingProvider]::Instance.GetEncoding(1252)).
@mklement0 thanks!
@SteveL-MSFT
Looks like the Content cmdlets used ISO-8859-1 as default to align with HTTP1.1, but HTML5 now uses Windows-1252 as equivalent due to mislabeling of sites using ISO-8859-1. Seems like using Windows-1252 would be ideal.
Defaults for Content cmdlets and Web cmdlets is different things. I believe the issue is for Content cmdlets and it is duplicate #3248. Default for Web Cmdlet I suggest better discuss in #3267.
@SteveL-MSFT: @iSazonov is correct, so I'm closing this issue.