Powershell: Set-Content/Add-Content/Get-Content use an 8-bit character encoding by default, but the help topics state ASCII; problematic Core default file encoding

Created on 3 Mar 2017  Â·  29Comments  Â·  Source: PowerShell/PowerShell

This issue has two distinct aspects:

  • discussion of an existing _documentation bug_
  • discussion of the problematic fixed default file encoding currently (alpha16) chosen for _Core_.

Steps to reproduce

'ö' | Set-Content -NoNewline -Encoding ASCII tmp.txt 
'ö' | Add-Content -Encoding ASCII -NoNewline tmp.txt 
Get-Content -Encoding ASCII tmp.txt
(Get-Content -Encoding Byte -TotalCount 2 tmp.txt) | % { '0x{0:x}' -f $_ }
'--'
'ö' | Set-Content -NoNewline tmp.txt   # use default encoding
'ö' | Add-Content -NoNewline tmp.txt   # use default encoding
Get-Content tmp.txt                    # use default encoding
(Get-Content -Encoding Byte -TotalCount 2 tmp.txt) | % { '0x{0:x}' -f $_ }

Expected behavior

??
0x3f
0x3f
--
??
0x3f
0x3f

Actual behavior

??
0x3f
0x3f
--
öö
0xf6
0xf6

That is, ASCII encoding turns a non-ASCII character into _literal_ ? (0x3f)

The fact that Set-Content without an -Encoding argument resulted in ö on reading implies that ASCII encoding wasn't used, and the specific byte value of 0xf6 further implies that that a single-byte, extended-ASCII encoding was used:

  • For _Windows_ PowerShell, it is the _respective_ system's legacy codepage ("ANSI"), such as Windows-1252 on US-English systems, or Windows-1251 on Russian systems. In other words: the specific encoding is, to put it in Unix terms, _locale-dependent_.

  • For PowerShell _Core_, _as of alpha 16_, it is ISO-8859-1, as @iSazonov helpfully points out (see his comment below for the source-code links).

In contrast, Get-Help Set-Content, Get-Help Add-Content, and Get-Help Get-Content state for parameter -Encoding:

Specifies the file encoding. The default is ASCII.

The help-topic sources (branch live) for the relevant cmdlets can be found here.

Additionally:

  • While these cmdlets _accept_ an encoding identifier Default, as used in other cmdlets, the help only mentions String.

  • Given that the two appear to result in the same encoding - what is their relationship?

  • The description for encoding String in the online help is inadequate:

Uses the encoding type for a string.

Environment data

PowerShell Core v6.0.0-alpha (v6.0.0-alpha.16) on Microsoft Windows 10 Pro (64-bit; v10.0.14393)
Issue-Bug Issue-Discussion Resolution-Fixed

Most helpful comment

@iSazonov Sweet!

@mklement0 I don't consider you an outsider, you're here constructively voicing opinions. Keep it coming :+1:

And for what it's worth, I absolutely agree. We need to nail encodings.

All 29 comments

The default is changed for PowerShell Core to ISO-8859-1
For OEM too ISO-8859-1

File provider
GetContentReader
GetContentWriter

This is real regression (breaking change) and we should fix it.

The Default is changed for PowerShell Core to ISO-8859-1
For OEM ISO-8859-1 too.

Lets see PowerShell FullCRL.
Default use function GetACP (Retrieves the current Windows ANSI code page identifier for the operating system.) See .Net Framework Reference

OEM is in only PowerShell and use function GetOEMCP (Returns the current original equipment manufacturer (OEM) code page identifier for the operating system.) See PowerShell code

I don't know why the both encodings was added in Powershell. Maybe PowerShell PG comment this.

Default is still not released in CoreCLR - so it is external issue. Maybe PowerShell PG find out this internally with .Net team. And should we use waiting-netstandart20 label?
OEM is not in CoreCLR at all - so it seems is internal issue. Is it make sense to fix it?

@mklement0 See .Net Framework Reference
We can not use Windows-1252 as default.

Here Default is not PowerShell default, it is OS default. Every system can has own default. Original .Net code use GetACP() to get OS default code page. Modern Unix use en_US.UTF-8 as default.

cc @JamesWTruher @BrucePay

@mklement0 Current Windows Powershell behavior is based on .Net Framework and it is dynamic:

System | FileSystemCmdletProviderEncoding.Default (GetACP()) | FileSystemCmdletProviderEncoding.OEM (GetOEMCP())
-|-|-
Windows English | 1252 | 437
Windows Russian | 1251 | 866

Current PowerShell Core behavior is hard-coded to ISO-8859-1. If we change it on Windows-1252 we still not get Windows PowerShell behavior:

System | FileSystemCmdletProviderEncoding.Default (GetACP()) | FileSystemCmdletProviderEncoding.OEM (GetOEMCP())
-|-|-
Windows English | 1252 | 437
Windows Russian | 1252 | 437

Preferred solution is to fix this in CoreCLR.

As you can see then we can properly read and write files on Windows Russian with 1251 (default system) code page by:
Get-Content -Encoding Default and Set-Content -Encoding Default

As far as Unix locale names mapping Unix uses standard names Table of locales (This is not a complete sample). Therefore, the fix in CoreCLR will not be too difficult.

We should wait for MSFT expert conclusion and then it will be clear that we need to fix in the code and documentation.

@mklement0 We have a RFC draft for this. Welcome to discuss https://github.com/PowerShell/PowerShell-RFC/issues/71

@iSazonov: Good idea. I've cleaned up my comments here, and I've revised the original post to point to the relevant other issues / comments / RFC.

@mklement0

•Are the help topics open-sourced too?

See full PowerShell repo list https://github.com/PowerShell/

@mklement0 more specifically https://github.com/powershell/powershell-docs

Thanks - I've added a link to the original post that links specifically to the parent folder of the relevant cmdlets' help-topic sources.

@iSazonov do you know of an issue where .NET Core is tracking this problem? You're right, ideally the fix is made in .NET Core.

@joeyaiello:

  • I don't think there is a fix to be made in .NET Core, or any variant of .NET, for that matter.

  • PowerShell, from v1 on, decided to do its own thing, _separate from the .NET framework_, which at its core (no pun intended) has always been UTF-8-without-BOM-based - and .NET Core is no exception.

  • Therefore, aligning PS with .NET's defaults - if chosen - must be a very deliberate act, carefully weighing potentially breaking backward-compatibility against the gains (I do think it's worth doing, however).

  • Similarly, with respect to providing _legacy Windows_ PowerShell behavior on _Unix_ platforms, the onus is on _PowerShell_, not .NET Core.

@joeyaiello I have not found such Issue in CoreFX repo. We should create new. I prefer that it made the PowerShell team because it can cause a large discussion.

Wow, there's a lot there. Let me clarify a few things:

First, my ask to @iSazonov was purely around the codepage issue he referenced above. I admit I skimmed the problem a little too quickly and mistook FileSystemCmdletProviderEncoding for a CoreCLR type rather than for our type. I had assumed this was similar to #2009 where we were reading an inaccurate value given to us by CoreCLR. If "current PowerShell Core behavior is hard-coded to ISO-8859-1" means that we're hardcoding a value in FileSystemCmdletProviderEncoding , let's fix that. (I'm hoping this one is noncontroversial because the behavior in Windows PowerShell is already the correct behavior.)

Before diving into the rest of the problems you raised, I should also note that I did not intend to put the onus on .NET Core for addressing the myriad of other problems we have around file encodings. As I've read it, the heart of this issue (#3248, not #707 which talks about the problem more generally) is that we're not following the same behavior as Windows PowerShell today because we're not respecting the codepage associated with a given machine's locale. What I still don't fully understand (and what we need to answer) is whether that is happening because of PowerShell or .NET Core. No matter what we do in the rest of the encoding space, Default and OEM need to work properly. If everyone agrees that I'm capturing the essence of this issue properly, I'll change the title to reflect it.

Now, to the more general problem: I am the first to admit that PowerShell's approach to encoding is a horrible mish-mash of inconsistent behaviors. That's why we have an RFC out that's intended to create more sane defaults on Linux (and for those people willing to change their defaults on Windows) while also maintaining the legacy mish-mash for those who have written scripts already to work around it. As @mklement0, @iSazonov, and others have already been doing, I highly encourage anyone who cares about the encoding problem to give us feedback on that RFC.

Additional side notes:

  • Unfortunately, after a deep-dive investigation, we know that the behavior in .NET Framework has not always been purely non-BOM (and in fact, CoreCLR still maintains this inconsistency for back-compat).
  • To give you historical context as I understand it from those who have been on the team for a long time, Microsoft thought UTF-16 was the future and little thought was given to BOM or multi-byte standards around UTF-8. Given our heavy bend towards localization, and (at least as perceived by MSFT at the time) a lack of prevalence of multi-byte UTF-8, we were all in on UTF-16 (hence why Unicode is the enum value associated with UTF-16, despite the technical inaccuracy of that label).

@joeyaiello: Thanks for that detailed response.

What I still don't fully understand (and what we need to answer) is whether that is happening because of PowerShell or .NET Core

The _Windows_ perspective:

I may have spoken too soon when I said that no .NET Core fix is needed, though PowerShell could do its own implementation, if needed:

The current PS Core source code contains this comment:

if CORECLR // Encoding.Default is not in CoreCLR

            // As suggested by CoreCLR team (tarekms), use latin1 (ISO-8859-1, CodePage 28591) as the default encoding.
            // We will revisit this if it causes any failures when running tests on Core PS.
            s_defaultEncoding = Encoding.GetEncoding(28591);

and a similar comment re OEM encoding.

(As has been discussed, using ISO-8859-1 is inadequate, primarily because it doesn't respect the _variable_ Windows legacy system locale, and secondarily because it doesn't even cover all characters in the most widely used "ANSI" codepage, Windows-1252.)

While [System.Text.Encoding]::Default isn't part of the .NET contract, it _is_ available, and so is the in-contract equivalent, [System.Text.Encoding]::GetEncoding(0).

What .NET Core returns is a _UTF-8_ encoding _with_ BOM, however, even on Windows, whereas on _Windows_ it arguably should return the "ANSI" encoding (the active code page implied by the system locale), as the .NET _Framework_ does.

However, the majority of the _Windows_ code pages are _not_ part of .NET Core., notably missing Windows-1252 and any of the OEM code pages.

An optional NuGet package does make them all available in .NET Core, even on Unix, however (as demonstrated here).

That package already seems to part of PS Core _at runtime_, actually, as evidenced by [System.Text.CodePagesEncodingProvider]::Instance.GetEncoding(1252) succeeding, even on Unix.

With this package, PS Core could fix the issue even without any changes to .NET Core, by passing the code-page identifiers returned by [cultureinfo]::CurrentCulture.TextInfo.ANSICodePage and
[cultureinfo]::CurrentCulture.TextInfo.OEMCodePage
(which presumably call the GetACP ("ANSI") and GetOEMCP (OEM) Windows API functions that @iSazonov mentions) to [System.Text.CodePagesEncodingProvider]::Instance.GetEncoding().


As for the _Unix_ perspective:

First and foremost: Is it worth trying to emulate the _Windows_ encoding behavior on Unix, using _Windows_ encodings, by mapping the Unix locales onto Windows code pages?

@iSazonov links to an (incomplete) table from the Moodle (a CMS) docs that seemingly provides this kind of mapping, but I see a problem with that:

  • The mapping cannot be unambiguous, because there are locales that have variations that use _different scripts_ (character sets). For instance, Bosnian can be written in both the Latin alphabet and the Cyrillic alphabet, and on _Windows_ these variants - of necessity - use _distinct code pages_.
    Say you're on a Unix platform whose active locale is bs_BA.UTF-8 - this locale, due to using UTF-8, is capable of representing _both_ writing systems. To map that onto a Windows code page, you must _choose_ between Windows-1250 (Latin) and Windows-1251 (Cyrillic), and there's no obvious choice.

Perhaps a compromise is to expose the Windows codepages as distinct -Encoding values, so that an explicit choice is possible (e.g., -Encoding 1250).

@joeyaiello: As for the side notes:

Unfortunately, after a deep-dive investigation, we know that the behavior in .NET Framework has not always been purely non-BOM (and in fact, CoreCLR still maintains this inconsistency for back-compat).

While the differing BOM behavior between [System.Text.Encoding]::UTF8 and [System.Text.UTF8Encoding] is regrettable, the true _default_ behavior has always been _BOM-less_ UTF-8.

By true default behavior I mean what happens when you use [System.IO.File] methods _without specifying an encoding_:

  • On _writing_, a BOM-less UTF-8 file is created.

  • On _reading_, a BOM-less file is interpreted as UTF-8.

This is in line with what modern Unix platforms do (with a now-ubiquitous UTF-8-based locale in effect).

I collected all links for more easy review.

File encoding

  1. FileSystemCmdletProviderEncoding _enum_
    Used by FileSystemProviger cmdlets.
    Contain special member Byte for BinaryStreams. It is processed by separated code, other members is converted to System.Text.Encoding type.
    Code is distributed in two files:
    FileSystemProvider.cs
    Utils.cs
  2. EncodingConversion _class_
    Used by non-FileSystemProviger cmdlets (+ web cmdlets).
    Don't contains Byte type.
    Unlike FileSystemCmdletProviderEncoding the EncodingConversion can be easily enhanced (to support dynamic list of all installed codepages).
    All member converted to System.Text.Encoding type.
    There is ArgumentToEncodingNameTransformationAttribute in Send-MailMessage
    It would be good to have ValidateEncoding attribute for cmdlet parameters.
    I believe we could enhance the class with Byte and exclude FileSystemCmdletProviderEncoding enum

Both types use call ClrFacade.GetDefaultEncoding() and ClrFacade.GetOEMEncoding()

Default and OEM

  1. Default
    For PowerShell CoreCLR - Encoding.GetEncoding(28591) - ISO-8859-1
    For PowerShell FullCLR - Encoding.Default (Encoding.Default and CreateDefaultEncoding()) - it is Win32Native.GetACP() for .Net Framework.
    (So Windows PowerShell use native call and work well)
    .Net Framework use UTF-8 as default for Silverlight

  2. OEM
    For .Net Framework it is not supported for encoding.
    For PowerShell CoreCLR - GetDefaultEncoding()- ISO-8859-1
    For PowerShell FullCLR - NativeMethods.GetOEMCP() - PowerShell already has the P/Invoke for CoreCLR too. So we can fix OEM for PowerShell Core on Windows. But for Unix we should get answer: what should GetOEMCP() returns? (Ex.: GetACP() returns "system/machine locale", GetOEMCP() returns "current session locale").
    It seems we cannot use [cultureinfo]::CurrentCulture.TextInfo.OEMCodePage (from @mklement0 )because it always returns null on Unix (I tested on WSL only). (It is in CoreCLR CultureInfo and CultureData)

Also we never talked about IOS - that has its codepages.

Here you can see that Default is external Issue and should be fixed in CoreFX, OEM is internal Issue and should be fixed in PowerShell repo.

About Web cmdlets

PowerShell Core defaults is ISO-8859-1
HTML5 default is UTF-8 HTML 5 rules for determining content type
It seems HTTP1.1 UTF-8 too https://tools.ietf.org/html/rfc7231
Currently CoreFX already use UTF8 as default.

@iSazonov: Great compilation of links, thanks.

Yes, [cultureinfo]::CurrentCulture.TextInfo.ANSICodePage and [cultureinfo]::CurrentCulture.TextInfo.OEMCodePage appear to be empty on all Unix platforms at the moment - though it appears the the groundwork has been laid in LocaleDataUnix.cs and CultureDataUnix.cs - including mapping to Windows LCIDs (which is the part I think that can't be done unambiguously - unless all Unix locale identifiers, across all platforms, have a @<script> suffix that denotes the charset).

Also, what I didn't consider earlier is that, of course, all culture/locale aspects _aside_ from legacy character encoding (date format, number format, ...) _must_ work in .NET Core on Unix platforms too, and it appears that that's already the case.

I don't know what the .NET Core team's plans are, but _conceivably_ - as suggested by not including the majority of Windows code pages - the intent is to _not_ support legacy Windows code pages - and let's not forget that they're _legacy_ for a reason.

In today's world, with Unicode as the lingua franca, the question really shifts from what the active, _culture-specific, mutually incompatible code page_ is to what _encoding_ of the _universal alphabet (Unicode)_ should be the default.

In other words: using Unicode has taken culture specificity out of the equation and reduced the question to _which encoding of Unicode_ should be assumed _by default_ when faced with an unmarked byte stream (file).

Windows - sorta, kinda, but not yet, really - settling on UTF-16 LE is, unfortunately, at odds with the Unix world, which has chosen UTF-8 (which allowed for a much smoother transition from legacy encodings, though it is Western-centric).

So if we take _legacy_ behavior out of the picture, what _default encoding of Unicode_ that is applied _in the absence of a BOM_ should officially be used, per each platform's policies?

  • Unix: UTF-8 - and the transition from legacy single-byte encodings _has_ happened.

  • Windows: UTF-16 LE - but the transition from legacy single-byte encodings has _not_ happened, to this day.

Perhaps the fact that Windows never truly transitioned to UTF-16 LE - at least in terms of _file_ encoding - is an _opportunity_ to align the two worlds, by both making them speak (by definition BOM-less) UTF-8 _by default_.

If so,

  • the only change to .NET Core that's required is to change [System.Text.Encoding]::GetEncoding(0) to _BOM-less_ UTF-8, on both Windows and Unix - as stated, the [System.IO.File] type already does that by default.

  • anyone wishing to use _legacy_-encoded files in PowerShell Core needs to re-encode them to _some_ encoding of Unicode: either BOM-less UTF-8, or one of the BOM-prefixed standard Unicode encoding forms (UTF-8, UTF-16, UTF-32).

Do we really need to support _legacy_ Windows code pages in PowerShell _Core_ (whether on Unix or Windows)?

Do we really need to support legacy Windows code pages in PowerShell Core (whether on Unix or Windows)?

There is a huge amount of software that only works with legacy code pages. Although the number has decreased since 2003 year.

@iSazonov: I hear you, but do users who rely on this kind of software need this legacy support in PowerShell _Core_? Aren't users who rely on legacy _Windows_ applications likely to stay in the realm of _Windows_ PowerShell?

I see the following fundamental options (from a purely _conceptual_ standpoint - ease of implementation is not being considered here).

I think it's important to get clarity with respect to what _should_ be implemented - as opposed to what _can_ be.
Based on the appropriate way forward, we can assess what aspects should be implemented by the underlying .NET Core rather than in PowerShell Core.

  • (a) Core-NoWinLegacy:

    Forgo all Windows legacy support in PowerShell _Core_ - only implement it in _Windows_ PowerShell.

  • (b) Core-WinLegacy-onWindows:

    Provide Windows legacy support in PowerShell _Core_ too, but only _on Windows_.

  • (c) Core-WinLegacy-Everywhere:

    Provide Windows legacy support in PowerShell Core on _all_ platforms, including Unix.

(c) may be what @JamesWTruher had in mind in his RFC, though the RFC mistakenly assumes only having to deal with _1_, _fixed_ single-byte legacy behavior: ASCII - whereas the truth is that a whole raft of _culture-specific_ "ANSI" codepages must be emulated.

—

As for how these options relate to things needed from .NET Core:

(a) clearly needs no additional effort.

(b) also needs no additional effort - the aforementioned NuGet package can bring in all Windows legacy code pages, and [cultureinfo]::CurrentCulture.TextInfo.ANSICodePage and [cultureinfo]::CurrentCulture.TextInfo.OEMCodePage _on Windows_ already tell us what code pages to use as the legacy defaults.

(c) _Will be implemented in .NET CoreCLR v2_: The current pre-v2 version already reports values for [cultureinfo]::CurrentCulture.TextInfo.ANSICodePage and [cultureinfo]::CurrentCulture.TextInfo.OEMCodePage on _Unix too_, performing the aforementioned mapping of Unix locale identifiers to legacy Windows code pages.
There are still _edge cases_ - see below.
That said, if the consensus is to implement (c), these edge cases (see bottom) should be considered an acceptable price to pay for Windows legacy support on Unix.


Note that currently, _Unix_ legacy support is not even part of the debate - while UTF-8-based locales are near-ubiquitous nowadays, Unix platforms still support legacy locales based on single-byte charmaps comparable to Windows legacy code pages.
To be legacy-friendly on Unix platforms, these charmaps would have to be respected too.

.NET Core, with the limited set of code pages it comes with, is _not_ capable of this legacy Unix support, and, as of this writing, the preview version of v2 doesn't change that.


Edge cases when mapping Unix locales to Windows legacy code pages:

As mentioned before, some culture uses _more than one_ script (alphabet); gotta love Wikipedia:

Serbian is practically the only European standard language with complete synchronic digraphia,[15] using both Cyrillic and Latin alphabets;

Thus, when mapping a Serbian Unix locale to a legacy Windows code page, _one or the other_ alphabet must be chosen, because no single legacy code page can represent _both_ alphabets.

_This choice isn't unambiguous._

Here is the list of Serbian (sr_*) locales available on two sample Unix platforms:

  • macOS 10.12:
sr_YU
sr_YU.ISO8859-2
sr_YU.ISO8859-5
sr_YU.UTF-8
  • Ubuntu 16.04:
sr_ME UTF-8
sr_RS UTF-8
sr_RS@latin UTF-8

macOS is stuck in the past (YU representing the long-defunct former Yugoslavia); the .ISO8859-2 and .ISO8859-5 unambiguously imply the Latin and Cyrillic alphabet respectively, but what sr_YU.UTF-8 should map to - given that UTF-8 is capable of encoding _both_ alphabets - is open to interpretation.
If there's always a _preferred_ alphabet for such cultures, perhaps that's a non-issue - I don't know the answer to that.

Ubuntu faces the same issues, with only sr_RS@latin UTF-8 _implying_ the alphabet via the embedded @latin script/variant identifier.

In practice, as of .NET Core v2.0.0-beta-001836-00, _all_ of the above cases default to the _Latin_ alpabet.

@joeyaiello @SteveL-MSFT @mklement0 I opened issue-question in .Net Core repo and get great comments about defaults https://github.com/dotnet/standard/issues/260#issuecomment-289549508

First just want to say, thanks for the extended explanations, everyone. This has been thoroughly useful. And thanks, @iSazonov, for looping in the dotnet/standard folks.

Now, next steps:

  • I want to tease the encoding issue into as many smaller chunks as possible. This thread was originally about the OEM/Default codepage issue. If we can fix that in Windows with a platform-specific guard today, and fix it later on Linux with .NET Standard and System.Text.Encoding.CodePages, we should do that and close out this issue.
  • As I said in the other thread, I'm not concerned about the legacy Linux encoding scenario. It's uncommon, and we don't have baggage there.
  • The other questions should all be addressed in PowerShell/PowerShell-RFC#71 and fixed in the RFC so that everyone can have a central view of how we're addressing encodings and so the @PowerShell/powershell-committee can sign off on that view.

Sound reasonable?

I pushed PR with fix for Default/OEM.

@joeyaiello: Thanks, that indeed sounds reasonable (and I hope it's OK that, as an outsider, I'm being vocal here - I love PowerShell and I think getting the encoding right is crucial to PowerShell's cross-platform success).

@iSazonov Sweet!

@mklement0 I don't consider you an outsider, you're here constructively voicing opinions. Keep it coming :+1:

And for what it's worth, I absolutely agree. We need to nail encodings.

3467 merged and we would close the Issue or leave it open in waiting our Encoding RFC.

Per the plan I posted above, let's close this particular as resolved, and drive the RFC to completion via the normal process. If other issues arise after that, we should open new issues.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

DarwinJS picture DarwinJS  Â·  65Comments

sba923 picture sba923  Â·  71Comments

NJ-Dude picture NJ-Dude  Â·  64Comments

Ciantic picture Ciantic  Â·  90Comments

joeyaiello picture joeyaiello  Â·  99Comments