This issue has two distinct aspects:
'ö' | Set-Content -NoNewline -Encoding ASCII tmp.txt
'ö' | Add-Content -Encoding ASCII -NoNewline tmp.txt
Get-Content -Encoding ASCII tmp.txt
(Get-Content -Encoding Byte -TotalCount 2 tmp.txt) | % { '0x{0:x}' -f $_ }
'--'
'ö' | Set-Content -NoNewline tmp.txt # use default encoding
'ö' | Add-Content -NoNewline tmp.txt # use default encoding
Get-Content tmp.txt # use default encoding
(Get-Content -Encoding Byte -TotalCount 2 tmp.txt) | % { '0x{0:x}' -f $_ }
??
0x3f
0x3f
--
??
0x3f
0x3f
??
0x3f
0x3f
--
öö
0xf6
0xf6
That is, ASCII encoding turns a non-ASCII character into _literal_ ?
(0x3f
)
The fact that Set-Content
without an -Encoding
argument resulted in ö
on reading implies that ASCII encoding wasn't used, and the specific byte value of 0xf6
further implies that that a single-byte, extended-ASCII encoding was used:
For _Windows_ PowerShell, it is the _respective_ system's legacy codepage ("ANSI"), such as Windows-1252 on US-English systems, or Windows-1251 on Russian systems. In other words: the specific encoding is, to put it in Unix terms, _locale-dependent_.
For PowerShell _Core_, _as of alpha 16_, it is ISO-8859-1, as @iSazonov helpfully points out (see his comment below for the source-code links).
In contrast, Get-Help Set-Content
, Get-Help Add-Content
, and Get-Help Get-Content
state for parameter -Encoding
:
Specifies the file encoding. The default is ASCII.
The help-topic sources (branch live
) for the relevant cmdlets can be found here.
Additionally:
While these cmdlets _accept_ an encoding identifier Default
, as used in other cmdlets, the help only mentions String
.
Given that the two appear to result in the same encoding - what is their relationship?
The description for encoding String
in the online help is inadequate:
Uses the encoding type for a string.
PowerShell Core v6.0.0-alpha (v6.0.0-alpha.16) on Microsoft Windows 10 Pro (64-bit; v10.0.14393)
The default is changed for PowerShell Core to ISO-8859-1
For OEM too ISO-8859-1
File provider
GetContentReader
GetContentWriter
This is real regression (breaking change) and we should fix it.
The Default
is changed for PowerShell Core to ISO-8859-1
For OEM
ISO-8859-1 too.
Lets see PowerShell FullCRL.
Default
use function GetACP (Retrieves the current Windows ANSI code page identifier for the operating system.) See .Net Framework Reference
OEM
is in only PowerShell and use function GetOEMCP (Returns the current original equipment manufacturer (OEM) code page identifier for the operating system.) See PowerShell code
I don't know why the both encodings was added in Powershell. Maybe PowerShell PG comment this.
Default
is still not released in CoreCLR - so it is external issue. Maybe PowerShell PG find out this internally with .Net team. And should we use waiting-netstandart20
label?
OEM
is not in CoreCLR at all - so it seems is internal issue. Is it make sense to fix it?
@mklement0 See .Net Framework Reference
We can not use Windows-1252 as default.
Here Default
is not PowerShell default, it is OS default. Every system can has own default. Original .Net code use GetACP()
to get OS default code page. Modern Unix use en_US.UTF-8
as default.
cc @JamesWTruher @BrucePay
@mklement0 Current Windows Powershell
behavior is based on .Net Framework and it is dynamic:
System | FileSystemCmdletProviderEncoding.Default (GetACP()) | FileSystemCmdletProviderEncoding.OEM (GetOEMCP())
-|-|-
Windows English | 1252 | 437
Windows Russian | 1251 | 866
Current PowerShell Core
behavior is hard-coded to ISO-8859-1. If we change it on Windows-1252 we still not get Windows PowerShell behavior:
System | FileSystemCmdletProviderEncoding.Default (GetACP()) | FileSystemCmdletProviderEncoding.OEM (GetOEMCP())
-|-|-
Windows English | 1252 | 437
Windows Russian | 1252 | 437
Preferred solution is to fix this in CoreCLR.
As you can see then we can properly read and write files on Windows Russian with 1251 (default system) code page by:
Get-Content -Encoding Default
and Set-Content -Encoding Default
As far as Unix locale names mapping Unix uses standard names Table of locales (This is not a complete sample). Therefore, the fix in CoreCLR will not be too difficult.
We should wait for MSFT expert conclusion and then it will be clear that we need to fix in the code and documentation.
@mklement0 We have a RFC draft for this. Welcome to discuss https://github.com/PowerShell/PowerShell-RFC/issues/71
@iSazonov: Good idea. I've cleaned up my comments here, and I've revised the original post to point to the relevant other issues / comments / RFC.
@mklement0
•Are the help topics open-sourced too?
See full PowerShell repo list https://github.com/PowerShell/
@mklement0 more specifically https://github.com/powershell/powershell-docs
Thanks - I've added a link to the original post that links specifically to the parent folder of the relevant cmdlets' help-topic sources.
@iSazonov do you know of an issue where .NET Core is tracking this problem? You're right, ideally the fix is made in .NET Core.
@joeyaiello:
I don't think there is a fix to be made in .NET Core, or any variant of .NET, for that matter.
PowerShell, from v1 on, decided to do its own thing, _separate from the .NET framework_, which at its core (no pun intended) has always been UTF-8-without-BOM-based - and .NET Core is no exception.
Therefore, aligning PS with .NET's defaults - if chosen - must be a very deliberate act, carefully weighing potentially breaking backward-compatibility against the gains (I do think it's worth doing, however).
Similarly, with respect to providing _legacy Windows_ PowerShell behavior on _Unix_ platforms, the onus is on _PowerShell_, not .NET Core.
@joeyaiello I have not found such Issue in CoreFX repo. We should create new. I prefer that it made the PowerShell team because it can cause a large discussion.
Wow, there's a lot there. Let me clarify a few things:
First, my ask to @iSazonov was purely around the codepage issue he referenced above. I admit I skimmed the problem a little too quickly and mistook FileSystemCmdletProviderEncoding
for a CoreCLR type rather than for our type. I had assumed this was similar to #2009 where we were reading an inaccurate value given to us by CoreCLR. If "current PowerShell Core behavior is hard-coded to ISO-8859-1" means that we're hardcoding a value in FileSystemCmdletProviderEncoding
, let's fix that. (I'm hoping this one is noncontroversial because the behavior in Windows PowerShell is already the correct behavior.)
Before diving into the rest of the problems you raised, I should also note that I did not intend to put the onus on .NET Core for addressing the myriad of other problems we have around file encodings. As I've read it, the heart of this issue (#3248, not #707 which talks about the problem more generally) is that we're not following the same behavior as Windows PowerShell today because we're not respecting the codepage associated with a given machine's locale. What I still don't fully understand (and what we need to answer) is whether that is happening because of PowerShell or .NET Core. No matter what we do in the rest of the encoding space, Default
and OEM
need to work properly. If everyone agrees that I'm capturing the essence of this issue properly, I'll change the title to reflect it.
Now, to the more general problem: I am the first to admit that PowerShell's approach to encoding is a horrible mish-mash of inconsistent behaviors. That's why we have an RFC out that's intended to create more sane defaults on Linux (and for those people willing to change their defaults on Windows) while also maintaining the legacy mish-mash for those who have written scripts already to work around it. As @mklement0, @iSazonov, and others have already been doing, I highly encourage anyone who cares about the encoding problem to give us feedback on that RFC.
Additional side notes:
Unicode
is the enum value associated with UTF-16
, despite the technical inaccuracy of that label). @joeyaiello: Thanks for that detailed response.
What I still don't fully understand (and what we need to answer) is whether that is happening because of PowerShell or .NET Core
The _Windows_ perspective:
I may have spoken too soon when I said that no .NET Core fix is needed, though PowerShell could do its own implementation, if needed:
The current PS Core source code contains this comment:
if CORECLR // Encoding.Default is not in CoreCLR
// As suggested by CoreCLR team (tarekms), use latin1 (ISO-8859-1, CodePage 28591) as the default encoding. // We will revisit this if it causes any failures when running tests on Core PS. s_defaultEncoding = Encoding.GetEncoding(28591);
and a similar comment re OEM encoding.
(As has been discussed, using ISO-8859-1 is inadequate, primarily because it doesn't respect the _variable_ Windows legacy system locale, and secondarily because it doesn't even cover all characters in the most widely used "ANSI" codepage, Windows-1252.)
While [System.Text.Encoding]::Default
isn't part of the .NET contract, it _is_ available, and so is the in-contract equivalent, [System.Text.Encoding]::GetEncoding(0)
.
What .NET Core returns is a _UTF-8_ encoding _with_ BOM, however, even on Windows, whereas on _Windows_ it arguably should return the "ANSI" encoding (the active code page implied by the system locale), as the .NET _Framework_ does.
However, the majority of the _Windows_ code pages are _not_ part of .NET Core., notably missing Windows-1252 and any of the OEM code pages.
An optional NuGet package does make them all available in .NET Core, even on Unix, however (as demonstrated here).
That package already seems to part of PS Core _at runtime_, actually, as evidenced by [System.Text.CodePagesEncodingProvider]::Instance.GetEncoding(1252)
succeeding, even on Unix.
With this package, PS Core could fix the issue even without any changes to .NET Core, by passing the code-page identifiers returned by [cultureinfo]::CurrentCulture.TextInfo.ANSICodePage
and
[cultureinfo]::CurrentCulture.TextInfo.OEMCodePage
(which presumably call the GetACP
("ANSI") and GetOEMCP
(OEM) Windows API functions that @iSazonov mentions) to [System.Text.CodePagesEncodingProvider]::Instance.GetEncoding()
.
As for the _Unix_ perspective:
First and foremost: Is it worth trying to emulate the _Windows_ encoding behavior on Unix, using _Windows_ encodings, by mapping the Unix locales onto Windows code pages?
@iSazonov links to an (incomplete) table from the Moodle (a CMS) docs that seemingly provides this kind of mapping, but I see a problem with that:
bs_BA.UTF-8
- this locale, due to using UTF-8, is capable of representing _both_ writing systems. To map that onto a Windows code page, you must _choose_ between Windows-1250 (Latin) and Windows-1251 (Cyrillic), and there's no obvious choice.Perhaps a compromise is to expose the Windows codepages as distinct -Encoding
values, so that an explicit choice is possible (e.g., -Encoding 1250
).
@joeyaiello: As for the side notes:
Unfortunately, after a deep-dive investigation, we know that the behavior in .NET Framework has not always been purely non-BOM (and in fact, CoreCLR still maintains this inconsistency for back-compat).
While the differing BOM behavior between [System.Text.Encoding]::UTF8
and [System.Text.UTF8Encoding]
is regrettable, the true _default_ behavior has always been _BOM-less_ UTF-8.
By true default behavior I mean what happens when you use [System.IO.File]
methods _without specifying an encoding_:
On _writing_, a BOM-less UTF-8 file is created.
On _reading_, a BOM-less file is interpreted as UTF-8.
This is in line with what modern Unix platforms do (with a now-ubiquitous UTF-8-based locale in effect).
I collected all links for more easy review.
FileSystemCmdletProviderEncoding
_enum_Byte
for BinaryStreams. It is processed by separated code, other members is converted to System.Text.Encoding
type.EncodingConversion
_class_Byte
type.FileSystemCmdletProviderEncoding
the EncodingConversion
can be easily enhanced (to support dynamic list of all installed codepages).System.Text.Encoding
type.ArgumentToEncodingNameTransformationAttribute
in Send-MailMessageValidateEncoding
attribute for cmdlet parameters.Byte
and exclude FileSystemCmdletProviderEncoding
enumBoth types use call ClrFacade.GetDefaultEncoding() and ClrFacade.GetOEMEncoding()
Default
For PowerShell CoreCLR - Encoding.GetEncoding(28591) - ISO-8859-1
For PowerShell FullCLR - Encoding.Default (Encoding.Default and CreateDefaultEncoding()) - it is Win32Native.GetACP() for .Net Framework.
(So Windows PowerShell use native call and work well)
.Net Framework use UTF-8 as default for Silverlight
OEM
For .Net Framework it is not supported for encoding.
For PowerShell CoreCLR - GetDefaultEncoding()- ISO-8859-1
For PowerShell FullCLR - NativeMethods.GetOEMCP() - PowerShell already has the P/Invoke for CoreCLR too. So we can fix OEM for PowerShell Core on Windows. But for Unix we should get answer: what should GetOEMCP() returns? (Ex.: GetACP()
returns "system/machine locale", GetOEMCP()
returns "current session locale").
It seems we cannot use [cultureinfo]::CurrentCulture.TextInfo.OEMCodePage
(from @mklement0 )because it always returns null
on Unix (I tested on WSL only). (It is in CoreCLR CultureInfo and CultureData)
Also we never talked about IOS - that has its codepages.
Here you can see that Default
is external Issue and should be fixed in CoreFX, OEM
is internal Issue and should be fixed in PowerShell repo.
PowerShell Core defaults is ISO-8859-1
HTML5 default is UTF-8 HTML 5 rules for determining content type
It seems HTTP1.1 UTF-8 too https://tools.ietf.org/html/rfc7231
Currently CoreFX already use UTF8 as default.
@iSazonov: Great compilation of links, thanks.
Yes, [cultureinfo]::CurrentCulture.TextInfo.ANSICodePage
and [cultureinfo]::CurrentCulture.TextInfo.OEMCodePage
appear to be empty on all Unix platforms at the moment - though it appears the the groundwork has been laid in LocaleDataUnix.cs and CultureDataUnix.cs - including mapping to Windows LCIDs (which is the part I think that can't be done unambiguously - unless all Unix locale identifiers, across all platforms, have a @<script>
suffix that denotes the charset).
Also, what I didn't consider earlier is that, of course, all culture/locale aspects _aside_ from legacy character encoding (date format, number format, ...) _must_ work in .NET Core on Unix platforms too, and it appears that that's already the case.
I don't know what the .NET Core team's plans are, but _conceivably_ - as suggested by not including the majority of Windows code pages - the intent is to _not_ support legacy Windows code pages - and let's not forget that they're _legacy_ for a reason.
In today's world, with Unicode as the lingua franca, the question really shifts from what the active, _culture-specific, mutually incompatible code page_ is to what _encoding_ of the _universal alphabet (Unicode)_ should be the default.
In other words: using Unicode has taken culture specificity out of the equation and reduced the question to _which encoding of Unicode_ should be assumed _by default_ when faced with an unmarked byte stream (file).
Windows - sorta, kinda, but not yet, really - settling on UTF-16 LE is, unfortunately, at odds with the Unix world, which has chosen UTF-8 (which allowed for a much smoother transition from legacy encodings, though it is Western-centric).
So if we take _legacy_ behavior out of the picture, what _default encoding of Unicode_ that is applied _in the absence of a BOM_ should officially be used, per each platform's policies?
Unix: UTF-8 - and the transition from legacy single-byte encodings _has_ happened.
Windows: UTF-16 LE - but the transition from legacy single-byte encodings has _not_ happened, to this day.
Perhaps the fact that Windows never truly transitioned to UTF-16 LE - at least in terms of _file_ encoding - is an _opportunity_ to align the two worlds, by both making them speak (by definition BOM-less) UTF-8 _by default_.
If so,
the only change to .NET Core that's required is to change [System.Text.Encoding]::GetEncoding(0)
to _BOM-less_ UTF-8, on both Windows and Unix - as stated, the [System.IO.File]
type already does that by default.
anyone wishing to use _legacy_-encoded files in PowerShell Core needs to re-encode them to _some_ encoding of Unicode: either BOM-less UTF-8, or one of the BOM-prefixed standard Unicode encoding forms (UTF-8, UTF-16, UTF-32).
Do we really need to support _legacy_ Windows code pages in PowerShell _Core_ (whether on Unix or Windows)?
Do we really need to support legacy Windows code pages in PowerShell Core (whether on Unix or Windows)?
There is a huge amount of software that only works with legacy code pages. Although the number has decreased since 2003 year.
@iSazonov: I hear you, but do users who rely on this kind of software need this legacy support in PowerShell _Core_? Aren't users who rely on legacy _Windows_ applications likely to stay in the realm of _Windows_ PowerShell?
I see the following fundamental options (from a purely _conceptual_ standpoint - ease of implementation is not being considered here).
I think it's important to get clarity with respect to what _should_ be implemented - as opposed to what _can_ be.
Based on the appropriate way forward, we can assess what aspects should be implemented by the underlying .NET Core rather than in PowerShell Core.
(a) Core-NoWinLegacy:
Forgo all Windows legacy support in PowerShell _Core_ - only implement it in _Windows_ PowerShell.
(b) Core-WinLegacy-onWindows:
Provide Windows legacy support in PowerShell _Core_ too, but only _on Windows_.
(c) Core-WinLegacy-Everywhere:
Provide Windows legacy support in PowerShell Core on _all_ platforms, including Unix.
(c) may be what @JamesWTruher had in mind in his RFC, though the RFC mistakenly assumes only having to deal with _1_, _fixed_ single-byte legacy behavior: ASCII - whereas the truth is that a whole raft of _culture-specific_ "ANSI" codepages must be emulated.
—
As for how these options relate to things needed from .NET Core:
(a) clearly needs no additional effort.
(b) also needs no additional effort - the aforementioned NuGet package can bring in all Windows legacy code pages, and [cultureinfo]::CurrentCulture.TextInfo.ANSICodePage
and [cultureinfo]::CurrentCulture.TextInfo.OEMCodePage
_on Windows_ already tell us what code pages to use as the legacy defaults.
(c) _Will be implemented in .NET CoreCLR v2_: The current pre-v2 version already reports values for [cultureinfo]::CurrentCulture.TextInfo.ANSICodePage
and [cultureinfo]::CurrentCulture.TextInfo.OEMCodePage
on _Unix too_, performing the aforementioned mapping of Unix locale identifiers to legacy Windows code pages.
There are still _edge cases_ - see below.
That said, if the consensus is to implement (c), these edge cases (see bottom) should be considered an acceptable price to pay for Windows legacy support on Unix.
Note that currently, _Unix_ legacy support is not even part of the debate - while UTF-8-based locales are near-ubiquitous nowadays, Unix platforms still support legacy locales based on single-byte charmaps comparable to Windows legacy code pages.
To be legacy-friendly on Unix platforms, these charmaps would have to be respected too.
.NET Core, with the limited set of code pages it comes with, is _not_ capable of this legacy Unix support, and, as of this writing, the preview version of v2 doesn't change that.
Edge cases when mapping Unix locales to Windows legacy code pages:
As mentioned before, some culture uses _more than one_ script (alphabet); gotta love Wikipedia:
Serbian is practically the only European standard language with complete synchronic digraphia,[15] using both Cyrillic and Latin alphabets;
Thus, when mapping a Serbian Unix locale to a legacy Windows code page, _one or the other_ alphabet must be chosen, because no single legacy code page can represent _both_ alphabets.
_This choice isn't unambiguous._
Here is the list of Serbian (sr_*
) locales available on two sample Unix platforms:
sr_YU
sr_YU.ISO8859-2
sr_YU.ISO8859-5
sr_YU.UTF-8
sr_ME UTF-8
sr_RS UTF-8
sr_RS@latin UTF-8
macOS is stuck in the past (YU
representing the long-defunct former Yugoslavia); the .ISO8859-2
and .ISO8859-5
unambiguously imply the Latin and Cyrillic alphabet respectively, but what sr_YU.UTF-8
should map to - given that UTF-8 is capable of encoding _both_ alphabets - is open to interpretation.
If there's always a _preferred_ alphabet for such cultures, perhaps that's a non-issue - I don't know the answer to that.
Ubuntu faces the same issues, with only sr_RS@latin UTF-8
_implying_ the alphabet via the embedded @latin
script/variant identifier.
In practice, as of .NET Core v2.0.0-beta-001836-00, _all_ of the above cases default to the _Latin_ alpabet.
@joeyaiello @SteveL-MSFT @mklement0 I opened issue-question in .Net Core repo and get great comments about defaults https://github.com/dotnet/standard/issues/260#issuecomment-289549508
First just want to say, thanks for the extended explanations, everyone. This has been thoroughly useful. And thanks, @iSazonov, for looping in the dotnet/standard folks.
Now, next steps:
OEM
/Default
codepage issue. If we can fix that in Windows with a platform-specific guard today, and fix it later on Linux with .NET Standard and System.Text.Encoding.CodePages
, we should do that and close out this issue. Sound reasonable?
I pushed PR with fix for Default/OEM.
@joeyaiello: Thanks, that indeed sounds reasonable (and I hope it's OK that, as an outsider, I'm being vocal here - I love PowerShell and I think getting the encoding right is crucial to PowerShell's cross-platform success).
@iSazonov Sweet!
@mklement0 I don't consider you an outsider, you're here constructively voicing opinions. Keep it coming :+1:
And for what it's worth, I absolutely agree. We need to nail encodings.
Per the plan I posted above, let's close this particular as resolved, and drive the RFC to completion via the normal process. If other issues arise after that, we should open new issues.
Most helpful comment
@iSazonov Sweet!
@mklement0 I don't consider you an outsider, you're here constructively voicing opinions. Keep it coming :+1:
And for what it's worth, I absolutely agree. We need to nail encodings.