Powershell: Resolve UTF-8, UTF-16, ASCII inconsistencies

Created on 20 Mar 2016  Â·  22Comments  Â·  Source: PowerShell/PowerShell

PS> ls > test
BASH> ls >> test
BASH> cat test # Content looks correct
PS> cat test # Content is wrong at the end

Committee-Reviewed Issue-Enhancement Resolution-Fixed WG-DevEx-Portability WG-Engine

Most helpful comment

Came here looking for issue related to this because > producing UTF-16 by default is maddening.

Do we need some sort of $EncodingPreference (or something like that)?

Please!

Can we break existing Windows behavior here?

Less necessary if there's a preference, IMO, though I have never seen a developer respond positively to a file being encoded with UTF-16. If PowerShell 6+ can lead the charge on moving Windows toward UTF-8-without-BOM everywhere that would be amazing.

Do redirection operators need to do something special here based on said $EncodingPreference

This seems to me to be the primary purpose of $EncodingPreference in the first place: changing the encoding used by >. Changing the default encoding for Out-File, Set-Content, etc. is a bonus, but at least those support a parameter (unlike >).

Not sure if >> already delegates to Add-Content, but it should behave the same: preserve encoding for an existing file, otherwise create a file with the specified (or default) encoding.

All 22 comments

Need to discuss this at greater length. Some thoughts:

  • There should probably be a global PS variable for setting the encoding to use
  • This global variable might be different across platforms

@joeyaiello do we need to fix this as a bug? A mixed-encoding file is just... wrong. It's bad input. I'm not really sure what Get-ChildItem should even do about it.

CAT works so why would it be OK for Get-ChildItem to not work?
"If you want things to work avoid the PS tools?"

On Wed, Apr 13, 2016 at 11:06 AM, Andy Schwartzmeyer <
[email protected]> wrote:

@joeyaiello https://github.com/joeyaiello do we need to fix this as a
bug? A mixed-encoding file is just... wrong. It's bad input. I'm not really
sure what Get-ChildItem should even do about it.

—
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
https://github.com/PowerShell/PowerShell/issues/707#issuecomment-209572132

  • Linux uses UTF-8 without a BOM by default
  • .NET's Core implementation does not use a BOM
  • Open question: what's the default encoding on OS X?
  • PS does add a BOM to outputted files (we need to investigate exactly where)

Linux UX sync agrees that "doing the right thing" means making sure that PowerShell's default behavior works best with the tooling and ecosystem where it exists. Therefore, we shouldn't break the existing behavior on Windows, and we _should_ change the behavior on Linux. To "do the right thing" on Linux, we have to make sure we don't ever add a BOM on Linux.

We have to test the non-BOM UTF-8 file generated on Linux can be read properly on Windows.

These changes would need to be made in the following cmdlets:

  • Out-File
  • Set-Content

We should also create an environment OR PS variable like $DefaultFileEncoding or $FileEncoding that changes the default behavior of the above cmdlets.

i validated that .NET _does_ put a BOM in the file (in a similar way that we open files), and that our cmdlets do not do anything specific to add the BOM.

PS# $tfile = "$PWD\tfile.txt"
PS# $utf32enc = [text.encoding]::UTF32
PS# $fw = [io.filestream]::New($tfile,([io.filemode]::CreateNew))
PS# $sw = [io.streamwriter]::New($fw,$utf32enc)
PS# $sw.flush()
PS# $sw.dispose()
PS# $fw.dispose()
PS# format-hex $tfile

           Path: F:\e\rs1d\admin\monad\src\tfile.txt

           00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
00000000   FF FE 00 00 00 00 00 00 00 00 00 00 00 00 00 00  .þ..............

It is the case that if you change to utf8 encoding the appropriate BOM is written in the file,

@JamesWTruher the results are different on Linux with .NET Core:

Out-File

PowerShell, on Linux, prepends a BOM when using UTF-8 encoding:

> "hello world" | Out-File -Encoding utf8 utf8
> file utf8
utf8: UTF-8 Unicode (with BOM) text

.NET Encoding

``` C#
using System;
using System.Collections.Generic;
using System.IO;
using System.Text;

namespace ConsoleApplication
{
public class Program
{
public static void Main(string[] args)
{
String[] lines =
{
$"BufferHeight: {Console.BufferHeight}",
$"BufferWidth: {Console.BufferWidth}",
$"WindowHeight: {Console.WindowHeight}",
$"WindowWidth: {Console.WindowWidth}",
$"LargestWindowHeight: {Console.LargestWindowHeight}",
$"LargestWindowWidth: {Console.LargestWindowWidth}",
$"IsErrorRedirected: {Console.IsErrorRedirected}",
$"IsOutputRedirected: {Console.IsOutputRedirected}",
$"IsInputRedirected: {Console.IsInputRedirected}",
""
};

        using (var stream = File.CreateText("Default.log"))
        {
            foreach (var line in lines) { stream.WriteLine(line); }
        }

        var encodings = new Dictionary<String, Encoding>()
            {
                { "UTF8-Default.log", new UTF8Encoding() },
                { "ASCII.log", new ASCIIEncoding() },
                { "UTF8-ExplicitBOM.log", new UTF8Encoding(true) },
                { "Unicode.log", new UnicodeEncoding() },
            };

        foreach (var encoding in encodings)
        {
            using (var file = new FileStream(encoding.Key, FileMode.Create))
            using (var stream = new StreamWriter(file, encoding.Value))
            {
                foreach (var line in lines) { stream.WriteLine(line); }
            }
        }
    }
}

}

``` sh
$ dotnet run
$ for i in *.log; do file $i; done
ASCII.log: ASCII text
Default.log: ASCII text
Unicode.log: Little-endian UTF-16 Unicode text
UTF8-Default.log: ASCII text
UTF8-ExplicitBOM.log: UTF-8 Unicode (with BOM) text

Obviously that text was from other experiments :smile: If you include some non-ASCII text:

``` c#
String[] lines =
{
"Normal ASCII text",
"你好"
};

You get:

ASCII.log: ASCII text
Default.log: UTF-8 Unicode text
Unicode.log: Little-endian UTF-16 Unicode text
UTF8-Default.log: UTF-8 Unicode text
UTF8-ExplicitBOM.log: UTF-8 Unicode (with BOM) text

Which makes sense, as .NET is properly giving up on encoding non-ASCII characters for ASCII, but encodes as UTF-8 as it's supposed to:

``` sh
$ cat ASCII.log
Normal ASCII text
??

$ cat UTF8-Default.log
Normal ASCII text
你好

.NET Core's UTF8Encoder is specifically set to _not_ emit a BOM by default. (Link obtained from the API browser.)

I think we're just using https://github.com/dotnet/coreclr/blob/master/src/mscorlib/src/System/Text/Encoding.cs/#L1542 which passes true so the behavior should be the same on Linux.

OMG the fact that UTF8Encoding and Encoding.UTF8 have different explicit default behaviors is absurd.

I just checked with @eerhardt, and he confirmed this is the same behavior as .NET 4.6.1. The two different "default" UTF-8 constructors differ in BOM usage.

So, there is no single answer to what .NET defaults to with respect to UTF-8 using a BOM. I logged dotnet/coreclr/issues/5000; but it is highly unlikely to be changed (and wouldn't really matter for us anyway).

I don't really know where this lands us; but at least we know what's going on :smile:

The issue is related to binary pipelines.

After you call a native utility (e.g., ls) in PowerShell, the output is converted to a string -- it's no longer a byte stream. You cannot expect anything good now.

The best you can do is to use PowerShell utilities since they don't produce byte streams, but objects. And the expected binary pipeline will solve these problems.

Q: Why guessing the encoding is a bad idea?
A: Guessing encoding can be wrong. And it can be NOT EVEN WRONG. Imagine a utility that produces binary stream that cannot be interpreted as a string, e.g., it contains what will be interpreted as 0, or it is a bitmap.

See #1908 and #1975.

to handle simple redirection, the following would be possible:

$PSDefaultParameterValues['out-file:encoding'] = "ascii"

then the following works just fine (on linux)

gci > test
bash -c 'ls >> test'
bash -c "cat test"
gci test

having different settings on Linux and Windows would solve the problem, since most linux apps are just going to output ascii. I don't believe there's any hope for combinations of apps which emit unicode ascii when attempting to do redirection
85 gc test

I think we probably need an RFC of some kind for this behavior that answer some of my open questions (although I think they're worth discussion today because the answer to a few of them may determine the level of abstraction at which that RFC should be written):

  • Given the scarcity of BOM usage w/ UTF-8 text files on Linux, should we have different behavior for *-Content cmdlets?
  • Related: are there any scenarios where reading an ASCII file as a UTF-8 file without a BOM is incorrect?
  • What are some cross-platform scenarios/demos that demonstrate success in this space?
  • Can we break existing Windows behavior here?
  • Do we need some sort of $EncodingPreference (or something like that)?
  • Do redirection operators need to do something special here based on said $EncodingPreference
  • Should we pull in Invoke-WebRequest/Invoke-RestMethod as part of this work?
  • Do we have an uber-level of platform-specific design behavior that we need to abstract behind some global setting (potentially w/ different platform-specific defaults)?

Other:

  • $OutputEncoding is only used for pipes of native commands currently. It also requires that you do $outputencoding = [System.Text.Encoding]::UTF8 in order to change it. That's not an awesome UX.
  • On Windows, Set-Content uses ASCII by default, Out-File uses UTF-16 by default.

Open still:

  • Do we use a cmdlet or a preference variable?

Closed on:

  • Okay to have different default behaviors on different systems.
  • Right default behavior on Linux is UTF-8 with no BOM.
  • Okay to change the all encoding behavior on a single machine (like $OutputEncoding)
  • No change to current default Windows behavior.
  • Need some way to change back to the "WindowsDefault" Encoding value that mixes default behavior for Set-Content and Out-File.

Came here looking for issue related to this because > producing UTF-16 by default is maddening.

Do we need some sort of $EncodingPreference (or something like that)?

Please!

Can we break existing Windows behavior here?

Less necessary if there's a preference, IMO, though I have never seen a developer respond positively to a file being encoded with UTF-16. If PowerShell 6+ can lead the charge on moving Windows toward UTF-8-without-BOM everywhere that would be amazing.

Do redirection operators need to do something special here based on said $EncodingPreference

This seems to me to be the primary purpose of $EncodingPreference in the first place: changing the encoding used by >. Changing the default encoding for Out-File, Set-Content, etc. is a bonus, but at least those support a parameter (unlike >).

Not sure if >> already delegates to Add-Content, but it should behave the same: preserve encoding for an existing file, otherwise create a file with the specified (or default) encoding.

We have an RFC open for this work (please add your comments in the issue dicussion @dahlbyk) but it will likely not land in beta1.

My humble opinion shares some other opinions above. Default for Out-File should be UTF8 without BOM in every system.

For 6.0.0 release we'll be doing a reduced scope version of this https://github.com/PowerShell/PowerShell/issues/4878

@JamesWTruher could you clarify here what wasn't achieved by the implementation that we're shipping in 6.0?

The original issue from @jpsnover no longer repros with RC2 with @JamesWTruher's changes. I believe the change from Jim is that we default $OutputEncoding to utf-8 NoBOM instead of ASCII.

Was this page helpful?
0 / 5 - 0 ratings