Powershell: Resolve UTF-8, UTF-16, ASCII inconsistencies

Created on 20 Mar 2016 · 22Comments · Source: PowerShell/PowerShell

PS> ls > test
BASH> ls >> test
BASH> cat test # Content looks correct
PS> cat test # Content is wrong at the end

Committee-Reviewed Issue-Enhancement Resolution-Fixed WG-DevEx-Portability WG-Engine

Source

jpsnover

Most helpful comment

Came here looking for issue related to this because > producing UTF-16 by default is maddening.

Do we need some sort of $EncodingPreference (or something like that)?

Please!

Can we break existing Windows behavior here?

Less necessary if there's a preference, IMO, though I have never seen a developer respond positively to a file being encoded with UTF-16. If PowerShell 6+ can lead the charge on moving Windows toward UTF-8-without-BOM everywhere that would be amazing.

Do redirection operators need to do something special here based on said $EncodingPreference

This seems to me to be the primary purpose of $EncodingPreference in the first place: changing the encoding used by >. Changing the default encoding for Out-File, Set-Content, etc. is a bonus, but at least those support a parameter (unlike >).

Not sure if >> already delegates to Add-Content, but it should behave the same: preserve encoding for an existing file, otherwise create a file with the specified (or default) encoding.

dahlbyk on 17 Mar 2017

❤6

All 22 comments

Need to discuss this at greater length. Some thoughts:

There should probably be a global PS variable for setting the encoding to use
This global variable might be different across platforms

joeyaiello on 24 Mar 2016

@joeyaiello do we need to fix this as a bug? A mixed-encoding file is just... wrong. It's bad input. I'm not really sure what Get-ChildItem should even do about it.

andschwa on 13 Apr 2016

CAT works so why would it be OK for Get-ChildItem to not work?
"If you want things to work avoid the PS tools?"

On Wed, Apr 13, 2016 at 11:06 AM, Andy Schwartzmeyer <
[email protected]> wrote:

@joeyaiello https://github.com/joeyaiello do we need to fix this as a
bug? A mixed-encoding file is just... wrong. It's bad input. I'm not really
sure what Get-ChildItem should even do about it.

—
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
https://github.com/PowerShell/PowerShell/issues/707#issuecomment-209572132

jpsnover on 14 Apr 2016

Linux uses UTF-8 without a BOM by default
.NET's Core implementation does not use a BOM
Open question: what's the default encoding on OS X?
PS does add a BOM to outputted files (we need to investigate exactly where)

Linux UX sync agrees that "doing the right thing" means making sure that PowerShell's default behavior works best with the tooling and ecosystem where it exists. Therefore, we shouldn't break the existing behavior on Windows, and we _should_ change the behavior on Linux. To "do the right thing" on Linux, we have to make sure we don't ever add a BOM on Linux.

We have to test the non-BOM UTF-8 file generated on Linux can be read properly on Windows.

These changes would need to be made in the following cmdlets:

Out-File
Set-Content

We should also create an environment OR PS variable like $DefaultFileEncoding or $FileEncoding that changes the default behavior of the above cmdlets.

joeyaiello on 16 May 2016

i validated that .NET _does_ put a BOM in the file (in a similar way that we open files), and that our cmdlets do not do anything specific to add the BOM.

PS# $tfile = "$PWD\tfile.txt"
PS# $utf32enc = [text.encoding]::UTF32
PS# $fw = [io.filestream]::New($tfile,([io.filemode]::CreateNew))
PS# $sw = [io.streamwriter]::New($fw,$utf32enc)
PS# $sw.flush()
PS# $sw.dispose()
PS# $fw.dispose()
PS# format-hex $tfile

           Path: F:\e\rs1d\admin\monad\src\tfile.txt

           00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
00000000   FF FE 00 00 00 00 00 00 00 00 00 00 00 00 00 00  .þ..............

It is the case that if you change to utf8 encoding the appropriate BOM is written in the file,

JamesWTruher on 17 May 2016

@JamesWTruher the results are different on Linux with .NET Core:

Out-File

PowerShell, on Linux, prepends a BOM when using UTF-8 encoding:

> "hello world" | Out-File -Encoding utf8 utf8
> file utf8
utf8: UTF-8 Unicode (with BOM) text

.NET Encoding

``` C#
using System;
using System.Collections.Generic;
using System.IO;
using System.Text;

namespace ConsoleApplication
{
public class Program
{
public static void Main(string[] args)
{
String[] lines =
{
$"BufferHeight: {Console.BufferHeight}",
$"BufferWidth: {Console.BufferWidth}",
$"WindowHeight: {Console.WindowHeight}",
$"WindowWidth: {Console.WindowWidth}",
$"LargestWindowHeight: {Console.LargestWindowHeight}",
$"LargestWindowWidth: {Console.LargestWindowWidth}",
$"IsErrorRedirected: {Console.IsErrorRedirected}",
$"IsOutputRedirected: {Console.IsOutputRedirected}",
$"IsInputRedirected: {Console.IsInputRedirected}",
""
};

        using (var stream = File.CreateText("Default.log"))
        {
            foreach (var line in lines) { stream.WriteLine(line); }
        }

        var encodings = new Dictionary<String, Encoding>()
            {
                { "UTF8-Default.log", new UTF8Encoding() },
                { "ASCII.log", new ASCIIEncoding() },
                { "UTF8-ExplicitBOM.log", new UTF8Encoding(true) },
                { "Unicode.log", new UnicodeEncoding() },
            };

        foreach (var encoding in encodings)
        {
            using (var file = new FileStream(encoding.Key, FileMode.Create))
            using (var stream = new StreamWriter(file, encoding.Value))
            {
                foreach (var line in lines) { stream.WriteLine(line); }
            }
        }
    }
}

}

``` sh
$ dotnet run
$ for i in *.log; do file $i; done
ASCII.log: ASCII text
Default.log: ASCII text
Unicode.log: Little-endian UTF-16 Unicode text
UTF8-Default.log: ASCII text
UTF8-ExplicitBOM.log: UTF-8 Unicode (with BOM) text

andschwa on 17 May 2016

Obviously that text was from other experiments :smile: If you include some non-ASCII text:

``` c#
String[] lines =
{
"Normal ASCII text",
"你好"
};

You get:

ASCII.log: ASCII text
Default.log: UTF-8 Unicode text
Unicode.log: Little-endian UTF-16 Unicode text
UTF8-Default.log: UTF-8 Unicode text
UTF8-ExplicitBOM.log: UTF-8 Unicode (with BOM) text

Which makes sense, as .NET is properly giving up on encoding non-ASCII characters for ASCII, but encodes as UTF-8 as it's supposed to:

``` sh
$ cat ASCII.log
Normal ASCII text
??

$ cat UTF8-Default.log
Normal ASCII text
你好

andschwa on 17 May 2016

.NET Core's UTF8Encoder is specifically set to _not_ emit a BOM by default. (Link obtained from the API browser.)

andschwa on 17 May 2016

I think we're just using https://github.com/dotnet/coreclr/blob/master/src/mscorlib/src/System/Text/Encoding.cs/#L1542 which passes true so the behavior should be the same on Linux.

lzybkr on 17 May 2016

OMG the fact that UTF8Encoding and Encoding.UTF8 have different explicit default behaviors is absurd.

andschwa on 17 May 2016

👍1

I just checked with @eerhardt, and he confirmed this is the same behavior as .NET 4.6.1. The two different "default" UTF-8 constructors differ in BOM usage.

andschwa on 17 May 2016

So, there is no single answer to what .NET defaults to with respect to UTF-8 using a BOM. I logged dotnet/coreclr/issues/5000; but it is highly unlikely to be changed (and wouldn't really matter for us anyway).

I don't really know where this lands us; but at least we know what's going on :smile:

andschwa on 17 May 2016

The issue is related to binary pipelines.

After you call a native utility (e.g., ls) in PowerShell, the output is converted to a string -- it's no longer a byte stream. You cannot expect anything good now.

The best you can do is to use PowerShell utilities since they don't produce byte streams, but objects. And the expected binary pipeline will solve these problems.

Q: Why guessing the encoding is a bad idea?
A: Guessing encoding can be wrong. And it can be NOT EVEN WRONG. Imagine a utility that produces binary stream that cannot be interpreted as a string, e.g., it contains what will be interpreted as 0, or it is a bitmap.

See #1908 and #1975.

GeeLaw on 5 Oct 2016

to handle simple redirection, the following would be possible:

$PSDefaultParameterValues['out-file:encoding'] = "ascii"

then the following works just fine (on linux)

gci > test
bash -c 'ls >> test'
bash -c "cat test"
gci test

having different settings on Linux and Windows would solve the problem, since most linux apps are just going to output ascii. I don't believe there's any hope for combinations of apps which emit unicode ascii when attempting to do redirection
85 gc test

JamesWTruher on 3 Feb 2017

I think we probably need an RFC of some kind for this behavior that answer some of my open questions (although I think they're worth discussion today because the answer to a few of them may determine the level of abstraction at which that RFC should be written):

Given the scarcity of BOM usage w/ UTF-8 text files on Linux, should we have different behavior for *-Content cmdlets?
Related: are there any scenarios where reading an ASCII file as a UTF-8 file without a BOM is incorrect?
What are some cross-platform scenarios/demos that demonstrate success in this space?
Can we break existing Windows behavior here?
Do we need some sort of $EncodingPreference (or something like that)?
Do redirection operators need to do something special here based on said $EncodingPreference
Should we pull in Invoke-WebRequest/Invoke-RestMethod as part of this work?
Do we have an uber-level of platform-specific design behavior that we need to abstract behind some global setting (potentially w/ different platform-specific defaults)?

joeyaiello on 9 Feb 2017

Other:

$OutputEncoding is only used for pipes of native commands currently. It also requires that you do $outputencoding = [System.Text.Encoding]::UTF8 in order to change it. That's not an awesome UX.
On Windows, Set-Content uses ASCII by default, Out-File uses UTF-16 by default.

Open still:

Do we use a cmdlet or a preference variable?

Closed on:

Okay to have different default behaviors on different systems.
Right default behavior on Linux is UTF-8 with no BOM.
Okay to change the all encoding behavior on a single machine (like $OutputEncoding)
No change to current default Windows behavior.
Need some way to change back to the "WindowsDefault" Encoding value that mixes default behavior for Set-Content and Out-File.

joeyaiello on 9 Feb 2017

Came here looking for issue related to this because > producing UTF-16 by default is maddening.

Do we need some sort of $EncodingPreference (or something like that)?

Please!

Can we break existing Windows behavior here?

Do redirection operators need to do something special here based on said $EncodingPreference

Not sure if >> already delegates to Add-Content, but it should behave the same: preserve encoding for an existing file, otherwise create a file with the specified (or default) encoding.

dahlbyk on 17 Mar 2017

❤6

We have an RFC open for this work (please add your comments in the issue dicussion @dahlbyk) but it will likely not land in beta1.

joeyaiello on 21 Mar 2017

👍1

My humble opinion shares some other opinions above. Default for Out-File should be UTF8 without BOM in every system.

regs01 on 17 May 2017

👍5

For 6.0.0 release we'll be doing a reduced scope version of this https://github.com/PowerShell/PowerShell/issues/4878

SteveL-MSFT on 30 Sep 2017

@JamesWTruher could you clarify here what wasn't achieved by the implementation that we're shipping in 6.0?

joeyaiello on 19 Dec 2017

The original issue from @jpsnover no longer repros with RC2 with @JamesWTruher's changes. I believe the change from Jim is that we default $OutputEncoding to utf-8 NoBOM instead of ASCII.

SteveL-MSFT on 2 Jan 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

PowerShell core on Linux - Get-Service should mimic linux "service" command.

MaximoTrinidad · 3Comments

Expect like capability with PowerShell

manofspirit · 3Comments

New-PSSession error "subsystem request failed on channel 0" when using SSH key authentication

pcgeek86 · 3Comments

Comment based help does not work for scripts that start with a shebang

abock · 3Comments

Parameter parsing/passing: unquoted tokens that look like named arguments with colon as the separator are broken in two when passed indirectly via $Args / @Args

mklement0 · 3Comments