PS> ls > test
BASH> ls >> test
BASH> cat test # Content looks correct
PS> cat test # Content is wrong at the end
Need to discuss this at greater length. Some thoughts:
@joeyaiello do we need to fix this as a bug? A mixed-encoding file is just... wrong. It's bad input. I'm not really sure what Get-ChildItem
should even do about it.
CAT works so why would it be OK for Get-ChildItem to not work?
"If you want things to work avoid the PS tools?"
On Wed, Apr 13, 2016 at 11:06 AM, Andy Schwartzmeyer <
[email protected]> wrote:
@joeyaiello https://github.com/joeyaiello do we need to fix this as a
bug? A mixed-encoding file is just... wrong. It's bad input. I'm not really
sure what Get-ChildItem should even do about it.—
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
https://github.com/PowerShell/PowerShell/issues/707#issuecomment-209572132
Linux UX sync agrees that "doing the right thing" means making sure that PowerShell's default behavior works best with the tooling and ecosystem where it exists. Therefore, we shouldn't break the existing behavior on Windows, and we _should_ change the behavior on Linux. To "do the right thing" on Linux, we have to make sure we don't ever add a BOM on Linux.
We have to test the non-BOM UTF-8 file generated on Linux can be read properly on Windows.
These changes would need to be made in the following cmdlets:
We should also create an environment OR PS variable like $DefaultFileEncoding
or $FileEncoding
that changes the default behavior of the above cmdlets.
i validated that .NET _does_ put a BOM in the file (in a similar way that we open files), and that our cmdlets do not do anything specific to add the BOM.
PS# $tfile = "$PWD\tfile.txt"
PS# $utf32enc = [text.encoding]::UTF32
PS# $fw = [io.filestream]::New($tfile,([io.filemode]::CreateNew))
PS# $sw = [io.streamwriter]::New($fw,$utf32enc)
PS# $sw.flush()
PS# $sw.dispose()
PS# $fw.dispose()
PS# format-hex $tfile
Path: F:\e\rs1d\admin\monad\src\tfile.txt
00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
00000000 FF FE 00 00 00 00 00 00 00 00 00 00 00 00 00 00 .þ..............
It is the case that if you change to utf8 encoding the appropriate BOM is written in the file,
@JamesWTruher the results are different on Linux with .NET Core:
PowerShell, on Linux, prepends a BOM when using UTF-8 encoding:
> "hello world" | Out-File -Encoding utf8 utf8
> file utf8
utf8: UTF-8 Unicode (with BOM) text
``` C#
using System;
using System.Collections.Generic;
using System.IO;
using System.Text;
namespace ConsoleApplication
{
public class Program
{
public static void Main(string[] args)
{
String[] lines =
{
$"BufferHeight: {Console.BufferHeight}",
$"BufferWidth: {Console.BufferWidth}",
$"WindowHeight: {Console.WindowHeight}",
$"WindowWidth: {Console.WindowWidth}",
$"LargestWindowHeight: {Console.LargestWindowHeight}",
$"LargestWindowWidth: {Console.LargestWindowWidth}",
$"IsErrorRedirected: {Console.IsErrorRedirected}",
$"IsOutputRedirected: {Console.IsOutputRedirected}",
$"IsInputRedirected: {Console.IsInputRedirected}",
""
};
using (var stream = File.CreateText("Default.log"))
{
foreach (var line in lines) { stream.WriteLine(line); }
}
var encodings = new Dictionary<String, Encoding>()
{
{ "UTF8-Default.log", new UTF8Encoding() },
{ "ASCII.log", new ASCIIEncoding() },
{ "UTF8-ExplicitBOM.log", new UTF8Encoding(true) },
{ "Unicode.log", new UnicodeEncoding() },
};
foreach (var encoding in encodings)
{
using (var file = new FileStream(encoding.Key, FileMode.Create))
using (var stream = new StreamWriter(file, encoding.Value))
{
foreach (var line in lines) { stream.WriteLine(line); }
}
}
}
}
}
``` sh
$ dotnet run
$ for i in *.log; do file $i; done
ASCII.log: ASCII text
Default.log: ASCII text
Unicode.log: Little-endian UTF-16 Unicode text
UTF8-Default.log: ASCII text
UTF8-ExplicitBOM.log: UTF-8 Unicode (with BOM) text
Obviously that text was from other experiments :smile: If you include some non-ASCII text:
``` c#
String[] lines =
{
"Normal ASCII text",
"ä½ å¥½"
};
You get:
ASCII.log: ASCII text
Default.log: UTF-8 Unicode text
Unicode.log: Little-endian UTF-16 Unicode text
UTF8-Default.log: UTF-8 Unicode text
UTF8-ExplicitBOM.log: UTF-8 Unicode (with BOM) text
Which makes sense, as .NET is properly giving up on encoding non-ASCII characters for ASCII, but encodes as UTF-8 as it's supposed to:
``` sh
$ cat ASCII.log
Normal ASCII text
??
$ cat UTF8-Default.log
Normal ASCII text
ä½ å¥½
.NET Core's UTF8Encoder
is specifically set to _not_ emit a BOM by default. (Link obtained from the API browser.)
I think we're just using https://github.com/dotnet/coreclr/blob/master/src/mscorlib/src/System/Text/Encoding.cs/#L1542 which passes true so the behavior should be the same on Linux.
OMG the fact that UTF8Encoding
and Encoding.UTF8
have different explicit default behaviors is absurd.
I just checked with @eerhardt, and he confirmed this is the same behavior as .NET 4.6.1. The two different "default" UTF-8 constructors differ in BOM usage.
So, there is no single answer to what .NET defaults to with respect to UTF-8 using a BOM. I logged dotnet/coreclr/issues/5000; but it is highly unlikely to be changed (and wouldn't really matter for us anyway).
I don't really know where this lands us; but at least we know what's going on :smile:
The issue is related to binary pipelines.
After you call a native utility (e.g., ls
) in PowerShell, the output is converted to a string
-- it's no longer a byte stream. You cannot expect anything good now.
The best you can do is to use PowerShell utilities since they don't produce byte streams, but objects. And the expected binary pipeline will solve these problems.
Q: Why guessing the encoding is a bad idea?
A: Guessing encoding can be wrong. And it can be NOT EVEN WRONG. Imagine a utility that produces binary stream that cannot be interpreted as a string
, e.g., it contains what will be interpreted as 0
, or it is a bitmap.
See #1908 and #1975.
to handle simple redirection, the following would be possible:
$PSDefaultParameterValues['out-file:encoding'] = "ascii"
then the following works just fine (on linux)
gci > test
bash -c 'ls >> test'
bash -c "cat test"
gci test
having different settings on Linux and Windows would solve the problem, since most linux apps are just going to output ascii. I don't believe there's any hope for combinations of apps which emit unicode ascii when attempting to do redirection
85 gc test
I think we probably need an RFC of some kind for this behavior that answer some of my open questions (although I think they're worth discussion today because the answer to a few of them may determine the level of abstraction at which that RFC should be written):
*-Content
cmdlets? $EncodingPreference
(or something like that)? $EncodingPreference
Invoke-WebRequest
/Invoke-RestMethod
as part of this work? Other:
$OutputEncoding
is only used for pipes of native commands currently. It also requires that you do $outputencoding = [System.Text.Encoding]::UTF8
in order to change it. That's not an awesome UX. Set-Content
uses ASCII by default, Out-File
uses UTF-16 by default. Open still:
Closed on:
$OutputEncoding
) Came here looking for issue related to this because >
producing UTF-16 by default is maddening.
Do we need some sort of $EncodingPreference (or something like that)?
Please!
Can we break existing Windows behavior here?
Less necessary if there's a preference, IMO, though I have never seen a developer respond positively to a file being encoded with UTF-16. If PowerShell 6+ can lead the charge on moving Windows toward UTF-8-without-BOM everywhere that would be amazing.
Do redirection operators need to do something special here based on said $EncodingPreference
This seems to me to be the primary purpose of $EncodingPreference
in the first place: changing the encoding used by >
. Changing the default encoding for Out-File
, Set-Content
, etc. is a bonus, but at least those support a parameter (unlike >
).
Not sure if >>
already delegates to Add-Content
, but it should behave the same: preserve encoding for an existing file, otherwise create a file with the specified (or default) encoding.
We have an RFC open for this work (please add your comments in the issue dicussion @dahlbyk) but it will likely not land in beta1.
My humble opinion shares some other opinions above. Default for Out-File should be UTF8 without BOM in every system.
For 6.0.0 release we'll be doing a reduced scope version of this https://github.com/PowerShell/PowerShell/issues/4878
@JamesWTruher could you clarify here what wasn't achieved by the implementation that we're shipping in 6.0?
The original issue from @jpsnover no longer repros with RC2 with @JamesWTruher's changes. I believe the change from Jim is that we default $OutputEncoding
to utf-8 NoBOM instead of ASCII.
Most helpful comment
Came here looking for issue related to this because
>
producing UTF-16 by default is maddening.Please!
Less necessary if there's a preference, IMO, though I have never seen a developer respond positively to a file being encoded with UTF-16. If PowerShell 6+ can lead the charge on moving Windows toward UTF-8-without-BOM everywhere that would be amazing.
This seems to me to be the primary purpose of
$EncodingPreference
in the first place: changing the encoding used by>
. Changing the default encoding forOut-File
,Set-Content
, etc. is a bonus, but at least those support a parameter (unlike>
).Not sure if
>>
already delegates toAdd-Content
, but it should behave the same: preserve encoding for an existing file, otherwise create a file with the specified (or default) encoding.