Powershell: Don’t parse the pipeline as text when it is directed from an EXE to another EXE or file. Keep the bytes as-is.

Created on 18 Aug 2016  Â·  20Comments  Â·  Source: PowerShell/PowerShell

Currently PowerShell parses STDOUT as string when piping from an EXE, while in some cases it should be preserved as a byte stream, like this scenario:

curl.exe http://whatever/a.png > a.png

or

node a.js | gzip -c > out.gz

Affected patterns include: native | native, native > file and (maybe) cat file | native.

Issue-Bug WG-Engine WG-Engine-Performance

Most helpful comment

There seems to be an assumption throughout this that PowerShell's philosophy that "everything is a pipeline" is OK. However, I think there might be some value in thinking of the use case of "PowerShell as a legacy native command launcher" as distinct. Would it be possible to allow the redirection operators to have their traditional meaning of redirecting the native command's stdout directly to the raw file instead of piping its output back to PowerShell? Even if all of the encoding issues are resolved for general piping, giving a command a pipe when it expects a physical file is still a semantic change.

Requiring a user to know that they need to specify special obscure options to say "don't change the output of this command" seems error-prone at best. I'm arguing that redirecting a simple command directly to a file should be data-preserving by default.

Or, is the philosophy that if someone wants to use non-text native commands that they should just switch back to a traditional cmd window?

All 20 comments

@vors @lzybkr
The current NativeCommandProcessor breaks:

  • LF line endings.
  • Non-ASCII text within UTF-8 without BOM header.
  • Binary file redirects (like curl.exe’s output).
  • > layouts text into 80 columns by default.

Maybe add a cmdlet/operator to call native command and get its raw output (as a byte array / stream?), something like this:

# Consider ^& operator is an alias for Get-CommandRawOutputStream; this is just an example syntax
$output = ^& curl.exe http://whatever/a.png # $output now is a byte array or stream
$output > C:\Temp\file.png # file.png now is a valid image file

# This should be valid, too:
^& curl.exe http://whatever/a.png > C:\Temp\file.png

This opens an opportunity for some additional usage patterns (you can put this raw content into variables, and pipe raw content from native commands to managed cmdlets).

But maybe we could add a special kind of redirection operator (like 2>&1, 3>&1, *>&1 we already have), something like this (where %>&1 is a new redirection operator that redirects command "raw output" without processing it as a string):

$output = curl.exe http://whatever/a.png %>&1
$output > C:\Temp\file.png

# Or even this:
curl.exe http://whatever/a.png %> C:\Temp\file.png # which is just awesome

Overall: I don't think that this kind of redirection should be tied to only native commands or some limited list of usage patterns (e.g. native | native).

@ForNeVeR My proposal is that:

  1. For native | native, keep the bytes as-is. This is already purposed by @vors.
  2. For ps | native, add a set of cmdlets which encodes PS objects into bytes, perhaps ps | encode-text utf-8 | native.
  3. For native | ps, we can use the type system to identify whether a cmdlet accepts “raw input”. For cmdlets like out-file or maybe decode-text, it will keep the bytes from native, and other cmdlets will use the parsed string as its input.

@be5invis okay, it seems like this proposal also supports all the relevant use cases I can imagine.

Shouldn't this open up an RFC since this is a breaking change (changes the observed behaviour)?

A workaround for this is to provide a cmdlet that stores the content in a temporary file. A working example is Use-RawPipeline in PowerShell Gallery. The current implementation is to store the file, but it could also be streamlined so that the file doesn't have to be stored.

See also #559, where this appears to be actively discussed and worked on by @vors on the PowerShell team.

Great discussion! Thank you all for the feedback.

I'd like to share my plans about this work:

  • In the scope if this issue we will address only native | native and native > file behavior. Note, that although it could be seen as a breaking change, it would not be the case for text output. The behavior would be preserved. Byte output would be much more reliable without wrapping bytes in PS strings. We agreed with @lzybkr that it's not breaking, hence no RFC process would be applied.
  • I don't see the immediate need in enhancing native | ps case, since PS is able to consume strings only from the native commands. Although, somebody may want to write function like
function foo
{
  param([byte[]]$rawBytes)
}

they may archive it with a temp file or some other technique as @GeeLaw pointed out.

  • Similarly, ps | native case has a well established pattern: when ps objects need to be passed to the native command, we apply implicit Out-String and pass everything as a text.
    Because PS doesn't use byte streams as a primitive for pipeline, I don't think we should develop special sugar to support it in the language directly. If there is a case, when it needs to be done, similar work-arounds can be used.

We can revisit the last two parts later, but I'd like to set expectations about scope of this issue.

@vors However the current “>” is identical to out-file, so you have to add a special version of out-file which takes raw bytes. So why don’t you give the ability to everyone?

@vors The change in #2450 greatly improves the experience, but the design still feels a little awkward and inconsistent. It seems to be based on arbitrary patterns rather than consistent behavior of operators and cmdlets with respect to input arguments of particular .NET types.

As a user I would expect binary operators like | to behave consistently given a LHS expression that evaluates to a byte stream (or some appropriate choice of byte stream-ish object), regardless of whether it is produced by invocation of a native executable, piping from a file, invocation of a PowerShell function/script/cmdlet, or .NET FFI.

Similarly, I would expect the | operator to behave consistently given a RHS that "can accept stdin", for some meaning of accepting stdin appropriate to the RHS expression in question. For native executables and files this is just sending the bytes to the correct file descriptor, for PowerShell invocables perhaps it would be param([Stream]$rawBytes) as you suggested.

If there is no way to overload |'s behavior so that this is not a breaking change, then we should have a different piping operator for raw streams, and cmdlets for converting between byte streams and guessed-encoding-decoded lines (similar to $ and ~ from @GeeLaw's Use-RawPipeline project).

Oh, https://psguy.me/modules/Use-RawPipeline/ is very interesting, thank you for the link.

There are 2 conversations going on here:

  1. just native | native (or native > file)
  2. native | ps and ps | native

They are highly related and it's true that solving (2) in a general way will buy us (1) automatically.
However the scope of the work for (2) is much broader and includes RFC and what not, while the (1) is a low hanging fruit: it can be done in a non-breaking manner, greatly improve perf for common cases and the changes itself are very modest. Note that to achieve perf parity with bash, we would need (1) in one form or another.

That's why I think that it make sense to separate these two tasks.

@vors What I mean is that IMHO native, ps etc. shouldn't be distinct, first class concepts in the first place. It makes the language conceptually simpler if it consists only of expressions that can produce and consume .NET values and operators that can wire such expressions together. PowerShell is after all strongly typed, if not statically typed.

Changing the language so that (ping.exe 1.2.3.4).GetType() is Stream (or some similar type with suitable metadata about the process) would be a way to synthesize native commands with the rest of Powershell in a less inconsistent way.

That's why I think that it make sense to separate these two tasks.

Do you know if there's already an RFC or issue for the latter task?

FYI Use-RawPipeline has been reworked to allow streamlined experience instead of having to store the content in a file and wait before the previous process ends to perform the next piped process. Its source code is available from https://github.com/GeeLaw/Use-RawPipeline

There seems to be an assumption throughout this that PowerShell's philosophy that "everything is a pipeline" is OK. However, I think there might be some value in thinking of the use case of "PowerShell as a legacy native command launcher" as distinct. Would it be possible to allow the redirection operators to have their traditional meaning of redirecting the native command's stdout directly to the raw file instead of piping its output back to PowerShell? Even if all of the encoding issues are resolved for general piping, giving a command a pipe when it expects a physical file is still a semantic change.

Requiring a user to know that they need to specify special obscure options to say "don't change the output of this command" seems error-prone at best. I'm arguing that redirecting a simple command directly to a file should be data-preserving by default.

Or, is the philosophy that if someone wants to use non-text native commands that they should just switch back to a traditional cmd window?

@SteveL-MSFT

Following command will be broken due to this issue.
docker save microsoft/windowsservercore:ltsc2016 > msft_wsc_ltsc2016.tar

You have to use docker save -o instead.

I have hard time understanding what #2450 actually fix. Because even though command like ping.exe github.com | grep.exe Reply works. The command:
curl.exe "https://i.redd.it/dntes9fqy3x11.jpg" > test1.jpg
still works only in cmd/WSL's bash/git bash.

I was trying to use git show ref:path/to/file.png > file.png and it looks that it's still not possible to use it in powershell. Are there any serious plans to fix it?

@mpawelski

2450 fixes the problem of requiring upstream native command to finish before piping the output downstream. It does not address PowerShell’s parsing byte stream from native output and reserializing object to byte stream. Please take some time to learn about how object-oriented pipes work and you will learn the problem is really hard to solve consistently. For that purpose, please use a native redirection/piping utility, e.g., Command Prompt, Start-Process, or Use-RawPipeline.

@GeeLaw So you think PS should break simple, native-exe piping, on purpose??????

@be5invis which post are you replying to?

If you are replying to a post 2+ months ago, I pointed out the necessity of RFC and developed a workaround.

If you are replying to a post ~1 month ago, I was explaining to @mpawelski that #2450 does not address this specific issue. The additional point (this issue is hard to solve consistently) re-enforces the necessity of RFC (and probably additional documentation explaining the new/old behaviors), and provides pointers to workarounds in current versions of PowerShell.

As for breaking exe piping, I interpret current implementation as mistakenly breaking it with some intention behind the scene (for other use cases). Improvement (coming up with a more intuitive conversion rule, implementing it, and documenting it, which are the RFC part) and education (making people aware of the nuances) are both important — one should know and currently cannot choose to ignore the difference between native utilities and cmdlets.

Edited by @joeyaiello: As a reminder, please be respectful and follow our Code of Conduct when commenting on issues or PRs.

Was this page helpful?
0 / 5 - 0 ratings