Powershell: UTF-16 pipe => string (or string[]) variable corrupsed

Created on 21 Jan 2020  Â·  12Comments  Â·  Source: PowerShell/PowerShell

Maybe related: #1908

It's not a good status that there's no way to convert UTF-16 LE (or other variable-length binary encodings) stream of external programs to string with correct encodings or bytes as-is via pipe.

# maybe corrupsed when native locale is not English or ANSI encoding has not been switched to UTF-8
$description = & "${env:ProgramFiles(x86)}\Microsoft Visual Studio\Installer\vswhere.exe" -property description -utf8

Steps to reproduce

cmd /c "echo 漢字"
cmd /u /c "echo 漢字"
chcp
$success = cmd /c "echo 漢字"
echo $success
$fails = cmd /u /c "echo 漢字"
echo $fails

Expected behavior

漢字
漢字
Active code page: 65001
漢字
漢字

Actual behavior

漢字
漢字
Active code page: 65001
漢字
"oW[


Environment data

Name                           Value
----                           -----
PSVersion                      6.2.3
PSEdition                      Core
GitCommitId                    6.2.3
OS                             Microsoft Windows 10.0.19041
Platform                       Win32NT
PSCompatibleVersions           {1.0, 2.0, 3.0, 4.0…}
PSRemotingProtocolVersion      2.3
SerializationVersion           1.1.0.1
WSManStackVersion              3.0

I hope there will be like:

$description = & "${env:ProgramFiles(x86)}\Microsoft Visual Studio\Installer\vswhere.exe" -property description -utf8 | [[[ConvertFrom-RawStream]]] -Encoding UTF8
$kanji = cmd /u /c "echo 漢字" | [[[ConvertFrom-RawStream]]] -Encoding Unicode
Issue-Question

All 12 comments

Does setting [Console]::OutputEncoding to a Unicode encoding not allow PS to read the incoming text properly? 🤔

PowerShell 6.2.3

https://aka.ms/pscore6-docs
Type 'help' to get help.

Loading personal and system profiles took 3772ms.
 tatsu@TATSU-NB-3RD   base  ~  [Console]::OutputEncoding

Preamble          :
BodyName          : utf-8
EncodingName      : Unicode (UTF-8)
HeaderName        : utf-8
WebName           : utf-8
WindowsCodePage   : 1200
IsBrowserDisplay  : True
IsBrowserSave     : True
IsMailNewsDisplay : True
IsMailNewsSave    : True
IsSingleByte      : False
DecoderFallback   : System.Text.DecoderReplacementFallback
IsReadOnly        : False
CodePage          : 65001


 tatsu@TATSU-NB-3RD   base  ~  [System.Text.Encoding]::UTF8

Preamble          :
BodyName          : utf-8
EncodingName      : Unicode (UTF-8)
HeaderName        : utf-8
WebName           : utf-8
WindowsCodePage   : 1200
IsBrowserDisplay  : True
IsBrowserSave     : True
IsMailNewsDisplay : True
IsMailNewsSave    : True
IsSingleByte      : False
EncoderFallback   : System.Text.EncoderReplacementFallback
DecoderFallback   : System.Text.DecoderReplacementFallback
IsReadOnly        : True
CodePage          : 65001


Seems not to have a problem on this. This also happens on Legacy PowerShell (5.1).

@vexx32 is correct in principle: [Console]::OutputEncoding is the encoding PowerShell uses to decode output from external programs such as cmd.

If you use cmd /u (for "Unicode", i.e. UTF-16LE output), you'd have to use
[Console]::OutputEncoding = [Text.Encoding]::Unicode

Unfortunately, doing so doesn't work in v6.2.3, due a bug that has since been fixed (since v7.0.0-preview.6): #10789

Worked in Legacy 5.1.
I wonder why [Console]::OutputEncoding must be changed for pipe in PowerShell even though don't have to for direct output.

The _bug_ wasn't there in 5.1, but you still had to change [Console]::OutputEncoding to match the command's output encoding; e.g.:

# Does NOT work by default - note the extra "spaces" representing NULs
PS> $out = cmd /u /c echo hi; $out
h i

@mklement0 Of course. What I said means that Legacy PowerShell (and future PowerShell 7) can convert output from cmd /u to string correctly as long as [Console]::Output has been changed , and that I don't know why cmd /c "echo 漢字"; cmd /u /c "echo 漢字" succeeds without doing else but $success = cmd /c "echo 漢字"; $fails = cmd /u /c "echo 漢字" doesn't without changing [Console]::OutputEncoding before cmd /u?

This issue can be divided into a fixed bug (only in development version) and a missing feature, so I'll close this and make one for the latter as a feature request later.

Yes, the bug is specific to 6.x.

cmd /u /c "echo 漢字" works (if you have the right font selected), because it prints _directly to the console_, presumably using the Unicode version of the WriteConsole WinAPI function - PowerShell passes the output _through_, and performs _no decoding_.

In other words: the need to set [Console]::OutputEncoding only arises if you _capture, pipe, or redirect_ the external program's output.

Given the explanation above, do you still think there is a missing feature?

If you want to _interpret_ the output from an external program _as text_, there is no way around telling PowerShell what the source character encoding is - I don't think we should be employing heuristics there.

  1. Can't PowerShell tell whether external programs output text via WriteConsoleW or WriteConsoleA?
  2. If not, Does PowerShell try to reinterpret UTF-16 LE wchar_t* text from WriteConsoleW to multibyte char* text encoded in OutputEncoding, and convert it to UTF-16 LE string (PowerShell native string) according to OutputEncoding again?

WriteConsole() is truly just for writing to the _console_ (from the linked docs):

WriteConsole fails if it is used with a standard handle that is redirected to a file. [...]
If the handle is a console handle, call WriteConsole. If the handle is not a console handle, the output is redirected and you should call WriteFile to perform the I/O.

That is, it doesn't apply in the capture/redirect/pipe case.

Also, more generally, not all console programs can be assumed to use the WinAPI, and there's no telling from the outside how a given console program is implemented.

Generally, console programs (when not printing directly to the screen), are expected to use the active code page's encoding, as reflected in the output of chcp.com and in [Console]::OutputEncoding.

If a given program uses a different encoding (as is the case when you use cmd /u ...), you need to (temporarily) set [Console]::OutputEncoding to that encoding in order for PowerShell to interpret the output correctly.

WriteConsole is more complex than I thought. Thank you for detailed explanation. Anyway, I have to write:

$enc = [Console]::OutputEncoding
[Console]::OutputEncoding = [Text.Encoding]::Unicode
$result = cmd /u /c "foo bar"
[Console]::OutputEncoding = $enc

I wish it were just one line.

Was this page helpful?
0 / 5 - 0 ratings