Powershell: Start-Transcript uses lossy ASCII character encoding instead of BOM-less UTF-8 if the target file happens to exist

Created on 23 Sep 2020 · 22Comments · Source: PowerShell/PowerShell

Note:

Windows PowerShell is _not_ affected - UTF-8 with BOM is used consistently there.
A fix for the closely related #13678 could probably encompass a fix for this issue as well.

When the target file _happens to exist and doesn't have a BOM_ - even without -Append - Start-Transcript mistakenly uses ASCII encoding to write the file.

Without -Append, Start-Transcript shouldn't even look at the existing file - it should simply replace it, and use the default encoding (BOM-less UTF-8). See #13678 for a closely related bug _with_ -Append.

Instead, Start-Transcript apparently looks for a BOM at the start of an existing file and uses ASCII encoding if it doesn't find one.

Steps to reproduce

# To surface the bug, make sure that the target file exists and doesn't have a BOM.
$null > temp:/$PID.txt

$null = Start-Transcript temp:/$PID.txt
'ü' # output a string with a non-ASCII-range character
$null = Stop-Transcript

Select-String -Quiet 'ü' temp:/$PID.txt | Should -BeTrue

Remove-Item temp:/$PID.txt

Expected behavior

The test should succeed.

Actual behavior

The test fails, because ü was transliterated to verbatim ?, suggesting that ASCII encoding was used to write the file.

Environment data

PowerShell Core 7.1.0-preview.7

Area-Cmdlets First-Time-Issue Issue-Enhancement Up-for-Grabs

Source

mklement0

All 22 comments

We get file encoding in the method https://github.com/PowerShell/PowerShell/blob/10237bdeb37718595d475bd0014b005a035b34df/src/System.Management.Automation/engine/Utils.cs#L1395

Last resort there is ASCII. It seems the method is used only in transcript and we could safely change the default.

iSazonov on 24 Sep 2020

👍1

Thanks, @iSazonov.

What you're describing is the right fix for the closely related #13678, whereas the fix for this issue is to not consider any preexisting file at all.

I suspect both issues can easily be fixed together - not sure how you want to handle that in terms of tagging.

mklement0 on 25 Sep 2020

whereas the fix for this issue is to not consider any preexisting file at all.

I don't remember but perhaps we use the same code too and fallback to ASCII.

iSazonov on 26 Sep 2020

That's what it looks like; so, to summarize:

_This_ issue should be fixed by _always_ using BOM-less UTF-8, because a preexisting file is simply _replaced_, so its encoding is irrelevant.
13678, which applies to -Append, should be fixed by falling back to (by definition BOM-less) UTF-8 rather than ASCII in the absence of a BOM in the preexisting file.

mklement0 on 26 Sep 2020

I think we can change ASCII default to UTF8-NoBOM in GetEncoding() - it is used only in transcript code.

iSazonov on 27 Sep 2020

Yes, but additionally we shouldn't even call GetEncoding() _unless -Append is specified_.

mklement0 on 27 Sep 2020

👍1

Has work on this issue already started? If not, I'd be interested to try.

Gimly on 1 Oct 2020

👍1

@mklement0 I've changed the default encoding to UTF8 No BOM in GetEncoding in the PR #13732. But for not calling GetEncoding if the Append is specified, it's a bit trickier, the call is made in a setter of the Path property which I feel might not be the best place for it. I'll take a bit more time to see exactly where the append parameter is specified and if I can move the call of GetEncoding out of the setter, but if you already have an idea about it, tell me.

Gimly on 1 Oct 2020

Thanks for tackling this, @Gimly - I haven't even looked at the original code yet, but @iSazonov has, so perhaps he has thoughts.

mklement0 on 1 Oct 2020

👍1

@mklement0
It seems I found a root of the bug.
We use System.IO.File.WriteAllText(effectiveFilePath, string.Empty); - its default is Utf8NoBOM.
https://source.dot.net/#System.IO.FileSystem/System/IO/File.cs,300
https://source.dot.net/#System.Private.CoreLib/StreamWriter.cs,139

Windows PowerShell is not affected - UTF-8 with BOM is used consistently there.

Are you sure? See https://referencesource.microsoft.com/#mscorlib/system/io/file.cs,848

The pattern is seems popular https://stackoverflow.com/questions/4999988/how-can-i-clear-the-content-of-a-file
You could make a warning comment there.

Since the pattern so popular we could ask .Net team to add new System.IO.File.Clear(string Path) to avoid such bugs.

@Gimly I think we could use WriteAllText(string path, string? contents, Encoding encoding) overload here
https://github.com/PowerShell/PowerShell/blob/e107ecd08d68cc67512cd833a355dca868bd5a03/src/Microsoft.PowerShell.ConsoleHost/host/msh/StartTranscriptCmdlet.cs#L228-L232

For encoding detection in the place I suggest to use
StreamReader(string path, Encoding encoding, bool detectEncodingFromByteOrderMarks)

Source Browser

Reference Source

Stack Overflow
How can I clear the content of a file?
I need to clear the contents of a particular file every time the applications starts. How do I do it?

iSazonov on 2 Oct 2020

Are you sure? See https://referencesource.microsoft.com/#mscorlib/system/io/file.cs,848

That's interesting, but _in practice_ Start-Transcript without a preexisting file on Windows PowerShell v5.1.18362.752 on Microsoft Windows 10 Pro (64-bit; Version 1909, OS Build: 18363.1082) creates UTF-8 files _with_ BOM.
Is the reference source outdated, or is this not the relevant code?

Reference Source

mklement0 on 2 Oct 2020

Also, System.IO.File.WriteAllText(effectiveFilePath, string.Empty) effectively acts like a .Clear() method: given that BOM-less UTF-8 is the default encoding, the file is effectively reset to 0 bytes (and closed again).

So the problem isn't with this call - it must be with later write operations.

mklement0 on 2 Oct 2020

Is the reference source outdated, or is this not the relevant code?

I guess it is outdated (they makes .Net Framework compatible with .Net Core).

iSazonov on 2 Oct 2020

Also, System.IO.File.WriteAllText(effectiveFilePath, string.Empty) effectively acts like a .Clear() method: given that BOM-less UTF-8 is the default encoding, the file is effectively reset to 0 bytes (and closed again).

But if source file is in UTF16BOM the method resets it to UTF8NoBOM.

iSazonov on 2 Oct 2020

But if source file is in UTF16BOM the method resets it to UTF8NoBOM.

If -Append isn't specified, it doesn't matter what the previously existing file's encoding was. The cmdlet's default encoding should apply (which in PS Core is always BOM-less UTF-8), as it does if there is no preexisting file.

mklement0 on 3 Oct 2020

We call WriteAllText() only in Append scenario.

iSazonov on 3 Oct 2020

@iSazonov I'm getting a bit lost on the discussion. If the default for WriteAllTextFile is indeed Utf8NoBOM, then isn't it what we expected it to be? You'd like to call the overload with the encoding with Utf8NoBOM anyway so that's we don't rely only on the default?

Gimly on 5 Oct 2020

@Gimly My expectation is that we don't change an encoding of the transcript file if it exists.

If the file does not exist - create with default encoding. Fix is to change the default from ASCII to Uft8NoBOM.
If file exist:
- if no -Append - clear the file ~but preserve its encoding~ and write with default encoding
- if -Append presents - open file with current encoding

To detect encoding it is better follow StreamReader.

iSazonov on 6 Oct 2020

@iSazonov

Re 1: agreed

Re 2:

Without -Append: We do _not_ want to preserve any existing coding if the file exists. Just like Set-Content and Out-File quietly replace an existing file('s contents) with their _default encoding_ (unless -Encoding is explicitly used), so should Start-Transcript. The existing file's content simply does not matter.
With -Append: whatever implementation we choose, what matters is: if there's a BOM, respect it; if there's none, use (BOM-less) UTF-8.

mklement0 on 6 Oct 2020

Without -Append: We do not want to preserve any existing coding if the file exists.

I have no strong objection.

iSazonov on 6 Oct 2020

@mklement0 I've opened a PR that, I think, fixes this issue. It's quite simple as I think the only part that was not working as you described was that the "clear file" that was done when the -Append flag isn't set wasn't fixing a specific encoding, but simply using the one that existed.

Now I've fixed the encoding (using the static readonly instance present in Utils) to UTF8 No Bom, as we had done in the PR #13732.

Gimly on 26 Oct 2020

👍1

Thank you, @Gimly.

mklement0 on 26 Oct 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

src/libpsl-native/test/googletest/ does not contain a CMakeLists.txt file

ajensenwaud · 3Comments

Modify Environment Variables (for example $env:PSModulePath) on Linux

rudolfvesely · 3Comments

Splatting a property

JohnLBevan · 3Comments

PowerShell core on Linux - Get-Service should mimic linux "service" command.

MaximoTrinidad · 3Comments

Comment based help does not work for scripts that start with a shebang

abock · 3Comments

Powershell: Start-Transcript uses lossy ASCII character encoding instead of BOM-less UTF-8 if the target file happens to exist

Steps to reproduce

Expected behavior

Actual behavior

Environment data

All 22 comments

13678, which applies to -Append, should be fixed by falling back to (by definition BOM-less) UTF-8 rather than ASCII in the absence of a BOM in the preexisting file.

Related issues

13678, which applies to `-Append`, should be fixed by falling back to (by definition BOM-less) UTF-8 rather than ASCII in the absence of a BOM in the preexisting file.