Powershell: Start-Transcript uses lossy ASCII character encoding instead of BOM-less UTF-8 if the target file happens to exist

Created on 23 Sep 2020  路  22Comments  路  Source: PowerShell/PowerShell

Note:

  • Windows PowerShell is _not_ affected - UTF-8 with BOM is used consistently there.
  • A fix for the closely related #13678 could probably encompass a fix for this issue as well.

When the target file _happens to exist and doesn't have a BOM_ - even without -Append - Start-Transcript mistakenly uses ASCII encoding to write the file.

Without -Append, Start-Transcript shouldn't even look at the existing file - it should simply replace it, and use the default encoding (BOM-less UTF-8). See #13678 for a closely related bug _with_ -Append.

Instead, Start-Transcript apparently looks for a BOM at the start of an existing file and uses ASCII encoding if it doesn't find one.

Steps to reproduce

# To surface the bug, make sure that the target file exists and doesn't have a BOM.
$null > temp:/$PID.txt

$null = Start-Transcript temp:/$PID.txt
'眉' # output a string with a non-ASCII-range character
$null = Stop-Transcript

Select-String -Quiet '眉' temp:/$PID.txt | Should -BeTrue

Remove-Item temp:/$PID.txt

Expected behavior

The test should succeed.

Actual behavior

The test fails, because was transliterated to verbatim ?, suggesting that ASCII encoding was used to write the file.

Environment data

PowerShell Core 7.1.0-preview.7
Area-Cmdlets First-Time-Issue Issue-Enhancement Up-for-Grabs

All 22 comments

We get file encoding in the method https://github.com/PowerShell/PowerShell/blob/10237bdeb37718595d475bd0014b005a035b34df/src/System.Management.Automation/engine/Utils.cs#L1395

Last resort there is ASCII. It seems the method is used only in transcript and we could safely change the default.

Thanks, @iSazonov.

What you're describing is the right fix for the closely related #13678, whereas the fix for this issue is to not consider any preexisting file at all.

I suspect both issues can easily be fixed together - not sure how you want to handle that in terms of tagging.

whereas the fix for this issue is to not consider any preexisting file at all.

I don't remember but perhaps we use the same code too and fallback to ASCII.

That's what it looks like; so, to summarize:

  • _This_ issue should be fixed by _always_ using BOM-less UTF-8, because a preexisting file is simply _replaced_, so its encoding is irrelevant.

  • 13678, which applies to -Append, should be fixed by falling back to (by definition BOM-less) UTF-8 rather than ASCII in the absence of a BOM in the preexisting file.

I think we can change ASCII default to UTF8-NoBOM in GetEncoding() - it is used only in transcript code.

Yes, but additionally we shouldn't even call GetEncoding() _unless -Append is specified_.

Has work on this issue already started? If not, I'd be interested to try.

@mklement0 I've changed the default encoding to UTF8 No BOM in GetEncoding in the PR #13732. But for not calling GetEncoding if the Append is specified, it's a bit trickier, the call is made in a setter of the Path property which I feel might not be the best place for it. I'll take a bit more time to see exactly where the append parameter is specified and if I can move the call of GetEncoding out of the setter, but if you already have an idea about it, tell me.

Thanks for tackling this, @Gimly - I haven't even looked at the original code yet, but @iSazonov has, so perhaps he has thoughts.

@mklement0
It seems I found a root of the bug.
We use System.IO.File.WriteAllText(effectiveFilePath, string.Empty); - its default is Utf8NoBOM.
https://source.dot.net/#System.IO.FileSystem/System/IO/File.cs,300
https://source.dot.net/#System.Private.CoreLib/StreamWriter.cs,139

Windows PowerShell is not affected - UTF-8 with BOM is used consistently there.

Are you sure? See https://referencesource.microsoft.com/#mscorlib/system/io/file.cs,848

The pattern is seems popular https://stackoverflow.com/questions/4999988/how-can-i-clear-the-content-of-a-file
You could make a warning comment there.

Since the pattern so popular we could ask .Net team to add new System.IO.File.Clear(string Path) to avoid such bugs.

@Gimly I think we could use WriteAllText(string path, string? contents, Encoding encoding) overload here
https://github.com/PowerShell/PowerShell/blob/e107ecd08d68cc67512cd833a355dca868bd5a03/src/Microsoft.PowerShell.ConsoleHost/host/msh/StartTranscriptCmdlet.cs#L228-L232

For encoding detection in the place I suggest to use
StreamReader(string path, Encoding encoding, bool detectEncodingFromByteOrderMarks)

Stack Overflow
I need to clear the contents of a particular file every time the applications starts. How do I do it?

Are you sure? See https://referencesource.microsoft.com/#mscorlib/system/io/file.cs,848

That's interesting, but _in practice_ Start-Transcript without a preexisting file on Windows PowerShell v5.1.18362.752 on Microsoft Windows 10 Pro (64-bit; Version 1909, OS Build: 18363.1082) creates UTF-8 files _with_ BOM.
Is the reference source outdated, or is this not the relevant code?

Also, System.IO.File.WriteAllText(effectiveFilePath, string.Empty) effectively acts like a .Clear() method: given that BOM-less UTF-8 is the default encoding, the file is effectively reset to 0 bytes (and closed again).

So the problem isn't with this call - it must be with later write operations.

Is the reference source outdated, or is this not the relevant code?

I guess it is outdated (they makes .Net Framework compatible with .Net Core).

Also, System.IO.File.WriteAllText(effectiveFilePath, string.Empty) effectively acts like a .Clear() method: given that BOM-less UTF-8 is the default encoding, the file is effectively reset to 0 bytes (and closed again).

But if source file is in UTF16BOM the method resets it to UTF8NoBOM.

But if source file is in UTF16BOM the method resets it to UTF8NoBOM.

If -Append isn't specified, it doesn't matter what the previously existing file's encoding was. The cmdlet's default encoding should apply (which in PS Core is always BOM-less UTF-8), as it does if there is no preexisting file.

We call WriteAllText() only in Append scenario.

@iSazonov I'm getting a bit lost on the discussion. If the default for WriteAllTextFile is indeed Utf8NoBOM, then isn't it what we expected it to be? You'd like to call the overload with the encoding with Utf8NoBOM anyway so that's we don't rely only on the default?

@Gimly My expectation is that we don't change an encoding of the transcript file if it exists.

  1. If the file does not exist - create with default encoding. Fix is to change the default from ASCII to Uft8NoBOM.
  2. If file exist:

    • if no -Append - clear the file ~but preserve its encoding~ and write with default encoding

    • if -Append presents - open file with current encoding

To detect encoding it is better follow StreamReader.

@iSazonov

Re 1: agreed

Re 2:

  • Without -Append: We do _not_ want to preserve any existing coding if the file exists. Just like Set-Content and Out-File quietly replace an existing file('s contents) with their _default encoding_ (unless -Encoding is explicitly used), so should Start-Transcript. The existing file's content simply does not matter.

  • With -Append: whatever implementation we choose, what matters is: if there's a BOM, respect it; if there's none, use (BOM-less) UTF-8.

Without -Append: We do not want to preserve any existing coding if the file exists.

I have no strong objection.

@mklement0 I've opened a PR that, I think, fixes this issue. It's quite simple as I think the only part that was not working as you described was that the "clear file" that was done when the -Append flag isn't set wasn't fixing a specific encoding, but simply using the one that existed.

Now I've fixed the encoding (using the static readonly instance present in Utils) to UTF8 No Bom, as we had done in the PR #13732.

Thank you, @Gimly.

Was this page helpful?
0 / 5 - 0 ratings