Powershell: Invoke-WebRequest includes byte order mark in content for files encoded as UTF-8-BOM (a la ISE)

Created on 4 Oct 2017  路  10Comments  路  Source: PowerShell/PowerShell

Steps to reproduce

Create new file in ISE. Save as utf-8-bom.txt. Commit and push to github. Download and inspect the file as follows:

$content = Invoke-WebRequest https://raw.githubusercontent.com/alx9r/BootstraPS/master/Resources/utf-8-bom.txt |
    % Content
[int]$content[0]

Expected behavior

116

116 is the decimal integer representation of the charact "t".

Actual behavior

65279

65279 is the decimal integer representation of the Unicode byte order mark.

Environment data

> $PSVersionTable

Name                           Value                                           
----                           -----                                           
PSVersion                      6.0.0-beta                                      
PSEdition                      Core                                            
GitCommitId                    v6.0.0-beta.7                                   
OS                             Microsoft Windows 6.3.9600                      
Platform                       Win32NT                                         
PSCompatibleVersions           {1.0, 2.0, 3.0, 4.0...}                         
PSRemotingProtocolVersion      2.3                                             
SerializationVersion           1.1.0.1                                         
WSManStackVersion              3.0                                             
Area-Cmdlets-Utility Resolution-Answered

Most helpful comment

It looks like the discrepancy between the decoding behavior of Invoke-WebRequest and Get-Content lies in the way that they decode the byte stream to a string. Specifically, the difference is as follows:

StreamReader is aware of the byte-order mark and seems to honor endianness during conversion. StreamToString, on the other hand, uses System.Text.Encoding. While System.Text.Encoding includes support for decoding different endianness, StreamToString does not interpret a byte order mark at the beginning of the stream to adjust the decoding to match.

Workaround

Byte-order mark aware conversion using Invoke-WebRequest and System.IO.StreamReader can be achieved as follows:

$response = Invoke-WebRequest https://raw.githubusercontent.com/PowerShell/PowerShell/404e876740aa65b1bdd17ce614060eb88e3e7da9/test/powershell/Modules/Microsoft.PowerShell.Utility/WebCmdlets.Tests.ps1

# this uses StreamToString which ignores the byte-order-mark, so it is the first character
[int]$response.Content[0] # 65279

# StreamReader interprets the byte-order-mark strips it, so the first character is the pound symbol
[int][System.IO.StreamReader]::new($response.RawContentStream).ReadToEnd()[0] # 35

Implications for Endianness Mismatch

There seems to be another implication to Invoke-WebRequest's use of StreamToString. Because StreamToString decodes without considering the byte-order-mark it should be expected that the .Content property of the object returned by Invoke-WebRequest would contain incorrect data in the case of an endianness mismatch between whatever computer wrote the file that is served to Invoke-WebRequest and the computer invoking Invoke-WebRequest.

All 10 comments

Since the file has a BOM, the output should have a BOM, right? The first file here is currently saved as UTF-8 w/ BOM and the second as UTF-8 no BOM:

PS C:\> [int](iwr https://raw.githubusercontent.com/PowerShell/PowerShell/master/test/powershell/Modules/
Microsoft.PowerShell.Utility/WebCmdlets.Tests.ps1).Content[0]
65279
PS C:\> [int](iwr https://raw.githubusercontent.com/PowerShell/PowerShell/master/test/powershell/Modules/
Microsoft.PowerShell.Utility/Write-Error.Tests.ps1).Content[0]
68

Seems to be working as expected to me? Note that this test will fail in the future as we'll be converting all our sources on GitHub to have no BOM.

Hmm I can't repro this with the example provided either.

$content = Invoke-WebRequest https://raw.githubusercontent.com/alx9r/BootstraPS/master/Resources/utf-8-bom.txt |
% Content
[int]$content[0]

In windows 10 this is give me 65279

edit: oh, I see. I had it backwards. I thought they said the BOM was missing. This looks like it's working as intended to me.

@alx9r please add more details if you think it's behaving incorrectly. Thanks!

Since the file has a BOM, the output should have a BOM, right?

I suppose that depends on what, exactly, is meant by "Content" for a text file. I don't have a strong opinion about whether "Content" should include the BOM when present or not. I do feel fairly strongly, however, that the inclusion or exclusion of a BOM should be consistent across the PowerShell built-in APIs. Consider that

[System.IO.Path]::GetTempFileName() |
    % {
        Invoke-WebRequest https://raw.githubusercontent.com/alx9r/BootstraPS/master/Resources/utf-8-bom.txt -OutFile $_
        [int](Get-Content $_)[0]
    }

outputs 116. In other words, if you download the file to the file system and retrieve it the "Contents" does not include a BOM but if you download the file to memory and inspect the "Contents" there is a BOM. It seems like either both should include the BOM or neither should include the BOM.

That's Get-Content not returning the BOM. The BOM should be there in the file:

Invoke-WebRequest https://raw.githubusercontent.com/alx9r/BootstraPS/master/Resources/utf-8-bom.txt -OutFile c:\temp\utf-8-bom.txt
$fileStream = [System.IO.FileStream]::new("c:\temp\utf-8-bom.txt", "Open", "Read")
[byte[]]$UTF8BOM = 0xEF, 0xBB, 0xBF
[byte[]]$bytes = [byte[]]::New(3)
$null = $fileStream.Read($bytes,0,$bytes.Length)
Compare-Object $UTF8BOM $bytes -PassThru

This should come back with no output if $bytes and $UTF8BOM are the same.

You can get it with Get-Content using 鈥揈ncoding Byte:

Invoke-WebRequest https://raw.githubusercontent.com/alx9r/BootstraPS/master/Resources/utf-8-bom.txt -OutFile c:\temp\utf-8-bom.txt
[byte[]]$UTF8BOM = 0xEF, 0xBB, 0xBF
[byte[]]$bytes = Get-Content c:\temp\utf-8-bom.txt 鈥揈ncoding Byte -TotalCount 3
Compare-Object $UTF8BOM $bytes -PassThru

@markekraus I think you've missed my point. My point isn't about whether the BOM does or does not exist in the file's byte stream. My point is that two built-in APIs treat BOM byte sequences that appear in the byte stream differently:

  • Get-Content omits the BOM character on decoding
  • HtmlWebResponseObject.Content includes the BOM character on decoding

The reason I opened this issue is that this inconsistency has consequences for user code. The inconsistency means that user code has to handle the same file arriving by web request differently from reading it from the file system.

@alx9r we have two issues that I think addresses your valid concern. One is for file encoding to default to UTF8NoBOM https://github.com/PowerShell/PowerShell/issues/4878, the second is for outputencoding to default to UTF8NoBOM https://github.com/PowerShell/PowerShell/issues/4681. However, Invoke-WebRequest should return whatever the server returned so if it has a BOM, it should be there.

I thought #4878 and #4681 would only alter the way files are encoded. Am I misunderstanding those? This issue is about decoding files. I can't really control the encoding I encounter in files in the wild which is why consistent decoding matters.

...Invoke-WebRequest should return whatever the server returned so if it has a BOM, it should be there.

The Content property that Invoke-WebRequest returns is rather far from the "whatever the server returned":

  • the headers are stripped (compare with RawContent)
  • the byte stream is decoded to what seems to be a UTF-16-like .Net string

It seems strange to me to do all that and leave the byte order mark.

If Invoke-WebRequest "should return the server returned" then by the same doctrine, Get-Content should return "whatever is in the file". I don't think that makes sense in either case.

The Content property is the file that is being served which is what I meant by "whatever the server returned" which would not contain the headers. The file contains a BOM (see my examples above). In the case where the file itself doesn't contain a BOM, no BOM is returned. Perhaps what you want is the ability to re-encode to some target encoding.

It looks like the discrepancy between the decoding behavior of Invoke-WebRequest and Get-Content lies in the way that they decode the byte stream to a string. Specifically, the difference is as follows:

StreamReader is aware of the byte-order mark and seems to honor endianness during conversion. StreamToString, on the other hand, uses System.Text.Encoding. While System.Text.Encoding includes support for decoding different endianness, StreamToString does not interpret a byte order mark at the beginning of the stream to adjust the decoding to match.

Workaround

Byte-order mark aware conversion using Invoke-WebRequest and System.IO.StreamReader can be achieved as follows:

$response = Invoke-WebRequest https://raw.githubusercontent.com/PowerShell/PowerShell/404e876740aa65b1bdd17ce614060eb88e3e7da9/test/powershell/Modules/Microsoft.PowerShell.Utility/WebCmdlets.Tests.ps1

# this uses StreamToString which ignores the byte-order-mark, so it is the first character
[int]$response.Content[0] # 65279

# StreamReader interprets the byte-order-mark strips it, so the first character is the pound symbol
[int][System.IO.StreamReader]::new($response.RawContentStream).ReadToEnd()[0] # 35

Implications for Endianness Mismatch

There seems to be another implication to Invoke-WebRequest's use of StreamToString. Because StreamToString decodes without considering the byte-order-mark it should be expected that the .Content property of the object returned by Invoke-WebRequest would contain incorrect data in the case of an endianness mismatch between whatever computer wrote the file that is served to Invoke-WebRequest and the computer invoking Invoke-WebRequest.

Was this page helpful?
0 / 5 - 0 ratings