Create new file in ISE. Save as utf-8-bom.txt
. Commit and push to github. Download and inspect the file as follows:
$content = Invoke-WebRequest https://raw.githubusercontent.com/alx9r/BootstraPS/master/Resources/utf-8-bom.txt |
% Content
[int]$content[0]
116
116 is the decimal integer representation of the charact "t".
65279
65279 is the decimal integer representation of the Unicode byte order mark.
> $PSVersionTable
Name Value
---- -----
PSVersion 6.0.0-beta
PSEdition Core
GitCommitId v6.0.0-beta.7
OS Microsoft Windows 6.3.9600
Platform Win32NT
PSCompatibleVersions {1.0, 2.0, 3.0, 4.0...}
PSRemotingProtocolVersion 2.3
SerializationVersion 1.1.0.1
WSManStackVersion 3.0
Since the file has a BOM, the output should have a BOM, right? The first file here is currently saved as UTF-8 w/ BOM and the second as UTF-8 no BOM:
PS C:\> [int](iwr https://raw.githubusercontent.com/PowerShell/PowerShell/master/test/powershell/Modules/
Microsoft.PowerShell.Utility/WebCmdlets.Tests.ps1).Content[0]
65279
PS C:\> [int](iwr https://raw.githubusercontent.com/PowerShell/PowerShell/master/test/powershell/Modules/
Microsoft.PowerShell.Utility/Write-Error.Tests.ps1).Content[0]
68
Seems to be working as expected to me? Note that this test will fail in the future as we'll be converting all our sources on GitHub to have no BOM.
Hmm I can't repro this with the example provided either.
$content = Invoke-WebRequest https://raw.githubusercontent.com/alx9r/BootstraPS/master/Resources/utf-8-bom.txt |
% Content
[int]$content[0]
In windows 10 this is give me 65279
edit: oh, I see. I had it backwards. I thought they said the BOM was missing. This looks like it's working as intended to me.
@alx9r please add more details if you think it's behaving incorrectly. Thanks!
Since the file has a BOM, the output should have a BOM, right?
I suppose that depends on what, exactly, is meant by "Content" for a text file. I don't have a strong opinion about whether "Content" should include the BOM when present or not. I do feel fairly strongly, however, that the inclusion or exclusion of a BOM should be consistent across the PowerShell built-in APIs. Consider that
[System.IO.Path]::GetTempFileName() |
% {
Invoke-WebRequest https://raw.githubusercontent.com/alx9r/BootstraPS/master/Resources/utf-8-bom.txt -OutFile $_
[int](Get-Content $_)[0]
}
outputs 116. In other words, if you download the file to the file system and retrieve it the "Contents" does not include a BOM but if you download the file to memory and inspect the "Contents" there is a BOM. It seems like either both should include the BOM or neither should include the BOM.
That's Get-Content
not returning the BOM. The BOM should be there in the file:
Invoke-WebRequest https://raw.githubusercontent.com/alx9r/BootstraPS/master/Resources/utf-8-bom.txt -OutFile c:\temp\utf-8-bom.txt
$fileStream = [System.IO.FileStream]::new("c:\temp\utf-8-bom.txt", "Open", "Read")
[byte[]]$UTF8BOM = 0xEF, 0xBB, 0xBF
[byte[]]$bytes = [byte[]]::New(3)
$null = $fileStream.Read($bytes,0,$bytes.Length)
Compare-Object $UTF8BOM $bytes -PassThru
This should come back with no output if $bytes
and $UTF8BOM
are the same.
You can get it with Get-Content
using 鈥揈ncoding Byte
:
Invoke-WebRequest https://raw.githubusercontent.com/alx9r/BootstraPS/master/Resources/utf-8-bom.txt -OutFile c:\temp\utf-8-bom.txt
[byte[]]$UTF8BOM = 0xEF, 0xBB, 0xBF
[byte[]]$bytes = Get-Content c:\temp\utf-8-bom.txt 鈥揈ncoding Byte -TotalCount 3
Compare-Object $UTF8BOM $bytes -PassThru
@markekraus I think you've missed my point. My point isn't about whether the BOM does or does not exist in the file's byte stream. My point is that two built-in APIs treat BOM byte sequences that appear in the byte stream differently:
Get-Content
omits the BOM character on decodingHtmlWebResponseObject.Content
includes the BOM character on decodingThe reason I opened this issue is that this inconsistency has consequences for user code. The inconsistency means that user code has to handle the same file arriving by web request differently from reading it from the file system.
@alx9r we have two issues that I think addresses your valid concern. One is for file encoding to default to UTF8NoBOM https://github.com/PowerShell/PowerShell/issues/4878, the second is for outputencoding to default to UTF8NoBOM https://github.com/PowerShell/PowerShell/issues/4681. However, Invoke-WebRequest should return whatever the server returned so if it has a BOM, it should be there.
I thought #4878 and #4681 would only alter the way files are encoded. Am I misunderstanding those? This issue is about decoding files. I can't really control the encoding I encounter in files in the wild which is why consistent decoding matters.
...Invoke-WebRequest should return whatever the server returned so if it has a BOM, it should be there.
The Content property that Invoke-WebRequest
returns is rather far from the "whatever the server returned":
It seems strange to me to do all that and leave the byte order mark.
If Invoke-WebRequest
"should return the server returned" then by the same doctrine, Get-Content
should return "whatever is in the file". I don't think that makes sense in either case.
The Content
property is the file that is being served which is what I meant by "whatever the server returned" which would not contain the headers. The file contains a BOM (see my examples above). In the case where the file itself doesn't contain a BOM, no BOM is returned. Perhaps what you want is the ability to re-encode to some target encoding.
It looks like the discrepancy between the decoding behavior of Invoke-WebRequest
and Get-Content
lies in the way that they decode the byte stream to a string. Specifically, the difference is as follows:
Get-Content
is decoded by System.IO.StreamReader
Invoke-WebRequest
is decoded by the PowerShell project's StreamToString
StreamReader
is aware of the byte-order mark and seems to honor endianness during conversion. StreamToString
, on the other hand, uses System.Text.Encoding
. While System.Text.Encoding
includes support for decoding different endianness, StreamToString
does not interpret a byte order mark at the beginning of the stream to adjust the decoding to match.
Byte-order mark aware conversion using Invoke-WebRequest
and System.IO.StreamReader
can be achieved as follows:
$response = Invoke-WebRequest https://raw.githubusercontent.com/PowerShell/PowerShell/404e876740aa65b1bdd17ce614060eb88e3e7da9/test/powershell/Modules/Microsoft.PowerShell.Utility/WebCmdlets.Tests.ps1
# this uses StreamToString which ignores the byte-order-mark, so it is the first character
[int]$response.Content[0] # 65279
# StreamReader interprets the byte-order-mark strips it, so the first character is the pound symbol
[int][System.IO.StreamReader]::new($response.RawContentStream).ReadToEnd()[0] # 35
There seems to be another implication to Invoke-WebRequest
's use of StreamToString
. Because StreamToString
decodes without considering the byte-order-mark it should be expected that the .Content
property of the object returned by Invoke-WebRequest
would contain incorrect data in the case of an endianness mismatch between whatever computer wrote the file that is served to Invoke-WebRequest
and the computer invoking Invoke-WebRequest
.
Most helpful comment
It looks like the discrepancy between the decoding behavior of
Invoke-WebRequest
andGet-Content
lies in the way that they decode the byte stream to a string. Specifically, the difference is as follows:Get-Content
is decoded bySystem.IO.StreamReader
Invoke-WebRequest
is decoded by the PowerShell project'sStreamToString
StreamReader
is aware of the byte-order mark and seems to honor endianness during conversion.StreamToString
, on the other hand, usesSystem.Text.Encoding
. WhileSystem.Text.Encoding
includes support for decoding different endianness,StreamToString
does not interpret a byte order mark at the beginning of the stream to adjust the decoding to match.Workaround
Byte-order mark aware conversion using
Invoke-WebRequest
andSystem.IO.StreamReader
can be achieved as follows:Implications for Endianness Mismatch
There seems to be another implication to
Invoke-WebRequest
's use ofStreamToString
. BecauseStreamToString
decodes without considering the byte-order-mark it should be expected that the.Content
property of the object returned byInvoke-WebRequest
would contain incorrect data in the case of an endianness mismatch between whatever computer wrote the file that is served toInvoke-WebRequest
and the computer invokingInvoke-WebRequest
.