Powershell: [My bug report]irm,iwr get xml Problem

Created on 10 Jan 2020  Â·  10Comments  Â·  Source: PowerShell/PowerShell

Unrecognizable and processed, garbled.
Example

$url='https://storage.live.com/items/A78ACCAEBB24EDD7!37945?&authkey=!APfFKTYtceWCfG0'
$g='./xmltest'
$reg='pN|utf'
((irm $URL) -split "[`r`n]+") -match $reg
irm $URL -outfile $g
(get-content   $g)-match $reg

Expected

PS /sh> irm $URL

xml                            Folder
---                            ------
version="1.0" encoding="utf-8" Folder

PS /sh> (irm $URL).Folder.Items.Document

ItemType ResourceID             RelationshipName
-------- ----------             ----------------
Document A78ACCAEBB24EDD7!37948 测试.json

Results

PS /sh> (iwr $URL).Headers.'Content-Type'
text/xml
PS /sh> ((irm $URL) -split "[`r`n]+") -match $reg
<?xml version="1.0" encoding="utf-8"?>
      <RelationshipName>æµè¯.json</RelationshipName>
  <RelationshipName>BingClients</RelationshipName>

Read saved files,Seems no problem.

PS /s> (get-content  ../aa/irm )-match 'pN|utf'
 <?xml version="1.0" encoding="utf-8"?>
<RelationshipName>测试.json</RelationshipName>
<RelationshipName>BingClients</RelationshipName>
PS /sdcard/Documents/sh>

curl

PS /sdcard/Documents/sh> ((curl $URL) -split "[`r`n]+") -match $reg
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2693  100  2693    0     0   2170      0  0:00:01  0:00:01 --:--:--  2170
<?xml version="1.0" encoding="utf-8"?>
      <RelationshipName>测试.json</RelationshipName>
  <RelationshipName>BingClients</RelationshipName>
Area-Cmdlets-Utility Issue-Question Up-for-Grabs

Most helpful comment

Note: I don't know what the _intended_ behavior is, but here is what seems to be happening:

Because the response doesn't indicate a character encoding (charset) in its Content-Type header field (text/xml rather than text/xml; charset=utf-8), PowerShell defaults to ISO-8859-1, in accordance with the - obsolete since 2014 - RFC 2616.

Because it blindly assumes ISO-8859-1, the UTF-8 BOM is read as _data_, and the payload is therefore not recognized as XML, which falls back to a(n incorrectly decoded) string instead of returning an XmlDocument instance.

Note that current RFC, RFC 7231, no longer mandates an overall default and instead defers to the default encoding of the given media type.
For XML, RFC 7303 mandates looking at the BOM first and if there is none at the charset attribute in the Content-Type header. If that isn't present either, respect the encoding specified in the XML declaration, and if there is none, default to UTF-8.

Given that HTM5 now also defaults to UTF-8 and given that RFC 2616 is obsolete, we should consider implementing the following logic in both Invoke-WebRequest and Invoke-RestMethod:

  • respect a BOM, if present
  • if there is no BOM, respect a charset attribute in Content-Type
  • otherwise, for XML and HTML, respect the encoding specified in the XML declaration (e.g. <?xml version="1.0" encoding="ISO-8859-1" ?>) / HTML <meta> element, if present (green-lit in #3267)
  • If none of the above applies, default to UTF-8.

All 10 comments

The problem is that live.com is not returning the encoding it's using in its headers. PowerShell obeys the standard by assuming ISO-8859-1, but unfortunately the site is using UTF-8.

Stack Overflow
I am trying to get information from the Spotify database through their Web API. However, I'm facing issues with accented vowels (ä,ö,ü etc.) Lets take Tiësto as an example. Spotify's API Browser can

@he852100 Please add info about PowerShell version. Can you repo with latest PowerShell Core build?

PSVersion                      7.0.0-daily.20200110
PSEdition                      Core
GitCommitId                    7.0.0-daily.20200110
OS                             Linux 3.10.0-1062.9.1.el7.x86_64 …
Platform                       Unix
PSCompatibleVersions           {1.0, 2.0, 3.0, 4.0…}
PSRemotingProtocolVersion      2.3
SerializationVersion           1.1.0.1
WSManStackVersion              3.0
sh> Invoke-WebRequest 'https://pscoretestdata.blob.core.windows.net/v7-0-0-daily-20200110/powershell-7.0.0-daily.20200110-linux-arm64.tar.gz' -O ~/powershell.tar.gz -Resume
StatusCode        : 416                                           
StatusDescription : RequestedRangeNotSatisfiable                  
Content           : <?xml version="1.0" encoding="utf-8"?><Error><Code>InvalidRange</Code><Message>The rang
                    e specified is invalid for the current size of the resource.
                    RequestId:e8b88225-401e-0127-7cdc-c866f8000000

PS /root> $a.headers.GetEnumerator()

Key             Value
---             -----
Server          {Windows-Azure-Blob/1.0, Microsoft-HTTPAPI/2.0}
x-ms-request-id {322455bd-301e-008d-77e3-c8f642000000}
x-ms-version    {2014-02-14}
Date            {Sun, 12 Jan 2020 00:56:33 GMT}
Content-Length  {249}
Content-Type    {application/xml}
Content-Range   {bytes */46486387}

Windows.net

PowerShell obeys the standard by assuming ISO-8859-1, but unfortunately the site is using UTF-8.

@iSazonov It can be determined that powershell does not recognize utf8bom

@he852100 I guess it comes from .Net Core.

@he852100 I guess it comes from .Net Core.

That comes from PS5 and older. If website saying, i'm utf8, why does iwr return ascii?

Note: I don't know what the _intended_ behavior is, but here is what seems to be happening:

Because the response doesn't indicate a character encoding (charset) in its Content-Type header field (text/xml rather than text/xml; charset=utf-8), PowerShell defaults to ISO-8859-1, in accordance with the - obsolete since 2014 - RFC 2616.

Because it blindly assumes ISO-8859-1, the UTF-8 BOM is read as _data_, and the payload is therefore not recognized as XML, which falls back to a(n incorrectly decoded) string instead of returning an XmlDocument instance.

Note that current RFC, RFC 7231, no longer mandates an overall default and instead defers to the default encoding of the given media type.
For XML, RFC 7303 mandates looking at the BOM first and if there is none at the charset attribute in the Content-Type header. If that isn't present either, respect the encoding specified in the XML declaration, and if there is none, default to UTF-8.

Given that HTM5 now also defaults to UTF-8 and given that RFC 2616 is obsolete, we should consider implementing the following logic in both Invoke-WebRequest and Invoke-RestMethod:

  • respect a BOM, if present
  • if there is no BOM, respect a charset attribute in Content-Type
  • otherwise, for XML and HTML, respect the encoding specified in the XML declaration (e.g. <?xml version="1.0" encoding="ISO-8859-1" ?>) / HTML <meta> element, if present (green-lit in #3267)
  • If none of the above applies, default to UTF-8.

Currently we have many workarounds. I guess they comes from PS 5.0.
Now we could use HttpContent.ReadAsStringAsync() method. It seems it already has the decoding logic
https://github.com/dotnet/runtime/blob/bd6cbe3642f51d70839912a6a666e5de747ad581/src/libraries/System.Net.Http/src/System/Net/Http/HttpContent.cs#L182

GitHub
.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps. - dotnet/runtime

That's promising, @iSazonov, but it looks like the referenced method gives precedence to the charset attribute over the payload's BOM, correct?

This is the reverse of how XML data is supposed to be handled according to RFC 7303 (leaving the additional need to respect an encoding in the XML declaration aside), and, arguably, for _all_ textual media types, according to section "5. Security Considerations" of RFC 6657:

this document recommends the use of charset information that is more likely to be correct (for example, in-band over out-of-band).

A BOM is an instance of in-band information, whereas the charset header-field attribute is out-of-band information; therefore, the BOM should take precedence.

Therefore, the method you link to wouldn't solve the problem described in #12861, for instance.

the BOM should take precedence

It looks like a .Net bug. You could open new issue in .Net Runtime repo.

In common, I guess we could simplify the PowerShell code if we would follow the .Net API.

Was this page helpful?
0 / 5 - 0 ratings