Powershell: Webcmdlets should parse the <html><head><meta charset="foo"> attribute for the correct encoding if not in http header

Created on 6 Mar 2017  路  26Comments  路  Source: PowerShell/PowerShell

Some websites do not populate the charset property of the content-type header so characters aren't rendered correctly. Suggestion is to expose a -charset parameter, however the user still needs to know the expected charset. Advanced users today can do the encoding translation in script. utf-8 probably works in most cases, so not entirely sure how useful this will be to expect the user to know ahead of time the correct charset.

See discussion from https://github.com/PowerShell/PowerShell/issues/3126 for details on how this came about

Area-Cmdlets-Utility Committee-Reviewed Issue-Enhancement

Most helpful comment

@PowerShell/powershell-committee believes AngleSharp is a better choice as it seems to be more actively updated and this should be a separate convertfrom-html cmdlet/module. The charset attribute on the meta html tag parsing should still be in the webcmdlets and a simplified best effort than using a complete html parser.

All 26 comments

I believe we should discuss here a default encoding for Web cmdlets.
Currently we use RFC 2616 "ISO-8859-1"
In fact, browsers for a long time use Windows-1252.
And HTML5 default is UTF-8.

Perhaps we should also use Windows 1252.

I looked Web cmdlets and found that we use ContentType parameter to encode a request (Content-Type header can contains charset value). If we don't specify a charset in ContenType we use default charset ISO-8859-1.
For decoding a response we use a charset from ContentType of the response. If a server return no ContentType we use the same default charset ISO-8859-1.

So we should treat -CharSet as -ResponseCharSet.

After some thought, I believe that using Windows 1252 as default is obsolete and we should aim at HTML5 and UTF-8 as defaults. https://w3techs.com/technologies/details/ml-html5/all/all

@PowerShell/powershell-committee reviewed this and agree this is an issue for customers. proposal is to parse the HTTP header first, if charset is in content-type, we use that. otherwise if content-type is html, we parse <meta charset="X"> for the charset attribute and use that

In Windows Powershell we use Internet Explorer to parse HTML. What portable parser we can use in PowerShell Core?

And if HTML don't contain <meta charset="foo"> ? What defaults we should use for fallback?

@iSazonov the proposal is that we don't rely on any browser for the html parsing (if complete parsing is needed, I still think it would make more sense in a convertfrom-html cmdlet). I think "best effort" for this is sufficient (perhaps even just a regular expression) to cover the majority of cases and we wouldn't worry about malformed html.

@SteveL-MSFT Original Windows web cmdlet returns ParsedHtml : mshtml.HTMLDocumentClass - we want lost the functionality?

@iSazonov .ParshedHtml relied on Internet Explorer. I don't think we can have a dependency on any particular web browser in the webcmdlets.

To answer your other question I missed, if <meta charset> doesn't exist and charset isn't specified in the content-type HTTP header, then we do what we do today which is assume ISO-8859-1

@SteveL-MSFT

.ParshedHtml relied on Internet Explorer

If we use any ported library for HTML parsing we will solve this Issue, get ParshedHtml ported, as well as get a base for ConvertFrom-HTML.

I wonder about ISO-8859-1. Why we do not want to accept the new standard HTML5?

@iSazonov my understanding is that HTTP1.1 still defaults to iso-8859-1, if the content-type is text/html, we should just follow the HTML 5 rules for determining content type ideally just using one of the OSS HTML parsing libraries you already found

@SteveL-MSFT It seems the doc is very old. New is http://www.w3.org/TR/html5/syntax.html#the-input-byte-stream It don't mention iso-8859-1 at all.

Currently CoreFX already use UTF8 as default.

@iSazonov that's html5, HTTP 1.1 defaults to ISO-8859-1 if charset is not specified. See 3.4.1 in https://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html

These standards are too muddled 馃槙 From https://tools.ietf.org/html/rfc7231:

The default charset of ISO-8859-1 for text media types has been
removed; the default is now whatever the media type definition says.
Likewise, special treatment of ISO-8859-1 has been removed from the
Accept-Charset header field.

In any case we trust CoreFX. Yes?

Ideally, we should just leave this to corefx.

This already is in CoreFX so we can.
Only is this in CoreFX version we use currently or we blocked until NetStandard 2.0?

Marking as waiting on NetStandard20. Once we move to latest CoreClr, we can verify if this is still an issue.

@SteveL-MSFT Can you initiate internal conclusion about using HtmlAgilityPack or
AngleSharp? What one we can more trust?
Then I would try to replace IE on one of these parsers for PowerShell Core (and leave IE for FullCLR).

@PowerShell/powershell-committee believes AngleSharp is a better choice as it seems to be more actively updated and this should be a separate convertfrom-html cmdlet/module. The charset attribute on the meta html tag parsing should still be in the webcmdlets and a simplified best effort than using a complete html parser.

@dantraMSFT can you look into this?

Looking now.

The problem occured because Invoke-RestMethod was calling ContentHelper.GetEncoding. This returns a fallback encoding which was defeating checks for meta tags in the response. Explicit tests were added to cover the same variations as are tested for Invoke-WebRequest.

I'm not seeing the problem with 'http://weibo.com'. Invoke-RestMethod is detecting the encoding correctly. Can you be more specific?

For tv.sohu.com and ip138.com, I found a bug in Invoke-RestMethod. It is calling WriteVerbose with the encoding indicating the encoding name or header name but Encoding.EncodingName is throwing. I'll need to change this and update the tests.

Any further work on this I'm deferring to 6.1.0

Submitted RFC for the creation of ConvertFrom-Html here: https://github.com/PowerShell/PowerShell-RFC/pull/137

We get new HttpClient with .Net 3+ so I remove the label.

Was this page helpful?
0 / 5 - 0 ratings