Powershell: Invoke-Webrequest is missing some properties, like .ParsedHtml and .AllElements

Created on 9 Dec 2016  路  28Comments  路  Source: PowerShell/PowerShell

Hello Everyone
PsCore 6.x's Invoke-Webrequest is missing on the most 'loved' Ps 'classic' features, ie the ability to parse web pages and turn them into explorable objects.

Steps to reproduce

$test=invoke-webrequest -url http://www.github.com

Expected behavior

$test.ParsedHtml
$test.AllElements
(amont others) are missing:

Microsoft Powershell :
psclassic

Core Poweshell:
pscore

Actual behavior

these properties are missing

Environment data

SerializationVersion 1.1.0.1
PSRemotingProtocolVersion 2.3
PSEdition Core
WSManStackVersion 3.0
BuildVersion 3.0.0.0
GitCommitId v6.0.0-alpha.13
PSVersion 6.0.0-alpha
PSCompatibleVersions {1.0, 2.0, 3.0, 4.0...}

I have hope to have web parsing capabilities in PSCore!
Thanks everyone!

Area-Cmdlets-Utility Resolution-External

Most helpful comment

However, writing ConvertFrom-HTML is likely to be more complex than an override of the single element of the HtmlWebResponseObject.

I'm not sure it is more complex. Besides, there are other needs for a ConvertFrom-HTML object besides Web Cmdlet results.

I also think separating HTML parsing from IWR is the right move even if IE was x-plat and we could still use the same underlying API on all supported systems. The default HTML parsing and the reliance on IE is one very common Windows PowerShell pain-point for using IWR in headless environments. (Because IE had to be configured before first use).

IMO, the design choices in the early IWR and IRM to break the singe responsibility principle in favor of ease of use has hamstrung the cmdltes to this day. Decoupling these cmdlets from their dependencies and opening those dependencies to more general use is good move, even if it is a breaking change to do so.

All 28 comments

@joeyaiello FYI: Full PowerShell essentially uses Internet Explorer to do the HTML parsing. This is likely to be problematic to replicate on .NET core.

@BrucePay I'd still like to investigate after we look at whatever ends up shipping .NET Standard 2.0

What's the best alternative if this is low priority? Any low effort workarounds for parsing HTML in core?

.Net Std 2.0 doesn't solve this. There is another issue where we discussed the need to leverage an OSS html parser rather than rely on a web browser being available (https://github.com/PowerShell/PowerShell/issues/2867), doing that work should also resolve this one.

May I propose using https://github.com/servo/html5ever? Loosely speaking, Servo is the next generation of Firefox written in Rust. You can writing your bindings against https://github.com/utkarshkukreti/select.rs which is written on top of html5ever.

Unlike some of the other parts of OSS community which in the past haven't colloborated with their own specific list of companies, Rust community is incredibly open and collaborative. VSCode uses a rust-coded library- https://github.com/BurntSushi/ripgrep/ whose author made changes specifically to accomodate them- @BurntSushi .

There's also https://github.com/google/gumbo-parser which doesn't seem maintained right now and Chromium but I don't know how easy it is to pull in external code.

PS: I am an honorary member of the Rust action strike force, which pressures people into rewriting already existing codebases into Rust. No I'm not, lol, there's no such thing. I'm just a Linux user who wishes to be able to parse html with iwr.

@SRGOM I don't have any affinity to any particular html parser, but certainly one that is maintained is more desirable. I know that @iSazonov had proposed looking at some other OSS ones as well.

It was in #3267 - AngleSharp. I now believe that AngleSharp is a lightweight and sufficient solution to this Issue. The use of more powerful engines should be discussed further (For example, if we want to migrate completely to a similar engine).

Rather than having this as part of the webcmdlets, these should be separate as html cmdlets which can be used against local files.

I am guessing this is still considered for the future?

@MSAdministrator Yes, it is still considered for the future.

@MSAdministrator Up-for-Grabs delegate this to community :-). Feel free to write RFC to start the dev process.

Submitted RFC for ConvertFrom-Html here: https://github.com/PowerShell/PowerShell-RFC/pull/137

Why has this been changed to default of -UseBasicParsing?
This is the cause of the issue. I was using the scripts property for gathering information from the marketplace.visualstudio.com and now that property is missing causing my automation script to fail. This really shouldn't be a RFC but a request to restore original functionality. All that needs to be done to fix this is remove that hidden flag.

@wyzerd Windows PowerShell relied on Internet Explorer to parse the html. Since Internet Explorer wasn't available in most platforms we support with PowerShell Core 6 (nanoserver, Linux, macOS), it made sense to default to -UseBasicParsing. @MSAdministrator's proposal for ConvertFrom-Html is a better solution rather than marrying the parsing capability to the web cmdlets (like parsing a local html file).

MS is trusted for a backwards compatibility. It's okay to miss features but please don't add or remove things silently. It breaks trust.

It wasn't removed silently. There were several announcement blog posts, including at least two or three from the main man behind the change, and it was documented in the patch notes and I believe is also in the updated help documentation.

@vexx32- You know what I'm trying to say but yes silently wasn't the right word to use. Please don't change default behavior that already works.

There wasn't really much of a choice given they wanted to support multiple platforms, really. It's unfortunate, but as Steve mentions it's probably best to look at alternate solutions than to keep it tied down to a past solution that would inevitably break.

MS as a whole is moving away from IE to Edge, so I'm sure that having IE as a dependency isn't desirable, regardless of how much it might be convenient for some.

@SteveL-MSFT,
Thanks for the explanation. It never occurred to me that IE would have been the used as part of the library. In consideration of the evolution of Powershell, I suppose it made sense in 3.0 when it was generated specifically for Windows. I still think ConvertFrom-Html is sort of a hack (no offense intended). For maximum compatibility, it would be better to have rewrite the library to parse the HTML in a similar manner that the IE Library did it and return a compatible object type.

While -UseBasicParsing removes the one class defined by IE (mshtml library), it also removes other objects unnecessarily, Scripts, AllElements, Forms to name a few that are readily available. It seems it should be easier to override the ParsedHtml than to write a crutch app to append to the call.

@wyzerd This would mean porting IE that is not real.

@wyzerd

it should be easier to override the ParsedHtml than to write a crutch app to append to the call.

This is an Open Source project. So if you feel this is a low level of effort, you are free to create a pull request to do so. 馃槂

@markekraus
To be fair, I didn't say low level. However, writing ConvertFrom-HTML is likely to be more complex than an override of the single element of the HtmlWebResponseObject. Saddest part about it is I don't even care about that single mshtml element. What I want is the scripts element that is completely compatible, but removed because of the implementation.

It is a good suggestion though and I really could use the time to sharpen my tools.

However, writing ConvertFrom-HTML is likely to be more complex than an override of the single element of the HtmlWebResponseObject.

I'm not sure it is more complex. Besides, there are other needs for a ConvertFrom-HTML object besides Web Cmdlet results.

I also think separating HTML parsing from IWR is the right move even if IE was x-plat and we could still use the same underlying API on all supported systems. The default HTML parsing and the reliance on IE is one very common Windows PowerShell pain-point for using IWR in headless environments. (Because IE had to be configured before first use).

IMO, the design choices in the early IWR and IRM to break the singe responsibility principle in favor of ease of use has hamstrung the cmdltes to this day. Decoupling these cmdlets from their dependencies and opening those dependencies to more general use is good move, even if it is a breaking change to do so.

Hi,

Can I check if there is any progress on this out of interest? Will this be resolved/added into Powershell 7 and / or .Net Core 3? Perhaps I should raise the request with the .net Core team. This was a really handy feature.

Many thanks!
Steve

Since new Microsoft Edge is based on Chromium we could discover Chromium engine on system and utilize the API.

Ran into this today as well.

Name Value
---- -----
PSVersion 7.0.3
PSEdition Core
GitCommitId 7.0.3
OS Microsoft Windows 10.0.19041
Platform Win32NT
PSCompatibleVersions {1.0, 2.0, 3.0, 4.0鈥
PSRemotingProtocolVersion 2.3
SerializationVersion 1.1.0.1
WSManStackVersion 3.0`

I think @kamome283 module is a better way with AngleSharp than try to communicate with an external process.
https://github.com/kamome283/AngleParse

GitHub
HTML parsing and processing tool for PowerShell. Contribute to kamome283/AngleParse development by creating an account on GitHub.

Seems that the community has helped fill in this gap with modules on PowerShellGallery to specifically handle parsing html.

Was this page helpful?
0 / 5 - 0 ratings