Html: Specify speculative HTML parsing (preload scanner)

Created on 9 Jun 2020 · 22Comments · Source: whatwg/html

From https://github.com/whatwg/html/issues/1349#issuecomment-640734015 , @chrishtr proposes spec edits to explicitly specify a preload scanner (as one part of a larger proposal on behavior for stylesheet loading). But the proposal doesn't say which exact approach to standardize. I believe this is implemented in slightly different ways:

Gecko: https://developer.mozilla.org/en-US/docs/Mozilla/Gecko/HTML_parser_threading
Chromium: https://github.com/whatwg/html/issues/5624#issuecomment-646949323
WebKit: ?

Are there documents that explain how this is implemented in Chromium and WebKit?

Is there interest in aligning on the observable behavior here?

cc @whatwg/html-parser @othermaciej @hsivonen @richard-townsend-arm @lilles @mfreed7

fetch parser

Source

zcorpan

👍3

Most helpful comment

I'm not sure it's a good idea to fully specify it. The preload scanner is a heuristic for faster page loading, and it needs to be possible to evolve it as implementations learn new things about performance, and possibly to evolve it in implementation-specific ways.

If observable behavior differences are causing real-world interop problems, then it may make sense to specify constraints on what can be done. Ideally, there would be no preloading that is mandatory, and as few cases as possible that are forbidden. It should be like the http cache, where there are constraints but also enough flexibility to do innovative things like racing disk and network.

That said, I'm not the top expert on this area of WebKit. Tagging @cdumez as a person who may know more and @hober to make sure we follow up internally with relevant folks who are not on GitHub.

othermaciej on 20 Jun 2020

👍3

All 22 comments

This is also relevant for https://github.com/whatwg/fetch/issues/590 and rel=preload. I'm definitely in favor of getting this a bit more detailed, especially with how the fetches end up being constructed and how that affects future fetches, etc.

annevk on 9 Jun 2020

Here is a very high-level description of the Chromium preload scanner, from @nyaxt:

At the high-level, there are two classes involved: PreloadScanner and HTMLResourcePreloader:

PreloadScanner takes a decoded string stream from the HTMLDocumentParser, drives a HTMLTokenizer internally to generate the HTML token stream (tokenizer output) and emit PreloadRequests.
HTMLResourcePreloader consumes the PreloadRequests and emits actual preload requests to the core/loader stack.

PreloadScanner consumes the token stream without actually constructing the DOM tree. It mostly only cares about the start tag tokens (% inline stylesheets), and sees if there's an interesting URL to start preloading. Since "false-positive" preload requests (== the urls PreloadScanner noticed but turns out not actually used) are very costly, PreloadScanner logic got more and more complicated over time. For example, it simulates the media query match so that it won't request images of the wrong resolution.

PreloadScanner is invoked for:

{main resource bytes received, document.write inputs} while parser is paused waiting for blocking script to run
scanning possible subresources for NoStatePrefetch feature

+@richard-townsend-arm to comment on whether any of the above has changed with the new synchronous parser

mfreed7 on 20 Jun 2020

Thanks! What is NoStatePrefetch?

zcorpan on 20 Jun 2020

othermaciej on 20 Jun 2020

👍3

Thanks! What is NoStatePrefetch?

https://developers.google.com/web/updates/2018/07/nostate-prefetch

mfreed7 on 21 Jun 2020

Is there interest in aligning on the observable behavior here?

My understanding is that Gecko's speculative parser is capable of looking further ahead than WebKit's that was inherited into Blink. I'm not interested in making Gecko's speculative parser less capable in order to match WebKit and Blink.

IIRC, the reported interop-sensitive problem has been rather niche and isolated: Trying to do server-side responsive images by setting resolution information in a cookie via JS and being unhappy that the browser already fetched the images whose URLs were in the HTML source. However, with a multiprocess architecture, it's increasingly problematic to give guarantees about when script-set cookies are reflected in the network requests that are made, so I think we shouldn't cater to tricks like this.

hsivonen on 22 Jun 2020

@othermaciej thanks for your thoughts. I agree that a full specification might not be desirable. I think the spec can still acknowledge that this optimization exists and specify some rules around how it should behave. This could serve web platform predictability and interoperability with regard to which URLs are speculatively prefetched.

@hsivonen thanks. Not wanting to regress on speculative parser capability is understandable. On the script-set cookie case, that's indeed something that doesn't work well together with the preload scanner optimization in particular.

For observable behavior, apart from the case @hsivonen mentioned, I'm aware of a few other cases that have been a source of complaints from web developers:

unnecessary downloads from <script src> with unsupported type (gecko issue, chromium issue)
unnecessary downloads with <picture> (webkit issue, another webkit issue, chromium issue)
double downloads with <img> when the response is not a valid image (gecko issue)

There are likely more issues, this was from a quick search.

To facilitate the discussion, I've written a few tentative tests today and a test generation script (so it's easy to add more tests): https://github.com/web-platform-tests/wpt/pull/24521

Test results: https://wpt.fyi/results/html/syntax/preload-scanner.tentative/page-load?sha=ed26ec1897&label=pr_head

zcorpan on 9 Jul 2020

Acknowledging that the optimization exists, and perhaps giving constraints that would resolve some specific differences in observable behavior, seems like probably a good idea.

othermaciej on 9 Jul 2020

👍1

I've written some more tests. Here are some things I learned:

Safari 112 preview:
- img-src-referrerpolicy-no-referrer fails because it includes the Referer header in the speculative fetch.
- picture-source-no-img fails, it speculatively fetches even though there's no img element.
- script-src-unsupported-type fails, it speculatively fetches the script with type=text/plain.
Chrome 86 dev:
- picture-source-br-img fails; it doesn't speculatively fetch when there's a br tag before the img in picture.
Safari and Chrome:
- Don't bother at all with svg or math tree builder rules for foreign content in the speculative parser. They tokenize with knowledge of some tags like xmp being RAWTEXT, but don't differentiate between HTML script or SVG script
- ~~In document.write(): the meta-charset-script-src test fails; a meta charset declaration after a blocking script doesn't affect the encoding of the document.~~ (The test for this had a mistake.)
Firefox 82 nightly:
- SVG script with xlink:href is speculatively fetched, but SVG script with href is not, even though href is normally supported.
- img-srcset and picture-source-br-img are flaky in Firefox for some reason. Other tests are not flaky as far as I can tell.

I've made the pass conditions be assuming the correct behavior would be to fetch (for a selective set of resource kinds) what would be fetched had the real HTML parser continued without the current script doing any "destructive" document.write. This appears to be very close to what Firefox does.

Latest test results: https://wpt.fyi/results/html/syntax/speculative-parsing/generated?label=pr_head&max-count=1&pr=24521

Now, I realize that not all of the cases I've tested have a basis of complaints from web developers. I wanted to find and highlight fundamental differences and possible bugs. What to specify is open for discussion.

zcorpan on 4 Sep 2020

👍2

I think it makes sense to specify the preloader, in order to make it easier to specify things that rely on it:

Link preload headers that also include a media attribute currently wait (In both Chromium and WebKit, IIRC) for the preloader to figure out the right viewport (from <meta viewport> tokens) before preloading the resource in question. Currently this is entirely unspecified.
As Simon saw, error handling can be strictly different between different implementations, which results in potentially different web observed fetches.
As @annevk pointed out, speculative preloads rely on the unspecified "preload cache" (or memory cache in WebKit, which is subtly different). It would be good to define what the behavior is when e.g. a non-cacheable script is speculatively fetched multiple times. IIRC, that would result in a different number of fetches between Chromium and Gecko.
The parser behavior in the face of blocking in-body styles, which started this whole discussion.
We want to be sure (and test) that all browsers don't preload the wrong things when Media Queries are involved (for responsive images, as well as with the media attribute on styles).

While the preload scanner may not be a major point of interop pain, but the differences in behavior are highly observable, through Resource Timing, Service Workers and the servers fielding the actual requests.
Different browsers preload different things potentially (e.g. import styles), in ways that may surprise developers. Defining a baseline may help to reduce this divergence in bahvior and reduce surprises.

I agree with @othermaciej and @hsivonen that we wouldn't want such a specification to limit future optimizations, but at this point, I feel this optimization is mature enough that we could specify its basics, and let user agents innovate on top of that.

yoavweiss on 9 Sep 2020

Thanks @yoavweiss!

New info for me to consider and test:

Link preload headers waiting for a meta viewport, potentially found by speculative parser
meta CSP can prevent speculative fetches that violate the CSP. (@yoavweiss mentioned this to me in IRC.)

zcorpan on 9 Sep 2020

meta CSP can prevent speculative fetches that violate the CSP. (@yoavweiss mentioned this to me in IRC.)

I believe that to simplify the implementation, it prevents any speculative fetches period (at least in Chromium). Might be good to see what other implementations are doing, and maybe do something smarter.

yoavweiss on 9 Sep 2020

Here are two demos with meta CSP:

From what I can tell with the Network tab devtools, in Safari and Firefox, both cases speculatively fetch the image, ignoring the CSP meta directives. Chrome doesn't fetch the image for either case.

zcorpan on 9 Sep 2020

SVG script with xlink:href is speculatively fetched, but SVG script with href is not, even though href is normally supported.

Filed. Thanks! Also: The name of the image test suggests that it's testing src rather than href. The spec says href. Apparently the tests haven't landed yet, so I failed to locate the test source to check.

hsivonen on 9 Sep 2020

@hsivonen thank you!

The tests are here https://github.com/web-platform-tests/wpt/pull/24521

There are multiple image tests, both svg-image-href and svg-image-src (the latter expects no speculative fetch to happen).

zcorpan on 9 Sep 2020

🎉1

I looked into trying to specify the integration points of speculative parsing in the spec in https://github.com/whatwg/html/pull/5959 , and I have some questions.

Specifically, I'm a bit confused about how to specify speculative parsing for the document.write string. From irc:

I don't follow how what the spec does for <script src=X></script> where X is: document.write("<script src=Y></script><img src=Z>");
first there's https://html.spec.whatwg.org/multipage/parsing.html#scriptEndTag
"Prepare" will mark the script as the pending parser-blocking script, for the first script
"script nesting level" is zero, so the Otherwise steps are run
so far so good. This is where I want to kick off the (first) speculative parser
but then the script is loaded and executed (step 10), and the document.write will invoke the tree builder reentrantly, reaching this part again but with "script nesting level" being 1
where the spec says "Set the parser pause flag to true, and abort the processing of any nested invocations of the tokenizer, yielding control back to the caller. (Tokenization will resume when the caller returns to the "outer" tree construction stage.)"
and that's what I don't get. how is the Y script loaded and executed?
Where should I kick off a second speculative parser to find the img in the document.write string

zcorpan on 29 Sep 2020

Isn't it that once we're back in the "outer" tree construction stage:

Tokenization resumes, getting to the <script src=Y> block, which then blocks the parser on loading and and then executing Y.
In the meantime, the preload scanner either resumes if it hasn't yet gotten to the place affected by document.write, or starts again at the document-write-affected area.

chrishtr on 29 Sep 2020

I thought there was special handling for document.write with the speculative parser.

In particular, the document write steps insert input into the input stream, but if the speculative parser is already parsing, it will be beyond the insertion point and thus not parse input without special handling.

zcorpan on 30 Sep 2020

I thought there was special handling for document.write with the speculative parser.

Are you speaking about the Chromium implementation?

In particular, the document write steps insert _input_ into the input stream, but if the speculative parser is already parsing, it will be beyond the insertion point and thus not parse _input_ without special handling.

Agree that if the speculative parser has already gotten to the end of the input and then document.write happens, special handling will be needed. Is it any more complicated than something equivalent to "the speculative parser begins again at the new areas added to the input stream if it has completed already"?

chrishtr on 1 Oct 2020

Agree that if the speculative parser has already gotten to the end of the input and _then_ document.write happens, special handling will be needed. Is it any more complicated than something equivalent to "the speculative parser begins again at the new areas added to the input stream if it has completed already"?

I think it’s more that the speculative parser might have to “rewind” back to the insertion point, since it might be past it.

mfreed7 on 2 Oct 2020

@mfreed7 right. It will be past it as soon as it has speculatively parsed a single character.

It's possible that Gecko has a different strategy with document.write compared to Chromium/WebKit. From https://developer.mozilla.org/en-US/docs/Mozilla/Gecko/HTML_parser_threading

nsHtml5Parser has one tokenizer/tree builder pair for parsing data from document.write(). It also has another lazily initialized tokenizer/tree builder for speculatively scanning the tail of document.write() data when the parser blocks without parsing the document.write() data to completion.

zcorpan on 2 Oct 2020

Draft spec for review

There is now a draft specification for review in #5959. (Preview of the generated spec)

At a high level, the optimization itself is optional, but if it's implemented, it needs to speculatively parse HTML as if it was following the normal HTML parser rules, and only fetch resources that would be fetched from normal processing (if the blocking script does nothing). Which kinds of resources to fetch is implementation-defined.

TPAC discussion

We discussed this topic a few days ago in the WHATWG TPAC breakout session. Minutes here: https://www.w3.org/2020/10/26-whatwg-minutes.html#preload

Commentary inline:

zcorpan: I've been working on this. Tests demonstrate interop issues.

zcorpan: document.write() external scripts cause performance problems. So the speculative parser tries to help. Gecko vs. Chromium/WebKit: Gecko more faithfully uses the real HTML parser to build a speculative tree. Chromium/WebKit only use the tokenizer, with some tree-building knowledge (e.g. know about script elements and style elements; don't know about SVGNS). This allows generating test cases to show interop differences. Don't have data on the web developer pain these interop causes, but it'd be useful to reach interop. Question is: how much detail do we want in the spec, and which behavior do we want to specify? Also we need to speculatively parse the document.write() string if you're writing an external script.

hsivonen: have you found any cases where "the right thing" differs from "what Firefox does"?

zcorpan: document.write("<meta charset>") plus a later document.write("<meta charset>") which invalidates that does not work.

This about meta charset was both minuted inaccurately and I made a mistake in my testing which I've now fixed. There are a couple of interesting bits about meta charset in Gecko's speculative parser, though:

it is ignored by the speculative parser, i.e. it does not change the encoding of the document or the URL character encoding of speculative fetches. (This only makes a difference when the page is doing something "wrong", either using a late encoding declaration, or document.write()ing an encoding declaration.)
in the "page-load" test for this (/html/syntax/speculative-parsing/generated/page-load/meta-charset-script-src.tentative.html), Firefox does not speculatively fetch the script src after the encoding declaration. I asked @hsivonen about this in IRC, but he didn't have an answer to why this happens. (The "document-write" test does speculatively fetch the script.)

smfr: reiterate othermaciej's comments in the issue that this is just an optimization done by UAs, and maybe it's not necessary to fully specify. If there are interop issues then maybe the spec should say enough to address those, but we're not convinced that it needs to be fully specified.

emilio: if we know of authors relying on particular bits, it'd be useful to document them; it's not clear whether it needs to be normative or formal.

zcorpan: so like a bullet point set of requirements or expectations

hsivonen: my main concern is not speccing things so that the less-accurate version is correct and Firefox's more-accurate version is incorrect. So either spec the correct thing or leave it hand-wavey enough that the correct version is allowed.

zcorpan: Another Firefox is not totally correct issue: it doesn't respect CSP. In Chromium the presence of CSP meta entirely disables the speculative parser. (WebKit aligns with Firefox and ignores CSP.)

emilio: that seems like a bug we could probably fix. Is there a bug on file?

Henri filed https://bugzilla.mozilla.org/show_bug.cgi?id=1673407

zcorpan: I haven't filed one. I do envision listing this as a requirement in the spec and filing a bug.

zcorpan: subtopic: the speculative cache. Ideally the spec would say something about that.

emilio: Gecko reuses the existing caches. Per-document stylesheet cache, image cache.

zcorpan: annevk brought this up in #5624.

(some discussion about stylesheet cache, subtleties, how it doesn't exist in the spec)

hsivonen: we have a hash table of URLs that have been speculatively fetched. So e.g. <img> and <script> pointing to the same URL won't speculatively fetch the same URL twice.

I haven't specified this hash table.

zcorpan on 29 Oct 2020

Was this page helpful?

0 / 5 - 0 ratings