From https://github.com/whatwg/html/issues/1349#issuecomment-640734015 , @chrishtr proposes spec edits to explicitly specify a preload scanner (as one part of a larger proposal on behavior for stylesheet loading). But the proposal doesn't say which exact approach to standardize. I believe this is implemented in slightly different ways:
Gecko: https://developer.mozilla.org/en-US/docs/Mozilla/Gecko/HTML_parser_threading
Chromium: https://github.com/whatwg/html/issues/5624#issuecomment-646949323
WebKit: ?
Are there documents that explain how this is implemented in Chromium and WebKit?
Is there interest in aligning on the observable behavior here?
cc @whatwg/html-parser @othermaciej @hsivonen @richard-townsend-arm @lilles @mfreed7
This is also relevant for https://github.com/whatwg/fetch/issues/590 and rel=preload. I'm definitely in favor of getting this a bit more detailed, especially with how the fetches end up being constructed and how that affects future fetches, etc.
Here is a very high-level description of the Chromium preload scanner, from @nyaxt:
At the high-level, there are two classes involved: PreloadScanner and HTMLResourcePreloader:
PreloadScanner takes a decoded string stream from the HTMLDocumentParser, drives a HTMLTokenizer internally to generate the HTML token stream (tokenizer output) and emit PreloadRequests.HTMLResourcePreloader consumes the PreloadRequests and emits actual preload requests to the core/loader stack.PreloadScanner consumes the token stream without actually constructing the DOM tree. It mostly only cares about the start tag tokens (% inline stylesheets), and sees if there's an interesting URL to start preloading. Since "false-positive" preload requests (== the urls PreloadScanner noticed but turns out not actually used) are very costly, PreloadScanner logic got more and more complicated over time. For example, it simulates the media query match so that it won't request images of the wrong resolution.
PreloadScanner is invoked for:
+@richard-townsend-arm to comment on whether any of the above has changed with the new synchronous parser
Thanks! What is NoStatePrefetch?
I'm not sure it's a good idea to fully specify it. The preload scanner is a heuristic for faster page loading, and it needs to be possible to evolve it as implementations learn new things about performance, and possibly to evolve it in implementation-specific ways.
If observable behavior differences are causing real-world interop problems, then it may make sense to specify constraints on what can be done. Ideally, there would be no preloading that is mandatory, and as few cases as possible that are forbidden. It should be like the http cache, where there are constraints but also enough flexibility to do innovative things like racing disk and network.
That said, I'm not the top expert on this area of WebKit. Tagging @cdumez as a person who may know more and @hober to make sure we follow up internally with relevant folks who are not on GitHub.
Thanks! What is NoStatePrefetch?
https://developers.google.com/web/updates/2018/07/nostate-prefetch
Is there interest in aligning on the observable behavior here?
My understanding is that Gecko's speculative parser is capable of looking further ahead than WebKit's that was inherited into Blink. I'm not interested in making Gecko's speculative parser less capable in order to match WebKit and Blink.
IIRC, the reported interop-sensitive problem has been rather niche and isolated: Trying to do server-side responsive images by setting resolution information in a cookie via JS and being unhappy that the browser already fetched the images whose URLs were in the HTML source. However, with a multiprocess architecture, it's increasingly problematic to give guarantees about when script-set cookies are reflected in the network requests that are made, so I think we shouldn't cater to tricks like this.
@othermaciej thanks for your thoughts. I agree that a full specification might not be desirable. I think the spec can still acknowledge that this optimization exists and specify some rules around how it should behave. This could serve web platform predictability and interoperability with regard to which URLs are speculatively prefetched.
@hsivonen thanks. Not wanting to regress on speculative parser capability is understandable. On the script-set cookie case, that's indeed something that doesn't work well together with the preload scanner optimization in particular.
For observable behavior, apart from the case @hsivonen mentioned, I'm aware of a few other cases that have been a source of complaints from web developers:
<script src> with unsupported type (gecko issue, chromium issue)<picture> (webkit issue, another webkit issue, chromium issue)<img> when the response is not a valid image (gecko issue)There are likely more issues, this was from a quick search.
To facilitate the discussion, I've written a few tentative tests today and a test generation script (so it's easy to add more tests): https://github.com/web-platform-tests/wpt/pull/24521
Test results: https://wpt.fyi/results/html/syntax/preload-scanner.tentative/page-load?sha=ed26ec1897&label=pr_head
Acknowledging that the optimization exists, and perhaps giving constraints that would resolve some specific differences in observable behavior, seems like probably a good idea.
I've written some more tests. Here are some things I learned:
img-src-referrerpolicy-no-referrer fails because it includes the Referer header in the speculative fetch.picture-source-no-img fails, it speculatively fetches even though there's no img element.script-src-unsupported-type fails, it speculatively fetches the script with type=text/plain.picture-source-br-img fails; it doesn't speculatively fetch when there's a br tag before the img in picture.svg or math tree builder rules for foreign content in the speculative parser. They tokenize with knowledge of some tags like xmp being RAWTEXT, but don't differentiate between HTML script or SVG scriptdocument.write(): the meta-charset-script-src test fails; a meta charset declaration after a blocking script doesn't affect the encoding of the document.script with xlink:href is speculatively fetched, but SVG script with href is not, even though href is normally supported.img-srcset and picture-source-br-img are flaky in Firefox for some reason. Other tests are not flaky as far as I can tell.I've made the pass conditions be assuming the correct behavior would be to fetch (for a selective set of resource kinds) what would be fetched had the real HTML parser continued without the current script doing any "destructive" document.write. This appears to be very close to what Firefox does.
Latest test results: https://wpt.fyi/results/html/syntax/speculative-parsing/generated?label=pr_head&max-count=1&pr=24521
Now, I realize that not all of the cases I've tested have a basis of complaints from web developers. I wanted to find and highlight fundamental differences and possible bugs. What to specify is open for discussion.
I think it makes sense to specify the preloader, in order to make it easier to specify things that rely on it:
media attribute currently wait (In both Chromium and WebKit, IIRC) for the preloader to figure out the right viewport (from <meta viewport> tokens) before preloading the resource in question. Currently this is entirely unspecified.While the preload scanner may not be a major point of interop pain, but the differences in behavior are highly observable, through Resource Timing, Service Workers and the servers fielding the actual requests.
Different browsers preload different things potentially (e.g. import styles), in ways that may surprise developers. Defining a baseline may help to reduce this divergence in bahvior and reduce surprises.
I agree with @othermaciej and @hsivonen that we wouldn't want such a specification to limit future optimizations, but at this point, I feel this optimization is mature enough that we could specify its basics, and let user agents innovate on top of that.
Thanks @yoavweiss!
New info for me to consider and test:
Link preload headers waiting for a meta viewport, potentially found by speculative parsermeta CSP can prevent speculative fetches that violate the CSP. (@yoavweiss mentioned this to me in IRC.)
metaCSP can prevent speculative fetches that violate the CSP. (@yoavweiss mentioned this to me in IRC.)
I believe that to simplify the implementation, it prevents any speculative fetches period (at least in Chromium). Might be good to see what other implementations are doing, and maybe do something smarter.
Here are two demos with meta CSP:
From what I can tell with the Network tab devtools, in Safari and Firefox, both cases speculatively fetch the image, ignoring the CSP meta directives. Chrome doesn't fetch the image for either case.
- SVG
scriptwithxlink:hrefis speculatively fetched, but SVGscriptwithhrefis not, even thoughhrefis normally supported.
Filed. Thanks! Also: The name of the image test suggests that it's testing src rather than href. The spec says href. Apparently the tests haven't landed yet, so I failed to locate the test source to check.
@hsivonen thank you!
The tests are here https://github.com/web-platform-tests/wpt/pull/24521
There are multiple image tests, both svg-image-href and svg-image-src (the latter expects no speculative fetch to happen).
I looked into trying to specify the integration points of speculative parsing in the spec in https://github.com/whatwg/html/pull/5959 , and I have some questions.
Specifically, I'm a bit confused about how to specify speculative parsing for the document.write string. From irc:
I don't follow how what the spec does for
<script src=X></script>where X is:document.write("<script src=Y></script><img src=Z>");
first there's https://html.spec.whatwg.org/multipage/parsing.html#scriptEndTag
"Prepare" will mark the script as the pending parser-blocking script, for the first script
"script nesting level" is zero, so the Otherwise steps are run
so far so good. This is where I want to kick off the (first) speculative parser
but then the script is loaded and executed (step 10), and the document.write will invoke the tree builder reentrantly, reaching this part again but with "script nesting level" being 1
where the spec says "Set the parser pause flag to true, and abort the processing of any nested invocations of the tokenizer, yielding control back to the caller. (Tokenization will resume when the caller returns to the "outer" tree construction stage.)"
and that's what I don't get. how is the Y script loaded and executed?
Where should I kick off a second speculative parser to find the img in the document.write string
Isn't it that once we're back in the "outer" tree construction stage:
<script src=Y> block, which then blocks the parser on loading and and then executing Y.I thought there was special handling for document.write with the speculative parser.
In particular, the document write steps insert input into the input stream, but if the speculative parser is already parsing, it will be beyond the insertion point and thus not parse input without special handling.
I thought there was special handling for
document.writewith the speculative parser.
Are you speaking about the Chromium implementation?
In particular, the document write steps insert _input_ into the input stream, but if the speculative parser is already parsing, it will be beyond the insertion point and thus not parse _input_ without special handling.
Agree that if the speculative parser has already gotten to the end of the input and then document.write happens, special handling will be needed. Is it any more complicated than something equivalent to "the speculative parser begins again at the new areas added to the input stream if it has completed already"?
Agree that if the speculative parser has already gotten to the end of the input and _then_ document.write happens, special handling will be needed. Is it any more complicated than something equivalent to "the speculative parser begins again at the new areas added to the input stream if it has completed already"?
I think it’s more that the speculative parser might have to “rewind” back to the insertion point, since it might be past it.
@mfreed7 right. It will be past it as soon as it has speculatively parsed a single character.
It's possible that Gecko has a different strategy with document.write compared to Chromium/WebKit. From https://developer.mozilla.org/en-US/docs/Mozilla/Gecko/HTML_parser_threading
nsHtml5Parserhas one tokenizer/tree builder pair for parsing data fromdocument.write(). It also has another lazily initialized tokenizer/tree builder for speculatively scanning the tail ofdocument.write()data when the parser blocks without parsing thedocument.write()data to completion.
There is now a draft specification for review in #5959. (Preview of the generated spec)
At a high level, the optimization itself is optional, but if it's implemented, it needs to speculatively parse HTML as if it was following the normal HTML parser rules, and only fetch resources that would be fetched from normal processing (if the blocking script does nothing). Which kinds of resources to fetch is implementation-defined.
We discussed this topic a few days ago in the WHATWG TPAC breakout session. Minutes here: https://www.w3.org/2020/10/26-whatwg-minutes.html#preload
Commentary inline:
zcorpan: I've been working on this. Tests demonstrate interop issues.
zcorpan: document.write() external scripts cause performance problems. So the speculative parser tries to help. Gecko vs. Chromium/WebKit: Gecko more faithfully uses the real HTML parser to build a speculative tree. Chromium/WebKit only use the tokenizer, with some tree-building knowledge (e.g. know about script elements and style elements; don't know about SVGNS). This allows generating test cases to show interop differences. Don't have data on the web developer pain these interop causes, but it'd be useful to reach interop. Question is: how much detail do we want in the spec, and which behavior do we want to specify? Also we need to speculatively parse the document.write() string if you're writing an external script.
hsivonen: have you found any cases where "the right thing" differs from "what Firefox does"?
zcorpan:
document.write("<meta charset>")plus a laterdocument.write("<meta charset>")which invalidates that does not work.
This about meta charset was both minuted inaccurately and I made a mistake in my testing which I've now fixed. There are a couple of interesting bits about meta charset in Gecko's speculative parser, though:
document.write()ing an encoding declaration.)/html/syntax/speculative-parsing/generated/page-load/meta-charset-script-src.tentative.html), Firefox does not speculatively fetch the script src after the encoding declaration. I asked @hsivonen about this in IRC, but he didn't have an answer to why this happens. (The "document-write" test does speculatively fetch the script.)smfr: reiterate othermaciej's comments in the issue that this is just an optimization done by UAs, and maybe it's not necessary to fully specify. If there are interop issues then maybe the spec should say enough to address those, but we're not convinced that it needs to be fully specified.
emilio: if we know of authors relying on particular bits, it'd be useful to document them; it's not clear whether it needs to be normative or formal.
zcorpan: so like a bullet point set of requirements or expectations
hsivonen: my main concern is not speccing things so that the less-accurate version is correct and Firefox's more-accurate version is incorrect. So either spec the correct thing or leave it hand-wavey enough that the correct version is allowed.
zcorpan: Another Firefox is not totally correct issue: it doesn't respect CSP. In Chromium the presence of CSP meta entirely disables the speculative parser. (WebKit aligns with Firefox and ignores CSP.)
emilio: that seems like a bug we could probably fix. Is there a bug on file?
Henri filed https://bugzilla.mozilla.org/show_bug.cgi?id=1673407
zcorpan: I haven't filed one. I do envision listing this as a requirement in the spec and filing a bug.
zcorpan: subtopic: the speculative cache. Ideally the spec would say something about that.
emilio: Gecko reuses the existing caches. Per-document stylesheet cache, image cache.
zcorpan: annevk brought this up in #5624.
(some discussion about stylesheet cache, subtleties, how it doesn't exist in the spec)
hsivonen: we have a hash table of URLs that have been speculatively fetched. So e.g.
<img>and<script>pointing to the same URL won't speculatively fetch the same URL twice.
I haven't specified this hash table.
Most helpful comment
I'm not sure it's a good idea to fully specify it. The preload scanner is a heuristic for faster page loading, and it needs to be possible to evolve it as implementations learn new things about performance, and possibly to evolve it in implementation-specific ways.
If observable behavior differences are causing real-world interop problems, then it may make sense to specify constraints on what can be done. Ideally, there would be no preloading that is mandatory, and as few cases as possible that are forbidden. It should be like the http cache, where there are constraints but also enough flexibility to do innovative things like racing disk and network.
That said, I'm not the top expert on this area of WebKit. Tagging @cdumez as a person who may know more and @hober to make sure we follow up internally with relevant folks who are not on GitHub.