Html: Add a "modern" parsing API

Created on 1 Sep 2017 · 102Comments · Source: whatwg/html

TL;DR HTML should provide an API for parsing. Why? "Textual" HTML is a widely used syntax. HTML parsing is complex enough to want to use the browser's parser, plus browsers can do implementation tricks with how they create elements, etc.

Unfortunately the way the HTML parser is exposed in the web platform is a hodge-podge. Streaming parsing is only available for main document loads; other things rely on strings which put pressure on memory. innerHTML is synchronous and could cause jank for large documents (although I would like to see data on this because it is pretty fast.)

Here are some strawman requirements:

Should work with streams, and probably strings.
It should be asynchronous. HTML parsing is fast, but if you wanted to handle megabytes of data on phones while animating something, you probably can't do it synchronously.

Commentary:

One big question is when this API exposes the tree it is operating on. Main document parsing does expose the tree and handles mutations to it pretty happily; innerHTML parsing does not until the nodes are adopted into the target node's document (which may start running custom element stuff.)

One minor question is what to do with errors.

Being asynchronous has implications for documents and/or custom elements. If you allow creating stuff in the main document, then you have to run the custom element constructors sometime, so to make it not jank you probably can't run them together. This is probably a feature worth addressing.

Most helpful comment

Thanks @jakearchibald for thinking of us.

I can speak for my 6+ months on the template literals VS DOM pattern so that maybe you can have as many info as possible about implementations/proposals/APIs etc.

I'll try to split this post in topics.

Not just a UID

I am not using just a UID, I'm using a comment that contains some UID.

// dumb example
function tag(statics, ...interpolations) {
  const out = [statics[0]];
  for (let i = 1; i < statics.length; i++)
    out.push('<!-- MY UID -->', statics[i]);
  return out.join('');
}

tag`<p>a ${'b'} c</p>`;

This gives me the ability to let the HTML parser split for me text content in chunks, and verify that if the nodeType of the <p> childNodes[x] is Node.COMMENT_NODE and its textContent is my UID, I'm fine.

The reason I'm using comments, beside letting the browser do the splitting job for me, is that browsers that don't support in core HTMLTemplateElement will discard partial tables, cols, or options layout but they wouldn't with comments.

var brokenWorkAround = document.createElement('div');
brokenWorkAround.innerHTML = '<td>goodbye TD</td>';
brokenWorkAround.childNodes; // [#text]
brokenWorkAround.outerHTML;
// <div>goodbye TD</div>

You can read about this issue in all the polyfill from webcomponents issues.
https://github.com/webcomponents/template/issues

As summary, if every browser was natively compatible with the template element and the fact it doesn't ignore any kind of node, the only thing parsers like mine would need is a way to understand when the HTML engine encounters a "_special node_", in my case represented by a comment with a special content.

Right now we all need to traverse the whole tree after creating it, and in search of special placeholders.

This is fast enough as a one-off operation, and thanks gosh template literals are unique so it's easy to perform the traversing only once, but it wouldn't scale on huge documents, specially now that I've learned for browsers, and due legacy, simply checking nodeType is a hell of a performance nightmare!

Attributes are "doomed"

Now that I've explained the basics for the content, let's talk about attributes.

If you inject a comment as attribute and there are no quotes around, the layout is destroyed.

<nope nopity=<!-- nope -->>nayh</nope>

So, for attributes, having a similar mechanism to define a unique entity/value to be notify about woul dbe ACE!!!! Right now the content is injected sanitized upfront. It works darn well but it's not ideal as a solution.

Moreover on Attribuites

If you put a placeholder in attributes you have the following possible issues:

IE / Edge might throw random errors and break if the attribute is, for example, style, and the content does not contain colons (even if it's unvalid). _some: uid; works, shena-nigans wouldn't.
some _not-so-smart_ browser throws error with invalid attributes. As example, <img src=uid> would throw an error about the resource without even bothering the network (which has a smarter layer). This is Firefox
some node will throw, without failing though (thanks gosh), errors at first parse. These are SVG nodes. If you have <rect x=uid y=uid /> , before you'll set the right values it will show an error that x or y were not valid.

HTML is very forgiving in many parts, attributes are quite the opposite for various scenarios.

As summary if whatever mechanism would tell the browser any attribute with such special content should be ignored, all these problems would disappear.

Backward compatibility

As much as I'd love to have help from the platform itslef regarding the template literals pattern, I'm afraid that won't ever land in production until all browsers out there would support it (or there is a reliable polyfill for that).

That means that exposing the internal HTML parser through a new API can surely benefits projects from the future, but it would rarely land for all browser in 5+ years.

This last point is just my consideration about effort / results ratio.

Thanks again for helping out regardless.

WebReflection on 1 Sep 2017

❤3 👍1

All 102 comments

cc @jakearchibald @whatwg/html-parser

annevk on 1 Sep 2017

One big question is when this API exposes the tree it is operating on.

I'd like this API to support progressive rendering, so I guess I guess my preference is "as soon as possible".

const streamingFragment = document.createStreamingFragment();

const response = await fetch(url);
response.body
  .pipeThrough(new TextDecoder())
  .pipeTo(streamingFragment.writable);

document.body.append(streamingFragment);

I'd like the above to progressively render. The parsing would follow the "in template", although we may want options to handle other cases, like SVG.

One minor question is what to do with errors

What kinds of errors?

jakearchibald on 1 Sep 2017

👍3

There are a few libraries that use tagged template literals to build HTML, I think their code would be simpler if they knew what state the parser was in at a given point. This might be an opportunity.

Eg:

const fragment = whatever`
  <p>${someContent}</p>
  <img src=${someImgSrc}>
`;

These libraries allow someContent to be text, an element, a promise for text/element. someImgSrc would be text in this case, but may be a function if it's assigning to an event listener. Right now these libraries insert a UID, then crawl the created elements for those UIDs so they can perform the interpolation.

I wonder if something like streamingFragment could provide enough details to avoid the UID hack.

const streamingFragment = document.createStreamingFragment();
const writer = streamingFragment.writer.getWriter();

await writer.write('<p>');
let parserState = await streamingFragment.getParserState();
parserState.currentNode; // paragraph

await writer.write('</p><img src=');
parserState = await streamingFragment.getParserState();

…I guess this last bit is more complicated, but ideally it should know it's in the "before attribute value" state for "src" within tag "img". Ideally there should be a way to get the resulting attribute & element as a promise.

+@justinfagnani @webreflection

jakearchibald on 1 Sep 2017

@dominiccooney HTML can have conformance errors, but there are recovery mechanisms for all of them and user agents doesn't bail out on errors. So any input can be consumed by the HTML parser without a problem.

I like @jakearchibald's API. However, I wonder if we need to support full document streaming parser and how API will look like for it. Also, in streaming fragment approach will it be possible to perform consequent writes to fragment (e.g. pipe one response to fragment and afterwards another one). If so, how it will behave: overwrite content of fragment or insert it in the end of fragment?

inikulin on 1 Sep 2017

@jakearchibald

I think their code would be simpler if they knew what state the parser was in at a given point.

What do you mean by state here? Parser insertion mode, tokeniser state or something else?

inikulin on 1 Sep 2017

@inikulin

I wonder if we need to support full document streaming parser

Hmm yeah. I'm not sure what the best pattern is to use for that.

will it be possible to perform consequent writes to fragment (e.g. pipe one response to fragment and afterwards another one). If so, how it will behave: overwrite content of fragment or insert it in the end of fragment?

Yeah, you can do this with streams. Either with individual writes, or piping with {preventClose: true}. This will follow the same rules as if you mess with elements' content during initial page load.

As in, if the parser eats:

<p>Hello

…then you:

document.querySelector('p').append(', how are you today?');

…you get:

<p>Hello, how are you today?

…if the parser then receives " everyone", I believe you get:

<p>Hello everyone, how are you today?

…as the parser as a pointer to the first text node of the paragraph.

jakearchibald on 1 Sep 2017

@jakearchibald There is a problem with this approach. Consider we have two streams: one writes <div>Hey and the other one ya. Usually when parser encounters end of the stream it finalises the AST and, therefore, the result of feeding the first stream to the parser will be <div>Hey</div> (parser will emit implied end tag here). So, when second stream will write ya you'll get <div>Hey</div>ya as a result. So it will be pretty much the same as creating second fragment and appending it to the first one. On the other hand we can have API that will explicitly say parser to treat second stream as a continuation of the first one.

inikulin on 1 Sep 2017

Thanks @jakearchibald for thinking of us.

I can speak for my 6+ months on the template literals VS DOM pattern so that maybe you can have as many info as possible about implementations/proposals/APIs etc.

I'll try to split this post in topics.

Not just a UID

I am not using just a UID, I'm using a comment that contains some UID.

// dumb example
function tag(statics, ...interpolations) {
  const out = [statics[0]];
  for (let i = 1; i < statics.length; i++)
    out.push('<!-- MY UID -->', statics[i]);
  return out.join('');
}

tag`<p>a ${'b'} c</p>`;

var brokenWorkAround = document.createElement('div');
brokenWorkAround.innerHTML = '<td>goodbye TD</td>';
brokenWorkAround.childNodes; // [#text]
brokenWorkAround.outerHTML;
// <div>goodbye TD</div>

You can read about this issue in all the polyfill from webcomponents issues.
https://github.com/webcomponents/template/issues

Right now we all need to traverse the whole tree after creating it, and in search of special placeholders.

Attributes are "doomed"

Now that I've explained the basics for the content, let's talk about attributes.

If you inject a comment as attribute and there are no quotes around, the layout is destroyed.

<nope nopity=<!-- nope -->>nayh</nope>

Moreover on Attribuites

If you put a placeholder in attributes you have the following possible issues:

IE / Edge might throw random errors and break if the attribute is, for example, style, and the content does not contain colons (even if it's unvalid). _some: uid; works, shena-nigans wouldn't.
some _not-so-smart_ browser throws error with invalid attributes. As example, <img src=uid> would throw an error about the resource without even bothering the network (which has a smarter layer). This is Firefox
some node will throw, without failing though (thanks gosh), errors at first parse. These are SVG nodes. If you have <rect x=uid y=uid /> , before you'll set the right values it will show an error that x or y were not valid.

HTML is very forgiving in many parts, attributes are quite the opposite for various scenarios.

As summary if whatever mechanism would tell the browser any attribute with such special content should be ignored, all these problems would disappear.

Backward compatibility

That means that exposing the internal HTML parser through a new API can surely benefits projects from the future, but it would rarely land for all browser in 5+ years.

This last point is just my consideration about effort / results ratio.

Thanks again for helping out regardless.

WebReflection on 1 Sep 2017

❤3 👍1

@inikulin

There is a problem with this approach

I don't think it's a problem. If you use {preventClose: true}, it doesn't encounter "end of stream". So:

await textStream1.pipeTo(streamingFragment.writable, { preventClose: true });
await textStream2.pipeTo(streamingFragment.writable);

The streaming fragment would consume the streams as if there were a single stream concatenated.

await textStream3.pipeTo(streamingFragment.writable);

The above would fail, as the writable has now closed.

jakearchibald on 1 Sep 2017

P.S. just in case my wishes come true ... what both me and (most-likely) Justin would love to have natively exposed, is a document.queryRawContent(UID) that would return, in linear order, atributes with such value, or comments nodes with such value.

<html lang=UID>
<body> Hello <!--UID-->! <p class=UID></p></body>

The JS coutner part would be:

const result = document.queryRawContent(UID);
[
  the html lang attribute,
  the comment childNodes[1] of the body,
  the p class arttribute
]

Now that, in core, would make my parser a no brainer (beside the issue with comments and attributes, but RegExp upfront are very good at that and blazing fast

[edit] even while streaming it would work, actually it'd be even better so it's one pass for the browser

WebReflection on 1 Sep 2017

👍1

Also since I know for many code is better than thousand words, this is the TL;DR version of what hyperHTML does.

function tag(statics, ...interpolations) {
  if (this.statics !== statics) {
    this.statics = statics;
    this.updates = parse.call(this, statics, '<!--WUT-->');
  }
  this.updates(interpolations);
}

function parse(statics, lookFor) {
  const updates = [];
  this.innerHTML = statics.join(lookFor);
  traverse(this, updates, lookFor);
  const update = (value, i) => updates[i](value);
  return interpolations => interpolations.forEach(update);
}

function traverse(node, updates, lookFor) {
  switch (node.nodeType) {
    case Node.ELEMENT_NODE:
      updates.forEach.call(node.attributes, attr => {
        if (attr.value === lookFor)
          updates.push(v => attr.value = v)});
      updates.forEach.call(node.childNodes,
        node => traverse(node, updates, lookFor)); break;
    case Node.COMMENT_NODE:
      if (`<!--${node.textContent}-->` === lookFor) {
        const text = node.ownerDocument.createTextNode('');
        node.parentNode.replaceChild(text, node);
        updates.push(value => text.textContent = value);
}}}

const body = tag.bind(document.body);

setInterval(() => {
  body`
  <div class="${'my-class'}">
    <p> It's ${(new Date).toLocaleTimeString()} </p>
  </div>`;
}, 1000);

The slow path is the traverse function, the _not-so-cool_ part is the innerHTML injection (as regular node, template or whatever it is) without having the ability to intercept, while parsing the string, all placeholders / attributes and act addressing them accordingly.

OK, I'll let you discuss the rest now :smile:

WebReflection on 1 Sep 2017

👍1

@WebReflection

I think the UID scanner you're talking about might not be necessary. Consider:

const fragment = whatever`
  <p>${someContent}</p>
  <img src=${someImgSrc}>
`;

Where whatever could do something like this:

async function whatever(strings, ...values) {
  const streamingFragment = document.createStreamingFragment();
  const writer = streamingFragment.writer.getWriter();

  for (const str of strings) {
    // str is:
    // <p>
    // </p> <img src=
    // >
    // (with extra whitespace of course)
    await writer.write(str);
    let parserState = streamingFragment.getParserState();

    if (parserState.tokenState == 'data') {
      // This is the case for <p>, and >
      await writer.write('<!-- -->');
      parserState.currentTarget.lastChild; // this is the comment you just created.
      // Swap it out for the interpolated value
    }
    else if (parserState.tokenState.includes('attr-value')) {
      // await the creation of this attr node
      parserState.attrNode.then(attr => {
        // Add the interpolated value, or remove it and add an event listener instead etc etc.
      });
    }
  }
}

jakearchibald on 1 Sep 2017

Yes, that might work. As long as these scenarios are allowed:

const fragment = whatever`
  <ul>${...}</ul>
  ${...}
  <p data-a=${....} onclick=${....}>also ${...} and</p>
  <img a=${...} b=${...} src=${someImgSrc}>
  <table><tr>${...}</tr></table>
`;

which looks like it'd be the case.

WebReflection on 1 Sep 2017

@WebReflection Interpolation should be allowed anywhere.

whatever`
  <${'img'} src="hi">
`;

In the above case tokenState would be "tag-open" or similar. At this point you could either throw a helpful error, or just pass the interpolated value through.

jakearchibald on 1 Sep 2017

@jakearchibald Do you expect tokenStateto be one of tokeniser states defined in https://html.spec.whatwg.org/multipage/parsing.html#tokenization? If so, I'm afraid we can't do that, they are part of parser intrinsics and are subject to change. Moreover, some of them can be meaningless for a user.

inikulin on 1 Sep 2017

@inikulin yeah, that's what I was hoping to expose, or something equivalent. Why can't we expose it?

jakearchibald on 1 Sep 2017

@jakearchibald

what about the following ?

whatever`
  <${'button'} ${'disabled'}>
`;

I actually don't mind having that possible because boolean attributes need boolean values so that ${obj.disabled ? 'disabled' : ''} doesn't look like a great option to me, but I'd be curious to know if "attribute-name" would be exposed too.

Anyway, having my example covered would be already awesome.

WebReflection on 1 Sep 2017

@WebReflection The tokeniser calls that the "Before attribute name state", so if we could expose that, it'd be possible.

jakearchibald on 1 Sep 2017

Not sure this is actually just extra noise or something valuable, but if it can simplify anything, viperHTML uses similar mechanism to parse once on the NodeJS side.

The parser is the pretty awesome htmlparser2.

Probably inspiring as API ? I use the comment trick there though, but since there is a .write mechanism, I believe it could be possible to make it incremental.

WebReflection on 1 Sep 2017

@jakearchibald These states are part of intrinsic parser mechanism and are subject of change, we've even removed/introduced few recently just to fix some conformance-error related bug in parser. So, exposing them to end user will require us to freeze current list of states, that will significantly complicate further development of the parser spec. Moreover, I believe some of them will be quite confusing for end users, e.g. Comment less-than sign bang dash dash state

inikulin on 1 Sep 2017

@inikulin would a subset be reasonable? As example, data and attr-value for me would cover already 100% of hyperHTML use cases and I believe those two will never change in the history of HTML ... right?

WebReflection on 1 Sep 2017

I'm keen on exposing some parser state to help libraries, but I'm happy for us to add it later rather than block streaming parsing on it.

jakearchibald on 1 Sep 2017

@WebReflection Yes, that could be a solution. But I have some use cases in mind that can be confusing for end user. Consider <div data-foo="bar". We'll emit attr-value state in that case, however this markup will not produce attribute in AST (it will not even produce a tag, since unclosed tags in the end of the input stream are omitted).

inikulin on 1 Sep 2017

@inikulin if someone writes broken html I don't expect anything different than throwing errors and break everything right away (when using a new parser API)

Template literals are static, there's no way one of them would instantly fail the parser ... it either work or fail forever, since these are also frozen Arrays.

Accordingly, I understand this API is not necessarily for template literals only, but if the streamer goes _bananas_ due wrong output it's developer fault.

today it's developer fault regardless, but she'll never notice due silent failure.

WebReflection on 1 Sep 2017

👍1

if someone writes broken html I don't expect anything different than throwing errors and break everything right away.

You will be surprised looking at the real world markup around the web. Also, there is no such thing as "broken markup" anymore. There is non-conforming markup, but modern HTML parser can swallow anything. So, to conclude, you suggest to bail out with an error in case of invalid markup in this new streaming API?

inikulin on 1 Sep 2017

You will be surprised looking at the real world markup around the web.

you missed the edit: _when using a new parser API_

So, to conclude, you suggest to bail out with an error in case of invalid markup in this new streaming API?

If the alternative is to not have it, yes please.

I'm tired of missed opportunities due lazy developers that need to be cuddle by standards for their mistakes.

WebReflection on 1 Sep 2017

If the alternative is to not have it, yes please.

I'm tired of missed opportunities due lazy developers that need to be cuddle by standards about their mistakes.

I'm not keen to this approach to be honest, it brings us back to times of XHTML. One of the advantages of HTML5 was it's flexibility regarding parse errors and, hence, document authoring.

inikulin on 1 Sep 2017

this API goal is different, and developers want to know if they wrote a broken template.

Not knowing it hurts themselves, and since there is no html highlight by default inside strings, it's also a safety belt for them.

So throw like any asynchronous operation that failed would throw, and let them decide if they want to fallback to innerHTML or fix that template literal instead, and forever.

WebReflection on 1 Sep 2017

👍1

To be more explicit, nobody on earth would write the following or, if they do by accident, nobody wants that to succeed.

template`<div data-foo="bar"`;

so why is that a concern?

On JavaScript, something similar would be a Syntax error and it will break everything.

WebReflection on 1 Sep 2017

@inikulin

Consider <div data-foo="bar". We'll emit attr-value state in that case, however this markup will not produce attribute in AST (it will not even produce a tag, since unclosed tags in the end of the input stream are omitted).

FWIW this would be fine in an API like my example above. The promise that returns the currently-in-progress element/attribute would reject in this case, but the stream would still write successfully.

I agree that a radically different parsing style would be bad. I'd prefer it to be closer to the regular document parser than innerHTML.

jakearchibald on 1 Sep 2017

I agree that a radically different parsing style would be bad. I'd prefer it to be closer to the regular document parser than innerHTML.

@jakearchibald They are pretty much the same, with exception that for innerHTML parser performs adjustment of it's state according to context element before parsing.

inikulin on 1 Sep 2017

@inikulin innerHTML behaves differently regarding script elements. I hope we could avoid those differences with this API.

jakearchibald on 1 Sep 2017

@WebReflection

this API goal is different, and developers want to know if they wrote a broken template.
To be more explicit, nobody on earth would write the following, and nobody wants that to succeed.

This would be somehow true if templates will be the only use case for this API. What if I want to fetch some arbitrary content provided by 3rd party? E.g. user-supplied comments or something else?

inikulin on 1 Sep 2017

👍2

What if I want to fetch some arbitrary content provided by 3rd party? E.g. user-supplied comments or something else?

what about it? you'll never have the partial content, just whole content. Or you are thinking about an evaluation at runtime of some user content that put some ${value} inside the comment?

In latter case, I don't see a realistic scenario. In "_just parse-stream it all_" case I don't see any issue, you'll never have a token in the first place.

Anyway, if it's about having missed notifications due silent failures and internal adjustments I'm also OK with it. It'll punish heavily developers that don't test their template, and I'm fine with that too.

WebReflection on 1 Sep 2017

@WebReflection To be clear, we are not talking about partial content only. There are many other cases there you can get non-conforming markup.

inikulin on 1 Sep 2017

@inikulin I honestly see your argument like fetch(randomThing).then(b => b.text()).then(eval), which I fail to see as ever desired use-case.

But like I've said, I wouldn't care if the silent failure/adjustment happens. I'm fine for the parser to never break, it'll be somebody else problem, as long as the parse can exist, exposing what it can, when it can, which will be 99% of the desired use cases to me.

Is this possible? Or this is a won't fix/won't implement ?

This is the bit I'm not sure I understand from your answers. I read potential limits, but not proposed alternatives / solutions.

WebReflection on 1 Sep 2017

To be clear, we're not interested in introducing a new, third parser (besides the HTML and XML ones) that only accepts conforming content.

domenic on 1 Sep 2017

👍2

XML already accepts conforming content, and I believe this parser would need to be compatible with SVG too.

However, like I've said, it works for me either ways.

TL;DR can this parser expose data and attr-value tokens/states whenever these are valid?

If so, great, that solves everything.

All other cases are (IMO) irrelevant, but not having it because of possible lack of tokens in broken layout would be a hugely missed opportunity for the Web.

I hope I've also made my point of view clear.

WebReflection on 1 Sep 2017

Here are some requirements which I think sums up what's been discussed so far:

Support a stream as input.
Allow parsing a whole HTML/XML document.
Allow parsing a fragment of an HTML/XML document from a given starting point (similar to how createContextualFragment does it).
Investigate the benefits and API impact of making parsing off-thread.
Allow nodes to be generated before the incoming stream closes, enabling progressive rendering.
Behave as close as possible to the document parser, eg blocking further element creation until blocking scripts download and execute.
Don't prevent a future API addition that exposes some parser state during parsing.

jakearchibald on 1 Sep 2017

👍2

@jakearchibald BTW, regarding script execution: maybe we can make it optional? For example if I parse HTML from some untrusted source it would be nice to be able to prevent execution for parsed fragment.

inikulin on 1 Sep 2017

@inikulin I fear that may be false security. Although innerHTML doesn't download/execute script elements, it doesn't block attributes that are later executed (eg onclick attributes).

Seems safer to defer to existing methods that control script download & execution, like CSP and sandbox.

jakearchibald on 1 Sep 2017

@jakearchibald Thinking of it bit more I wonder how fragment approach suppose to work considering that when you append fragment into node it's children are adopted by new parent node: https://dom.spec.whatwg.org/#concept-node-insert. So if we insert fragment while content is still piped into it, how we should behave? Make parent node a receiver of all consequent HTML content? In that case we'll need a machinery to pipe HTML content into element. In that regard, it will make more sense to implement streaming parser API for elements and document fragments without introducing new node type (something like element.writable and fragment.writable).

inikulin on 1 Sep 2017

@inikulin In terms of adopting, how does https://jakearchibald.com/2016/fun-hacks-faster-content/#using-iframes-and-documentwrite-to-improve-performance work?

I don't like element.writable as it doesn't really fit with how writables can only be written to once. That's how I ended up with a special streaming fragment. It may be the same node type as a regular fragment though.

jakearchibald on 1 Sep 2017

Hmm, it a bit confusing that fragment becomes some kind of proxy entity to pipe HTML in element considering that new nodes will not appear in fragment. But maybe it's just my perception...

inikulin on 1 Sep 2017

They'll appear in the fragment until the fragment is appended to an element.

It's no stranger than https://jakearchibald.com/2016/fun-hacks-faster-content/#using-iframes-and-documentwrite-to-improve-performance, but I guess that's pretty strange.

jakearchibald on 1 Sep 2017

I very much share the goal of being able to streaming-parse HTML without blocking the main thread.

This goal is pretty connected to some of the goals I had in the DOMChangeList proposal (specifically the DOMTreeConstruction part of that proposal). Here's a sketch of we could enhance that proposal to support these goals:

Create a new HTML parser that creates a stream of DOMTreeConstruction operations. Since DOMTreeConstruction is already a binary format, a stream of the binary blob might be sufficient.
Create a streaming version of insertTreeBefore that would take a stream of DOMTreeConstruction and append it into the DOM.

DOMTreeConstruction is already intended to provide a low-level API that can be used in a worker and transferred from a worker to the UI thread (without having to deal with the thorny questions of making a transferrable DOM available in workers). That makes it a nice fit for async parsing and possibly even streaming parsing.

This thread is really a missing piece of the other proposal: DOMChangeList provides a way to go from operations to actual DOM, but it doesn't provide a compliant way for going from HTML to operations. If we added a way from going from HTML to operations, we can break up the entire processing pipeline and do arbitrary parts of the process in workers (anything up to putting the operations in the real DOM).

wycats on 1 Sep 2017

👍3

As an unrelated aside, I would find it very helpful to have an API that provided a stream of tokenizer events that could be intercepted on the way to the parser. That would allow Glimmer to implement the {{ extension in user-space ({{ text isn't legal in all of the places where you would want it to be meaningful, and has different meaning in text positions vs. attributes). Today, we are forced to require a bundler for HTML, but I would love to be able to use more of the browser's infrastructure instead.

wycats on 1 Sep 2017

@domenic said:

To be clear, we're not interested in introducing a new, third parser (besides the HTML and XML ones) that only accepts conforming content.

Doesn't the existing HTML parser spec specifically describe a mode that aborts with an exception on the first error?

For non-streaming content, it would probably be sufficient just to expose whether an error had occurred at all (and then userspace could throw away the tree). For streaming content, it might also be sufficient (userspace could "roll back" by deleting any nodes that were already inserted?)

wycats on 1 Sep 2017

Wow, long thread is long. I had a busy morning, so I'll try to hit two points I caught just now.

Async API: This would make it difficult to use this API in many scenarios. Right now when you create and attach an element, you may expect that the element has rendered synchronously. With an async parser API, if the element has to parse it's template to render, that breaks. In essence, using an async parser API would be similar to using <template> today, but with asyncAppend instead of append. Lots of cod would get more complex as element state itself becomes async and we don't have a standard way of waiting for an element to be "ready".

Of course, if we had top-level await, we could hide that async API behind module initialization.
Being able to get parser state while parsing fragments would be awesome, but in order to avoid inserting sentinels altogether, we'd need a few more features, like
1. Get a reference to the under-construction element, or previously constructed text node.
2. Prevent collapsing consecutive text nodes. ie if we parse <div>abc then def</div>, we'd need a way to get a reference to abc and def and not collapse them into a single node.

But stepping back, the real API I want is to be able to create a tree of DOM and easily get references to particular nodes in cheaply. https://github.com/whatwg/html/issues/2254 (at least the
Template Parts" idea in there) would solve my use-case completely.

Another thing that would help is a variation on the TreeWalker API that didn't return a node from nextNode() so that I could navigate to a node without creating wrappers for all preceding nodes.

justinfagnani on 2 Sep 2017

@justinfagnani I think @jakearchibald already solved your i and ii points;
https://github.com/whatwg/html/issues/2993#issuecomment-326552132

you can write a comment and retrieve it right away as your place holder so that you'd have abc then your content, and later on whatever it is, including def and eventually another data where you can add another comment

`<div>a ${'b'} c ${'but also d'} e</div>`

WebReflection on 2 Sep 2017

But stepping back, the real API I want is to be able to create a tree of DOM and easily get references to particular nodes in cheaply

this is the same I've discussed already, but here they are proposing something better: while you are parsing, you can intercept and pollute the stream at runtime.

It's more powerful, but I understand the async painful point you made.

Here https://github.com/whatwg/html/issues/2993#issuecomment-326547102

I've proposed a way to retrieve a UID from either attributes or content, which is, I believe, similar to the API you want to retrieve particular templates.

WebReflection on 2 Sep 2017

It does seem this issue has kind of exploded. Everyone has interpreted the OP as if it's interested in helping with their particular problem. Roughly:

@jakearchibald is interested in using this for progressive rendering.
@WebReflection, @justinfagnani, and @wycats are interested in using this to support their templating engines
@wycats also wants it to help support and synergize with his DOMChangeList idea.

This is all interesting discussion, and I don't want to discourage it. But I do want to highlight that it's unlikely we'll end up solving all of these problems with a single API. The OP was specifically spun off of #2827, which is more about progressive rendering (thus, async, streams, and no template interpolation). I just want people to be aware that we may solve that separately, and and leave templating and an integer-based instruction set for tree construction to other APIs.

Of course by now this thread is mostly about templating engines and their needs, so maybe it should be repurposed. We'll see where the discussion takes us :). Personally I am with @justinfagnani that template parts is the most promising direction so far for that particular problem.

domenic on 2 Sep 2017

I think @wycats has a reasonable point that if we want to provide a low-level parser API that can also be used in workers, the result of that needs to be some kind of changeset that can be applied to trees. That's also roughly how off-the-main-thread parsers in browsers need to be modeled today. Finding the primitives upon which the whole thing is built seems like the best idea to me given past frustration with higher-level alternatives that don't quite address the needs.

annevk on 2 Sep 2017

@annevk sure, if the point of this feature was workers, then a change list might make sense. But I am personally more interested in allowing the browser to do work in an asynchronous and streaming fashion, instead of focusing on workers. That async/streaming fashion could be potentially off-thread if that provides some benefit, but purely as an implementation detail.

For example, browsers are likely to start just using the main thread, just asynchronously to prevent jank. Then later they may investigate using native threads, which they can do much more effectively than JS. JS can only put binary data in shared memory, whereas browsers could put actual C++ Node objects. (Not that those are thread-safe today, but it's a possible future implementation strategy.)

Stated another way, I don't think it's correct to identify a change list as an underlying primitive. It's a new high-level API aimed at a very specific, new use case; it's not related to the primitives that currently underlie streaming HTML parsing.

domenic on 2 Sep 2017

👍2

What kinds of errors? @jakearchibald, @inikulin, et. al

I mean where the HTML spec says to "report an error." I strongly agree with @domenic et al that we should not make a different/strict syntax for HTML. Whether/where errors are reported is a relatively unimportant detail, let's worry about it later.

re: progressive rendering

@jakearchibald, I think we would need to build a controller which balances not only parsing and tree construction work, but append, style, layout and paint work. In the worst case lets say the controller is terrible and behaves like innerHTML; is this feature still worth it? What's the point where it becomes compelling?

One benefit of keeping this API pretty high-level without hooks into substitution, states, etc. is that in future when there's something like "async append" the existing uses of the API could become async appends, just with the UA doing the commit ASAP.

re: templating/hole finding

Crudely, DOM Ranges can point to an element and an offset. The DOM doesn't have a way to point to tag names, attribute names or values. Additionally, tag names can't be changed later. So if hole finding is a thing, it needs to either happen as part of tokenization or have some restrictions placed on it to make sense.

The HTML templating system I worked on in 2007 got a lot of benefit out of requiring that a substitution did not flip the HTML parser into different modes. That system also cared about semantics, for example, that a given attribute value contains script. The HTML parser doesn't but our (Blink's) built-in XSS protection knows which attribute names are event handlers and so on. If the there was a tokenization-level API supporting (or even requiring) those sort of restrictions could be useful.

re: DOM construction lists

I think if you want to construct elements from a worker, HTML's text serialization is a hard format to write correctly and read correctly. This parsing API could help make the reading side of things better but doesn't do anything for writers. HTML's text serialization is primarily useful because there's a good chance you already have your data in this format.

@wycats ' proposal for DOM tree construction command buffers is reminiscent of the HTML parser's operations, but you have to squint (appendHTML is a bit like "inject document.written bytes".) I'm not sure how usable HTML's thing would be as an API. You have to do attributes before content, for example. There's also some things missing: HTML knows a context it is parsing into before it starts (like, "fragment parsing into a table", etc.) and has some wild operations (like "reconstruct the set of open formatting elements") and so on.

dominiccooney on 4 Sep 2017

The thread here seems to assume a lot of context that is not stated explicitly. What should I read to learn about the use cases?

hsivonen on 4 Sep 2017

👍3

@dvoytenko , could you tell us more about your specific use case for the document.open/write/close code in #2827?

dominiccooney on 5 Sep 2017

This thread has 3-4 different proposals yet not clear goal or use cases for this API. What problem(s) are we trying to solve here?

rniwa on 5 Sep 2017

👍1

@rniwa Quite a few sites, including GitHub, hijack link clicks and perform the navigation themselves to avoid reparsing/executing the same JavaScript on the next page. However, this can become a lot slower on long github pages, as you lose the benefit of streaming with innerHTML. See https://jakearchibald.com/2016/fun-hacks-faster-content/.

Thinking of the extensible web manifesto, a streaming parser would expose this exiting browser behaviour to JavaScript, without having to resort to iframe hacks.

If it could also expose some parser state mid-parse, it would help (although not completely solve) some template cases. But helping somewhat unrelated cases feels like a win in terms of the extensible web.

jakearchibald on 5 Sep 2017

@domenic

Personally I am with @justinfagnani that template parts is the most promising direction so far for that particular problem.

I agree that there are better targeted solutions for that particular case, but isn't this 'appcaching' it? Offering low-level parser details feels like it would help more use-cases.

@dominiccooney

I think we would need to build a controller which balances not only parsing and tree construction work, but append, style, layout and paint work

I think that's compatible. If this thing supports streams, it can support back-pressure. Also, the "get parser state" method could return a promise that waits for the queued HTML to flush.

Additionally, tag names can't be changed later. So if hole finding is a thing, it needs to either happen as part of tokenization or have some restrictions placed on it to make sense.

You could achieve this with some parser info:

whatever`
  <${foo} src="hi">
`;

After flushing < the parser should know it's in the tag-open state.

@hsivonen

https://jakearchibald.com/2016/fun-hacks-faster-content/ might help.

jakearchibald on 5 Sep 2017

👍1

FWIW I see this pattern as a footgun

whatever`
  <${foo} src="hi">
`;

Attributes have different meaning accordingly with the kind of node you want to put there.
It might play well with strings on the server side, but on DOM side I don't see that as a must have, quite the opposite.

What I mean is that the following, which at this point could also be allowed too, doesn't look good at all.

whatever`
  <${foo} ${attr}=${value}></${foo}>
`;

What if foo is a br or any other void element or vice-versa (you wrote <${foo} /> but it's not void) ? IMO this goes a bit too far from what I (personally) ever needed, from a template parser/engine.

WebReflection on 5 Sep 2017

@WebReflection I agree that this kind of interpolation wouldn't make sense for HyperHTML, but it would make it easy to detect this situation and throw a meaningful error.

I'm much more interested in exposing existing browser internals to create new possibilities and make existing things easier, than creating a new inflexible API that solves one use-case.

jakearchibald on 5 Sep 2017

it wasn't about hyperHTML, it was more about common sense.

What does the following produce?

whatever`
  <${bar} ${attr}=${value}>
  <${foo} ${attr}=${value}></${foo}>
`;

It's absolutely unpredictable and it's also XSS prone, IMO, but surely I don't want to block anyone exploring anything, I'm just thinking loudly about that pattern.

WebReflection on 5 Sep 2017

Any system that's piping text directly into a parser needs to be very careful with user input. Using parser state, the developer can pick the appropriate escaping method, or throw if it's a state they don't want/wish to support.

jakearchibald on 5 Sep 2017

@jakearchibald from what I can tell you're still thinking of the parser states as a "real" thing, and as a low-level primitive. But as @inikulin pointed out, they're not really primitives, they're just implementation details and spec devices we use to navigate through the algorithm. I also don't think we should expose them.

domenic on 5 Sep 2017

@dominiccooney We use inactive document's open/write to stream shadow DOM. We display relatively big documents in shadow roots and streaming helps a lot with perceived latency. The way it works is this:

a. We create a buffer (inactive) document document.implementation.createHTMLDocument and call open on it.
b. For each new chunk arriving via XHR, we call document.write on the buffer element
c. We do some preprocessing on bufferDocument.head
d. For each new bufferDocument.body.firstChild we move it to the real attached shadow root.

These steps achieve a rather good perceived streaming performance. Once a node is moved, the subsequent streaming happens in the shadow root. It works much like https://jakearchibald.com/2016/fun-hacks-faster-content/ suggests, but we don't want to use iframes.

dvoytenko on 5 Sep 2017

Use Cases

@dvoytenko Thanks for those details. Roughly how much content are we talking about here?

@jakearchibald Your "fun hacks faster content" is awesome—this is the kind of scenario I have in mind. (Actually I read "fun hacks" with interest in 2016 and it has been irritating me ever since. It bothers me how hard it is to do this; that you have to break up chunks yourself; that Blink runs script elements adopted from the iframe when the spec says don't do that; etc.)

@rniwa, @justinfagnani I think the template filling is meeting a different set use cases. The abstraction and staging is different: Template filling seems more focused on DOM, whereas this is about splicing characters without breaking tokenization; template filling seems more focused on having an artifact which is instantiated, maybe multiple times, whereas this is about streaming taking bytes from somewhere and rehydrating them exactly once. I could even envisage these things being used together, for example, you stream a component definition and use the API proposed here to parse it; that includes a template you fill when an instance of the component is newed up.

@inikulin, @rniwa, is that satisfactory? Do you have any follow up questions about use cases?

@jakearchibald wrote:

I think that's compatible. If this thing supports streams, it can support back-pressure. Also, the "get parser state" method could return a promise that waits for the queued HTML to flush.

I agree! I'm just worried that there's a path dependence here. How naive could an implementation of this API be and still be useful?

@WebReflection, below is an extended meditation on how we ameliorate the XSS problem you mention in your example. This doesn't solve the problem of self closing tags causing the structure to be unpredictable, though. I don't think that is a terrible problem. A conservative author could just always write immediate closing tags for any spliced tagnames; I believe it is always safe to write closing tags, even for self-closing tags. (@domenic?)

Exposing parser states

I agree that exposing HTML parser states is a bad idea because it will limit parser evolution and probably just annoy authors anyway ("oh, I handled the comment state but forgot to handle the comment end dash state".)

What if we exposed a smaller set of states?

For example, we could map the _tag open state_ and _tag name state_ into one "abstract" state, say, _tag name_. After feeding the slice to the parser we require the parser to be in the HTML spec _tag name state_; if not, then that might be a hard error.

We could start conservatively by allowing splicing in a small set of states—tag names, attribute names, attribute values, and text—and impose restrictions, for example, maybe parsing the splice in html`<div>${thing}</div>` is only allowed in the HTML spec data, character reference, named character reference, numeric character reference, hexadecimal character reference, ... etc. states and must end in the data state. This would allow _thing_ to be "fun & games" but not "fun <script>alert('and games')" (we would abort when we hit the < and try to transition into tag open state) or "fun &" (we would abort when we finished parsing the splice and find ourselves in the character reference state and not the data state.)

I expect the implementation would carry around a bitset of allowed states which it tests on transitions. There's a bunch of states but many could be collapsed because we never allow splicing near the DOCTYPE states and so on. This could slow main document parsing, but making the parser yield more often probably means we're on a slow path anyway. I think it's probably fine.

We also have the option of implementing different syntax for splices so you can splice a string and not worry about whether it's being spliced into text or an attribute, and whether that attribute was single quoted, double quoted, or unquoted.

But say in future we want to allow arbitrary markup there. We can do this with a set of functions authors use to communicate how they want the splice handled; these return an object that the outer _html_ function interprets, for example html`<div>${hi_parser_trust_me`${thing}`}</div>` where _hi_parser_trust_me_ is another platform function which returns an object that the outer _html_ function knows to interpret with relaxed parsing rules. Of course we'd need to take care with the design and design a useful set of those functions with intuitive names and make shorthands like html`<div>${hi_parser_trust_me(thing)}</div>` also work.

dominiccooney on 7 Sep 2017

I still don't understand what the use cases of this feature are. If Github is using a single call to innerHTML, and it's slow, the correct fix is to break that up into chunks and then process each chunk separately. I have a hard time believing that Github's scenario warrants a completely new streaming HTML parser.

Please give us a lit of concrete use cases for which this feature is required. This is a massive feature which requires a ton of engineering effort to implement in browser engines, and I'd like to have clear-cut important use cases that can't be satisfied without it; not something websites can very easily workaround.

rniwa on 7 Sep 2017

If Github is using a single call to innerHTML, and it's slow, the correct fix is to break that up into chunks and then process each chunk separately.

It is hard to break up a chunk of HTML without parsing it.

I have a hard time believing that Github's scenario warrants a completely new streaming HTML parser.

There's already a streaming HTML parser—the main document parser. It is just hardwired to work in certain settings and not others.

I'd like to have clear-cut important use cases that can't be satisfied without it; not something websites can very easily workaround.

I think it is helpful to have use cases, so yeah, let's sharpen them up. What is "clear-cut" and "important" might be a bit subjective; what's your standard?

I want to push back on this idea that workarounds are OK. If authors end up having to rely on lots of workarounds, the accumulated burden can be significant. I think @jakearchibald's post about streaming load performance is worth studying: How long does it take authors to discover this iframe, document.write hack? How resource intensive is spinning up an iframe? How bad is it to enshrine that Safari/Chrome/Edge script running bug?

dominiccooney on 7 Sep 2017

If Github is using a single call to innerHTML, and it's slow, the correct fix is to break that up into chunks and then process each chunk separately.

It is hard to break up a chunk of HTML without parsing it.

As far as I can tell, Github is generating HTML for the entire comment section & sending it over XHR. Unless their backend somehow parses HTML each time it has to modify an issue page, they should have a mechanism to generate HTML per comment. At that point, they could be splitting up markup via comments and batch them up and send it via XHR.

Also, browser engines could implement an optimization to speculatively tokenize & parse DOM nodes when a content with text/html MIME type is fetched.

I have a hard time believing that Github's scenario warrants a completely new streaming HTML parser.

There's already a streaming HTML parser—the main document parser. It is just hardwired to work in certain settings and not others.

I'm saying that exposing and maintaining that as a JS API without introducing a major security vulnerability would require a significant engineering effort.

I want to push back on this idea that workarounds are OK. If authors end up having to rely on lots of workarounds, the accumulated burden can be significant.

It all depends on the cost. It would be great if we can make DOM thread safe and expose to Web workers without perf & security issues but the engineering effort required to do that is akin to rewriting WebKit from scratch so we wouldn't lightly propose such a thing. That's a bit of an extreme case but there's always a cost & benefit trade off in every feature we're adding to the Web platform, and there's also an opportunity cost. The time I spend implementing this feature is a time I could spend fixing other bugs and implementing other features in WebKit.

Since this feature has a significant implementation cost (at least to WebKit), the justification for it needs to be correspondingly strong.

rniwa on 7 Sep 2017

to speculatively tokenize & parse DOM nodes when a content with text/html MIME type is fetched

That doesn't help with actually adding these elements to the DOM in a streaming fashion too.

Let's, as the first step, minimize the required API for the original @jakearchibald's use case to something like:

document.getElementById('div').appendChildStream(respStream);

What are the new security implications of this that are not already present for innerHTML? What is the added implementation cost that is not covered by main document parser and/or iframe hack?

RReverser on 7 Sep 2017

I understand @rniwa argument, which is why I hoped for a very simple scenario that I believe would already solve 99% of use cases: attribute value, content chunk.

I also agree with @RReverser this should start as small as possible or it won't ever land.

const tag = document.createStreamTag((node, value) => {
  if (node.nodeType === Node.ATTRIBUTE_NODE) {
    // we have an attribute. We can reach its owner
    // we can deal with its name and the value as content
  } else {
    // we have a Node.ELEMENT_NODE
    // it's still open with N childNodes
    // we can append a new node, discard the value, do whatever
  }
});

// parse & stream
tag`<div class=${'name'} onclick=${fn}>
  ${'some content'}
</div>`;

The tag stream will always return a DocumentFragment (in this case containing a div) and above example will invoke 3 times the callback:

first time with the class attribute node, and value "name"
second time with onclick attribute node, and fn as it is as listener (no .toString() implicit anything),
the third time the div node itself, with childNodes.length equal to 1, which is the text before the chunk.

The value of the third invocation will be the text "some content", but it could also be anything else, including a Promise object.

If this was possible through the platform, it'd be quite revolutionary.

All primitives to enrich the logic on top would be there. The only missing bit to cover all my use cases already implemented and available to check/view/see if you want, is the fact HTML is case-insensitive so that an attribute like onCustomEvent would result into a DOM attribute with name oncustomevent instead.

Latter one is not a huge limit but maybe somebody has an idea on how that could be solved.

WebReflection on 7 Sep 2017

@domenic

from what I can tell you're still thinking of the parser states as a "real" thing, and as a low-level primitive. But as @inikulin pointed out, they're not really primitives, they're just implementation details and spec devices we use to navigate through the algorithm.

If browsers don't implement it & don't intend to, what's the point of having it in a spec? I realise that browsers may use different terms internally, but unless they're implementing something wildly different to the spec, and intend to continue doing so, those states could be mapped to something standard.

I also don't think we should expose them.

Why?

@dominiccooney

What if we exposed a smaller set of states?

Agreed. We could even start by exposing nothing, but design the parser in a way that allows this in future.

jakearchibald on 7 Sep 2017

@jakearchibald

but unless they're implementing something wildly different to the spec

Sometimes they do, e.g. Blink don't use states from spec that are dedicated to entity parsing and uses custom state machine for that: https://chromium.googlesource.com/chromium/blink/+/master/Source/core/html/parser/HTMLEntityParser.cpp

inikulin on 7 Sep 2017

👍1

Right, generally specifications define some kind of process that brings you from A to B. The details of that process are not important and implementations are encouraged to compete in that area. The moment you want to expose more details of that process to the outside world it starts mattering a whole lot more what those details are and how they function, as the moment you expose them you prevent all kinds of optimizations and code refactoring that could otherwise take place.

annevk on 7 Sep 2017

👍1

Fair enough. It'd be good to expose these states at some point, but it doesn't need to be v1.

jakearchibald on 7 Sep 2017

@WebReflection I agree having events for separate pieces of the HTML as it goes through would be quite nice, but I'd say it's already a little bit more advanced than the "as small as possible", more like version 2. For version 1, it would be nice at least to be able to insert streaming content into the DOM even without hooks for separate parts of it.

RReverser on 7 Sep 2017

events are just attributes ... what I've written intercepts/pauses at dom chunks and / or attributes, no matter which attribute it is or what it does ... attributes :smile:

WebReflection on 7 Sep 2017

@WebReflection Sure, but as I said, it's a bit more advanced because it requires providing hooks from inside of the parser. I want to start with something that will be definitely possible to get implemented by vendors with pretty much no changes or hooks that are not already there, and then iterate on top of that.

RReverser on 7 Sep 2017

@dominiccooney

Thanks for those details. Roughly how much content are we talking about here?

This is really full-size docs. Anywhere between 10K and 200K. I don't know what averages are, tbh.

dvoytenko on 7 Sep 2017

https://github.com/whatwg/html/issues/2142 – previous issue where a streaming parsing API was discussed

jakearchibald on 11 Sep 2017

👍1

Another important question: do we want it to behave like a streaming innerHTML? If so, such functionality can't be achieved with the fragment approach, since we don't know context of parsing ahead of time. Consider we have a <textarea> element. With innerHTML setter parser knows that content will be parsed in context of <textarea> element and switches tokeniser to text parsing mode. So, e.g. <div></div> will be parsed as text content. Whereas, with fragment we'll parse it as a div tag. If we'll use same machinery for fragment parsing approach as we use for the <template> parsing we can workaround some of the cases, such as parsing table content (however e.g. foster parenting will not work), but everything that involves adjustment of the tokeniser state will be a problem.

inikulin on 14 Sep 2017

@inikulin The fragment could buffer text until it's appended, at which point it knows its context. Although a guess it's a bit weird that you wouldn't be able to look at stuff in the fragment.

The API could take an option that would give it context ahead of time, so nodes could be created before insertion.

jakearchibald on 14 Sep 2017

@jakearchibald What if we modify API a bit. We'll introduce new entity, let's call it StreamingParser for now:
```js
// If we provide context element, then content is streamed directly to it.
let parser = new StreamingParser(contentElement);

let response = await fetch(url);
response.body
.pipeTo(parser.stream);

// You can examine parsed content at any moment using parser.fragment
// property which is a fragment mapped to the parsed content in context element
console.log(parser.fragment.childNodes.length);

// If context element is not provided, we don't stream content anywhere,
// however you can still use parser.fragment to examine content or attach it to some node
parser = new StreamingParser();

// ...

inikulin on 14 Sep 2017

If you don't provide the content element, how is the content parsed?

jakearchibald on 14 Sep 2017

In that case parser.fragment (or even better call it parser.target) will be a DocumentFragment element implicitly created by the parser.

inikulin on 14 Sep 2017

Is that a valid context for a parser?

jakearchibald on 14 Sep 2017

As in, if I push <path/> to the parser, what ends up in parser.fragment?

jakearchibald on 14 Sep 2017

DocumentFragment itself is not a valid context for parser. I forgot to elaborate here: in case if we don't provide content element for the parser, it creates <template> element under the hood and pipes content into it, parser.target will be template.content in this case.

inikulin on 14 Sep 2017

It'd still be nice to have the nodes created before the target. A "context" option could do this. The option could take a Range, an Element (treated like a range that starts within the element), or a DOMString, which is treated as an element that would be created by document.createElement(string).

jakearchibald on 14 Sep 2017

How it will behave if we pass a Range as a context?

inikulin on 14 Sep 2017

@jakearchibald Seems like I got it: in case of Range we'll stream to all elements in Range? If so. we'll need separate instance of parser for each element in Range.

inikulin on 27 Sep 2017

@inikulin whoa, I really thought I'd replied to this, sorry. Range would simply be used to figure out the context, like https://w3c.github.io/DOM-Parsing/#idl-def-range-createcontextualfragment(fragment). There'd only be one parser instance.

jakearchibald on 27 Sep 2017

@jakearchibald Thanks for the clarification. We've just discussed possible behaviours with @RReverser and we were wondering if parsing should affect context element's ambient context: e.g. in case if we stream inside <table> and provided markup contains text outside table cell should we move this text above context <table> element (foster parent it) as it's done in full document parsing. Or we should behave exactly like innerHTML and keep text inside <table>?

inikulin on 27 Sep 2017

Hmm, that's a tough one. It'd be difficult to do what the parser does while giving access to the nodes before they're inserted. As in:

const streamingFragment = document.createStreamingFragment({context: 'table'});
const writer = streamingFragment.writer.getWriter();
await writer.write('hello');

// Is 'hello' anywhere in streamingFragment.childNodes?

In cases where the node would be moved outside of the context, we could do the innerHTML thing, or discard the node (it's been moved outside of the fragment, to nowhere).

I'd want to avoid as many of the innerHTML behaviours as possible, but I guess it isn't possible here.

jakearchibald on 27 Sep 2017

Another concern we discussed with @inikulin (also related to the discussion in last few comments) is that content being parsed might contain closing tags and so leave the parent context. In that regard, behaviour of innerHTML or createContextualDocumentFragment seems better in that it keeps the content isolated, although we're still not sure how stable is machinery for the latter API (given that it does more than innerHTML, e.g. executing scripts is allowed).

RReverser on 27 Sep 2017

In an offline discussion, @sebmarkbage brought up the helpful point that if we added Response-accepting srcObject to iframe (see https://github.com/whatwg/html/issues/3972), this would also serve as a streaming parsing API, albeit only in iframes.

domenic on 14 Nov 2018

@domenic Hmm, I'm not sure how it would help with streaming parsing? Seems to mostly help with streaming generation of content?

RReverser on 14 Nov 2018

@RReverser The parsing would also be done in a streaming fashion, just like it is currently done for iframes loaded from network-derived lowercase-"r" responses.

domenic on 19 Dec 2018

What I mean is, I don't see how this helps with actually parsing HTML from JS side (and getting tokens etc.), it rather seems to help with generating and delivering HTML to the renderer.

RReverser on 20 Dec 2018

Actually nevermind, I realised that half of this old thread was already about the "delivery to the renderer" problem and not actual parsing. Which is useful too, but seems confusing to mix both in the same discussion.

RReverser on 20 Dec 2018

So I have an idea: how about something like this on HTMLElement: Promise<void> replaceChildrenWithHTML((ReadableStream or DOMString) stream)? This would lock the children list (attempts to read the children fail with an error) and return a promise resolved once it's unlocked and ready to manipulate again. This in effect would be an asynchronous elem.innerHTML = ..., and would be easy to make efficient with background DOM parsing. Note that the browser can append elements at any time, and while you can't manipulate elements themselves, addition can still be detected by properties like outerHeight. (This is so they can pipeline it - it makes for a better user experience.)

As for why a generic readable stream? Such an API could be immensely useful for not just things like displaying Markdown documents from the server, but also for things like displaying large CI logs and large files, where in more advanced cases, a developer might choose to use the scroll position plus the current outer height to determine a threshold to render more items, simply buffering the logs until they're ready to render them. (I could totally see a browser doing this for displayed text files whose sizes are over 100MB - they might even choose to buffer the rest to disk to save memory and just read from there after they've received everything from network, only pre-loading things that are remotely close to the viewport.)

I'm aware of how old this bug is. I still want to resurrect it.