It's been suggested to me to post this here, so here it goes. Sorry if this is a wrong place to ask. I've only recently started checking out the specification, mostly around tokenization as I'm building a parser for an HTML-based templating language.
I reference W3C's specification here, but I've checked WHATWG as well and the relevant parts seem to be the same (at a quick glance the only difference is better documentation of parse errors).
I'm having trouble understanding the point of bogus state. One way to get there is recognizing characters <?, when we "[c]reate a comment token whose data is the empty string". While in the bogus state, we pretty much only append to the created comment token and then "[e]mit the comment token".
What I do not understand is how is this (the fact that we were in the bogus state) reflected in the emitted stream of tokens. There is no flag on the token; the spec is clear on what a comment token has: "Comment and character tokens have data."
It certainly seems like it shouldn't result into a regular comment based on the description in § 8.1.6. Comments, where it's clearly stated that a comment must start with <!--.
I've searched for the word "bogus" across the whole chapter and some other chapters, but it was always in the context of which _state_ to switch to. (I couldn't find a way to search across the whole document, just did browser's search.) The only other place where I found this mentioned is this example but I don't think it's relevant.
Maybe I'm missing something obvious but I was really confused. Also very few things to find online regarding bogus comment states as browsers probably just swallow them all.
I haven't checked your issue in detail, so apologies if this is off the mark, but one thing that jumped out at me is this distinction:
8.1.6 specifies what authors must do to write HTML that conforms to the spec.
The parser spec explains how browsers handle all input, including invalid input like <? Some bogus comment ?>. The word "bogus" is specifically used to connote that these comments are not generated by valid markup.
The bogus-comment state exists mainly just for the purposes of error-reporting parsers — those that choose to report any case that the spec defines as a “parse error”:
https://html.spec.whatwg.org/multipage/parsing.html#parse-errors
This specification defines the parsing rules for HTML documents, whether they are syntactically correct or not. Certain points in the parsing algorithm are said to be parse errors. The error handling for parse errors is well-defined (that's the processing rules described throughout this specification), but user agents, while parsing an HTML document, may abort the parser at the first parse error that they encounter for which they do not wish to apply the rules described in this specification.
So if you’re implementing an error-reporting parser, the point of the bogus-comment state is to allow your implementation to identify bogus comments so you can report errors for them.
But if you’re not implementing an error-reporting parser, the point of the bogus-comment state is to allow you the option to make your implementation abort parsing if it encounters a bogus comment. You’re not required to abort for parse errors, but you’re also not prohibited from aborting (see above).
What I do not understand is how is this (the fact that we were in the bogus state) reflected in the emitted stream of tokens.
It’s not.
There is no flag on the token; the spec is clear on what a comment token has: "Comment and character tokens have data."
Right. Bogus comments from the source markup just end up in the resulting document as regular comments syntactically indistinguishable from any other comments.
It certainly seems like it shouldn't result into a regular comment based on the description in § 8.1.6. Comments, where it's clearly stated that a comment must start with
<!--.
As @domenic points out, that part of the spec states requirements for authoring, not for parsing. Note again in the paragraph from the spec cited above the part that says:
This specification defines the parsing rules for HTML documents, whether they are syntactically correct or not.
I reference W3C's specification here
It’s imprudent to be using the https://w3c.github.io/html/ spec for implementing a conformant parser. Among other reasons, it’s not the spec that other implementors work from.
but I've checked WHATWG as well and the relevant parts seem to be the same (at a quick glance the only difference is better documentation of parse errors)
There are other differences. If you’re building a parser, you want to be working from the https://html.spec.whatwg.org/multipage/ spec.
Thanks for the replies everyone, it's all clear now. In fact, after sleeping on it, I'm not even sure why I expected the info about bogus state to be in the token -- the spec clearly states when a parse error should be thrown (or recorded).
If you’re building a parser, you want to be working from the https://html.spec.whatwg.org/multipage/ spec.
Thanks for the pointer, @sideshowbarker. I've read more about the difference between the two now (never has a chance -- or a reason -- before). I'll be consulting the WHATWG spec from now on :slightly_smiling_face: