Csswg-drafts: [selectors-4] [css-syntax-3] Does the U+0000 → U+FFFD rule apply to selector parsing?

Created on 7 Apr 2020 · 9Comments · Source: w3c/csswg-drafts

I just added a test at http://wpt.live/dom/nodes/ParentNode-querySelector-escapes.html which boils down to

const el = document.createElement("span");
el.id = "\u{fffd}";
document.body.append(el);

assert_equals(document.body.querySelector("#\u{0}"), el);

i.e. which assumes the selector "#\u{0}" gets parsed as "#\u{fffd}". (Note: this is distinct from the selector represented in JS code as "\\u000"; we are using JS escapes, not CSS escapes, so the string contains a literal U+0000 code point, not a backslash, u, etc.)

Firefox passes this past, while Chrome and Safari fail it.

However I'm unsure if the spec actually implies this. Parse a selector links to <selector-list> which bottoms out in <hash-token>. Nothing there explicitly handles U+0000, and it has an inclusive-sounding sentence "value composed of zero or more code points".

The tokenization section of css-syntax-3 makes a U+0000 → U+FFFD replacement. However, it's not clear to me whether this applies to selector parsing when the selectors enter via DOM APIs and the "parse a selector" algorithm. My best guess is that it applies to CSS style sheets only.

If the intention is to apply the tokenization rules to selector parsing, I'd suggest linking the word parse in step 1 of parse a selector (and parse a relative selector) to some algorithm which feeds a string as the input stream into the tokenizer, then into the grammar. If the intention is to apply the tokenization rules only to CSS style sheets, then maybe a note in that section would be better.

Closed as Question Answered Commenter Satisfied css-syntax-3 selectors-4

Source

domenic

All 9 comments

In Firefox that rule is built into the tokenizer, which is why it applies regardless of where the selector comes from.

I don't know why the rule wouldn't apply for the selectors coming from querySelector / other DOM APIs.

cc @SimonSapin

emilio on 7 Apr 2020

I'd suggest linking the word parse in step 1 of parse a selector (and parse a relative selector) to some algorithm which feeds a string as the input stream into the tokenizer, then into the grammar.

They both link to specific rules inside the grammar at https://drafts.csswg.org/selectors-4/#grammar. That section starts with:

Selectors are parsed according to the following grammar:

Which links to https://drafts.csswg.org/css-syntax-3/#css-parse-something-according-to-a-css-grammar which is ultimately based on the tokenizer.

Perhaps it should be clearer that https://drafts.csswg.org/css-syntax-3/#input-preprocessing is a required step at some point, CC @tabatkins

SimonSapin on 7 Apr 2020

Thanks. It sounds like this is a Chrome/Safari bug then, but I do think the spec could be clearer.

domenic on 7 Apr 2020

Yes, that's a Chrome/Safari bug.

I'm not sure how I could actually make the spec clearer; it seems to be laid out pretty explicitly to me:

As Simon says, Selectors links to https://drafts.csswg.org/css-syntax-3/#css-parse-something-according-to-a-css-grammar.
That's section 5.3.1; the upper-level section 5.3 https://drafts.csswg.org/css-syntax/#parser-entry-points says:

[The algorithms] assume that they are invoked on a token stream, but they may also be invoked on a string; if so, first perform input preprocessing to produce a code point stream, then perform tokenization to produce a token stream.

Afaict, that's completely explicit about what to do here.
If you skipped that and tried to follow the algorithm itself, step 1 of the algo links you to https://drafts.csswg.org/css-syntax/#parse-a-list-of-component-values, which links to https://drafts.csswg.org/css-syntax/#consume-a-component-value, whose first instruction is "consume the next input token"; if you're just assuming you have a string here, there's no such thing as an "input token" and you're in completely undefined territory.

(Closing as Question Answered, but if you can see any holes in this that I could patch up, feel free to comment or re-open.)

tabatkins on 7 Apr 2020

Ah, I do see that Selectors doesn't link to the "parse" algorithm in that algo, but only thru the indirection of the statement at the start of the grammar section. I could add a link to the instances of "parse" in those algos, no problem.

tabatkins on 7 Apr 2020

Thanks; that definitely helps. Without the link it seemed to me like we might be using some undefined notion of parse that was basically "match the grammar", or otherwise entered at some later stage in the css-syntax spec. The link helps a lot.

Ideally it would be the case if parsing invoked https://drafts.csswg.org/css-syntax-3/#input-preprocessing explicitly, instead of having that effectively monkeypatch invocations of parse in a vague way (and in a way that requires finding https://drafts.csswg.org/css-syntax/#parser-entry-points before realizing that all the talk of bytes in that section is a distraction). But at least now it's clear that the tokenizer is involved in querySelector()-mediated selector parsing, so I can at least verify the answer, even if implementing it from scratch would not be possible by following the algorithm chain.

domenic on 7 Apr 2020

1933 was similar, there I already suggested linking to https://drafts.csswg.org/css-syntax-3/#css-parse-something-according-to-a-css-grammar to avoid confusions.

Loirooriol on 8 Apr 2020

Just closing the loop on this issue: in addition to fixing Selectors to explicitly invoke the CSS/parse algo, I also rewrote the parsing algos to explicitly put the input normalization into their algo steps, rather than relying on a blanket instruction for how to do normalization at the beginning of the section (requiring you to know that such an instruction exists if you just follow a link straight to the algo).

tabatkins on 6 May 2020

👍1

Oh, and Domenic was happy with the more-explicit normalization when I asked him about it in another channel, so recording this as "Commenter Satisfied".

tabatkins on 6 May 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings