Hi 馃憢
This is a follow up from #5414 . Specifically about 4.3.4. Consume an ident-like token when the identifier string result is math for url.
Quote from PR ( @tabatkins ):
Ah, looking at the history, I changed _from_ your suggested text to the current text in csstools@5f67386, and now that I see it, I understand why - if there is any whitespace between the open-paren and the string, I need to preserve it, so the next token produced is a whitespace token.
(I can't just emit the function-token immediately and let normal whitespace processing handle that; I need to scan forward and see if it's going to be a normal function (containing a string) or if it needs to be specially parsed as a url-token (unquoted).)
From specs:
If string鈥檚 value is an ASCII case-insensitive match for "url", and the next input code point is U+0028 LEFT PARENTHESIS ((), consume it. While the next two input code points are whitespace, consume the next input code point. If the next one or two input code points are U+0022 QUOTATION MARK ("), U+0027 APOSTROPHE ('), or whitespace followed by U+0022 QUOTATION MARK (") or U+0027 APOSTROPHE ('), then create a
with its value set to string and return it. Otherwise, consume a url token, and return it. 
I just want to clarify if I understood it correctly and if maybe the wording could be improved.
From my understanding, this can return either a <function-token> or a <url-token> (or <bad-url-token>). What it's a bit confusing while reading this is that, in the text highlighted in bold, it _seems_ that it should consume white space regardless whether it'd return a <function-token> or <url-token>. If it's a <url-token>, it makes sense but if this turns out to be a <function-token> (because next input code is U+0022 QUOTATION MARK (") or U+0027 APOSTROPHE (')), then the white space shouldn't have been consumed. In other words, if it turns out to be a <function-token> then it should re-consume the next input code point after U+0028 LEFT PARENTHESIS ((). Right? So the white space after url( (if any) could be consumed next as <whitespace-token> and the U+0022 QUOTATION MARK (") or U+0027 APOSTROPHE (') as <string-token> (or <bad-string-token>)
The slightly awkward algorithm ensures that, if it turns out that it needs to emit a function-token, and there was whitespace between the ( and the ", it'll leave one character of whitespace for the tokenizer to pick up on the next pass so it can emit a whitespace token.
The tokenizer already collapses runs of adjacent whitespace into a single whitespace token, so the fact that I consumed a bunch of whitespace characters as part of producing the preceding token isn't observable. The benefit of this is that I don't need to do arbitrary lookahead from the ( to discover if, after an arbitrary number of whitespace characters, I eventually run into a "; instead I only need to look two characters ahead.
(Overall, the tokenizer requires three characters of lookahead, and the parser requires one token of lookahead; keeping that minimal is good for the efficiency of implementations.)
Most helpful comment
The slightly awkward algorithm ensures that, if it turns out that it needs to emit a function-token, and there was whitespace between the
(and the", it'll leave one character of whitespace for the tokenizer to pick up on the next pass so it can emit a whitespace token.The tokenizer already collapses runs of adjacent whitespace into a single whitespace token, so the fact that I consumed a bunch of whitespace characters as part of producing the preceding token isn't observable. The benefit of this is that I don't need to do arbitrary lookahead from the
(to discover if, after an arbitrary number of whitespace characters, I eventually run into a"; instead I only need to look two characters ahead.(Overall, the tokenizer requires three characters of lookahead, and the parser requires one token of lookahead; keeping that minimal is good for the efficiency of implementations.)