Html: [Parser] Make `/` in unquoted attribute value a parse error.

Created on 11 May 2017  路  17Comments  路  Source: whatwg/html

@diervo discovered an interesting case of ambiguity in https://github.com/HTMLParseErrorWG/meta/issues/12:

<input data-attr=foo/>

will be parsed as

<input data-attr="foo/">

I believe we should report parse error in that case like we do for ", <, =, etc., since inclusion of / in an attribute value can be confusing here. There could be an argument against that it will make unquoted URLs with path a parse error, however we already report errors on URLs with query parameters (due to =). Another (a bit more tricky) option can be reporting error if we are in Attribute value (unquoted) state and we hit the > branch and previously consumed character is /.

document conformance parser

Most helpful comment

I think I prefer not fixing this. We've always regarded the / in <img/> as a talisman and we've advocated against "polyglot" on the basis that pretending / and other XML things have meaning will result in more confusion (it's much better to know <input> doesn't have a closing tag).

This seems more in the realm of lint tools and coding guidelines.

All 17 comments

Hypothesis: <a href=https://example.org/> is pretty common. It seems annoying to make that invalid.

Although the case you bring up does seem a bit like a footgun, especially for inline SVG.

Testing my hypothesis with httparchive...

SELECT * FROM (
SELECT page, url, REGEXP_EXTRACT(LOWER(body), r'(<[a-z][a-z0-9-]*\s(?:\s*[a-z0-9-:_]\s*(?:=\s*(?:"[^"]+"|\'[^\']+\'|(?:[^\s/>]+(?:/[^\s>])*)+))?)*[a-z0-9-:_]\s*=\s*[^"\'>\s]/>)') AS match
FROM [httparchive:har.2017_04_15_chrome_requests_bodies]
) WHERE match != "null"

(I was getting cross-eyed while writing that regexp, but then tried running it and it seems like it picked up what I was looking for!)

2693 rows. A quick look suggests many are like this one:

<img src="//bn.maist.jp/img/filler/maist_06.gif" border=0/>

Which elements and what's the last attribute?

  REGEXP_EXTRACT(match, r'^<([^\s]+)') AS elm,
  REGEXP_EXTRACT(match, r'\s(\S+)\s*=\s*\S+>$') AS attr
FROM [pelagic-breaker-785:test.unquoted_slash_gt]

Result: https://gist.github.com/zcorpan/e8735aee4f73a5b416f70e62ec80e7ef

$ grep ",src" results-20170511-171859.csv | wc -l
       3
$ grep ",href" results-20170511-171859.csv | wc -l
      20
$ grep ",border" results-20170511-171859.csv | wc -l
    2188
$ grep "img," results-20170511-171859.csv | wc -l
    2463

Well, it appears src/href are pretty rare, and img,border is the most common. Although border is itself invalid, I think we could do something about this.

Note that I didn't look for unquoted src/href attributes in general, this only tried to pick up the last attribute ending in />. So it's still possible that it is common to have unquoted src/href attributes with slashes, just not as the last attribute without space before >.

cc @sideshowbarker @hsivonen

@zcorpan

Note that I didn't look for unquoted src/href attributes in general, this only tried to pick up the last attribute ending in />. So it's still possible that it is common to have unquoted src/href attributes with slashes, just not as the last attribute without space before >.

Yes, exactly for this scenario we can use the second options from the OP post. Others are fine, I guess.

Second option SGTM.

I'm probably missing something obvious, but what the "second option"?

I agree that unquoted attributes interact with SVG integration in a way that's probably surprising to people who don't know the spec well.

However, complaining about <a href=https://github.com/>GitHub</a> would be bad. Here's why:

In the early days of HTML5, many people were advocating that one should always write HTML the way one writes XML: with all the quotes and all the end tags. And we were like "You really don't need to. It's well-defined. Type less and don't feel bad about it." We've had this position for over a decade, and the "You should always suspect everything and write like XML just in case." position has mostly gone quiet.

If we now recanted on this, we'd undermine the WHATWG's credibility as source of reliable requirements than don't change on whim and that don't need "defensive coding" in case they change. If we now changed this, we'd vindicate the story that you should always do something extra (quotes, end tags), just in case, because you never know how things work out if you do something easy and convenient and you can't trust things to actually work the way specs say.

I'm probably missing something obvious, but what the "second option"?

I think that means from https://github.com/whatwg/html/issues/2665#issue-227925394 this:

Another (a bit more tricky) option can be reporting error if we are in Attribute value (unquoted) state and we hit the > branch and previously consumed character is /.

For the record, I agree with @hsivonen鈥檚 conclusion and rationale in https://github.com/whatwg/html/issues/2665#issuecomment-303049936 that complaining about <a href=https://github.com/>GitHub</a> would be bad

OK. I agree in principle that complaining about such a link is bad, although the data suggests it is rare compared to genuine mistakes (mostly on img).

Here's option (3): like (2), but only for void elements and foreign elements.

I think I prefer not fixing this. We've always regarded the / in <img/> as a talisman and we've advocated against "polyglot" on the basis that pretending / and other XML things have meaning will result in more confusion (it's much better to know <input> doesn't have a closing tag).

This seems more in the realm of lint tools and coding guidelines.

Hmm. I prefer fixing this, as it's pretty clearly incorrect code.

In my opinion it should be an error whenever you intended the URL to be https://example.com/ but it ended up being https://example.com. I don't fully understand the various different options (or even how e.g. <a> vs. <img> parse those cases) but the more narrowly we can focus on that case the better.

The fact that the data shows it to be rare implies counteracts @hsivonen's concern in my opinion.

The above was written with an incorrect understanding of the problem; please ignore.

How would it end up without the slash?

Oh, I see, I lost the context of the original post -_-. Please ignore the above.

I guess now I have no opinion. I am OK with not protecting people who throw in extra /s from the consequences of their bad life choices ;).

@annevk I don't object to not fixing this, but what is your reasoning for inline SVG?

In my mind, not having a parse error here means it's prone to silent errors to use unquoted attributes in inline SVG.

@zcorpan wouldn't <image src=/> end up being parsed as <image src="/"> and therefore result in an error due to lack of a closing tag?

Yes. I guess there would usually be an error, but not pointing to the actual cause. But I suppose conformance checkers could special-case this and provide a helpful error message... it just wouldn't fall out from implementing the spec's parse errors.

I filed https://github.com/validator/validator/issues/528 to follow up on the better error message idea, but since there isn't overwhelming support for changing the standard, let's leave the standard as-is.

(At least normatively; possibly we could add something non-normative hinting about this issue. Comment or reopen if you think that should be done.)

Was this page helpful?
0 / 5 - 0 ratings