Html: Decreasing verbosity of HTML

Created on 22 Mar 2020  Â·  7Comments  Â·  Source: whatwg/html

HTML standard requires that almost every element has to be closed like </head> , </body> , </style> etc. although which element is ended is obvious.

What I propose is that rather than using the closing words just </> can be used as below:

<html>
 <head>
  <style>body{background-color:yellow;}</>
 </>
 <body>
  <h1>heading</>
  <p>text <b>bold text</> text</>
 </>
</>

Benefits of this method are as follows:
HTML codes will be,

  • easier to type,
  • smaller in size,
  • easier to parse by computers without needing to check for errors like closing wrong element.

Most helpful comment

If the new feature is made optional then there will be no incompatibility problem.

If existing pages contain the text </>, which currently gets thrown away by the parser as a parsing error, then they will change behavior after this change.

That is, if you today type something like <div>foo</>bar</div>, you'll get a div containing the text "foobar". After this feature is added, you'll instead get a div containing "foo", and "bar" as a text-node sibling after it, as if you'd typed <div>foo</div>bar</div>. That the backwards-incompatibility change.

Seem unlikely? Perhaps, but there are many trillions of HTML pages out there with all manner of badly-authored code, and we can't predict ahead of time how likely it is that this sort of content exists. We'd have to test, which takes some time and effort, and we only want to do that if the improvement is worth the cost.

Further, as stated, there are also lots of sanitization libraries that might (correctly, today) not sanitize that sort of tag from the content, and after this change would make pages vulnerable to markup injection. Unlike the last point, we can't easily measure this, and it's potentially more serious of an issue, as well.


Separately, I'll also note that this clashes with another existing HTML feature, the ability to omit end tags from many elements. Without a tagname, the end tag becomes ambiguous in these situations as to which element should be closed. For example:

<div>
  <p>Here's some text, no need to close the paragraph element.
</div>
<div>Here's another div.</div>

This is valid today, and produces the obvious markup structure - a div containing a p. If you switch to this more concise endtag, tho:

<div>
  <p>Here's some text, no need to close the paragraph element.
</>
<div>Here's another div.</>

Do you have the same structure, meaning that the markup is now less obvious and you have to memorize the end-tag omission rules to know when it's allowed to use </>? Or does the first </> close the p, leaving the first div unclosed, so the second div is now its child rather than its sibling? There was no question before, even if you wrote the code weirdly; you'd just see a </div> and know it was closing the div, not the p.

All 7 comments

<head> and <body> do not need to be closed.

<head> and <body> do not need to be closed.

I think you did not understand the main idea. Do not go deep into details, just catch the main idea.

Furthermore I think that it should be made necessary to close every element like in XHTML because everyone cannot memorize what must and what may not be closed.

Unfortunately, such a change to the HTML parser would not be backwards compatible and might also have security implications.

Unfortunately, such a change to the HTML parser would not be backwards compatible and might also have security implications.

  1. If the new feature is made optional then there will be no incompatibility problem.
  2. I cannot imagine a security problem if closing tags can be shortened this way. Can you give me an example of such a "security implication"?
  3. I also did not understand why you closed this issue in a lightning fast way...

I cannot imagine a security problem if closing tags can be shortened this way. Can you give me an example of such a "security implication"?

Say you have an application that uses one of the various popular HTML “sanitization” libraries to transform user-input HTML content. These are usually whitelist based. Although generally such a whitelist omits both <script> and <style>, I’m certain some applications disallow the former and not the latter. Assuming that’s the case here, consider this user input:

<style>
  x::before {
    content: "</>";
  }

  y::before {
    content: "<script>alert('womp womp');</script>";
  }
</style>

Currently, this will be parsed as one style element with one text child. Even if one is additionally parsing and reserializing the stylesheet text — maybe stripping any invalid selectors or unrecognized properties — the attack would end up getting preserved; it’s presently legit CSS. In agents where </> could end the <style> element’s content, though, the text from the second to third quotation mark is chardata, and <script> begins a script element. Existing sanitiziers would fail. HTML can’t introduce a feature knowing that it can create new XSS opportunities for any application that doesn’t update to match the change (including “resanitizing” any existing content in storage).

To be fair, HTML’s parsing of the content of <script>, <style>, and a few other elements is special and possibly one could say these cases remain as they are and can never use </>. That probably would shut down many problems, but ...

This belongs to a class of issues that tend to arise any time one of these things is true:

  • previously non-fatal input is given a new, different meaning
  • a language has multiple interpretive “modes,” some source texts are valid examples of both modes, and interpreting some of those source texts in each mode produces different results

In HTML, there is no fatal input at all, so the surface for syntactic extension is very limited, pretty much absent (new features build on existing syntax, so that older agents still produce the same document structure if new elements or attributes are introduced). HTML also has a fragment parse goal, and fragments are often treated as portable (as in the example above), so one would need a reliable way to communicate the intended interpretive mode within the source text even if it is not a document, e.g. a magic comment or something that builds on the “bad comment” productions and looks like a processing instruction or doctype declaration.

Although generally such a whitelist includes both