Respec: HTML Entities should be converted when generating XHTML

Created on 19 Feb 2020 · 16Comments · Source: w3c/respec

Important info

URL to affected spec: https://w3c.github.io/pub-manifest/ (just as an example, it is a general issue)
ReSpec version: Click the ReSpec pill (e.g., 20.0.1). 25.2.01
[x] I did a "hard refresh", but it's still busted.

Description of problem

If I generate XHTML (as opposed to HTML), all browsers experience an XML error and stop. The reason: the HTML file contains HTML entities, like  , which are kept in the XHTML verbatim. Browsers do not understand that, they need the numerical equivalent (in this case  ).

Source

iherman

Most helpful comment

@marcoscaceres That's just the DOM inspector. Try $0.textContent on your console and it will show " " as expected.

Edit: I mean your draft looks good 👍

saschanaz on 20 Feb 2020

🎉2

All 16 comments

Um... Ivan... serious question: isn't xhtml deprecated? Or is this for epub (which I think also now supports plain old HTML)?

marcoscaceres on 19 Feb 2020

Well... I do not know whether xhtml is deprecated or not, but I don't think so. Looking at §1.6 of the latest spec suggests otherwise. (I agree, though, that most, if not almost all, people producing Web pages these days use HTML.)

Yes, at present, it is important for EPUB. At this moment, the official EPUB3 relies on XHTML, per Core Media Types and also EPUB Content Documents 3.2. What happens is that many reading systems do accept HTML, too, but that means HTML is tolerated. I have not checked, but I would expect that the EPUB checker software, that is used by most publishers extensively, rejects HML or at least issues a warning (@dauwhe, is that correct?). I would hope that this would change in an upcoming release, but that is where we are now.

iherman on 19 Feb 2020

I can't reproduce this; first, the suggested affected spec does not include   and second, the exporter correctly replaces entities into the actual characters (e.g.   into and © into ©).

saschanaz on 19 Feb 2020

@saschanaz, thanks for checking. Here is what I realized:

The   is indeed not in the original respec code. Respec adds a non-breaking space for each reference (which is great!)
You are right that on Chrome and Firefox it the results are non-breaking space characters. However, on Safari (on a Mac) the resulting file does include the   entity and not the unicode character.

Ie, my original assumption for the bug was wrong indeed; the problem seems to be that something does not work as expected on Safari...

(I use the latest released version of Safari)

iherman on 19 Feb 2020

That sounds like an XmlSerializer implementation bug on Safari. I don't have a Mac so it's a bit limited to test it, could you file an issue on WebKit side?

saschanaz on 19 Feb 2020

Hm. Is it 'just' am XMLSerializer bug or something in the context? I must admit I am not familiar with that level of APIs, so it is a bit awkward for me to raise a bug whose details I do not really understand...

iherman on 19 Feb 2020

In that case I can do it myself. Could you confirm that this minimal repro shows   on Safari? It does on my Epiphany but not on Firefox nor Chrome, so it should be definitely something about XMLSerializer.

saschanaz on 19 Feb 2020

Confirmed, it's a Safari bug.

marcoscaceres on 19 Feb 2020

@saschanaz can you help me write the bug report for Webkit? I'm happy to file it. Does this sound ok?

Steps to reproduce:

Open https://codepen.io/SaschaNaz/pen/zYGKvOQ

Expected:
Serializing the   should result in a nbsp in the output. See Chrome and Firefox, which replace the   for the correct code point.
Actual:
The serializer spits out:
 

I'm a bit unsure if the above is correct... as when I check Firefox it is correct:
Firefox

But chrome outputs:
Screenshot 2020-02-20 07 55 54

marcoscaceres on 19 Feb 2020

(we will probably need to consult the HTML spec to see what is supposed to happen to entities upon XML serialization... @travisleithead, as editor of the serialization spec, maybe you can save us a bit of time? Is Chrome/Safari right? or is Firefox right?)

marcoscaceres on 19 Feb 2020

Well, the spec aligns with Firefox/Chrome behavior.
https://w3c.github.io/DOM-Parsing/#xml-serializing-a-text-node
Since   is just a Text node in the DOM, it is serialized as a space since there's no need to entity-encode anything there.

Having said that, a lot of this spec was fiction, and there's work to be done (by someone at some point) to try to align it with implementations :) But to me, Firefox/Chrome's behavior makes the most sense since it doesn't require a special case. (E.g., I assume not all spaces get translated to   when serialized.)

travisleithead on 20 Feb 2020

👍1

@marcoscaceres That's just the DOM inspector. Try $0.textContent on your console and it will show " " as expected.

Edit: I mean your draft looks good 👍

saschanaz on 20 Feb 2020

🎉2

ah! thanks for checking/explaining that @saschanaz. You are the best! Ok, so WebKit bug it is.

marcoscaceres on 20 Feb 2020

Ok, filed:
https://bugs.webkit.org/show_bug.cgi?id=207976

Closing, as it's not something we can fix here :) XHTML lives to fight another day!

marcoscaceres on 20 Feb 2020

Thanks also @travisleithead for the explanation and helping understand the state of that spec. Hopefully we can get around to updating it!

marcoscaceres on 20 Feb 2020

@marcoscaceres @saschanaz @travisleithead thanks for carrying this through. @saschanaz I only saw your note when starting my day a few minutes ago (the joys of cooperating over diverse time zones 😄

iherman on 20 Feb 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings