Html: Cleanup around the HTML parser and how it sniffs encodings

Created on 20 Apr 2016  路  12Comments  路  Source: whatwg/html

Context
From actual description I can't determine behavior for Document.characterSet|charset|inputEncoding when loading a document. In addition, @annevk indicates that getting an output encoding should be used only when encoding, not decoding, but algorithms that cooperate with the parser use it. Would be cleare to write "getting an output encoding to encode" because "encoding" means both (thing and action). Looks like unnecessary change was done here https://github.com/whatwg/html/pull/679/commits/6a31c26cf12e39dab1a488e75dd56c03d6786d39. You can revert or define "getting an output encoding to decode" (maps UTF-16 to UTF-8 and x-user-defined to Windows-1252) and use it if necessary.

But still wondering where exactly document's encoding is set when input has BOM. Where is the prose for that action? Sniffing algorithm doesn't mention anything about the BOM and decode doesn't change document's encoding.

clarification

Most helpful comment

I'm working on PR's for fix # 1 which I'll be opening in a few days if no one disagrees. @inexorabletash? @hsivonen?

All 12 comments

I created a fix for the regression.

The other problem is that the Encoding standard ultimately decides what encoding is used for encoding, but has no means to communicate that back to the caller (the caller just gets a stream of bytes).

Not sure how this would be described ideally. @hsivonen @inexorabletash, opinions?

In HTML decode is used only by HTM parser and when fetching script. In first case BOM should correct document's encoding, but for script I don't know if BOM affects any properties or slots.
Basically you can just correct this prose to be more clear, add "Also when decode change encoding (due to BOM) then...".

Decode is also used in XHR (text response) and once again I don't know if BOM affects any properties or slots. But more interesting is document response, I wondering if BOM has any effect, must find some test.

Re: Encoding - can the "decode" hook just return a tuple, i.e. "Return _output_ and _encoding_" ? I see more work in HTML to refer to e.g. the "sniffed encoding" vs. the "used encoding" and to actually assign the latter to the document's character encoding, like @ArkadiuszMichalski mentions.

And then update XHR to ignore/use the returned encoding (depending on results of @ArkadiuszMichalski's test). Or, if we don't want to touch other dependent specs, then same thing we'd do in a real language: rename "decode" to "detect encoding and decode" and add a "decode" with the old signature.

(But that seems obvious so I feel like I'm missing something.)

@inexorabletash I guess it is rather obvious, was just curious whether you had a preference.

Backing to XHR, text response is correct, browsers recognize BOM at the begining, remove them and change encoding, Firefox has bug when using UTF-16 BOM (but UTF-8 BOM works), and test was created.

For document response (step 5.) Firefox and Chrome works the same as for HTML parser, so respect <meta> and when BOM exist then change encoding and set document's encoding to correct value. In short term:
UTF-8(encoding in editor) iso-8859-16(<meta>) = document.characterSet: iso-8859-16
UTF-8(encoding in editor) iso-8859-16(<meta>) BOM UTF-16LE = document.characterSet: UTF-16LE
UTF-16LE(encoding in editor) UTF-16LE(<meta>) = document.characterSet: UTF-8
UTF-16LE(encoding in editor) UTF-16LE(<meta>) BOM UTF-16LE = document.characterSet: UTF-16LE

FWIW, "encoding in editor" has no effect here. It's not an input to the algorithm since it cannot be determined.

Not sure how this would be described ideally. @hsivonen @inexorabletash, opinions?

I think the main problem with the integration of HTML and the Encoding Standard is that the Encoding Standard doesn't accommodate 1) incremental decoding and 2) internal encoding declarations.

The Encoding Standard makes BOM sniffing part of "decode" and the HTML Standard does not sniff for the BOM on its own, so spec-wise the UA is supposed to buffer up to 1024 bytes and sniff it for before letting the "decode" step sniff for the BOM.

This does not match a reasonable implementation, such as Gecko. Gecko sniffs for the BOM incrementally as bytes arrive (doesn't even wait for 3 bytes like the Encoding Standard says if fewer bytes make a decision possible) and skips sniffing if a BOM was found.

I think the Encoding Standard should cater to the existence of languages like HTML and CSS that support non-BOM internal encoding declarations and allow the language-specific sniffing algorithms to be plugged in between BOM sniffing and the actual decode phase.

Spec-wise, if you want the Encoding Standard to continue to own BOM sniffing, I suggest extending the current "decode" algorithm such that it takes three inputs: byte stream, fallback encoding and a (potentially null) sniffing state machine for an internal encoding declaration. Then make it output a stream of characters and the encoding actually used. Require that the sniffing state machine be able to operate incrementally so that it yields an encoding as soon as it has seen enough and make the "decode" algorithm feed it up to 1024 bytes (as soon as the bytes arrive--not waiting for 1024 bytes first; it would be nice for the BOM sniffing not to buffer 3 bytes first if the first byte already can't be the first byte of an UTF-8 BOM).

Then have HTML and CSS specify their calls to decode as passing in a sniffing state machine.

Implementation wise, I wouldn't make users of a streaming encoding conversion API pass in the state machine but make them perform BOM sniffing first, search for an internal encoding declaration second and then invoke an incremental decoder.

FWIW, encoding-rs provides a low-level streaming API for encoders and decoders as well as high-level non-streaming convenience methods. The convenience methods decode and decode_with_replacement return a pair whose second item is the encoding that was actually used. (Likewise with encode_with_replacement. There's currently no convenience method for encoding without replacement, because I'm not aware of a use case in the Web Platform considering that encoding to UTF-8 is the same in both cases as UTF-8 has no unmappables.)

Firefox has bug when using UTF-16 BOM (but UTF-8 BOM works)

I checked recently and UTF-8 BOM didn't work. It's one of the Gecko bugs I hope encoding-rs to fix by providing BOM sniffing right in the decoder objects for callers like text/plain XHR.

Oh, and the above doesn't account for the HTML spec currently allowing heuristic sniffing. If you still want to allow that, it's probably easier to make the HTML spec use a distinct BOM sniffing hook than to let BOM sniffing in the HTML context be baked into "decode".

I checked recently and UTF-8 BOM didn't work. It's one of the Gecko bugs I hope encoding-rs to fix by providing BOM sniffing right in the decoder objects for callers like text/plain XHR.

Ah, you're right, all BOM didn't work, but when I get UTF-8 file with UTF-8 BOM (first 3 byte), then this BOM is somehow removed, because first char when reading responseText is not BOM (this confused me), so I set header to other encoding (like ISO-8859-2) and when reading responseText this BOM exist but encoding was not change to UTF-8 (is still ISO-8859-2).

I was recently pointed to this issue, and I find it surprising that this problem hasn't yet been fixed in the specifications, even if it's not an issue in all three major browser engines, where the document's character encoding follows the BOM.

To recap, the several fixes proposed, from easier to harder to implement, are:

  1. Add a new hook to the Encoding standard that takes a token stream and returns the BOM's encoding, if any. This could then be used in both Encoding's decode and HTML's encoding sniffing algorithms. @domenic implemented this in jsdom (https://github.com/whatwg/html/issues/1910#issuecomment-254017369), and from a conversation on IRC he still considers it a valid solution for the specifications.
  2. Have Encoding's decode algorithm return a tuple of a token stream and an encoding, as proposed by @inexorabletash for example.
  3. Finally, @hsivonen proposes that decode takes an algorithm / state machine as an additional optional parameter, so that the sniffing for format-specific encoding declarations takes place at the same time as the BOM sniffing, in order to better match implementations.

While it's worth matching the internal behavior of implementations, any of these options would agree with the observable behavior. Furthermore, option 3 fails to take into account MIME Sniffing's algorithm to sniff a mislabeled feed, which should be implemented in parallel with HTML's encoding sniffing (though I haven't checked what actual browsers do).

In any case, if a BOM is found, the encoding's confidence should be certain. While there is no real need to prevent user agents from changing the encoding while parsing, since the effective encoding would be identical, this helps make clear that there's no need to reparse.

I like 1. It adds some redundancy, but that's okay as these are legacy callers. New places would exclusively use UTF-8 anyway.

I'm working on PR's for fix # 1 which I'll be opening in a few days if no one disagrees. @inexorabletash? @hsivonen?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

bramus picture bramus  路  4Comments

mustaqahmed picture mustaqahmed  路  4Comments

NE-SmallTown picture NE-SmallTown  路  4Comments

lazarljubenovic picture lazarljubenovic  路  4Comments

tkent-google picture tkent-google  路  3Comments