Toml: Properly define Unicode scalar format

Created on 30 Mar 2020 · 5Comments · Source: toml-lang/toml

Any Unicode character may be escaped with the \uXXXX or \UXXXXXXXX forms. The escape codes must be valid Unicode scalar values.

Which links to a further explanation...

the ranges of integers 0 to D7FF16 and E00016 to 10FFFF16 inclusive.

I can't find a clear definition of the format the escape codes should be in, which makes writing a conformant parser impossible.

Within the tests, the X characters (the digits) are always within the range [0-9A-Z].

It seems they are hex-encoded byte sequences, containing unicode byte sequences.

| Prefix | Encoding Scheme |
| ------ | ------------------- |
| \u | UTF-16BE |
| \U | UTF-32BE |

It's worth noting that many parsers also permit a-z as hex characters. (toml-rs)

Source

Plecra

Most helpful comment

@Plecra My point about endianness was just referring to the order in which you write a hex scalar sequence; the least-significant bytes in the scalar sequence go at the least significant end of the written number (the right-hand-side), just like the way you write conventional decimal numbers. I guess if you think about it in terms of reading order, that does make it big-endian O_o.

Having said that, it seems in trying to disambiguate I've actually added unnecessary ambiguity, and it's totally irrelevant. Key takeaway is that the scalar sequence has nothing to do with the endianness of the integers, per @BurntSushi's point.

marzer on 30 Mar 2020

❤1 👍1

All 5 comments

The linked document you're referring to specifies hexadecimal using the 16 subscript. The X characters must be [a-fA-F0-9].

marzer on 30 Mar 2020

👍1

As far as the meaning of the two separate forms, it's nothing to do with encoding schemes- unicode scalar escape sequences always refer to a 32-bit code point, and are always written as little-endian.

The two different forms are just different digit allowances, you can use \uXXXX if you don't need more than 4 hex digits (in which case \UXXXXXXXX would be annoying overkill). It's the same as in C and many other languages with string escape sequences.

marzer on 30 Mar 2020

Good to know! I hope we can pop that explanation of the hexadecimal encoding in the documentation and add some tests for [a-z] (typo: [a-f]).

One little thing, what do you mean by little-endian there? A has the scalar value 65, or 41 in base 16, so I'd expect to escape it like \u0041, which I think is big-endian? It'd be parsed as 4 * (16 ^ 1) + 1 * (16 ^ 0), with higher significance given to the earlier address. Maybe I'm misunderstanding endianness

Plecra on 30 Mar 2020

Endianness has nothing to do with this. Encodings have nothing to do with this either. The format is a normal hex number with 4 or 8 digits. That's it.

And it's not [a-z]. It's [a-f].