Zig: allow unicode characters in character literals

Created on 23 Mar 2019 · 30Comments · Source: ziglang/zig

While solving #2088 I am about to push a change that makes this test pass:

test "unicode escape in character literal" {
    var a: u24 = '\U01f4a9';
    expect(a == 128169);
}

This makes sense since character literals are just comptime_int. There's no footgun here because you can't accidentally misuse it:

test "aoeu" {
    var str = "hello";
    str[1] = '\U01f4a9';
}

```
/home/andy/dev/zig/build/test.zig:5:14: error: integer value 128169 cannot be implicitly casted to type 'u8'
str[1] = 'U01f4a9';
^

With that in mind, I think it makes sense to allow utf-8 characters in character literals, since [we have UTF-8 source encoding](https://ziglang.org/documentation/master/#Source-Encoding). I propose this test should pass:

```zig
const std = @import("std");

test "utf8 character literal" {
    const x = '💩';
    std.testing.expect(x == 128169);
}

accepted contributor friendly proposal

Source

andrewrk

👍4

Most helpful comment

It isn't necessary. There is a single quote ' at the end, so it is unambiguous. However there is a big difference. \u covers the Basic Multilingual Plane BMP where the most useful stuff is, while \U is the supplementary planes. (such as Egyptian Hieroglyphics and Emoji) see this chart https://www.unicode.org/roadmaps/bmp/

I think having two different escapes for this is really confusing. Instead there should just be one form of unicode escape.

The \u{} syntax is used by javascript (since ES6), lua (since 5.3), swift (who swapped from our current syntax!) and seems to be an accepted recent improvement in languages

daurnimator on 29 Mar 2019

👍3

All 30 comments

wait since we have utf8 source encoding can 💩, or あい be a valid identifier? or was this already discussed in a previous issue?

emekoi on 23 Mar 2019

@emekoi that is explicit rejected, unless using the C ABI, in which you can use @"any utf-8 string, with arbitrary bytes using \x00 syntax"

shawnl on 23 Mar 2019

\U01f4a9

Why did we pick an uppercase U for this rather than lowercase? Most languages use \u for unicode characters.

While on this topic; was there any consideration of \u{01f4a9} with braces?

daurnimator on 24 Mar 2019

While on this topic; was there any consideration of u{01f4a9} with braces?

It isn't necessary. There is a single quote ' at the end, so it is unambiguous. However there is a big difference. \u covers the Basic Multilingual Plane BMP where the most useful stuff is, while \U is the supplementary planes. (such as Egyptian Hieroglyphics and Emoji) see this chart https://www.unicode.org/roadmaps/bmp/

shawnl on 29 Mar 2019

It isn't necessary. There is a single quote ' at the end, so it is unambiguous. However there is a big difference. \u covers the Basic Multilingual Plane BMP where the most useful stuff is, while \U is the supplementary planes. (such as Egyptian Hieroglyphics and Emoji) see this chart https://www.unicode.org/roadmaps/bmp/

I think having two different escapes for this is really confusing. Instead there should just be one form of unicode escape.

The \u{} syntax is used by javascript (since ES6), lua (since 5.3), swift (who swapped from our current syntax!) and seems to be an accepted recent improvement in languages

daurnimator on 29 Mar 2019

👍3

Oh, yeah I wasn't seeing that it needs to be consistent with string literal escape syntax. Yeah, I like \u{} is clearer, but \x still has to be supported as is (where it is the same as in C), because \x80 is differn't from \u{80}. \u{} should accept 1, 2, or 3 bytes.

shawnl on 29 Mar 2019

@shawnl agreed, \x should be kept, \uXXXX and \UXXXXXX should be replaced with \u{X}.

daurnimator on 29 Mar 2019

@andrewrk can we get an approved on the new character literal syntax, which will also apply for escape sequences inside strings? i'd like to code it up.

shawnl on 29 Mar 2019

@shawnl you've got your approved, see #2129 :)

daurnimator on 29 Mar 2019

This makes sense since character literals are just comptime_int. There's no footgun here because you can't accidentally misuse it:

I think I found a problem:

const std = @import("std");

test "character literal" {
    std.debug.warn("{}\n", .{'\xfd'});
    std.debug.warn("{}\n", .{'ý'});
}

This prints 253 twice.

daurnimator on 13 Apr 2020

Yeah, the latin-1 feature of unicode is wierd.

El lun., 13 abr. 2020 18:16, daurnimator notifications@github.com
escribió:

This makes sense since character literals are just comptime_int. There's
no footgun here because you can't accidentally misuse it:

I think I found a problem:

const std = @import("std");

test "character literal" {
std.debug.warn("{}\n", .{'\xfd'});

std.debug.warn("{}\n", .{'ý'});
}

This prints 253 twice.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ziglang/zig/issues/2097#issuecomment-612916934, or
unsubscribe
https://github.com/notifications/unsubscribe-auth/AAD4W4UQMBP3FLAWX5RFZT3RMMNFHANCNFSM4HAUTKLA
.

shawnl on 13 Apr 2020

Yeah, the latin-1 feature of unicode is wierd.

The issue is not the latin-1 set; but that character literals are defined to be both bytes and unicode codepoints.

daurnimator on 13 Apr 2020

There is only an ambiguity with the latin-1 set, and the unicode codepoint
matches the byes of that set. This is a unicode feature. If you want this
to be non-ambiguous we could switch to using a utf-8-as-32-bit-number
representation of these unicode points.

El lun., 13 abr. 2020 18:30, daurnimator notifications@github.com
escribió:

Yeah, the latin-1 feature of unicode is wierd.

The issue is not the latin-1 set; but that character literals are defined
to be both bytes and unicode codepoints.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ziglang/zig/issues/2097#issuecomment-612922717, or
unsubscribe
https://github.com/notifications/unsubscribe-auth/AAD4W4U3VHRFABGPSIGI2ZDRMMOZ7ANCNFSM4HAUTKLA
.

shawnl on 13 Apr 2020

@shawnl Unless I'm missing something, it's not true in general that byte values in the range [0x00, 0xFF] map to Unicode codepoints in the range [U+000000, U+0000FF]. That happens to be true for ISO-8859-1, but in UTF-8 the correct byte sequence for U+0000FD is 0xC3 0xBD.

@daurnimator I agree that you found a problem. There definitely shouldn't be an implicit conversion from unicode codepoints in the range [U+0080, U+10FFFF] to u8 in any context where UTF-8 is understood to be the relevant encoding.

CantrellD on 4 May 2020

El lun., 4 may. 2020 2:32, CantrellD notifications@github.com escribió:

@shawnl https://github.com/shawnl Unless I'm missing something,

You are.

it's not true in general that byte values in the range [0x00, 0xFF] map to
Unicode codepoints in the range [U+000000, U+0000FF]. That happens to be
true for ISO-8859-1, but in UTF-8 the correct byte sequence for U+0000FD is
0xC3 0xBD.

@daurnimator https://github.com/daurnimator I agree that you found a
problem. There definitely shouldn't be an implicit conversion from unicode
codepoints in the range [U+0080, U+10FFFF] to u8 in any context where UTF-8
is understood to be the relevant encoding.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ziglang/zig/issues/2097#issuecomment-623193383, or
unsubscribe
https://github.com/notifications/unsubscribe-auth/AAD4W4X3PVJTZBIKE5GNA6LRPXWHPANCNFSM4HAUTKLA
.

shawnl on 4 May 2020

@shawnl Please elaborate? The relevant Unicode code pages seem to be Basic Latin and Latin-1 Supplement. Basic Latin only defines code points up to U+00007F. And the UTF-8 encoding for U+000080 is 0xC2 0x80. I'd like to understand what my mistake is, so I can avoid it in the future.

CantrellD on 4 May 2020

U+80 through U+FF correspond to bytes 80 through FF of latin-1 supplement,
which is exactly the feature this bug is complaining about. Sorry for
coming across snarky.

El lun., 4 may. 2020 4:27, CantrellD notifications@github.com escribió:

@shawnl https://github.com/shawnl Please elaborate? The relevant
Unicode code pages seem to be Basic Latin
https://en.wikipedia.org/wiki/Basic_Latin_(Unicode_block) and Latin-1
Supplement
https://en.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_block). Basic
Latin only defines code points up to U+00007F. And the UTF-8 encoding for
U+000080 is 0xC2 0x80
https://www.fileformat.info/info/unicode/char/0080/index.htm. I'd like
to understand what my mistake is, so I can avoid it in the future.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ziglang/zig/issues/2097#issuecomment-623209705, or
unsubscribe
https://github.com/notifications/unsubscribe-auth/AAD4W4UB5TVKOMTHKUMNJU3RPYDYBANCNFSM4HAUTKLA
.

shawnl on 4 May 2020

El lun., 4 may. 2020 4:30, Shawn Landden slandden@gmail.com escribió:

U+80 through U+FF correspond to bytes 80 through FF of

iSO-8859-1 code page.

latin-1 supplement, which is exactly the feature this bug is complaining
about. Sorry for coming across snarky.

El lun., 4 may. 2020 4:27, CantrellD notifications@github.com escribió:

@shawnl https://github.com/shawnl Please elaborate? The relevant
Unicode code pages seem to be Basic Latin
https://en.wikipedia.org/wiki/Basic_Latin_(Unicode_block) and Latin-1
Supplement
https://en.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_block).
Basic Latin only defines code points up to U+00007F. And the UTF-8 encoding
for U+000080 is 0xC2 0x80
https://www.fileformat.info/info/unicode/char/0080/index.htm. I'd like
to understand what my mistake is, so I can avoid it in the future.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ziglang/zig/issues/2097#issuecomment-623209705, or
unsubscribe
https://github.com/notifications/unsubscribe-auth/AAD4W4UB5TVKOMTHKUMNJU3RPYDYBANCNFSM4HAUTKLA
.

shawnl on 4 May 2020

ISO-8859-1 can only encode the first 256 Unicode codepoints, but character literals in Zig can be used for codepoints outside that range. Does the encoding change from ISO-8859-1 to UTF-8 for codepoints in the range [U+000100, U+10FFFF]? That doesn't seem right, but I don't know how else the current behavior could be correct.

CantrellD on 4 May 2020

ISO-8859-1 can only encode the first 256 Unicode codepoints

The problem is the overlap (which is a feature). If you want to avoid that, then instead of representing as unicode point, you have to represent as zero-padded UTF-8.

shawnl on 4 May 2020

@shawnl Oh, you mean that character literals represent Unicode codepoints? Sorry, I guess it's obvious that they do, currently. But there's still an issue, as far as I can tell, because transforming a Unicode codepoint into a byte always requires some kind of encoding scheme. If the encoding scheme is implicitly UTF-8 (and I believe it is) then transforming codepoints in the range [U+000080, U+10FFFF] into a single byte is impossible.

You could compensate for that problem by assuming ISO-8859-1 is the encoding for codepoints in the range [U+000080, U+0000FF], but I'm not yet convinced that such an assumption is an intended feature of the language. I'm hoping to get confirmation one way or another about that specific question, but it is of course possible that I'm trying to answer the wrong question.

Edit: I'm assuming that character literals can be implicitly transformed into bytes, BTW. If that's wrong then I need to reevaluate everything.

CantrellD on 4 May 2020

Actually, I've also been assuming that there's an intended semantic difference between codepoints and integers. And I finally realize that's a dubious assumption. But the original post in this issue did say that there's no footgun from this feature, and I think that may need to be reconsidered given the ambiguous semantics for codepoints in the range [U+000080, U+0000FF]. Please excuse me if that ambiguity was already considered when this feature was implemented.

Edit: And to clarify one more point: I do believe that it would be more natural for character literals to represent byte arrays, using the UTF-8 encoding, as you suggested. Though that's conditional on string literals being UTF-8 encoded byte arrays.

CantrellD on 4 May 2020

This prints 253 twice.

What is the problem?

253 is correct

andrewrk on 4 May 2020

Now here is what I would expect to work, but this is a standard library issue, not a problem with the zig syntax:

const std = @import("std");

test "character literal" {
    std.debug.warn("{c}\n", .{'💩'});
}

currently yields this compile error:

/home/andy/Downloads/zig/lib/std/fmt.zig:525:13: error: Cannot print integer that is larger than 8 bits as a ascii
            @compileError("Cannot print integer that is larger than 8 bits as a ascii");
            ^
/home/andy/Downloads/zig/lib/std/fmt.zig:497:52: note: called from here
        .Int, .ComptimeInt => return formatIntValue(value, fmt, options, out_stream),
                                                   ^
/home/andy/Downloads/zig/lib/std/fmt.zig:334:31: note: called from here
            return formatValue(value, fmt, options, out_stream);
                              ^
/home/andy/Downloads/zig/lib/std/fmt.zig:213:35: note: called from here
                    try formatType(
                                  ^
/home/andy/Downloads/zig/lib/std/io/out_stream.zig:28:34: note: called from here
            return std.fmt.format(self, format, args);
                                 ^
/home/andy/Downloads/zig/lib/std/debug.zig:65:25: note: called from here
    noasync stderr.print(fmt, args) catch return;
                        ^
./test3.zig:4:19: note: called from here
    std.debug.warn("{c}\n", .{'💩'});
                  ^
./test3.zig:3:26: note: called from here
test "character literal" {
                         ^

But I would expect {c} to try to print the number as UTF-8.

andrewrk on 4 May 2020

What is the problem?

@andrewrk One is two bytes, UTF-8 encoded, the other is a single byte.

shawnl on 4 May 2020

pub fn main() anyerror!void  {
    std.debug.warn("{c}\n", .{'ý'});
}

This prints ² instead of ý

Edit: Sorry, took me a while to fix the formatting. I don't use markdown much.

CantrellD on 4 May 2020

I'm actually not sure what's going on there, since storing the value 253 in a byte array produces something that isn't valid UTF-8. So whatever gets printed is probably undefined behavior, I guess.

Edit: I guess I'm not actually sure if that code stores the value 253 in a byte array, actually. I'll try to test things a bit more carefully.

CantrellD on 4 May 2020

I have been reading this thread with popcorn for a while.

A byte is 8 bits.

El lun., 4 may. 2020 5:49, CantrellD notifications@github.com escribió:

I'm actually not sure what's going on there, since storing the value 253
in a byte array produces something that isn't valid UTF-8. So whatever gets
printed is probably undefined behavior, I guess.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ziglang/zig/issues/2097#issuecomment-623224538, or
unsubscribe
https://github.com/notifications/unsubscribe-auth/AAD4W4RKVXQSXGMWCI4PDQTRPYNLRANCNFSM4HAUTKLA
.

shawnl on 4 May 2020

If std.debug.warn uses the same logic as std.fmt.bufPrint, then trying to store the character literal 'ý' in a string like "foo{c}bar" does seem to do the obvious thing and insert the value 253. Obviously that produces an array that isn't valid UTF-8, which is surprising if and only if you think in terms of characters and strings as opposed to bytes and byte arrays. The point I'm trying to make is that this line...