While solving #2088 I am about to push a change that makes this test pass:
test "unicode escape in character literal" {
var a: u24 = '\U01f4a9';
expect(a == 128169);
}
This makes sense since character literals are just comptime_int. There's no footgun here because you can't accidentally misuse it:
test "aoeu" {
var str = "hello";
str[1] = '\U01f4a9';
}
```
/home/andy/dev/zig/build/test.zig:5:14: error: integer value 128169 cannot be implicitly casted to type 'u8'
str[1] = 'U01f4a9';
^
With that in mind, I think it makes sense to allow utf-8 characters in character literals, since [we have UTF-8 source encoding](https://ziglang.org/documentation/master/#Source-Encoding). I propose this test should pass:
```zig
const std = @import("std");
test "utf8 character literal" {
const x = '💩';
std.testing.expect(x == 128169);
}
wait since we have utf8 source encoding can 💩, or あい be a valid identifier? or was this already discussed in a previous issue?
@emekoi that is explicit rejected, unless using the C ABI, in which you can use @"any utf-8 string, with arbitrary bytes using \x00 syntax"
\U01f4a9
Why did we pick an uppercase U for this rather than lowercase? Most languages use \u for unicode characters.
While on this topic; was there any consideration of \u{01f4a9} with braces?
While on this topic; was there any consideration of u{01f4a9} with braces?
It isn't necessary. There is a single quote ' at the end, so it is unambiguous. However there is a big difference. \u covers the Basic Multilingual Plane BMP where the most useful stuff is, while \U is the supplementary planes. (such as Egyptian Hieroglyphics and Emoji) see this chart https://www.unicode.org/roadmaps/bmp/
It isn't necessary. There is a single quote ' at the end, so it is unambiguous. However there is a big difference.
\ucovers the Basic Multilingual Plane BMP where the most useful stuff is, while\Uis the supplementary planes. (such as Egyptian Hieroglyphics and Emoji) see this chart https://www.unicode.org/roadmaps/bmp/
I think having two different escapes for this is really confusing. Instead there should just be one form of unicode escape.
The \u{} syntax is used by javascript (since ES6), lua (since 5.3), swift (who swapped from our current syntax!) and seems to be an accepted recent improvement in languages
Oh, yeah I wasn't seeing that it needs to be consistent with string literal escape syntax. Yeah, I like \u{} is clearer, but \x still has to be supported as is (where it is the same as in C), because \x80 is differn't from \u{80}. \u{} should accept 1, 2, or 3 bytes.
@shawnl agreed, \x should be kept, \uXXXX and \UXXXXXX should be replaced with \u{X}.
@andrewrk can we get an approved on the new character literal syntax, which will also apply for escape sequences inside strings? i'd like to code it up.
@shawnl you've got your approved, see #2129 :)
This makes sense since character literals are just
comptime_int. There's no footgun here because you can't accidentally misuse it:
I think I found a problem:
const std = @import("std");
test "character literal" {
std.debug.warn("{}\n", .{'\xfd'});
std.debug.warn("{}\n", .{'ý'});
}
This prints 253 twice.
Yeah, the latin-1 feature of unicode is wierd.
El lun., 13 abr. 2020 18:16, daurnimator notifications@github.com
escribió:
This makes sense since character literals are just comptime_int. There's
no footgun here because you can't accidentally misuse it:I think I found a problem:
const std = @import("std");
test "character literal" {
std.debug.warn("{}\n", .{'\xfd'}); std.debug.warn("{}\n", .{'ý'});}
This prints 253 twice.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ziglang/zig/issues/2097#issuecomment-612916934, or
unsubscribe
https://github.com/notifications/unsubscribe-auth/AAD4W4UQMBP3FLAWX5RFZT3RMMNFHANCNFSM4HAUTKLA
.
Yeah, the latin-1 feature of unicode is wierd.
The issue is not the latin-1 set; but that character literals are defined to be both bytes and unicode codepoints.
There is only an ambiguity with the latin-1 set, and the unicode codepoint
matches the byes of that set. This is a unicode feature. If you want this
to be non-ambiguous we could switch to using a utf-8-as-32-bit-number
representation of these unicode points.
El lun., 13 abr. 2020 18:30, daurnimator notifications@github.com
escribió:
Yeah, the latin-1 feature of unicode is wierd.
The issue is not the latin-1 set; but that character literals are defined
to be both bytes and unicode codepoints.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ziglang/zig/issues/2097#issuecomment-612922717, or
unsubscribe
https://github.com/notifications/unsubscribe-auth/AAD4W4U3VHRFABGPSIGI2ZDRMMOZ7ANCNFSM4HAUTKLA
.
@shawnl Unless I'm missing something, it's not true in general that byte values in the range [0x00, 0xFF] map to Unicode codepoints in the range [U+000000, U+0000FF]. That happens to be true for ISO-8859-1, but in UTF-8 the correct byte sequence for U+0000FD is 0xC3 0xBD.
@daurnimator I agree that you found a problem. There definitely shouldn't be an implicit conversion from unicode codepoints in the range [U+0080, U+10FFFF] to u8 in any context where UTF-8 is understood to be the relevant encoding.
El lun., 4 may. 2020 2:32, CantrellD notifications@github.com escribió:
@shawnl https://github.com/shawnl Unless I'm missing something,
You are.
it's not true in general that byte values in the range [0x00, 0xFF] map to
Unicode codepoints in the range [U+000000, U+0000FF]. That happens to be
true for ISO-8859-1, but in UTF-8 the correct byte sequence for U+0000FD is
0xC3 0xBD.@daurnimator https://github.com/daurnimator I agree that you found a
problem. There definitely shouldn't be an implicit conversion from unicode
codepoints in the range [U+0080, U+10FFFF] to u8 in any context where UTF-8
is understood to be the relevant encoding.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ziglang/zig/issues/2097#issuecomment-623193383, or
unsubscribe
https://github.com/notifications/unsubscribe-auth/AAD4W4X3PVJTZBIKE5GNA6LRPXWHPANCNFSM4HAUTKLA
.
@shawnl Please elaborate? The relevant Unicode code pages seem to be Basic Latin and Latin-1 Supplement. Basic Latin only defines code points up to U+00007F. And the UTF-8 encoding for U+000080 is 0xC2 0x80. I'd like to understand what my mistake is, so I can avoid it in the future.
U+80 through U+FF correspond to bytes 80 through FF of latin-1 supplement,
which is exactly the feature this bug is complaining about. Sorry for
coming across snarky.
El lun., 4 may. 2020 4:27, CantrellD notifications@github.com escribió:
@shawnl https://github.com/shawnl Please elaborate? The relevant
Unicode code pages seem to be Basic Latin
https://en.wikipedia.org/wiki/Basic_Latin_(Unicode_block) and Latin-1
Supplement
https://en.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_block). Basic
Latin only defines code points up to U+00007F. And the UTF-8 encoding for
U+000080 is 0xC2 0x80
https://www.fileformat.info/info/unicode/char/0080/index.htm. I'd like
to understand what my mistake is, so I can avoid it in the future.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ziglang/zig/issues/2097#issuecomment-623209705, or
unsubscribe
https://github.com/notifications/unsubscribe-auth/AAD4W4UB5TVKOMTHKUMNJU3RPYDYBANCNFSM4HAUTKLA
.
El lun., 4 may. 2020 4:30, Shawn Landden slandden@gmail.com escribió:
U+80 through U+FF correspond to bytes 80 through FF of
iSO-8859-1 code page.
latin-1 supplement, which is exactly the feature this bug is complaining
about. Sorry for coming across snarky.El lun., 4 may. 2020 4:27, CantrellD notifications@github.com escribió:
@shawnl https://github.com/shawnl Please elaborate? The relevant
Unicode code pages seem to be Basic Latin
https://en.wikipedia.org/wiki/Basic_Latin_(Unicode_block) and Latin-1
Supplement
https://en.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_block).
Basic Latin only defines code points up to U+00007F. And the UTF-8 encoding
for U+000080 is 0xC2 0x80
https://www.fileformat.info/info/unicode/char/0080/index.htm. I'd like
to understand what my mistake is, so I can avoid it in the future.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ziglang/zig/issues/2097#issuecomment-623209705, or
unsubscribe
https://github.com/notifications/unsubscribe-auth/AAD4W4UB5TVKOMTHKUMNJU3RPYDYBANCNFSM4HAUTKLA
.
ISO-8859-1 can only encode the first 256 Unicode codepoints, but character literals in Zig can be used for codepoints outside that range. Does the encoding change from ISO-8859-1 to UTF-8 for codepoints in the range [U+000100, U+10FFFF]? That doesn't seem right, but I don't know how else the current behavior could be correct.
ISO-8859-1 can only encode the first 256 Unicode codepoints
The problem is the overlap (which is a feature). If you want to avoid that, then instead of representing as unicode point, you have to represent as zero-padded UTF-8.
@shawnl Oh, you mean that character literals represent Unicode codepoints? Sorry, I guess it's obvious that they do, currently. But there's still an issue, as far as I can tell, because transforming a Unicode codepoint into a byte always requires some kind of encoding scheme. If the encoding scheme is implicitly UTF-8 (and I believe it is) then transforming codepoints in the range [U+000080, U+10FFFF] into a single byte is impossible.
You could compensate for that problem by assuming ISO-8859-1 is the encoding for codepoints in the range [U+000080, U+0000FF], but I'm not yet convinced that such an assumption is an intended feature of the language. I'm hoping to get confirmation one way or another about that specific question, but it is of course possible that I'm trying to answer the wrong question.
Edit: I'm assuming that character literals can be implicitly transformed into bytes, BTW. If that's wrong then I need to reevaluate everything.
Actually, I've also been assuming that there's an intended semantic difference between codepoints and integers. And I finally realize that's a dubious assumption. But the original post in this issue did say that there's no footgun from this feature, and I think that may need to be reconsidered given the ambiguous semantics for codepoints in the range [U+000080, U+0000FF]. Please excuse me if that ambiguity was already considered when this feature was implemented.
Edit: And to clarify one more point: I do believe that it would be more natural for character literals to represent byte arrays, using the UTF-8 encoding, as you suggested. Though that's conditional on string literals being UTF-8 encoded byte arrays.
Now here is what I would expect to work, but this is a standard library issue, not a problem with the zig syntax:
const std = @import("std");
test "character literal" {
std.debug.warn("{c}\n", .{'💩'});
}
currently yields this compile error:
/home/andy/Downloads/zig/lib/std/fmt.zig:525:13: error: Cannot print integer that is larger than 8 bits as a ascii
@compileError("Cannot print integer that is larger than 8 bits as a ascii");
^
/home/andy/Downloads/zig/lib/std/fmt.zig:497:52: note: called from here
.Int, .ComptimeInt => return formatIntValue(value, fmt, options, out_stream),
^
/home/andy/Downloads/zig/lib/std/fmt.zig:334:31: note: called from here
return formatValue(value, fmt, options, out_stream);
^
/home/andy/Downloads/zig/lib/std/fmt.zig:213:35: note: called from here
try formatType(
^
/home/andy/Downloads/zig/lib/std/io/out_stream.zig:28:34: note: called from here
return std.fmt.format(self, format, args);
^
/home/andy/Downloads/zig/lib/std/debug.zig:65:25: note: called from here
noasync stderr.print(fmt, args) catch return;
^
./test3.zig:4:19: note: called from here
std.debug.warn("{c}\n", .{'💩'});
^
./test3.zig:3:26: note: called from here
test "character literal" {
^
But I would expect {c} to try to print the number as UTF-8.
What is the problem?
@andrewrk One is two bytes, UTF-8 encoded, the other is a single byte.
pub fn main() anyerror!void {
std.debug.warn("{c}\n", .{'ý'});
}
This prints ² instead of ý
Edit: Sorry, took me a while to fix the formatting. I don't use markdown much.
I'm actually not sure what's going on there, since storing the value 253 in a byte array produces something that isn't valid UTF-8. So whatever gets printed is probably undefined behavior, I guess.
Edit: I guess I'm not actually sure if that code stores the value 253 in a byte array, actually. I'll try to test things a bit more carefully.
I have been reading this thread with popcorn for a while.
A byte is 8 bits.
El lun., 4 may. 2020 5:49, CantrellD notifications@github.com escribió:
I'm actually not sure what's going on there, since storing the value 253
in a byte array produces something that isn't valid UTF-8. So whatever gets
printed is probably undefined behavior, I guess.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ziglang/zig/issues/2097#issuecomment-623224538, or
unsubscribe
https://github.com/notifications/unsubscribe-auth/AAD4W4RKVXQSXGMWCI4PDQTRPYNLRANCNFSM4HAUTKLA
.
If std.debug.warn uses the same logic as std.fmt.bufPrint, then trying to store the character literal 'ý' in a string like "foo{c}bar" does seem to do the obvious thing and insert the value 253. Obviously that produces an array that isn't valid UTF-8, which is surprising if and only if you think in terms of characters and strings as opposed to bytes and byte arrays. The point I'm trying to make is that this line...
There's no footgun here because you can't accidentally misuse it
...from the original post seems to be incorrect for character literals corresponding to codepoints in the range [U+000080, U+0000FF].
A byte is 8 bits.
Okay, but what happens when you invoke std.debug.warn with a byte array that isn't valid UTF-8?
Most helpful comment
I think having two different escapes for this is really confusing. Instead there should just be one form of unicode escape.
The
\u{}syntax is used by javascript (since ES6), lua (since 5.3), swift (who swapped from our current syntax!) and seems to be an accepted recent improvement in languages