This is found in many other languages, aimed at making longer literals easier to read at a glance by grouping together logical units within numbers. This is especially useful for the longer 128-bit and beyond literals that are available in zig.
I propose allowing a _ separator anywhere in a number literal to align with being the simplest rule to understand. Numeric literals are parsed into values as if the separators were not present.
Examples:
const a = 0x1234_2839_1083_1928;
const b = 0x123_190.109_038_018p102;
const c = 0_x0123; // Not allowed, cannot insert separator on radix prefix
const d = _1238; // This is parsed as an identifier
const e = 9_________123123; // Multiple literals are allowed in sequence.
A more in-depth reference of other implementations can be found in the javascript proposal.
Pushed an initial implementation to a new branch here.
A note: it may be useful to allow other common numeral separators - only per project, not universally.
For example, Indian numbers use comma:
3,00,00,000
https://en.wikipedia.org/wiki/Indian_numbering_system
Trailing underscores could be allowed, for situations like:
arr[0] = 73__;
arr[1] = 8655;
arr[2] = 1___;
arr[3] = 0___;
arr[4] = 12__;
arr[5] = 987_;
which gives a hint about expected max range for a group of numbers.
Or for alignment:
arr[8_] = 1;
arr[9_] = 2;
arr[10] = 1;
arr[11] = 2;
Not sure if there would be mich benefit for customisable separators. These are only really for visual grouping and not so much for accurate numeric localization.
The current implementation does allow trailing dashes as part of the literal. Those examples of yours should work.
@tiehuis: program with lot of hardcoded numbers (e.g. ballistic tables) may have better readability and higher chance of catching typos, due to familiar style.
But this is not feature for everyone, and if ever implemented, it should allow ad-hoc project customisation. I imagine wild things like ability to avoid repeated numbers:
...
80482.23
....3.23
....4.93
80493.22
...
I have a hope that Zig's metaprogramming will enable these "tricks".
The current implementation does allow trailing dashes as part of the literal. Those examples of yours should work.
Great.
I think this has pros and cons. Reading 1_000_000 is a little better than reading 1000000. The only reason I hesitate to merge this right away is that 1______0__00__0___0________0 is much worse than 1000000 and now it becomes possible to have working, compiling code that looks like that. Even though 1000000 could look a little better, it's at least reasonable, and the only way you can write that number.
Another thing to think about is that a bare number literal is a math construct. 1000000 means the same thing in every region. However once you start introducing separators, regional differences creep into code. Some people might want 100_00_00. Maybe it's better to avoid that whole class of problems.
Here's my alternative proposal that uses status quo:
To compete with this Java:
long hexBytes = 0xFF_EC_DE_5E;
long hexWords = 0xCAFE_F00D;
long maxLong = 0x7fff_ffff_ffff_ffffL;
byte nybbles = 0b0010_0101;
long bytes = 0b11010010_01101001_10010100_10010010;
Here's the Zig:
// ++--++--
const hex_in_8s = 0xFFECDE5E;
// ++++----
const hex_in_16s = 0xCAFEF00D;
// /--\/--\/--\/--\
const max_s64 = 0x7fffffffffffffff;
// hhhhllll
const nybbles = 0b00100101;
// 33333333222222221111111100000000
const bytes = 0b11010010011010011001010010010010;
Takes up extra space, but the free-form nature of comments means you can write whatever you want there, which is arguably more powerful than only being able to group digits.
I admit the _ grouping looks nicer in some cases. But I'm not sure that's a compelling reason to make the language more complicated.
I think my biggest objection to this proposal is that it introduces language complexity in the form of syntactic sugar without encouraging any different semantics. This proposal enables the subjective concept of making long literals easier to read by grouping digits in some way.
Consider that there are even more ways to write a literal in Zig that allow even more expression of intent than this proposal. For example:
const max_s64 = (1 << 63) - 1;
const bytes =
(0b11010010 << 24) |
(0b01101001 << 16) |
(0b10010100 << 8) |
(0b10010010 << 0);
// see https://github.com/gcc-mirror/gcc/blob/61eae75c6230c7df9fa3e935b2efadda61667c5f/libiberty/crc32.c#L70
const crc32_table: [256]u32 = comptime generate_crc32_table();
1______0__00__0___0________0
I have yet to see people doing things like this outside IOCCC.
OTOH they go to great lengths to create helpful visual artifacts in the code, like column alignment. Technical authors do the same with numeric tables or math heavy texts.
Why not to make it per project option? It someone fears he can switch it off.
@andrewrk
Instead of allowing a separator everywhere, we could be more restrictive and only allow single separators between digits. This actually seems to be pretty normal in other languages. Ada, C++, Ruby and Julia use this method.
Regarding different region details. If there are implicit semantics behind the meaning of a number literal, separators actually may help convey to a reader that there are some implied extra details. Of course if they are just separating something without any specific meaning then that is a valid concern.
@thejoshwolfe
Valid alternative.
The main draws I see over just comments are two. A standardized way of doing this is a bonus and means we don't get different competing styles to represent the same thing (only a minor). The other would be that because literal separators are much easier to insert (one character vs. annotating an entire line) this means that it is probably more likely that they would be used vs. a comment-based approach, helping code readability. It also allows one to leave comments for more important details like why the particular value may have been chosen, for example.
@tiehuis I really appreciate the writeup, and especially the fact that you went off and coded it. Your arguments are reasonable. But I'm going to have to go with keeping the language small and only 1 way to do things.
But I'm going to have to go with keeping the language small and only 1 way to do things.
If you use that reasoning, I'd personally drop the 0o and 0b prefixes as well.
The only use case I've ever seen for octal is in unix file permissions which is something you could easily handle using constants.
Anything you can represent in binary, you can just represent in hexadecimal. For example compare 0xff == 0b1111111, or 0x8000 == 0b1000000000000000. Personally, binary literals are very hard to read without the _ separator, and even C rejected binary literals due to lack of precedent and insufficient utility. cf. 6.4.4.1
Just an FYI: this can be implemented with pretty tiny changes to the syntax and lexer. However, it would require an extra sentence to explain it in the documentation, and it does mean there are more ways to write equivalent integer literals than the already existing decimal, hex, octal, and binary literals. Personally, I think it would be worth it.
The grammar changes from this:
<- "0x" hex+ "." hex+ ([pP] [-+]? hex+)? skip
/ [0-9]+ "." [0-9]+ ([eE] [-+]? [0-9]+)? skip
/ "0x" hex+ "."? [pP] [-+]? hex+ skip
/ [0-9]+ "."? [eE] [-+]? [0-9]+ skip
INTEGER
<- "0b" [01]+ skip
/ "0o" [0-7]+ skip
/ "0x" hex+ skip
/ [0-9]+ skip
to this:
hex_ <- [0-9a-fA-F_]
FLOAT
<- "0x" hex_+ "." hex_+ ([pP] [-+]? hex_+)? skip
/ [0-9] [0-9_]+ "." [0-9_]+ ([eE] [-+]? [0-9_]+)? skip
/ "0x" hex_+ "."? [pP] [-+]? hex_+ skip
/ [0-9] [0-9_]+ "."? [eE] [-+]? [0-9_]+ skip
INTEGER
<- "0b" [01_]+ skip
/ "0o" [0-7_]+ skip
/ "0x" hex_+ skip
/ [0-9] [0-9_]+ skip
And the lexer (or parser, depending on implementation) just needs to skip the underscores when evaluating the number.
@andrewrk could you consider reopening this?
After this issue was closed in 2017 almost every mainstream language has come to support this feature. If zig's goal is to replace C and become the new lingua franca, it make sense adopting the syntax that other languages are using.
I've compiled an extensive list of languages that support _ as a digit separator:
' as the separator, but it would have used _ if it didn't conflict with the grammar)I'd just like to note that C isn't on that list.
One of the things that differentiates C from most languages is, IMO, its simplicity. While this proposal is, itself, not complex, I find that languages aren't generally brought down by a few major changes, but by many minor ones.
Imagine if a dozen similar changes to the grammar were made. Each one, on its own, is relatively benign; together, they remove everything that makes Zig what it is. If Zig were to adopt every minor change that "every mainstream" language supports, there wouldn't really be a point to Zig at all.
@pixelherodev Also note that C doesn't have binary literals 0b. Most of these languages added binary literals and _ separators at the same time. People have longer response times when counting more than 4 objects and binary literals have a large number of elements, so they are hard for people to parse without a visual separator. It's hard to tell the difference between 0b11111111 and 0b111111111. Using visual separators makes the code much easier to read: 0b1111_1111 and 0b1_1111_1111.
That's true, but it's also still possible to write out, say, 0xFF or 0x1FF, is it not? If you're writing out large binary strings manually, maybe you should switch to hex.
Or, if numerical separators are important, here's an alternate proposal: a comptime function in zag (which, in case you haven't come across me using that term elsewhere, is what I've started calling the Zig standard library) which takes a string literal - like, say, "393_219_293_192", parses and removes the separator, and parses as an integer? This also has the advantage of allowing a single function that supports every base by simply passing the base on to parseInt in std.fmt.
Usage:
const a = std.fmt.parseSeparatedInt("1f3a_3904_a9ca_299c", 16);
This leaves the grammar as is, provides most (albeit not all) of the advantages of implementing it as a language feature, and slightly reduces how large a Zig compiler needs to be to compete with the current stage1.
@pixelherodev I think encouraging parsing functions for something so elementary is the wrong way to go. Using something like std.fmt.parseSeparatedInt is cumbersome, so people will be encouraged to take short cuts. However, different people are going to take different shortcuts:
Person A might do this:
const p = std.fmt.parseSeparatedInt;
// ...
const y = switch (x) {
p("0010_1111", 2) .. p("0011_1111", 2) => symbol_1(x),
p("1111_0000", 2) .. p("1111_1111", 2) => symbol_2(x),
// ...
};
Person B might do this:
function b(comptime str: []const u8) comptime_int {
return std.fmt.parseSeparatedInt(str, 2);
}
// ...
const y = switch (x) {
b("0010_1111") .. b("0011_1111") => symbol_1(x),
b("1111_0000") .. b("1111_1111") => symbol_2(x),
// ...
};
However, as the reader of this code, how am I supposed to know what p and b do? I can't guarantee what it does, so I have to check the definition. Having builtin syntax introduces much less cognitive overhead:
const y = switch (x) {
0b0010_1111 .. 0b0011_1111 => symbol_1(x),
0b1111_0000 .. 0b1111_1111 => symbol_2(x),
// ...
};
The list of languages that @momumi compiled that support this feature is the most compelling argument i've seen.
I don't think it's fair to support 0b literals and not separators; it seems like an arbitrary design decision to me. Both have their niche uses; both give more than one obvious way to do things; both are unsupported in C; both are supported by most major modern languages (i think?).
Had a go at implement this in #4741
This implementation is similar to the javascript version where _ may only be placed between two digits.
So these are valid:
1_000_0001_0_0_0_0_0_00x1234_56780x12_34_56_781_000.000_001e1_000These are invalid:
1__010_0_b100b_101_.01._01.0_e11.0e_11.0e1_1.0e+_1Implemented by @momumi in #4741, landed in 13d04f9963be930360ab728edd47f1a6ecfb1777.
Most helpful comment
Had a go at implement this in #4741
This implementation is similar to the javascript version where
_may only be placed between two digits.So these are valid:
1_000_0001_0_0_0_0_0_00x1234_56780x12_34_56_781_000.000_001e1_000These are invalid:
1__010_0_b100b_101_.01._01.0_e11.0e_11.0e1_1.0e+_1