Zig: Proposal: Number literal separators

Created on 29 Sep 2017 · 20Comments · Source: ziglang/zig

This is found in many other languages, aimed at making longer literals easier to read at a glance by grouping together logical units within numbers. This is especially useful for the longer 128-bit and beyond literals that are available in zig.

I propose allowing a _ separator anywhere in a number literal to align with being the simplest rule to understand. Numeric literals are parsed into values as if the separators were not present.

Examples:

const a = 0x1234_2839_1083_1928;
const b = 0x123_190.109_038_018p102;
const c = 0_x0123; // Not allowed, cannot insert separator on radix prefix
const d = _1238; // This is parsed as an identifier
const e = 9_________123123; // Multiple literals are allowed in sequence.

A more in-depth reference of other implementations can be found in the javascript proposal.

accepted contributor friendly proposal

Source

tiehuis

👍6

Most helpful comment

Had a go at implement this in #4741

This implementation is similar to the javascript version where _ may only be placed between two digits.

So these are valid:

1_000_000
1_0_0_0_0_0_0
0x1234_5678
0x12_34_56_78
1_000.000_001e1_000

These are invalid:

1__0
10_
0_b10
0b_10
1_.0
1._0
1.0_e1
1.0e_1
1.0e1_
1.0e+_1

momumi on 15 Mar 2020

👍6 🚀4

All 20 comments

Pushed an initial implementation to a new branch here.

tiehuis on 29 Sep 2017

A note: it may be useful to allow other common numeral separators - only per project, not universally.

For example, Indian numbers use comma:

3,00,00,000

https://en.wikipedia.org/wiki/Indian_numbering_system

Trailing underscores could be allowed, for situations like:

arr[0] = 73__;
arr[1] = 8655;
arr[2] = 1___;
arr[3] = 0___;
arr[4] = 12__;
arr[5] = 987_;

which gives a hint about expected max range for a group of numbers.

Or for alignment:

arr[8_] = 1;
arr[9_] = 2;
arr[10] = 1;
arr[11] = 2;

PavelVozenilek on 29 Sep 2017

👎1

Not sure if there would be mich benefit for customisable separators. These are only really for visual grouping and not so much for accurate numeric localization.

The current implementation does allow trailing dashes as part of the literal. Those examples of yours should work.

tiehuis on 29 Sep 2017

@tiehuis: program with lot of hardcoded numbers (e.g. ballistic tables) may have better readability and higher chance of catching typos, due to familiar style.

But this is not feature for everyone, and if ever implemented, it should allow ad-hoc project customisation. I imagine wild things like ability to avoid repeated numbers:

...
    80482.23
    ....3.23
    ....4.93
    80493.22
...

I have a hope that Zig's metaprogramming will enable these "tricks".

The current implementation does allow trailing dashes as part of the literal. Those examples of yours should work.

Great.

PavelVozenilek on 29 Sep 2017

👎1

I think this has pros and cons. Reading 1_000_000 is a little better than reading 1000000. The only reason I hesitate to merge this right away is that 1______0__00__0___0________0 is much worse than 1000000 and now it becomes possible to have working, compiling code that looks like that. Even though 1000000 could look a little better, it's at least reasonable, and the only way you can write that number.

andrewrk on 29 Sep 2017

Another thing to think about is that a bare number literal is a math construct. 1000000 means the same thing in every region. However once you start introducing separators, regional differences creep into code. Some people might want 100_00_00. Maybe it's better to avoid that whole class of problems.

andrewrk on 29 Sep 2017

Here's my alternative proposal that uses status quo:

To compete with this Java:

long hexBytes = 0xFF_EC_DE_5E;
long hexWords = 0xCAFE_F00D;
long maxLong = 0x7fff_ffff_ffff_ffffL;
byte nybbles = 0b0010_0101;
long bytes = 0b11010010_01101001_10010100_10010010;

Here's the Zig:

//                  ++--++--
const hex_in_8s = 0xFFECDE5E;

//                   ++++----
const hex_in_16s = 0xCAFEF00D;

//                /--\/--\/--\/--\
const max_s64 = 0x7fffffffffffffff;

//                hhhhllll
const nybbles = 0b00100101;

//              33333333222222221111111100000000
const bytes = 0b11010010011010011001010010010010;

Takes up extra space, but the free-form nature of comments means you can write whatever you want there, which is arguably more powerful than only being able to group digits.

I admit the _ grouping looks nicer in some cases. But I'm not sure that's a compelling reason to make the language more complicated.

I think my biggest objection to this proposal is that it introduces language complexity in the form of syntactic sugar without encouraging any different semantics. This proposal enables the subjective concept of making long literals easier to read by grouping digits in some way.

Consider that there are even more ways to write a literal in Zig that allow even more expression of intent than this proposal. For example:

const max_s64 = (1 << 63) - 1;
const bytes =
    (0b11010010 << 24) |
    (0b01101001 << 16) |
    (0b10010100 << 8) |
    (0b10010010 << 0);
// see https://github.com/gcc-mirror/gcc/blob/61eae75c6230c7df9fa3e935b2efadda61667c5f/libiberty/crc32.c#L70
const crc32_table: [256]u32 = comptime generate_crc32_table();

thejoshwolfe on 29 Sep 2017

👍1

1______0__00__0___0________0

I have yet to see people doing things like this outside IOCCC.

OTOH they go to great lengths to create helpful visual artifacts in the code, like column alignment. Technical authors do the same with numeric tables or math heavy texts.

Why not to make it per project option? It someone fears he can switch it off.

PavelVozenilek on 29 Sep 2017

@andrewrk

Instead of allowing a separator everywhere, we could be more restrictive and only allow single separators between digits. This actually seems to be pretty normal in other languages. Ada, C++, Ruby and Julia use this method.

Regarding different region details. If there are implicit semantics behind the meaning of a number literal, separators actually may help convey to a reader that there are some implied extra details. Of course if they are just separating something without any specific meaning then that is a valid concern.

@thejoshwolfe

Valid alternative.

The main draws I see over just comments are two. A standardized way of doing this is a bonus and means we don't get different competing styles to represent the same thing (only a minor). The other would be that because literal separators are much easier to insert (one character vs. annotating an entire line) this means that it is probably more likely that they would be used vs. a comment-based approach, helping code readability. It also allows one to leave comments for more important details like why the particular value may have been chosen, for example.

tiehuis on 30 Sep 2017

@tiehuis I really appreciate the writeup, and especially the fact that you went off and coded it. Your arguments are reasonable. But I'm going to have to go with keeping the language small and only 1 way to do things.

andrewrk on 8 Dec 2017

But I'm going to have to go with keeping the language small and only 1 way to do things.

If you use that reasoning, I'd personally drop the 0o and 0b prefixes as well.

The only use case I've ever seen for octal is in unix file permissions which is something you could easily handle using constants.

Anything you can represent in binary, you can just represent in hexadecimal. For example compare 0xff == 0b1111111, or 0x8000 == 0b1000000000000000. Personally, binary literals are very hard to read without the _ separator, and even C rejected binary literals due to lack of precedent and insufficient utility. cf. 6.4.4.1

momumi on 26 Dec 2019

Just an FYI: this can be implemented with pretty tiny changes to the syntax and lexer. However, it would require an extra sentence to explain it in the documentation, and it does mean there are more ways to write equivalent integer literals than the already existing decimal, hex, octal, and binary literals. Personally, I think it would be worth it.

The grammar changes from this:

    <- "0x" hex+   "." hex+   ([pP] [-+]? hex+)?   skip
     /      [0-9]+ "." [0-9]+ ([eE] [-+]? [0-9]+)? skip
     / "0x" hex+   "."? [pP] [-+]? hex+   skip
     /      [0-9]+ "."? [eE] [-+]? [0-9]+ skip
INTEGER
    <- "0b" [01]+  skip
     / "0o" [0-7]+ skip
     / "0x" hex+   skip
     /      [0-9]+ skip

to this:

hex_ <- [0-9a-fA-F_]
FLOAT
    <- "0x" hex_+   "." hex_+   ([pP] [-+]? hex_+)?   skip
     /      [0-9] [0-9_]+ "." [0-9_]+ ([eE] [-+]? [0-9_]+)? skip
     / "0x" hex_+   "."? [pP] [-+]? hex_+   skip
     /      [0-9] [0-9_]+ "."? [eE] [-+]? [0-9_]+ skip
INTEGER
    <- "0b" [01_]+  skip
     / "0o" [0-7_]+ skip
     / "0x" hex_+   skip
     /      [0-9] [0-9_]+ skip

And the lexer (or parser, depending on implementation) just needs to skip the underscores when evaluating the number.

scottjmaddox on 8 Jan 2020

@andrewrk could you consider reopening this?

After this issue was closed in 2017 almost every mainstream language has come to support this feature. If zig's goal is to replace C and become the new lingua franca, it make sense adopting the syntax that other languages are using.

I've compiled an extensive list of languages that support _ as a digit separator:

java (SE 7)
javascript (planned, already implemented in browsers and nodejs)
python (3.6)
C# (7.0)
C++ (C++14 actually uses ' as the separator, but it would have used _ if it didn't conflict with the grammar)
php (7.4)
go (1.13)
rust (1.0)
ruby (1.0)
Visual Basic (Visual Basic 2017)
perl (2.0)
D (1.0)
swift (1.0)
kotlin (1.1)
haskell (8.6.1)
F# (4.1)
assembly (nasm 0.99.06, fasm 1.71.56)
Verilog (95)
VHDL (1993 cf. §13.4)
julia (1.0)
erlang (eep 51 accepted)
octave (4.2)
typescript (2.7)
elixir (1.1)
ocaml (3.07)
scheme (srfi-169)
eiffel (5.6)
ada (1983)

momumi on 19 Jan 2020

❤4

I'd just like to note that C isn't on that list.

One of the things that differentiates C from most languages is, IMO, its simplicity. While this proposal is, itself, not complex, I find that languages aren't generally brought down by a few major changes, but by many minor ones.

Imagine if a dozen similar changes to the grammar were made. Each one, on its own, is relatively benign; together, they remove everything that makes Zig what it is. If Zig were to adopt every minor change that "every mainstream" language supports, there wouldn't really be a point to Zig at all.

pixelherodev on 19 Jan 2020

👎1

@pixelherodev Also note that C doesn't have binary literals 0b. Most of these languages added binary literals and _ separators at the same time. People have longer response times when counting more than 4 objects and binary literals have a large number of elements, so they are hard for people to parse without a visual separator. It's hard to tell the difference between 0b11111111 and 0b111111111. Using visual separators makes the code much easier to read: 0b1111_1111 and 0b1_1111_1111.

momumi on 19 Jan 2020

👍6

That's true, but it's also still possible to write out, say, 0xFF or 0x1FF, is it not? If you're writing out large binary strings manually, maybe you should switch to hex.

Or, if numerical separators are important, here's an alternate proposal: a comptime function in zag (which, in case you haven't come across me using that term elsewhere, is what I've started calling the Zig standard library) which takes a string literal - like, say, "393_219_293_192", parses and removes the separator, and parses as an integer? This also has the advantage of allowing a single function that supports every base by simply passing the base on to parseInt in std.fmt.

Usage:

const a = std.fmt.parseSeparatedInt("1f3a_3904_a9ca_299c", 16);

This leaves the grammar as is, provides most (albeit not all) of the advantages of implementing it as a language feature, and slightly reduces how large a Zig compiler needs to be to compete with the current stage1.

pixelherodev on 19 Jan 2020

👎1

@pixelherodev I think encouraging parsing functions for something so elementary is the wrong way to go. Using something like std.fmt.parseSeparatedInt is cumbersome, so people will be encouraged to take short cuts. However, different people are going to take different shortcuts:

Person A might do this:

const p = std.fmt.parseSeparatedInt;

// ...

const y = switch (x) {
    p("0010_1111", 2) .. p("0011_1111", 2) => symbol_1(x),
    p("1111_0000", 2) .. p("1111_1111", 2) => symbol_2(x),
    // ...
};

Person B might do this:

function b(comptime str: []const u8) comptime_int {
    return std.fmt.parseSeparatedInt(str, 2);
}

// ...

const y = switch (x) {
    b("0010_1111") .. b("0011_1111") => symbol_1(x),
    b("1111_0000") .. b("1111_1111") => symbol_2(x),
    // ...
};

However, as the reader of this code, how am I supposed to know what p and b do? I can't guarantee what it does, so I have to check the definition. Having builtin syntax introduces much less cognitive overhead:

const y = switch (x) {
    0b0010_1111 .. 0b0011_1111 => symbol_1(x),
    0b1111_0000 .. 0b1111_1111 => symbol_2(x),
    // ...
};

momumi on 19 Jan 2020

👍1

The list of languages that @momumi compiled that support this feature is the most compelling argument i've seen.

I don't think it's fair to support 0b literals and not separators; it seems like an arbitrary design decision to me. Both have their niche uses; both give more than one obvious way to do things; both are unsupported in C; both are supported by most major modern languages (i think?).

thejoshwolfe on 19 Jan 2020

👍3

Had a go at implement this in #4741

This implementation is similar to the javascript version where _ may only be placed between two digits.

So these are valid:

1_000_000
1_0_0_0_0_0_0
0x1234_5678
0x12_34_56_78
1_000.000_001e1_000

These are invalid:

1__0
10_
0_b10
0b_10
1_.0
1._0
1.0_e1
1.0e_1
1.0e1_
1.0e+_1

momumi on 15 Mar 2020

👍6 🚀4

Implemented by @momumi in #4741, landed in 13d04f9963be930360ab728edd47f1a6ecfb1777.

andrewrk on 23 Mar 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

replace "&&" and "||" with "and" and "or"

andrewrk · 3Comments

make Debug and ReleaseSafe modes fully safe

andrewrk · 3Comments

make @mulAdd support integers, comptime integers, and comptime floats, and no explicit type parameter

andrewrk · 3Comments

fix inability to interact with C ABI symbols with underscore name (`_`) by making it a keyword

andrewrk · 3Comments

missing stacktraces

komuw · 3Comments