Language: Digit separators in number literals.

Created on 29 Jun 2018 · 12Comments · Source: dart-lang/language

Solution to #1.

To make long number literals more readable, allow authors to inject digit group separators inside numbers.
Examples with different possible separators:

100 000 000 000 000 000 000  // space 
100,000,000,000,000,000,000  // comma
100.000.000.000.000.000.000  // period
100'000'000'000'000'000'000  // apostrophe (C++)
100_000_000_000_000_000_000  // underscore (many programming languages).

The syntax must work even with just a single separator, so it can't be anything that can already validly seperate two expressions (excludes all infix operators and comma) and should already be part of a number literal (excludes decimal point).
So, the comma and decimal point are probably never going to work, even if they are already the standard "thousands separator" in text in different parts of the world.

Space separation is dangerous because it's hard to see whether it's just space, or it's an accidental tab character. If we allow spacing, should we allow arbitrary whitespace, including line terminators? If so, then this suddenly become quite dangerous. Forget a comma at the end of a line in a multiline list, and two adjacent integers are automatically combined (we already have that problem with strings). So, probably not a good choice, even if it is the preferred formatting for print text.

The apostrope is also the string single-quote character. We don't currently allow adjacent numbers and strings, but if we ever do, then this syntax becomes ambiguous. It's still possible (we disambiguate by assuming it's a digit separator). It is currently used by C++ 14 as a digit group separator, so it is definitely possible.

That leaves underscore, which could be the start of an identifier. Currently 100_000 would be tokenized as "integer literal 100" followed by "identifier _000". However, users would never write an identifier adjacent to another token that contains identifier-valid characters (unlike strings, which have clear delimiters that do not occur anywher else), so this is unlikely to happen in practice. Underscore is already used by a large number of programming languages including Java, Swift, and Python.

We also want to allow multiple separators for higher-level grouping, e.g.,:

100__000_000_000__000_000_000

For this purpose, the underscore extends gracefully. So does space, but has the disadvantage that it collapses when inserted into HTML, whereas '' looks odd.

For ease of reading and ease of parsing, we should only allow a digit separator that actually separates digits - it must occur between two digits of the number, not at the end or beginning, and if used in double literals, not adjacent to the . or e{+,-,} characters, or next to an x in a hexadecimal literal.

Examples

100__000_000__000_000__000_000  // one hundred million million millions!
0x4000_0000_0000_0000
0.000_000_000_01
0x00_14_22_01_23_45  // MAC address
555_123_4567  // US Phone number

Invalid literals:

100_
0x_00_14_22_01_23_45 
0._000_000_000_1
100_.1
1.2e_3

An identifier like _100 is a valid identifier, and _100._100 is a valid member access. If users learn the "separator only between digits" rule quickly, this will likely not be an issue.

Implementation issues

Should be trivial to implement at the parsing level. The only issue is that a parser might need to copy the digits (without the separators) before calling a parse function, where currently it might get away with pointing a native parse function directly at its input bytes.
This should have no effect after the parsing.

Style guides might introduce a preference for digit grouping (say, numbers with more than six digits should use separators) so a formatter or linter may want access to the actual source as well as the numerical value. The front end should make this available for source processing tools.

Library issues

Should int.parse/double.parse accept inputs with underscores. I think it's fine to not accept such input. It is not generated by int.toString(), and if a user has a string containing such an input, they can remove underscores manually before calling int.parse. That is not an option for source code literals.
I'd prefer to keep int.parse as efficient as possible, which means not adding a special case in the inner loop.
In JavaScript, parsing uses the built-in parseInt or Number functions, which do not accept underscores, so it would add (another) overhead for JavaScript compiled code.

Related work

Java digit separators.

feature small-feature state-backlog

Source

lrhn

👍43

Most helpful comment

_ seems to be least confusing and non-intrusive syntax.

tejainece on 29 Jun 2018

👍8

All 12 comments

+1!

eernstg on 29 Jun 2018

👍5

_ seems to be least confusing and non-intrusive syntax.

tejainece on 29 Jun 2018

👍8

My feeling has always been that if you need separators in your number literal, you have likely already done something wrong. Instead of separators, create a const expression that shows where that large number is coming from.

Instead of:

const largeThing = 100000000000000000000;
const bigHex = 0x4000000000000000;

Consider, say:

const msPerSecond = 1000;
const nsPerMs = 1000000;
const largeThing = 100000000 * nsPerMs * msPerSecond;

const bigHex = 1 << 62;

This has the advantage of being easier to read and showing why these constants have these values. You do sometimes run into big arbitrary literals coming from empirical measurements or other things, but those tend to be fairly rare.

Given that number separators add confusion around how things like int.parse() behave, and there are "workarounds" that actually lead to clearer code, I've never felt they carried their weight.

munificent on 24 Jul 2018

👎13 👍2

@munificent How many digits are there in 100000000000000000000 and 0x4000000000000000? You gotta get a cursor and count. Instead if you put an _ before every 4 digits, you any say x parts * 4 (for hex. 3 for currency, etc).

It is not always possible to decompose a number into its composite parts.

tejainece on 24 Jul 2018

👍4

FWIW, I'd like to suggest using quotes, like we already have for string literals: 100__000_000__000_000__000_000 might instead be expresssed as `100 000,000 000,000 000,000`. If you'd rather reserve the ` character for some future use, one could consider n'100 000,000 000,000 000,000', modeled after raw strings. Either way, there are more choices for the separator characters inside quotes.

If this repo is not the right place for unsolicited opinions from non-Dart team members, sorry to bother you.

pschiffmann on 2 Aug 2018

👍1

It is not always possible to decompose a number into its composite parts.

I think one of these is usually true:

The number can be decomposed into smaller meaningful parts.
The number is some arbitrary empirical constant in which case a human will rarely need to scrutinize the individual digits.

So, in either case, I don't think it's a high priority to be able to easily read very large number literals.

munificent on 3 Aug 2018

👎3 👍2

I agree with munificent. But I think Dart needs the exponentiation operator ** for some cases:

const largeThing = 10**14;

Dart should then allow exponentiation of constant numbers to be a constant value with is not possible with pow(10,14) at the moment.

kasperpeulen on 15 Aug 2018

👍2 👎1

While an exponentation operator can solve some issues, it won't make me get 0x7FFFFFFFFFFFFFFF right. That is a valid 64-bit integer literal (at least if I counted the F's correctly).

(There is also the option of exponential notation for integers: 1p22 or 0x1p62 as short for 1 * 10**22 and 0x1 * 16 ** 62, like we have for doubles using e).

lrhn on 27 May 2019

Maybe it's not high priority, but would definitely be a useful thing. See an example:

log.d("Built in ${stopwatch.elapsedMicroseconds / 1000000} s");

and compare with the snippet below:

log.d("Built in ${stopwatch.elapsedMicroseconds / 1_000_000} s");

Just a simple and readable one-liner. No need for adding multiplication or extra variables for readibility, like:

log.d("Built in ${stopwatch.elapsedMicroseconds / (1000 * 1000)} s");

var multiplier = 1000 * 1000; log.d("Built in ${stopwatch.elapsedMicroseconds / multiplier} s");

AdamskiMarcin on 7 Oct 2019

Any updates on this?

Levi-Lesches on 1 Jun 2020

I'd love to see this. Particularly with colours in Flutter, where I usually have something like const Color(0xff3F4E90). I would much prefer const Color(0xff_3F4E90), because the 0xff makes it more difficult to read the actual color at a glance.