Roslyn: Proposal: Digit separators

Created on 3 Feb 2015 · 41Comments · Source: dotnet/roslyn

Being able to group digits in large numeric literals would have great readability impact and no significant downside.

Adding binary literals (#215) would increase the likelihood of numeric literals being long, so the two features enhance each other.

We would follow Java and others, and use an underscore _ as a digit separator. It would be able to occur everywhere in a numeric literal (except as the first and last character), since different groupings may make sense in different scenarios and especially for different numeric bases:

c# int bin = 0b1001_1010_0001_0100; int hex = 0x1b_a0_44_fe; int dec = 33_554_432; int weird = 1_2__3___4____5_____6______7_______8________9; double real = 1_000.111_1e-1_000;

Any sequence of digits may be separated by underscores, possibly more than one underscore between two consecutive digits. They are allowed in decimals as well as exponents, but following the previous rule, they may not appear next to the decimal (10_.0), next to the exponent character (1.1e_1), or next to the type specifier (10_f). When used in binary and hexadecimal literals, they may not appear immediately following the 0x or 0b.

The syntax is straightforward, and the separators have no semantic impact - they are simply ignored.

This has broad value and is easy to implement.

Area-Language Design Feature Request Feature Specification Language-C# Resolution-External

Source

MadsTorgersen

👍7

Most helpful comment

Using space as a separator would probably be a bad idea, because it would cause hard-to-spot mistakes. For instance, int[] numbers = { 1 2 } _looks_ like an array with the numbers 1 and 2, but it would actually be an array with only the number 12. Forgetting a comma would silently change the meaning of the code, instead of causing an error.

thomaslevesque on 7 Feb 2015

👍14

All 41 comments

Does this apply to real literals as well? For example, would 1_0_._5_e_-_1_6_m_ be valid?

I have no idea if this would be useful, just curious.

svick on 3 Feb 2015

👍1

:+1:

monoman on 4 Feb 2015

Don't shoot me, but would it be too hard to parse "space" as a seperator? Or does that make the grammer ambiguous?

c# int two = 0b 10; short max = 0x ffff; long oneMillion = 1 000 000;

Just thinking out loud.

chrisaut on 4 Feb 2015

👍11

Digit separators where included in the VB.net (vNext CTP) would it be beneficial to a also describe what was allow in VB? It think
'1_000' was allowed but 1__000 wasn't.
Comma usage could be an issue as it would clash with array literals, as you couldn't tell what was number and what was and array element.

AdamSpeight2008 on 4 Feb 2015

I agree I like space more then underscore. It's generally easier to type and makes it easier when working with something like hex numbers e.g.
0x8080 8080 8080 8080UL is so much easier to read and make sure I've filled all the slots vs something like 0x8080808080808080UL where I have to sit and count to see if I got 16 characters or I only typed 14 or something. How's about ' as well.

mburbea on 4 Feb 2015

👍2

I don't see how you could use a space as the separator because numeric literals would then potentially consist of not one but several tokens. This would make them very difficult to parse.

The underscore seems the best choice of separator to me, particularly as it's already used by several other languages.

I'm not so keen on allowing multiple consecutive underscores but I suppose it does no harm.

alanfo on 6 Feb 2015

This grammar wouldn't allow consecutive separators.

  digit ::= '0' - '9'
  sep   ::= '_'
 prefix ::=
literal ::= prefix (sep? digit)+

I think spaces could also be possible

    digit ::= '0'-'9'
seperator ::= ' '
  literal ::= digit (separator? digit)*

AdamSpeight2008 on 6 Feb 2015

I think it would be very hard to use spaces.

I haven't looked at the parser, but it's probably doing something like breaking the text at white spaces, parenthesis, braces, whatever and analyzing the tokens from there. Assuming that after a numeric literal it might come the rest of it is doable, but I don't think it is worth the cost.

And what next? This?

var a = 1111
        1111
        1111
        1111;

Or this?

var a = 1111    // comment
        1111    // comment
        1111    // comment
        1111;   // comment

Although it might be an itsy bitsy harder to write in most keyboard configuration, the semantic break of the numeric literal is the same with the _ and I would argue that it's even better because gives separation and cohesion.

paulomorgado on 6 Feb 2015

Wonder if the parser supports significant whitespace?

AdamSpeight2008 on 6 Feb 2015

The VB implementation of digit group separators prototyped last year actually supported three different separators originally: underscore, back tick, and space. So you could write &B1111 0010 or 1_000_000 or 3`600. We quickly decided that back tick didn't make enough sense to anyone and cut it. The VB preview still supported both underscores and spaces. The biggest motivation for spaces was binary literals, another feature prototyped at the same time, because binary numbers are conventionally separated with spaces.

As to implementation, it's not hard at all really - at least in VB, particularly when you don't allow multiple consecutive separators. Normally the scanner encounters a digit and starts scanning a integral literal one character at a time until it encounters a character that's not a digit for the base being used (decimal, hex, octal) then it stops. We changed it so that if the non-digit character were a underscore or space it would peek one more character ahead and if that character were a digit it would keep scanning it as a single token. There are some corner cases you have to put extra recovery around but it's not very complicated, particularly because in VB it's not valid to have two integer literals follow one another so it's non-breaking to interpret 1 1 as 11. I think C# is the same here though in C# we were pretty settled that underscore would be the sole separator.

I think the biggest concern about that is that tools would be confused thinking the space was a word boundary (not VS, the editor is smart enough in VS to handle space) and we just couldn't foresee what havoc spaces would be unleashing on the world (if any).

Another more minor concern was complexity - would users benefit more from having a single consistent separator used everywhere? If we decided to pick one it would likely be the underscore so space was only a possibility if we were ok with having two separators which was an open question.

-ADG

AnthonyDGreen on 7 Feb 2015

thomaslevesque on 7 Feb 2015

👍14

@thomaslevesque that is a very good point, before I suggested it I quickly tried to think of places where two numbers would follow each other, but I had totally missed this obvious one. I think that is probably a deal breaker.

Seems generally people are not for using space, and I think I have come to agree with this point. Still don't like how "1_000" looks, but it might be the best and easiest option.

chrisaut on 7 Feb 2015

Isn't this proposal about digit separators for the literals have a prefix?

AdamSpeight2008 on 7 Feb 2015

@AdamSpeight2008, no, it's for all numeric literals.

thomaslevesque on 7 Feb 2015

@AdamSpeight2008, we did consider restricting space in particular to its most obvious use case - binary literals. It would be unusual, but I think it's worth considering if it gives us more confidence in the feature.

@thomaslevesque, @chrisaut, I find that developers tend to bias negatively on what would confuse other developers and how often. Just about every feature ever proposed or introduced has someone saying "this will cause hard to spot mistakes for everyone ever". There are also features which at first seem harmless - then later turn out to be pits of failure. Fortunately, with "Roslyn" and a managed code base it's much easier to quickly prototype language features - even the scary ones and experiment and make decisions after making observations. I think that will give us the most room to explore the full potential of the language without being committed to doing or not doing a feature a particular way too early. It's still very _very_ early in the design of VB15 (this idea has 0% chance of making it into C#) and given how often space has been proposed or preferred by different VB users we've spoken to I'd hate to cut the idea down prematurely if it could actually produce a better experience for those users.

Regards,

-ADG

AnthonyDGreen on 8 Feb 2015

I'd say ' or ` are better choices than _:

They're easier to type (single keystroke instead of combination)
Even they are placed at the top of the text they look more similar to commas and dots that are used as digit separators in various cultures
The _ might be useful to other features, such as user defined literals.

mikedn on 8 Feb 2015

❤1 👍1

@mikedn

They're easier to type (single keystroke instead of combination)

Sadly this holds true only for the US keyboard layout. At least In the German layout all three require two key strokes. Only space is one keystroke here, too.

d-kr on 8 Feb 2015

👍2

Agreed ` or ' are undesirable for the reasons already mentioned. I actually don't mind using _ as a separator at all, and, frankly, anything here is better than nothing :)

Using space seems like a recipe for conflicts all over the place, and I don't see it adding that much value. I dislike the idea of allowing multiple, alternative separators, while anyone reusing Roslyn wouldn't care, other tools doing their own lexing of C# code would have to do much more work.

tomasr on 8 Feb 2015

' is used for a comment in VB.net

AdamSpeight2008 on 9 Feb 2015

In VB.net _ is also used as a line continuation.
Would that cause a misread of the user's intent?

AdamSpeight2008 on 9 Feb 2015

@mikedn, @tomasr or ' is good only for decimal digits. Lets see other cases:

cs int bin = 0b`1001`1010`0001`0100; int hex = 0x1b`a0`44`fe; int dec = 33`554`432; int weird = 1`2``3```4````5`````6``````7```````8````````9`````````;

I think _ is better because it more universal.

ViIvanov on 10 Feb 2015

@ViIvanov ` and ' make it look like numbers are indicating degrees. or feet and inches.

AdamSpeight2008 on 10 Feb 2015

@AdamSpeight2008, in VB the explicit line continuation is actually to ensure that the underscore is never a trailing character of an identifier or other token so it wouldn't be a problem.

I agree that ` and ' look more like units of measurement. _ has a precedent in identifiers as a chunk separator. is used for binary numbers in particular and has been recommended by various bodies as a standard separator alternative to either comma or period (http://en.wikipedia.org/wiki/Decimal_mark#Digit_grouping)

I haven't seen a good scenario for multiple consecutive separators yet and am likely to advocate disallowing them.

AnthonyDGreen on 11 Feb 2015

Just to reinforce what @AnthonyDGreen and @d-kr said, on the Portuguese keyboard layout requires me to type **[SHIFT]**+**[]** followed by [SPACE] if the following character is a vowel.

You couldn't possibly imagine how hard was to me to type code in markdown.

paulomorgado on 11 Feb 2015

I like this proposal. But I don't know why this restriction is necessary: "When used in binary and hexadecimal literals, they may not appear immediately following the 0x or 0b."
I feel like
int bin = 0b_1001_1010_0001_0100;
is much better than
int bin = 0b1001_1010_0001_0100;
and I can't imagine any problem with allowing this.

jveselka on 26 Jun 2015

👍12

@jveselka Me too especially the general grammar would be literal ::= prefix (sep? digit)+

AdamSpeight2008 on 26 Jun 2015

@gafter So the final decision is disallowing separators immediately after prefixes?

yume-chan on 3 Jun 2016

@CnSimonChan I think it is implement in the Future branch.but it needs the feature flag to be set (or the language version to be VB15. Not sure if these features are available by default in that version (15) of the language.

AdamSpeight2008 on 3 Jun 2016

@zippec: Completely agree. @jaredpar should we break out the feature request for 0x_1001_1000 to be valid into a separate issue?

jskeet on 22 Jul 2016

👍1

@jskeet yes let's use a separate issue since this feature is implemented as spec'd here. We can use the new issue to track changing to allow that syntax.

jaredpar on 22 Jul 2016

Would be nice. Although space feels less C#ish, I still vote for spaces, I mean can it go wrong as long as we're expecting a ;?
Anyway, I think it should only be allowed in binary/hex/o̷c̷t̷ etc.?

weitzhandler on 6 Mar 2017

@weitzhandler, I think that changing C# 7 and Visual Studio for tomorrow is, most probably, out of the question. 😄

paulomorgado on 6 Mar 2017

😄1

var a = 1                                                                                                                                                                                                                                      0;

is actually just ten?

alrz on 6 Mar 2017

👍1

@alrz, that's no worst than

var a = 1______________________________________________________________________________________________________________________________________________________________________________________________________________________________________0;

The greater issue here is that, in this particular case and only in this particular case, space is a special case for white spaces. And that's bad. Very bad.

paulomorgado on 6 Mar 2017

@paulomorgado

No, the space is worst because it's invisible. In your example it's impossible to overlook the zero because the literal goes on and on. and on.

alrz on 6 Mar 2017

👍1

Limit to single space (surely no line breaks :rage:).

_ is definitely more C#ish anyway.
And separation only make sense in binary/hex.

weitzhandler on 6 Mar 2017

@weitzhandler No it doesn't. C# doesn't mind how many spaces you are using between tokens at all.

alrz on 6 Mar 2017

We should keep the discussion here.

weitzhandler on 6 Mar 2017