Toml: Signed zero

Created on 9 Feb 2016 · 23Comments · Source: toml-lang/toml

The section on Integers and Floats does not appear to specify whether zero may be signed. ie. whether +0 or -0 are allowed, and if so what they mean.

For integers, they could be disallowed or ignored, as +0, 0 and -0 cannot be distinguished.

For floating point numbers some support is probably wise, as floating point numbers typically distinguish between +0 and -0.

Source

jodastephen

👍1

Most helpful comment

This is a strange and somewhat dismaying turn of events. I cannot find a single programming language or data format where the syntax -0 is not allowed for integers:

$ irb
irb(main):001:0> -0
=> 0

$ python
Python 2.7.12 (default, Oct 11 2016, 05:20:59)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.38)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> -0
0

$ node
> -0
-0

$ julia -q
julia> -0
0

$ php
-0
-0

$ R
> -0
[1] 0

I could go on – it's allowed in C, C++, Java, JSON, YAML, etc. – literally every single programming language and data format I can find. In languages where 0 is a floating-point value, the - is meaningful and preserved, in languages where it's an integer, it's insignificant but allowed.

Less cynically, I personally think it's best to not let someone write -0 or +0 since they're adding information that is anyways gonna be dropped.

The TOML spec explicitly allows writing +123 yet that + sign is completely redundant. How is zero special in this regard?

The reason that programming languages allow +0 and -0 is because they need the operators to not raise an error for zero, which is not a concern here because TOML does not operate on these values.

I'm not sure what this means but this is not why programming languages allow the syntax – they allow it because it's much simpler and more orthogonal for the syntax to allow signs in front of numbers full stop instead of adding ad hoc syntax exceptions like this. This is the code smell I was referring to – this change makes parsing TOML harder, not easier:

before: all numbers may be prefixed with signs
after: all numbers may be prefixed with signs, except the number zero.

Also, even when disallowed directly, I suppose if you REALLY needed it, you could fake it by specifying something you know will underflow like -1e-1000.

-1e-1000 is allowed and parses as an IEEE -0.0? So you can write the value, but not with the standard syntax? Overall, this issue seems to confuse syntax and semantics considerably.

StefanKarpinski on 25 Nov 2017

👍4

All 23 comments

In a similar vein, please reconsider not allowing signed infinities or not-a-number (NaN)s to the spec.

These are very important representations of IEEE floating-point numbers and are often used in configuration files to specify initial values or tolerances.

In fact there are many non-unique NaNs, usually a default given by LIBC for one particular value, but they are complex and can be signaling or non-signaling.

That is why almost every serialization format for IEEE floating-point numbers also ensure that Hex floating-point representations are accepted. There can be a lot of surprising and subtle nuances when converting from a decimal representation to binary representation and vice versa. These conversions are so fraught with non-obvious traps that even LIBC added hex floats years ago!

adfernandes on 27 Dec 2016

Yes to: infinities, signed zeros and a single NaN value. No to different NaN values – this is rarely used for anything meaningful.

StefanKarpinski on 3 Mar 2017

👍3

Agreed about the NaNs - apologies if I wasn't clear. A single non-signalling canoncial NaN is fine. That's what libc does, after all!

adfernandes on 6 Mar 2017

For integers, they could be disallowed or ignored, as +0, 0 and -0 cannot be distinguished.

Made #492 for this.

pradyunsg on 17 Nov 2017

@StefanKarpinski Would you mind elaborating on the utility of signed zeros for integers? I'm curious what use cases you see.

mojombo on 23 Nov 2017

To clarify: I don't think signed zeros should be supported for integers at all, although allowing integer zero to be written as -0, 0 or +0 seems reasonable. I was referring to what floating-point values should be supported, which I still think is:

all finite 64-bit binary IEEE floating-point values including distinct ±0.0
±Inf
NaN – but only one of them

Accordingly, it makes sense to have literal syntax for all of these as float values. The finite values have the usual syntax as already spelled out in the spec – just make sure to distinguish 0.0 === +0.0 from -0.0. For non-finite values, it would make sense to allow all case-insensitive variations of -Inf, Inf, +Inf and NaN as floating-point literals. Case insensitivity is motivated by the fact that different systems tend to be inconsistent with the capitalization when printing these, but the names "inf" and "nan" are fairly standard (although some systems use "infinity" instead of "inf"); it also matches the case-insensitivity of parsing the e in float literals like 1e6 === 1E6.

The motivation for supporting non-finite values as TOML literals is that applications may expect and allow them as meaningful inputs and/or configuration parameters. Especially in numerical applications, ±Inf can be useful and -0.0 and +0.0 may be treated differently. There are technically a large number of distinct NaN values (with different payload bits) and the sign bit distinguishes between quiet and signaling NaNs. In practice, however, modern compilers and CPUs don't support the quiet/signaling distinction, almost no applications use NaN payloads, and there is no standard syntax for printing different NaN values, so only having a single NaN seems like the sane thing to do.

StefanKarpinski on 23 Nov 2017

@StefanKarpinski Thanks for the clarification. I'm struggling a bit with this one. After reviewing the IEEE 754 spec, it does seem like TOML offers incomplete support for floats, and adding the special values would make sense. Though I think they might be rarely used, I can indeed think of scenarios in which ±inf would be useful (less so for -0.0 and nan). The problems arise when I try to pass them through the objective of minimalism that TOML strives for. Because you're right that if we add the special values, we should also add hex floating point as well, to truly and properly support IEEE 754 floats. That's a lot of weight to add to what could otherwise be a conceptually very simple float type.

As a bit of additional context, JSON does not support these special values, but YAML does.

I'd enjoy hearing from @pradyunsg @BurntSushi @alexcrichton on this.

mojombo on 24 Nov 2017

Because you're right that if we add the special values, we should also add hex floating point as well, to truly and properly support IEEE 754 floats.

That was @adfernandes arguing for hex float syntax, not me – I agree that hex float syntax is unnecessary. It's rarely used and is redundant as long as you know the type of floating-point number that you're expressing (Float64) since enough decimal digits let you express any possible value precisely.

That said, the above is not all-or-nothing since you can add support to newer TOML specs for ±Inf and NaN without breaking old documents since presumably those syntaxes will be invalid until then. So you could start by adding ±Inf and see if there's any need for NaN.

StefanKarpinski on 24 Nov 2017

I'd enjoy hearing from @pradyunsg

Ahoy!

My inputs on this would be:

disallow signed zero integers -0 and +0 (#492)
disallow signed zero floats -0.0 and +0.0
disallow NaN
avoid hex float representation. It's useful in _much rarer_ cases than infinity.
maybe... add a note in "Float" section suggesting to parse a string if one wants NaN/signed zeros
add support for signed infinities with only one lowercase spelling, in line with true/false
- inf because infinity is too long.

Reasons:

This is simple to understand and consistent.
- Everyone who's done some math (6th grade?) knows what infinities and decimal numbers are.
- NaN and Signed zeros are a nuanced topic in computer's internal representation of data.
Only one spelling? "There should be one-- and preferably only one --obvious way to do it."
IEEE 754 is one function call away most of the time anyway today. You can use a string if you care about NaN/signed zeros.
- Python: float(my_IEEE_754_str)
- Javascript: parseFloat(my_IEEE_754_str)
- C: atof(my_IEEE_754_str)

Aside, signed zeros: aren't they essentially for retaining sign information about a calculation once it goes outside of the range? Like rounded from below zero vs rounded from above zero. I don't see why would someone use them in a configuration.

pradyunsg on 24 Nov 2017

My view as the OP is that it should be possible to take unique representations and pass them to and fro via this format without losing meaning. Java for example supports -0.0, NaN, +Inf and -Inf, so those would be the four special cases I would argue for (I think that this is the standard set of special values normally seen in different languages). I also think it would be surprising to have support for infinity and not NaN. StefanKarpinski also expressed support for these four special values.

Hex representations and anything other than a single spelling seem unnecessary.

However, as per #390, I believe that parsers should be case insensitive (Postel's Law) wrt NaN and Inf. This doesn't mean that the spec can't recommend a certain case.

jodastephen on 24 Nov 2017

Having thought about it a bit more, I think that IEEE is a red herring here and non-finite and hex literals are independent features. Hex literals are not part of the IEEE standard and Inf and NaN are ubiquitous in non-IEEE floating-point formats (MPFR, GMP, ArbFloat, etc.). Decimal literals are sufficient to write any finite value up to a given precision, so hex literals don’t add any expressiveness. Without Inf and NaN literals, on the other hand, there’s no way to express non-finite values.

StefanKarpinski on 24 Nov 2017

I agree that hex float syntax is unnecessary. It's rarely used and is redundant as long as you know the type of floating-point number that you're expressing (Float64) since enough decimal digits let you express any possible value precisely.

Ok fair enough. Then I think we can leave that out for sure.

Aside, signed zeros: aren't they essentially for retaining sign information about a calculation once it goes outside of the range? Like rounded from below zero vs rounded from above zero. I don't see why would someone use them in a configuration.

This is an extremely good point. You end up with -0 as the result of an underflow (during a calculation or directly). As TOML is a configuration format, it doesn't make much sense to specify it directly. Also, even when disallowed directly, I suppose if you REALLY needed it, you could fake it by specifying something you know will underflow like -1e-1000.

@pradyunsg I like your reasoning on ±inf. I can see reasonable use cases for them as literal values in configuration. I also would prefer to only allow lowercase (same as true/false). Where TOML is case insensitive, it should be because some spec it's piggybacking on is case insensitive (datetime) or so common in practice as to be unavoidable (e in floats).

As for NaN, when considered as necessary for direct input in a config file, I'm coming up short with regards to use cases. Like -0, it is primarily the result of a computational error case and I don't know why you'd be specifying it as a config value.

So my latest thoughts are pretty much what @pradyunsg said, but I'll reiterate:

disallow signed zero integers -0 and +0 (#492)
disallow signed zero floats -0.0 and +0.0
disallow NaN
no hex float representation.
add support for signed infinities: -inf, inf, +inf

mojombo on 24 Nov 2017

👍1

I’m not sure what the purpose of actively disallowing signed zeros in the integer syntax is. It’s completely standard in programming languages to allow the sign but only have a single integer zero value. As I commented in a code review on the grammar change, it’s fishy that this complicates the simple grammar for integers so much. Is the concern that it allows more than one way to write zero? If so, it seems that features like underscores between digits should be a much bigger source of concern since it leads to many ways to write every number.

StefanKarpinski on 25 Nov 2017

On the whole this seems like a direction that is going to lead to non-standard TOML dialects. Someone will have a NaN somewhere in their data and use TOML as a serialization format and be annoyed that they can’t save their data as-is. Why do they have a NaN? Who knows. It happens.

It strikes me as odd to say that the underlying format is C double and then not follow standard parsing and printing rules for that type. I suspect that will cause enough frustration that implementors will be tempted to have variations that “just work” the way these things do in their preferred programming language.

StefanKarpinski on 25 Nov 2017

My view as the OP is that it should be possible to take unique representations and pass them to and fro via this format without losing meaning.

I disagree. I don't think TOML needs to be a round tripping format. FWIW, it isn't one today.

what the purpose of actively disallowing signed zeros in the integer syntax is.

What is the purpose of allowing signed zeros? The phrases negative zero and positive zero do not make sense unless you know this signed zeros allowed in internal representation thing.

Less cynically, I personally think it's best to not let someone write -0 or +0 since they're adding information that is anyways gonna be dropped. The reason that programming languages allow +0 and -0 is because they need the _operators_ to not raise an error for zero, which is not a concern here because TOML does not operate on these values.

As I commented in a code review on the grammar change

As I've said there before, let's keep the discussion here. It doesn't make sense to talk about how to do until we decide what we want to do. As for the grammar complication, I think it's fine but I'd rather not digress right now. :)

use TOML as a serialization format

It's not a serialisation format?

(edit: grammar stuff)

pradyunsg on 25 Nov 2017

This is a strange and somewhat dismaying turn of events. I cannot find a single programming language or data format where the syntax -0 is not allowed for integers:

$ irb
irb(main):001:0> -0
=> 0

$ python
Python 2.7.12 (default, Oct 11 2016, 05:20:59)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.38)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> -0
0

$ node
> -0
-0

$ julia -q
julia> -0
0

$ php
-0
-0

$ R
> -0
[1] 0

Less cynically, I personally think it's best to not let someone write -0 or +0 since they're adding information that is anyways gonna be dropped.

The TOML spec explicitly allows writing +123 yet that + sign is completely redundant. How is zero special in this regard?

The reason that programming languages allow +0 and -0 is because they need the operators to not raise an error for zero, which is not a concern here because TOML does not operate on these values.

before: all numbers may be prefixed with signs
after: all numbers may be prefixed with signs, except the number zero.

Also, even when disallowed directly, I suppose if you REALLY needed it, you could fake it by specifying something you know will underflow like -1e-1000.

-1e-1000 is allowed and parses as an IEEE -0.0? So you can write the value, but not with the standard syntax? Overall, this issue seems to confuse syntax and semantics considerably.

StefanKarpinski on 25 Nov 2017

👍4

So -1e-1000 parses as an IEEE -0.0? So you can write the value, but not with the standard syntax?

This arguments translates directly to ±inf too.

(edit: yeah I meant inf, not NaN)

lspitzner on 25 Nov 2017

Why not just say that TOML floating-point numbers represent IEEE binary 64-bit floating-point values, and leave the syntax as is, and consider adding ±inf? That seems like the simplest, least ambiguous specification and gives broad interoperability since that format is ubiquitous in hardware and software. TOML doesn't have to say or do anything about signed zeros, it just has to allow them in the syntax and the rest is up to the floating-point parser (which is beyond the scope of the standard). Similarly, allowing signs on integer zeros keeps the standard as simple as possible, and since integer types in languages don't have signed zeros, -0 and 0 represent the same value.

StefanKarpinski on 25 Nov 2017

👍1

Okay. I'm changing my position from "simple" floats to "complete" floats.

Something I'd missed before -- The spec says "64-bit (double) precision expected."

TOML is already (sort of) saying use this standard, so I do think it'll also be fine if it switches to "TOML has full support for IEEE754 double precision floating point numbers (with NaN and inf)" and have a couple of lines about spelling inf and NaN. It'll be sort of like how TOML offloads definitions to RFC3999 for datetimes.

integer = [ sign ] 1*DIGIT

float = integer [ "." 1*DIGIT ] [ "e" integer ] / named-float
named-float = [ sign ] %x69.6E.66 / %x4E.61.4E; +inf, -inf, NaN

you can write the value, but not with the standard syntax?

This is indeed weird. If it's possible anyhow, I think it's better to treat it as first class.

TOML doesn't have to say or do anything about signed zeros

Fair enough.

-1e-1000 is allowed and parses as an IEEE -0.0?

If the world was decimal128, it'd have stored the value as is. Alas, the world isn't that good yet. :-(

pradyunsg on 25 Nov 2017

👍1

I like this direction much more :)

Piggybacking on IEEE 754 doubles (64-bit binary floats) seems like an excellent idea – you don't want to get into the business of making floating-point specs if you can avoid it. Despite its foibles (like -0.0 and many different NaN values and non-reflexive equality of NaN), IEEE 754 is solid, it's the workhorse of numerical work and it is absolutely everywhere.

If you're going to spell inf in lowercase, it seems sensible to also spell nan in lowercase too. That's how C's printf functions print these values, so it seems as close to a standard choice as there is and matches the lowercase literals true and false.

StefanKarpinski on 25 Nov 2017

nan

:+1:

pradyunsg on 26 Nov 2017

before: all numbers may be prefixed with signs

after: all numbers may be prefixed with signs, except the number zero.

This is a compelling argument, and well stated. Thank you all for the reasoned discourse, this is why we do it! I've been moved by the discussion in the same way as @pradyunsg, and the idea to say that TOML floats are IEEE 754 64-bit floats is an excellent idea, and a natural, backwards compatible extension of the current spec. I'll work up a new PR with that direction and we can see how it feels.

mojombo on 27 Nov 2017

❤2

Please see #506 for the proposed changes as outlined earlier.

mojombo on 28 Nov 2017

Was this page helpful?

0 / 5 - 0 ratings