Nim: JSON parsing fails for integer values greater than BiggestInt.max

Created on 26 Sep 2020 · 29Comments · Source: nim-lang/Nim

The JSON parsing functions (like parseJson and parseFile) fail when trying to parse an integer which value exceeds the maximum value that BiggestInt can hold, even if that value could be represented by the BiggestUInt type.

Example

import json

let jsonWithBigInteger = "9223372036854775808"

echo parseJson(jsonWithBigInteger)

See https://play.nim-lang.org/#ix=2yON

Current Output

/usercode/in.nim(5)      in
/playground/nim/lib/pure/json.nim(931) parseJson
/playground/nim/lib/pure/json.nim(847) parseJson
/playground/nim/lib/pure/json.nim(783) parseJson
/playground/nim/lib/pure/strutils.nim(1094) parseBiggestInt
/playground/nim/lib/pure/parseutils.nim(443) parseBiggestInt
/playground/nim/lib/pure/parseutils.nim(423) rawParseInt
/playground/nim/lib/pure/parseutils.nim(397) integerOutOfRangeError
Error: unhandled exception: Parsed integer outside of valid range [ValueError]

Expected Output

Possible Solution

In json.nim integers are represented using the BiggestInt type (see https://github.com/nim-lang/Nim/blob/devel/lib/pure/json.nim#L184 and https://github.com/nim-lang/Nim/blob/devel/lib/pure/json.nim#L795), which is signed and thus cannot contain any values that are greater than the maximum value for BiggestInt. A solution would be to detect if BiggestInt can hold the integer value and if not, use BiggestUInt.

Additional Information

This issue is blocking my project Exercism in which I have to parse the following JSON: https://github.com/exercism/problem-specifications/blob/master/exercises/grains/canonical-data.json#L66

$ nim -v
Nim Compiler Version 1.2.6 [MacOSX: amd64]
Compiled at 2020-09-21
Copyright (c) 2006-2020 by Andreas Rumpf

active boot switches: -d:release

Feature Low Priority Stdlib

Source

ErikSchierboom

👎2 👍1

Most helpful comment

How does that help? Sounds like a terrible product to use.

We need BigInt and I think it’s pretty clear why; the JSON spec demands it. Anything else is just going to require fixing later.

disruptek on 4 Oct 2020

👍5

All 29 comments

A solution would be to detect if BiggestInt can hold the integer value and if not, use BiggestUInt.

This would need another JsonNodeKind and branch of JsonNode specifically for uints, which would have no other use outside this specific case (integers between [2^63, 2^64)). A simple alternative is parsing to a float, which is technically what JS does. A JBigNum branch that stores a raw string of the integer is also too specialized, since it is only for integers that are too big and not, for example, floats that have more than 18 decimal digits.

hlaaftana on 26 Sep 2020

I’ve had this problem for ages. As far as I can tell, JSON doesn’t define a range for integers and unfortunately, we cannot complete a parse successfully when our native type cannot hold the JSON integer value.

The right solution fixes our stdlib, I think, but someone who cares about our hooks system should weigh in with a sympathetic design.

What do our other 3rd-party JSON libraries do?

disruptek on 26 Sep 2020

We should parse to float for these cases. Or maybe pretend it's a string literal. UInt64 solves nothing.

Araq on 27 Sep 2020

Should be doable by fixing https://github.com/nim-lang/Nim/issues/14696#issue-640117621

juancarlospaco on 27 Sep 2020

It will be very tricky for caller to handle logic that normally a JInt is expected, but if the value greater than some value it is a JFloat and lose precision silently. IMHO, throwing an error make sense.

jackhftang on 4 Oct 2020

If the library was generic on the integer type then the user could choose appropriate type for their use case including a bigint from nimble. Is it possible to make it generic with a default to BiggestInt without breaking compatibility?

qbradley on 4 Oct 2020

D's std.json (see https://dlang.org/phobos/std_json.html) handles large uint's correctly; nim can handle them correctly too.

import std.stdio;
import std.json;
void main(string[]args){
  string s = "9223372036854775808";
  JSONValue j = parseJSON(s);
  assert(j.toString == "9223372036854775808");
  assert(j.type == JSONType.uinteger);
  assert(j.get!ulong == 9223372036854775808);
}

(note that https://github.com/dlang-community/std_data_json or https://vibed.org/api/vibe.data.json/ is more popular)

timotheecour on 4 Oct 2020

@timotheecour Whats JSONType.uinteger a Big Int basically?.
I think that silently losing information into floats is not good workaround, need Big Int on stdlib.

juancarlospaco on 4 Oct 2020

Whats JSONType.uinteger a Big Int basically?.

no, it's an enum member encoding the parsed type; see code here: https://github.com/dlang/phobos/blob/master/std/json.d

here are how these map to internal D types:

struct JSONValue
{
    import std.exception : enforce;

    union Store
    {
        string                          str;
        long                            integer;
        ulong                           uinteger;
        double                          floating;
        JSONValue[string]               object;
        JSONValue[]                     array;
    }
    private Store store;

and here's the parsing logic:

                    if (isNegative)
                    {
                        value.store.integer = parse!long(data);
                        value.type_tag = JSONType.integer;
                    }
                    else
                    {
                        // only set the correct union member to not confuse CTFE
                        ulong u = parse!ulong(data);
                        if (u & (1UL << 63))
                        {
                            value.store.uinteger = u;
                            value.type_tag = JSONType.uinteger;
                        }
                        else
                        {
                            value.store.integer = u;
                            value.type_tag = JSONType.integer;
                        }
                    }

which is hopefully self explanatory: it tries to fit into a integer (mapping to D's long, or nim's int) unless it's positive and doesn't fit into an integer, in which case it maps to uinteger (D's ulong, nim's uint)

I think that silently losing information into floats is not good workaround, need Big Int on stdlib.

float is a bad workaround here; and Big Int is not needed to solve this problem. We can adapt the above D solution which (hopefully) results in no breaking change; only parsed integers that don't fit in int would result in being encoded as uint (instead of raising ValueError as they do currently)

timotheecour on 4 Oct 2020

How does that help? Sounds like a terrible product to use.

We need BigInt and I think it’s pretty clear why; the JSON spec demands it. Anything else is just going to require fixing later.

disruptek on 4 Oct 2020

👍5

Json numbers are not BigInt, they are arbitrary precision decimal numbers. The problem here is the same with string and BigInt, they require an additional tiny allocation and pointer indirection making json parsing even slower than it already is. So my recommendation is to not write to the decimal field unless absolutely necessary.

We should parse to float for these cases. Or maybe pretend it's a string literal. UInt64 solves nothing

WTF float? convert the number to a lossy type and then back to uint. What could ever go wrong with that one. String would work I agree, but also an additional uint64 field would fix the problem.

krux02 on 9 Oct 2020

👍1

I have to agree with @krux02 here; Json numbers are not BigInt so using BigInt wouldn't solve the general problem of number, and most importantly would incur overhead in the common case; furthermore there is not ETA on std/bigints see https://github.com/nim-lang/Nim/issues/14696 (essentially, either it's a slow nim implementation/wrapper or it's a fast nim wrapper around GMP but which has LGPL licensing concerns, see https://github.com/nim-lang/Nim/issues/14696#issuecomment-703356285)

my proposal remains:

use uint64 exactly as done in D (for uint64 numbers too large to fit in int64) see https://github.com/nim-lang/Nim/issues/15413#issuecomment-703303598
for now users that need larger numbers than uint64/int64 range (ditto with large floats) must encode as string otherwise overflow exception is raised
eventually, once std/bigints materialize, the overflow exception would not be thrown for numbers that don't fit in uint64/int64
ditto with std/bigfloats (which could also be enabled by a GMP wrapper btw, since GMP implements both)

benefits:

can be implemented quickly, without waiting on std/bigints, std/bigfloats
no bigint overhead forced upon every number including the 99.99..% common case (in practice) of numbers that fit in int64/uint64
no breaking change nor performance breaking change once std/bigints becomes available

timotheecour on 9 Oct 2020

I don't know why @krux02 brought up the number type; this issue is about the integer type.

I'm against changing json twice -- once to accomodate one extra bit, and again when we have a proper solution that meets the spec. If you can provide the API that won't necessitate an additional change later, then I'd support this change.

I'm curious why you think that we'll have bigint overhead for all integers. It seems to me that we'll know how many digits the integer has, so most smaller integers should be quite easy to accomodate natively without any bigint machinery.

disruptek on 10 Oct 2020

👍1

uint64 in json are way more common than BigInt, for the simple reason that uint64 is a native type in most environments, so it makes sense to support it efficiently without incurring BigInt overhead, which can be large.

I don't see anything wrong with adding JsonNodeKind.JUInt for the range int64.high+..uint64.high and then once std/bigints is added, adding JsonNodeKind.JBigInt for the range of integers > uint64.high and < -int64.low.

Note: other libraries I found that support (at least) uint64 in json do as I suggest (more research welcome; by no means complete):

dlang does have a std.bigint module yet they opted for the JUInt approach I described, and for good reasons.
ditto with rust json: https://github.com/serde-rs/json/blob/master/src/number.rs

I'm against changing json twice

I don't follow this. No code will break and the mapping from a Number in any given range to a JsonNodeKind would not change as JUInt and JBigInt are added. Plus, to match the spec you'd anyway need to add a JBigFloat kind, whether you have BigInt or not.

PR

=> https://github.com/nim-lang/Nim/pull/15545

timotheecour on 10 Oct 2020

uint64 in json are way more common than BigInt, for the simple reason that uint64 is a native type in most environments, so it makes sense to support it efficiently without incurring BigInt overhead, which can be large.

I'm with you so far. :+1:

I don't see anything wrong with adding JsonNodeKind.JUInt for the range int64.high+..uint64.high and then once std/bigints is added, adding JsonNodeKind.JBigInt for the range of integers > uint64.high and < -int64.low.

I know. But that doesn't mean there's nothing wrong with it, right? It just means you don't see it.

Once you start changing the semantics, you ruin the point of JSON. If you're using it in any kind of way beyond that which it is specified for, then basically you are doing it wrong. The whole idea is that we agree on what we know and don't know about the data.

I'm against changing json twice

I don't follow this. No code will break and the mapping from a Number in any given range to a JsonNodeKind would not change as JUInt and JBigInt are added. Plus, to match the spec you'd anyway need to add a JBigFloat kind, whether you have BigInt or not.

Not my problem. I didn't invent JSON nor did I write our implementation of it. I'm just trying to explain that you're going down the wrong path if you decide to attempt to ascribe more precision to types that are intentionally vague. This isn't adding abstraction, it's removing it.

Now, how are you not going to break my code when you add a new JsonNodeKind? I have exhaustive case statements for this enum.

disruptek on 11 Oct 2020

🚀1

Now, how are you not going to break my code when you add a new JsonNodeKind? I have exhaustive case statements for this enum.

If the new enum value only occur if you encounter values that were impossible to use before, then I don't think it would break your code. As timothee suggested, the uint type should only be used for integer values that cannot be represented with int64.

I don't know why @krux02 brought up the number type; this issue is about the integer type.

Well to be precise, Json only has a one type for numbers. And these are arbitrary precision decimal types (I call it JNumber for now), because that is exactly what you can represent in ASCII. Nim's native types are fixed precision integer and floating point types with up to 64 bits of precision. On top of that Nim's floating point numbers (IEEE 754) also allow NaN and ±∞. This means neithe is JNumber a subset of any of the buildin Nim types, nor is float32/float64 a subset of JNumber and the implementation of Json now has to deal with this mess.

One advantage that we do know is, that almost all numbers that end up in json are serialized from some native number types, not hand written nor from some arbitrary precision library. So it totally makes sens to special case for all these builtin types to support them as good and fast as possible. So even if the Arbitrary precision number type would already be supported, it would still make sense to add this UInt special case for performance reasons.

krux02 on 13 Oct 2020

👍1

First, maybe you can explain what I'm missing in the following exhaustive case statement:

case kind
of JInt: discard
of JFloat: discard
of JString: discard
of JBool: discard
of JObject: discard
of JArray: discard
of JNull: discard

This will result in a compile-time error if you add a new value to JsonNodeKind.

I agree that the implementation is poor. Again, if you want my support, show me how the change does not break existing code. Alternatively, show how the change won't itself need to be changed in some breaking way in the future.

Honestly, I would sooner support a solution that is 100x slower, supports only one JNumber type, and does so for every conceivable JSON encoding. If you're going to break me, break me for the last time.

disruptek on 13 Oct 2020

👍1

Yes, sorry, you are right. For a moment I forgot that in Nim case statements fail to compile when they are not fully covered.

krux02 on 13 Oct 2020

What about JUndefined? It lets us change the spec just once. You can subsequently parse JUndefined with whatever assumption you wish to bring to bear upon the source.

disruptek on 13 Oct 2020

👍2

JavaScript doesn't have 64 bit unsigned numbers at all, this discussion is silly. If we look at the spec, https://www.json.org/json-en.html it seems clear to me that any number must be accepted, including numbers much larger than high(uint64). I propose we add a new JSON kind, JRawNumber and parse it into a string value. Then clients can use a custom parseNumber proc and their favorite bignum library (or restrict it to uint64).

Araq on 14 Oct 2020

👍2

Sure, but this is JSON, not JavaScript.

If we break with JUndefined then we may as well also consolidate into a single JNumber kind also.

disruptek on 14 Oct 2020

I like JRawNumber if done in addition to JUInt, uint64 is a native type on most platforms and shouldn't incur performance cost

JUInt shall still be introduced (see #15545) and ensure uint64 doesn't incur overhead, since it's arguably much more common than bigint/bigfloat
JRawNumber is also added as a fallback to represent all valid json numbers not already covered by the other number kinds (JInt, JUint, JFloat)
JRawNumber is represented as a string and json parser only checks that the string represent valid json number, no decoding happens at parse time
proc getRawNumberStr(j: JsonNode): string is added to json get APIs (no decoding happens here)
user can use parse the output of getRawNumberStr outside of std/json using a separate library for bigint/bigfloat

this gives best of both worlds:

performance is not sacrified for the common case (uint64 can be common for some applications eg uuids etc)
json spec is followed
json isn't bloated by an added dependency on some bigint/bigfloat library
only breaks the case statement once (JUInt + JRawNumber can be added in 1 PR or at least within 1 nim point release)

timotheecour on 14 Oct 2020

👎1 👍1

Not a fan, but I'm thinking I'll be writing my own JSON parser, so don't let me stand in the way of "progress".

disruptek on 15 Oct 2020

👍1

@disruptek well, if you do that, you should probably start here: https://github.com/Araq/packedjson

krux02 on 15 Oct 2020

Am I understanding correctly that your plan is to break code by introducing JRawNumber? Can we come up with a transition period for this at least?

dom96 on 18 Oct 2020

In the meantime I found a better solution: A parse flag so that "unparsable" numbers are mapped to JString.

Araq on 19 Oct 2020

👍1

but then you cannot distinguish

{ "n": 12345678901234567890 }

{ "n": "12345678901234567890" }

one way is to map to a new type JUnknown instead of JString, but I don't think it is a good solution.

jackhftang on 19 Oct 2020

I've updated https://github.com/nim-lang/Nim/pull/15545 as described in https://github.com/nim-lang/Nim/issues/15413#issuecomment-708678072

nim> parseJson(""" { "a1": 123456789012345676712345678} """)["a1"].kind
JNumber == type JsonNodeKind

nim> parseJson(""" { "a1": 123456789012345676712} """)["a1"].kind
JUInt == type JsonNodeKind

nim> parseJson(""" { "a1": 123456789012345676} """)["a1"].kind
JInt == type JsonNodeKind

EDIT: and see https://github.com/nim-lang/Nim/pull/15646 for my proposed approach to mitigate impact of enum changes.

timotheecour on 20 Oct 2020

but then you cannot distinguish

Yes true, but not really worth to worry about. "JString" is bad, but floating point numbers need to fall back to a string already anyway if they want to serialize infinity somehow. So I think "numbers might be in a string literal" is something I think is acceptable.

krux02 on 21 Oct 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

nimpretty generates broken formatting

SolitudeSF · 3Comments

Casting a pointer to an UncheckedArray[uint8] crashes the compiler

liquid600pgm · 3Comments

[style guide] [low priority] Naming conventions for modules?

awr1 · 3Comments

try expression doesn't work with return on expect branch

alaviss · 3Comments

[RFC] Error for exported procs that take or return private types

hlaaftana · 3Comments