Nim: JSON parsing fails for integer values greater than BiggestInt.max

Created on 26 Sep 2020  Â·  29Comments  Â·  Source: nim-lang/Nim

The JSON parsing functions (like parseJson and parseFile) fail when trying to parse an integer which value exceeds the maximum value that BiggestInt can hold, even if that value could be represented by the BiggestUInt type.

Example

import json

let jsonWithBigInteger = "9223372036854775808"

echo parseJson(jsonWithBigInteger)

See https://play.nim-lang.org/#ix=2yON

Current Output

/usercode/in.nim(5)      in
/playground/nim/lib/pure/json.nim(931) parseJson
/playground/nim/lib/pure/json.nim(847) parseJson
/playground/nim/lib/pure/json.nim(783) parseJson
/playground/nim/lib/pure/strutils.nim(1094) parseBiggestInt
/playground/nim/lib/pure/parseutils.nim(443) parseBiggestInt
/playground/nim/lib/pure/parseutils.nim(423) rawParseInt
/playground/nim/lib/pure/parseutils.nim(397) integerOutOfRangeError
Error: unhandled exception: Parsed integer outside of valid range [ValueError]

Expected Output


Possible Solution

  • In json.nim integers are represented using the BiggestInt type (see https://github.com/nim-lang/Nim/blob/devel/lib/pure/json.nim#L184 and https://github.com/nim-lang/Nim/blob/devel/lib/pure/json.nim#L795), which is signed and thus cannot contain any values that are greater than the maximum value for BiggestInt. A solution would be to detect if BiggestInt can hold the integer value and if not, use BiggestUInt.

Additional Information

$ nim -v
Nim Compiler Version 1.2.6 [MacOSX: amd64]
Compiled at 2020-09-21
Copyright (c) 2006-2020 by Andreas Rumpf

active boot switches: -d:release
Feature Low Priority Stdlib

Most helpful comment

How does that help? Sounds like a terrible product to use.

We need BigInt and I think it’s pretty clear why; the JSON spec demands it. Anything else is just going to require fixing later.

All 29 comments

A solution would be to detect if BiggestInt can hold the integer value and if not, use BiggestUInt.

This would need another JsonNodeKind and branch of JsonNode specifically for uints, which would have no other use outside this specific case (integers between [2^63, 2^64)). A simple alternative is parsing to a float, which is technically what JS does. A JBigNum branch that stores a raw string of the integer is also too specialized, since it is only for integers that are too big and not, for example, floats that have more than 18 decimal digits.

I’ve had this problem for ages. As far as I can tell, JSON doesn’t define a range for integers and unfortunately, we cannot complete a parse successfully when our native type cannot hold the JSON integer value.

The right solution fixes our stdlib, I think, but someone who cares about our hooks system should weigh in with a sympathetic design.

What do our other 3rd-party JSON libraries do?

We should parse to float for these cases. Or maybe pretend it's a string literal. UInt64 solves nothing.

Should be doable by fixing https://github.com/nim-lang/Nim/issues/14696#issue-640117621

It will be very tricky for caller to handle logic that normally a JInt is expected, but if the value greater than some value it is a JFloat and lose precision silently. IMHO, throwing an error make sense.

If the library was generic on the integer type then the user could choose appropriate type for their use case including a bigint from nimble. Is it possible to make it generic with a default to BiggestInt without breaking compatibility?

D's std.json (see https://dlang.org/phobos/std_json.html) handles large uint's correctly; nim can handle them correctly too.

import std.stdio;
import std.json;
void main(string[]args){
  string s = "9223372036854775808";
  JSONValue j = parseJSON(s);
  assert(j.toString == "9223372036854775808");
  assert(j.type == JSONType.uinteger);
  assert(j.get!ulong == 9223372036854775808);
}

(note that https://github.com/dlang-community/std_data_json or https://vibed.org/api/vibe.data.json/ is more popular)

@timotheecour Whats JSONType.uinteger a Big Int basically?.
I think that silently losing information into floats is not good workaround, need Big Int on stdlib.

Whats JSONType.uinteger a Big Int basically?.

no, it's an enum member encoding the parsed type; see code here: https://github.com/dlang/phobos/blob/master/std/json.d

here are how these map to internal D types:

struct JSONValue
{
    import std.exception : enforce;

    union Store
    {
        string                          str;
        long                            integer;
        ulong                           uinteger;
        double                          floating;
        JSONValue[string]               object;
        JSONValue[]                     array;
    }
    private Store store;

and here's the parsing logic:

                    if (isNegative)
                    {
                        value.store.integer = parse!long(data);
                        value.type_tag = JSONType.integer;
                    }
                    else
                    {
                        // only set the correct union member to not confuse CTFE
                        ulong u = parse!ulong(data);
                        if (u & (1UL << 63))
                        {
                            value.store.uinteger = u;
                            value.type_tag = JSONType.uinteger;
                        }
                        else
                        {
                            value.store.integer = u;
                            value.type_tag = JSONType.integer;
                        }
                    }

which is hopefully self explanatory: it tries to fit into a integer (mapping to D's long, or nim's int) unless it's positive and doesn't fit into an integer, in which case it maps to uinteger (D's ulong, nim's uint)

I think that silently losing information into floats is not good workaround, need Big Int on stdlib.

float is a bad workaround here; and Big Int is not needed to solve this problem. We can adapt the above D solution which (hopefully) results in no breaking change; only parsed integers that don't fit in int would result in being encoded as uint (instead of raising ValueError as they do currently)

How does that help? Sounds like a terrible product to use.

We need BigInt and I think it’s pretty clear why; the JSON spec demands it. Anything else is just going to require fixing later.

Json numbers are not BigInt, they are arbitrary precision decimal numbers. The problem here is the same with string and BigInt, they require an additional tiny allocation and pointer indirection making json parsing even slower than it already is. So my recommendation is to not write to the decimal field unless absolutely necessary.

We should parse to float for these cases. Or maybe pretend it's a string literal. UInt64 solves nothing

WTF float? convert the number to a lossy type and then back to uint. What could ever go wrong with that one. String would work I agree, but also an additional uint64 field would fix the problem.

I have to agree with @krux02 here; Json numbers are not BigInt so using BigInt wouldn't solve the general problem of number, and most importantly would incur overhead in the common case; furthermore there is not ETA on std/bigints see https://github.com/nim-lang/Nim/issues/14696 (essentially, either it's a slow nim implementation/wrapper or it's a fast nim wrapper around GMP but which has LGPL licensing concerns, see https://github.com/nim-lang/Nim/issues/14696#issuecomment-703356285)

my proposal remains:

  • use uint64 exactly as done in D (for uint64 numbers too large to fit in int64) see https://github.com/nim-lang/Nim/issues/15413#issuecomment-703303598
  • for now users that need larger numbers than uint64/int64 range (ditto with large floats) must encode as string otherwise overflow exception is raised
  • eventually, once std/bigints materialize, the overflow exception would not be thrown for numbers that don't fit in uint64/int64
  • ditto with std/bigfloats (which could also be enabled by a GMP wrapper btw, since GMP implements both)

benefits:

  • can be implemented quickly, without waiting on std/bigints, std/bigfloats
  • no bigint overhead forced upon every number including the 99.99..% common case (in practice) of numbers that fit in int64/uint64
  • no breaking change nor performance breaking change once std/bigints becomes available

I don't know why @krux02 brought up the number type; this issue is about the integer type.

I'm against changing json twice -- once to accomodate one extra bit, and again when we have a proper solution that meets the spec. If you can provide the API that won't necessitate an additional change later, then I'd support this change.

I'm curious why you think that we'll have bigint overhead for all integers. It seems to me that we'll know how many digits the integer has, so most smaller integers should be quite easy to accomodate natively without any bigint machinery.

uint64 in json are way more common than BigInt, for the simple reason that uint64 is a native type in most environments, so it makes sense to support it efficiently without incurring BigInt overhead, which can be large.

I don't see anything wrong with adding JsonNodeKind.JUInt for the range int64.high+..uint64.high and then once std/bigints is added, adding JsonNodeKind.JBigInt for the range of integers > uint64.high and < -int64.low.

Note: other libraries I found that support (at least) uint64 in json do as I suggest (more research welcome; by no means complete):

I'm against changing json twice

I don't follow this. No code will break and the mapping from a Number in any given range to a JsonNodeKind would not change as JUInt and JBigInt are added. Plus, to match the spec you'd anyway need to add a JBigFloat kind, whether you have BigInt or not.

PR

=> https://github.com/nim-lang/Nim/pull/15545

uint64 in json are way more common than BigInt, for the simple reason that uint64 is a native type in most environments, so it makes sense to support it efficiently without incurring BigInt overhead, which can be large.

I'm with you so far. :+1:

I don't see anything wrong with adding JsonNodeKind.JUInt for the range int64.high+..uint64.high and then once std/bigints is added, adding JsonNodeKind.JBigInt for the range of integers > uint64.high and < -int64.low.

I know. But that doesn't mean there's nothing wrong with it, right? It just means you don't see it.

Once you start changing the semantics, you ruin the point of JSON. If you're using it in any kind of way beyond that which it is specified for, then basically you are doing it wrong. The whole idea is that we agree on what we know and don't know about the data.

I'm against changing json twice

I don't follow this. No code will break and the mapping from a Number in any given range to a JsonNodeKind would not change as JUInt and JBigInt are added. Plus, to match the spec you'd anyway need to add a JBigFloat kind, whether you have BigInt or not.

Not my problem. I didn't invent JSON nor did I write our implementation of it. I'm just trying to explain that you're going down the wrong path if you decide to attempt to ascribe more precision to types that are intentionally vague. This isn't adding abstraction, it's removing it.

Now, how are you not going to break my code when you add a new JsonNodeKind? I have exhaustive case statements for this enum.

Now, how are you not going to break my code when you add a new JsonNodeKind? I have exhaustive case statements for this enum.

If the new enum value only occur if you encounter values that were impossible to use before, then I don't think it would break your code. As timothee suggested, the uint type should only be used for integer values that cannot be represented with int64.

I don't know why @krux02 brought up the number type; this issue is about the integer type.

Well to be precise, Json only has a one type for numbers. And these are arbitrary precision decimal types (I call it JNumber for now), because that is exactly what you can represent in ASCII. Nim's native types are fixed precision integer and floating point types with up to 64 bits of precision. On top of that Nim's floating point numbers (IEEE 754) also allow NaN and ±∞. This means neithe is JNumber a subset of any of the buildin Nim types, nor is float32/float64 a subset of JNumber and the implementation of Json now has to deal with this mess.

One advantage that we do know is, that almost all numbers that end up in json are serialized from some native number types, not hand written nor from some arbitrary precision library. So it totally makes sens to special case for all these builtin types to support them as good and fast as possible. So even if the Arbitrary precision number type would already be supported, it would still make sense to add this UInt special case for performance reasons.

First, maybe you can explain what I'm missing in the following exhaustive case statement:

case kind
of JInt: discard
of JFloat: discard
of JString: discard
of JBool: discard
of JObject: discard
of JArray: discard
of JNull: discard

This will result in a compile-time error if you add a new value to JsonNodeKind.

I agree that the implementation is poor. Again, if you want my support, show me how the change does not break existing code. Alternatively, show how the change won't itself need to be changed in some breaking way in the future.

Honestly, I would sooner support a solution that is 100x slower, supports only one JNumber type, and does so for every conceivable JSON encoding. If you're going to break me, break me for the last time.

Yes, sorry, you are right. For a moment I forgot that in Nim case statements fail to compile when they are not fully covered.

What about JUndefined? It lets us change the spec just once. You can subsequently parse JUndefined with whatever assumption you wish to bring to bear upon the source.

JavaScript doesn't have 64 bit unsigned numbers at all, this discussion is silly. If we look at the spec, https://www.json.org/json-en.html it seems clear to me that any number must be accepted, including numbers much larger than high(uint64). I propose we add a new JSON kind, JRawNumber and parse it into a string value. Then clients can use a custom parseNumber proc and their favorite bignum library (or restrict it to uint64).

Sure, but this is JSON, not JavaScript.

If we break with JUndefined then we may as well also consolidate into a single JNumber kind also.

I like JRawNumber if done in addition to JUInt, uint64 is a native type on most platforms and shouldn't incur performance cost

  • JUInt shall still be introduced (see #15545) and ensure uint64 doesn't incur overhead, since it's arguably much more common than bigint/bigfloat
  • JRawNumber is also added as a fallback to represent all valid json numbers not already covered by the other number kinds (JInt, JUint, JFloat)
  • JRawNumber is represented as a string and json parser only checks that the string represent valid json number, no decoding happens at parse time
  • proc getRawNumberStr(j: JsonNode): string is added to json get APIs (no decoding happens here)
  • user can use parse the output of getRawNumberStr outside of std/json using a separate library for bigint/bigfloat

this gives best of both worlds:

  • performance is not sacrified for the common case (uint64 can be common for some applications eg uuids etc)
  • json spec is followed
  • json isn't bloated by an added dependency on some bigint/bigfloat library
  • only breaks the case statement once (JUInt + JRawNumber can be added in 1 PR or at least within 1 nim point release)

Not a fan, but I'm thinking I'll be writing my own JSON parser, so don't let me stand in the way of "progress".

@disruptek well, if you do that, you should probably start here: https://github.com/Araq/packedjson

Am I understanding correctly that your plan is to break code by introducing JRawNumber? Can we come up with a transition period for this at least?

In the meantime I found a better solution: A parse flag so that "unparsable" numbers are mapped to JString.

but then you cannot distinguish

{ "n": 12345678901234567890 }

or

{ "n": "12345678901234567890" }

one way is to map to a new type JUnknown instead of JString, but I don't think it is a good solution.

I've updated https://github.com/nim-lang/Nim/pull/15545 as described in https://github.com/nim-lang/Nim/issues/15413#issuecomment-708678072

nim> parseJson(""" { "a1": 123456789012345676712345678} """)["a1"].kind
JNumber == type JsonNodeKind

nim> parseJson(""" { "a1": 123456789012345676712} """)["a1"].kind
JUInt == type JsonNodeKind

nim> parseJson(""" { "a1": 123456789012345676} """)["a1"].kind
JInt == type JsonNodeKind

EDIT: and see https://github.com/nim-lang/Nim/pull/15646 for my proposed approach to mitigate impact of enum changes.

but then you cannot distinguish

Yes true, but not really worth to worry about. "JString" is bad, but floating point numbers need to fall back to a string already anyway if they want to serialize infinity somehow. So I think "numbers might be in a string literal" is something I think is acceptable.

Was this page helpful?
0 / 5 - 0 ratings