Json: Parsing of floats locale dependent

Created on 2 Sep 2016 · 22Comments · Source: nlohmann/json

With version 2.0.3, the following code:

std::locale l("");
std::string curr_locale = l.name();
float j = nlohmann::json::parse("0.1");
std::setlocale(LC_ALL, "en_US.UTF-8");
float k = nlohmann::json::parse("0.1");

yields the following results (read in the debugger):
curr_locale = "nb_NO.UTF-8"
j = 0
k = 0.100000001

Thus, the decimals are truncated in the Norwegian locale. This is rather unexpected, and should IMO not happen.

I think it originates from the parser calling strtod which is locale dependent.

confirmed bug

Source

jomade

Most helpful comment

@jomade No, this is not valid JSON. Only these numbers are allowed: http://json.org/number.gif

nlohmann on 1 Dec 2016

👍2

All 22 comments

This seems to be related to #228, but affects the parser. I'll look into it.

nlohmann on 2 Sep 2016

I can reproduce the error with

#include <clocale>
#include "src/json.hpp"

int main()
{
    std::setlocale(LC_ALL, "no_NO.UTF-8");
    float j = nlohmann::json::parse("0.1");

    std::setlocale(LC_ALL, "en_US.UTF-8");
    float k = nlohmann::json::parse("0.1");

    std::cerr << "j = " << j << ", k = " << k << std::endl;
}

The output is:

j = 0, k = 0.1

Edit: The reason is that internally std::strtold is called which is not locale-independent.

nlohmann on 2 Sep 2016

Thanks for your efforts and quickly looking into it!

jomade on 2 Sep 2016

Some systems have strtold_l where you can pass in a locale used for conversion. Unfortunately it is not standardized in POSIX or similiar. A first try in writing a replacement version of strtod_l can be found here.

t-b on 7 Sep 2016

👍1

Thanks for the link - unfortunately, it uses string streams which - though correct - perform much worse than strtold.

nlohmann on 12 Sep 2016

I copied the code from PR #337 into a feature branch and worked from there. I cleaned the code a bit and ran the benchmarks:

Before:

parse jeopardy.json                           5  3251660027 ns/op
parse canada.json                           100   139002149 ns/op
parse citm_catalog.json                     200    70643268 ns/op
parse twitter.json                          500    32565660 ns/op
parse numbers/floats.json                    10  1148783185 ns/op
parse numbers/signed_ints.json               20   893569712 ns/op
parse numbers/unsigned_ints.json             20   865393626 ns/op
dump jeopardy.json                           10  1092918030 ns/op
dump jeopardy.json with indent               10  1278158184 ns/op
./json_benchmarks_develop 235.942s

After:

parse jeopardy.json                           5  2758517489 ns/op
parse canada.json                           200    82824924 ns/op
parse citm_catalog.json                     200    67739617 ns/op
parse twitter.json                          500    31993207 ns/op
parse numbers/floats.json                    20   686197008 ns/op
parse numbers/signed_ints.json               20   880843166 ns/op
parse numbers/unsigned_ints.json             20   872672400 ns/op
dump jeopardy.json                           10  1237188893 ns/op
dump jeopardy.json with indent               10  1246099190 ns/op
./json_benchmarks_strtod 246.574s

For floating-point heavy tests (canada.json, floats.json) we see an improvement of about 66%!

There are still 3 tests failing:

-------------------------------------------------------------------------------
regression tests
  issue #186 miloyip/nativejson-benchmark: floating-point parsing
-------------------------------------------------------------------------------
src/unit-regression.cpp:36
...............................................................................

src/unit-regression.cpp:325: FAILED:
  CHECK( j.get<double>() == 72057594037927928.0 )
with expansion:
  72057594037927936.0 == 72057594037927928.0

src/unit-regression.cpp:328: FAILED:
  CHECK( j.get<double>() == 9223372036854774784.0 )
with expansion:
  9223372036854775808.0
  ==
  9223372036854774784.0

-------------------------------------------------------------------------------
compliance tests from nativejson-benchmark
  doubles
-------------------------------------------------------------------------------
src/unit-testsuites.cpp:103
...............................................................................

src/unit-testsuites.cpp:113: FAILED:
  CHECK( json::parse(json_string)[0].get<double>() == Approx(expected) )
due to unexpected exception with message:
  json_string := [2
  .225073858507201136057409796709131975934819546351645648023426109724822222021-
  0769455165295239081350879141491589130396211068700864386945946455276572074078-
  2062174337998814106326732925355228688137214901298112245145188984905722230728-
  5255133155755015914397476397983411801999323962548289017107081850690630666655-
  9949382757725720157630626906633326475653000092458883164330377797918696120494-
  9739037782970490505108060994073026293712895895000358379996720725430436028407-
  8895771796150945516748243471030702609144621572289880258182545180325707018860-
  8721131280795122334262883686223215037756666225039825343359745688844239002654-
  9819838548794829220689472168983109969836584681402285424333066033985088644580-
  4001034933970427567186443383770486037861622771738545623065874679014086723327-
  636718751234567890123456789012345678901e-308]
  expected := 2,22507e-308
  type must be number, but is null

===============================================================================
test cases:   34 |   32 passed | 2 failed
assertions: 8974 | 8971 passed | 3 failed

I need to better understand how to detect/handle overflows.

nlohmann on 13 Nov 2016

Digging up into http://www.exploringbinary.com/how-glibc-strtod-works/ I am thinking the current approach is too simple. @whackashoe, how would you propose to cope with rounding?

nlohmann on 13 Nov 2016

I sorta have that feeling too... I've had a bit of a think about if we could do some trick with epsilon and predict loss during the mults and divs when calculating to get correct rounding but I'm not super certain how that would look or if that is even feasible- I'm certainly a floating point novice heh. The alternative would be like in your link basically trashing this and doing some bigint sort of thing... I'd really like if we could beat that performance wise though :)

whackashoe on 14 Nov 2016

Searching google for "c++ string to double locale independent", this issue is on the second page. :)

Maybe a hybrid approach:

if strtod_l (or _strtod_l on Windows) is available at compile time, use that.

If not:
check at runtime to see if the locale is "C".
If so, use strtod.
Otherwise, fall back to the slower implementation.

This could go in a single helper function so there aren't a lot of ifdefs or other forms of conditionals in the code.

I'm not sure what the performance of the latter would be like, or if it's even feasible.

gregmarr on 15 Nov 2016

@gregmarr I think having a 10x or whatever it is penalty if locale doesn't match- seems like poor workaround. I did a bench a while back and it was pretty bad (10x might be exaggeration). Really what we are doing doesn't match strtod - that just happens to provide all the functionality (and more...)

whackashoe on 15 Nov 2016

Correctness wins over performance, especially when there is a correct and performant option that is usually available.

If you can find a correct solution outside of that, that's great. Otherwise, you'll have to go with something slow when the correct solution isn't otherwise available.

Do we actually know that strtod_l is a problem in any C++11 conforming compiler/library?

http://lua-users.org/lists/lua-l/2016-04/msg00216.html

strtod_l is available on linux (at least both glibc and musl) if _GNU_SOURCE is defined
_strtod_l is available in MSVC since VS2005
strtod_l is available on OSX since Darwin 8.
strtod_l is available on FreeBSD since 9.1

I found one reference where mingw's c library doesn't have either version above, and in that case, the user simply searched the string, replaced '.' with the locale's decimal separator, and then called strtod.

gregmarr on 15 Nov 2016

Most of the slowness due streams comes from initialization and locale-imbueing, so this can be mitigated by reusing initialized stream per-thread:

// const int x = to_num(...)
// const long double = to_num(...)
struct to_num
{
    const char* const data_ = nullptr;
    const size_t      len_  = 0;

    to_num(const to_num&)            = delete;
    to_num(to_num&&)                 = delete;
    to_num operator=(const to_num&)  = delete;
    to_num& operator=(to_num&&)      = delete;

    to_num(const char* const data, size_t len = std::string::npos)
        : data_{ data }
        , len_{ len == std::string::npos ? strlen(data) : len }
    {}

    to_num(const std::string& s)
        : data_{ s.data() }
        , len_{ s.size() }
    {}

    template<typename T,
             typename = typename std::enable_if<std::is_arithmetic<T>::value>::type >
    operator T() const
    {
        static thread_local std::unique_ptr<std::stringstream> sstr;

        if(!sstr) {
            sstr.reset(new std::stringstream);
            sstr->imbue(std::locale::classic());
        }

        if(len_ == 0) {
            throw std::runtime_error(
                    std::string("Can't parse empty string as numeric type=")
                    + typeid(T).name());
        }

        sstr->write(data_, static_cast<std::streamoff>(len_));
        T result;
        *sstr >> result;

        const bool could_not_parse = !sstr->eof();

        // fix-up sstr state regardless of whether
        // the exception is thrown below.
        sstr->clear();       // clear-out flags
        sstr->str(std::string()); // clear-out data

        if(could_not_parse) {
            throw std::runtime_error(
                    "Can't parse " + std::string(data_, len_)
                    + " as a number of type " + typeid(T).name());
        }

        return result;
    }
};

TurpentineDistillery on 19 Nov 2016

👍1

@TurpentineDistillery Thanks for the code. I shall have a look and see how it performs compared to the current solution.

nlohmann on 24 Nov 2016

I tried the stream-based parsing approach myself.
The performance is still garbage : (

There's another approach: query the current locale's decimal-separator character (maybe just once with static const initialization); if not '.', then preprocess the input to look the way the current locale expects (replace '.' with locale's decimal-separator) and then dispatch to strtold or appropriate variant.

TurpentineDistillery on 1 Dec 2016

@TurpentineDistillery I mentioned that possibility at the end of my previous comment. Have we tried the strtod_l function to see what its performance is?

gregmarr on 1 Dec 2016

@gregmarr, indeed you have! I thought this approach was a bit of a hack when I first thought about it, but now I'm feeling better about it : )

TurpentineDistillery on 1 Dec 2016

Is it possible to store numbers like 1.000.000,00 in json? If so, the parsing as you outline might get more complicated since it will yield an invalid result if you just replace the comma with a dot. However, it still could be a viable way.

jomade on 1 Dec 2016

@jomade No, this is not valid JSON. Only these numbers are allowed: http://json.org/number.gif

nlohmann on 1 Dec 2016

👍2

@jomade and we wouldn't be replacing the comma with a dot. We'd be replacing the dot in 1000000.00 with a comma 1000000,00 so that the locale-specific function would parse properly.

gregmarr on 1 Dec 2016

@nlohmann The failing regression tests seem to be misleading as they do not take into account std::numeric_limits<double>::digits10. This is equal to 15 for me as it is most likely an IEEE 754 double [1], and we do see that the first 15 digits in both testcases are equal.

I ran std::strtod on strings from 72057594037927928.0 to 72057594037927936.0, and the first four values in the range returns 72057594037927928.0, while the last four values in the range returns 72057594037927936.0. The testcase 7205759403792793199999e-5 does lean towards the smaller value, but I'm not sure if this should be relied on.

[1] https://en.wikipedia.org/wiki/IEEE_floating_point#Basic_and_interchange_formats

qwename on 29 Dec 2016

@qwename With 15 or fewer digits, you are guaranteed that a string to double to string conversion will produce the original string. With 17 digits, you are guaranteed that double to string to double conversion will produce the same double. With more than 15 digits, and an arbitrary sequence of digits, then string to double to string is not guaranteed to produce the original string. It will only do that if the string was produced by a conversion from double to string with 17 digits.

gregmarr on 30 Dec 2016

Closed with #450.

nlohmann on 14 Feb 2017

Was this page helpful?

0 / 5 - 0 ratings