With version 2.0.3, the following code:
std::locale l("");
std::string curr_locale = l.name();
float j = nlohmann::json::parse("0.1");
std::setlocale(LC_ALL, "en_US.UTF-8");
float k = nlohmann::json::parse("0.1");
yields the following results (read in the debugger):
curr_locale = "nb_NO.UTF-8"
j = 0
k = 0.100000001
Thus, the decimals are truncated in the Norwegian locale. This is rather unexpected, and should IMO not happen.
I think it originates from the parser calling strtod which is locale dependent.
This seems to be related to #228, but affects the parser. I'll look into it.
I can reproduce the error with
#include <clocale>
#include "src/json.hpp"
int main()
{
std::setlocale(LC_ALL, "no_NO.UTF-8");
float j = nlohmann::json::parse("0.1");
std::setlocale(LC_ALL, "en_US.UTF-8");
float k = nlohmann::json::parse("0.1");
std::cerr << "j = " << j << ", k = " << k << std::endl;
}
The output is:
j = 0, k = 0.1
Edit: The reason is that internally std::strtold is called which is not locale-independent.
Thanks for your efforts and quickly looking into it!
Some systems have strtold_l where you can pass in a locale used for conversion. Unfortunately it is not standardized in POSIX or similiar. A first try in writing a replacement version of strtod_l can be found here.
Thanks for the link - unfortunately, it uses string streams which - though correct - perform much worse than strtold.
I copied the code from PR #337 into a feature branch and worked from there. I cleaned the code a bit and ran the benchmarks:
Before:
parse jeopardy.json 5 3251660027 ns/op
parse canada.json 100 139002149 ns/op
parse citm_catalog.json 200 70643268 ns/op
parse twitter.json 500 32565660 ns/op
parse numbers/floats.json 10 1148783185 ns/op
parse numbers/signed_ints.json 20 893569712 ns/op
parse numbers/unsigned_ints.json 20 865393626 ns/op
dump jeopardy.json 10 1092918030 ns/op
dump jeopardy.json with indent 10 1278158184 ns/op
./json_benchmarks_develop 235.942s
After:
parse jeopardy.json 5 2758517489 ns/op
parse canada.json 200 82824924 ns/op
parse citm_catalog.json 200 67739617 ns/op
parse twitter.json 500 31993207 ns/op
parse numbers/floats.json 20 686197008 ns/op
parse numbers/signed_ints.json 20 880843166 ns/op
parse numbers/unsigned_ints.json 20 872672400 ns/op
dump jeopardy.json 10 1237188893 ns/op
dump jeopardy.json with indent 10 1246099190 ns/op
./json_benchmarks_strtod 246.574s
For floating-point heavy tests (canada.json, floats.json) we see an improvement of about 66%!
There are still 3 tests failing:
-------------------------------------------------------------------------------
regression tests
issue #186 miloyip/nativejson-benchmark: floating-point parsing
-------------------------------------------------------------------------------
src/unit-regression.cpp:36
...............................................................................
src/unit-regression.cpp:325: FAILED:
CHECK( j.get<double>() == 72057594037927928.0 )
with expansion:
72057594037927936.0 == 72057594037927928.0
src/unit-regression.cpp:328: FAILED:
CHECK( j.get<double>() == 9223372036854774784.0 )
with expansion:
9223372036854775808.0
==
9223372036854774784.0
-------------------------------------------------------------------------------
compliance tests from nativejson-benchmark
doubles
-------------------------------------------------------------------------------
src/unit-testsuites.cpp:103
...............................................................................
src/unit-testsuites.cpp:113: FAILED:
CHECK( json::parse(json_string)[0].get<double>() == Approx(expected) )
due to unexpected exception with message:
json_string := [2
.225073858507201136057409796709131975934819546351645648023426109724822222021-
0769455165295239081350879141491589130396211068700864386945946455276572074078-
2062174337998814106326732925355228688137214901298112245145188984905722230728-
5255133155755015914397476397983411801999323962548289017107081850690630666655-
9949382757725720157630626906633326475653000092458883164330377797918696120494-
9739037782970490505108060994073026293712895895000358379996720725430436028407-
8895771796150945516748243471030702609144621572289880258182545180325707018860-
8721131280795122334262883686223215037756666225039825343359745688844239002654-
9819838548794829220689472168983109969836584681402285424333066033985088644580-
4001034933970427567186443383770486037861622771738545623065874679014086723327-
636718751234567890123456789012345678901e-308]
expected := 2,22507e-308
type must be number, but is null
===============================================================================
test cases: 34 | 32 passed | 2 failed
assertions: 8974 | 8971 passed | 3 failed
I need to better understand how to detect/handle overflows.
Digging up into http://www.exploringbinary.com/how-glibc-strtod-works/ I am thinking the current approach is too simple. @whackashoe, how would you propose to cope with rounding?
I sorta have that feeling too... I've had a bit of a think about if we could do some trick with epsilon and predict loss during the mults and divs when calculating to get correct rounding but I'm not super certain how that would look or if that is even feasible- I'm certainly a floating point novice heh. The alternative would be like in your link basically trashing this and doing some bigint sort of thing... I'd really like if we could beat that performance wise though :)
Searching google for "c++ string to double locale independent", this issue is on the second page. :)
Maybe a hybrid approach:
if strtod_l (or _strtod_l on Windows) is available at compile time, use that.
If not:
check at runtime to see if the locale is "C".
If so, use strtod.
Otherwise, fall back to the slower implementation.
This could go in a single helper function so there aren't a lot of ifdefs or other forms of conditionals in the code.
I'm not sure what the performance of the latter would be like, or if it's even feasible.
@gregmarr I think having a 10x or whatever it is penalty if locale doesn't match- seems like poor workaround. I did a bench a while back and it was pretty bad (10x might be exaggeration). Really what we are doing doesn't match strtod - that just happens to provide all the functionality (and more...)
Correctness wins over performance, especially when there is a correct and performant option that is usually available.
If you can find a correct solution outside of that, that's great. Otherwise, you'll have to go with something slow when the correct solution isn't otherwise available.
Do we actually know that strtod_l is a problem in any C++11 conforming compiler/library?
http://lua-users.org/lists/lua-l/2016-04/msg00216.html
I found one reference where mingw's c library doesn't have either version above, and in that case, the user simply searched the string, replaced '.' with the locale's decimal separator, and then called strtod.
Most of the slowness due streams comes from initialization and locale-imbueing, so this can be mitigated by reusing initialized stream per-thread:
// const int x = to_num(...)
// const long double = to_num(...)
struct to_num
{
const char* const data_ = nullptr;
const size_t len_ = 0;
to_num(const to_num&) = delete;
to_num(to_num&&) = delete;
to_num operator=(const to_num&) = delete;
to_num& operator=(to_num&&) = delete;
to_num(const char* const data, size_t len = std::string::npos)
: data_{ data }
, len_{ len == std::string::npos ? strlen(data) : len }
{}
to_num(const std::string& s)
: data_{ s.data() }
, len_{ s.size() }
{}
template<typename T,
typename = typename std::enable_if<std::is_arithmetic<T>::value>::type >
operator T() const
{
static thread_local std::unique_ptr<std::stringstream> sstr;
if(!sstr) {
sstr.reset(new std::stringstream);
sstr->imbue(std::locale::classic());
}
if(len_ == 0) {
throw std::runtime_error(
std::string("Can't parse empty string as numeric type=")
+ typeid(T).name());
}
sstr->write(data_, static_cast<std::streamoff>(len_));
T result;
*sstr >> result;
const bool could_not_parse = !sstr->eof();
// fix-up sstr state regardless of whether
// the exception is thrown below.
sstr->clear(); // clear-out flags
sstr->str(std::string()); // clear-out data
if(could_not_parse) {
throw std::runtime_error(
"Can't parse " + std::string(data_, len_)
+ " as a number of type " + typeid(T).name());
}
return result;
}
};
@TurpentineDistillery Thanks for the code. I shall have a look and see how it performs compared to the current solution.
I tried the stream-based parsing approach myself.
The performance is still garbage : (
There's another approach: query the current locale's decimal-separator character (maybe just once with static const initialization); if not '.', then preprocess the input to look the way the current locale expects (replace '.' with locale's decimal-separator) and then dispatch to strtold or appropriate variant.
@TurpentineDistillery I mentioned that possibility at the end of my previous comment. Have we tried the strtod_l function to see what its performance is?
@gregmarr, indeed you have! I thought this approach was a bit of a hack when I first thought about it, but now I'm feeling better about it : )
Is it possible to store numbers like 1.000.000,00 in json? If so, the parsing as you outline might get more complicated since it will yield an invalid result if you just replace the comma with a dot. However, it still could be a viable way.
@jomade No, this is not valid JSON. Only these numbers are allowed: http://json.org/number.gif
@jomade and we wouldn't be replacing the comma with a dot. We'd be replacing the dot in 1000000.00 with a comma 1000000,00 so that the locale-specific function would parse properly.
@nlohmann The failing regression tests seem to be misleading as they do not take into account std::numeric_limits<double>::digits10. This is equal to 15 for me as it is most likely an IEEE 754 double [1], and we do see that the first 15 digits in both testcases are equal.
I ran std::strtod on strings from 72057594037927928.0 to 72057594037927936.0, and the first four values in the range returns 72057594037927928.0, while the last four values in the range returns 72057594037927936.0. The testcase 7205759403792793199999e-5 does lean towards the smaller value, but I'm not sure if this should be relied on.
[1] https://en.wikipedia.org/wiki/IEEE_floating_point#Basic_and_interchange_formats
@qwename With 15 or fewer digits, you are guaranteed that a string to double to string conversion will produce the original string. With 17 digits, you are guaranteed that double to string to double conversion will produce the same double. With more than 15 digits, and an arbitrary sequence of digits, then string to double to string is not guaranteed to produce the original string. It will only do that if the string was produced by a conversion from double to string with 17 digits.
Closed with #450.
Most helpful comment
@jomade No, this is not valid JSON. Only these numbers are allowed: http://json.org/number.gif