Json: Add a SAX parser

Created on 13 Feb 2018 · 79Comments · Source: nlohmann/json

The library currently only supports DOM-like parsing. This does not scale when the input files are enormous (#927). I would like to discuss how a SAX-like parser could look like.

My proposal (heavily motivated by RapidJSON) is as follows:

struct SAX
{
    // a null value was read
    bool null();

    // a boolean value was read
    bool boolean(bool);

    // an integer number was read
    bool number_integer(number_integer_t);

    // an unsigned integer number was read
    bool number_unsigned(number_unsigned_t);

    // a floating-point number was read
    // the string parameter contains the raw number value
    bool number_float(number_float_t, const std::string&);

    // a string value was read
    bool string(const std::string&);

    // the beginning of an object was read
    // binary formats may report the number of elements
    bool start_object(std::size_t elements);

    // an object key was read
    bool key(const std::string&);

    // the end of an object was read
    bool end_object();

    // the beginning of an array was read
    // binary formats may report the number of elements
    bool start_array(std::size_t elements);

    // the end of an array was read
    bool end_array();

    // a binary value was read
    // examples are CBOR type 2 strings, MessagePack bin, and maybe UBJSON array<uint8t>
    bool binary(const std::vector<uint8_t>& vec);

    // a parse error occurred
    // the byte position and the last token are reported
    bool parse_error(int position, const std::string& last_token);
};

Some remarks:

All functions return a bool: true if the parser should continue or false if the parser should stop processing the input.
The proposal covers parsing of JSON, but also of CBOR, MessagePack, and UBJSON. Therefore, it contains extensions like array or object sizes as well as a binary type which do not occur when parsing JSON.
The idea is that the user would implement the above struct (we need to discuss whether we make all functions virtual, define a default implementation, etc.) and pass it to new parse functions, e.g. void parse_json(SAX &sax); or void parse_ubjson(SAX &sax);.

What do you think?

enhancemenimprovement proposed fix

Source

nlohmann

All 79 comments

The idea is that the user would implement the above struct (we need to discuss whether we make all functions virtual, define a default implementation, etc.) and pass it to new parse functions, e.g. void parse_json(SAX &sax); or void parse_ubjson(SAX &sax);.

For "define a default implementation", were you thinking that the library would include the struct declaration in the headers, and the implementation in a separate .cpp file that could be used or replaced as needed?

gregmarr on 14 Feb 2018

For "define a default implementation", were you thinking that the library would include the struct declaration in the headers, and the implementation in a separate .cpp file that could be used or replaced as needed?

I think I would rather say the library should define an interface that client code needs to implement.

nlohmann on 14 Feb 2018

Maybe the "default implementation" could be a template argument of some sort, like adl_serializer currently is?

I have no experience with SAX at all but it could be a good occasion to continue the discussion which began in #605 and #774 with @vinniefalco.

theodelrieu on 14 Feb 2018

I think "default implementation" makes little sense here, because it is an extension point for user code after all. Sorry for the confusion.

nlohmann on 14 Feb 2018

So you could not use the SAX version without reimplementing the SAX struct?

theodelrieu on 14 Feb 2018

I think pure virtuals would provide the most flexibility. That way a code base can have multiple objects that implement the SAX interface.

gregmarr on 14 Feb 2018

@theodelrieu Yes, the user code needs to implement the SAX interface and pass the object to the SAX parser.

nlohmann on 14 Feb 2018

I think an eof() event would be helpful to detect situations like false 10 which currently would just trigger boolean(false) and end. With an eof() event, one can detect that there is still some non-whitespace input.

nlohmann on 25 Feb 2018

I'm not sure where to write this, but I saw #601 #862 and thought that I would point out that there is a draft for typed arrays in cbor which might be interesting to implement. https://github.com/cbor-wg/array-tags

NickeZ on 2 Mar 2018

I am not a fan of this SAX struct approach (but I am not submitting a pull request so this is only my opinion). As a side point, in terms of C++ design I agree with gregmarr that pure virtuals would be a better approach.

I was using rapidjson and switched to this package because it's easier and more intuitive. (I have had to keep rapidJSON for our sax-based serializer/deserializer). I would hope any SAX implementation would keep the spirit of nlohmann/json.

As the 'S' in SAX is for ,,stream'', why not use the stream model which is more C++-isch anyway and fits what this package already does? That is, if the system encounters the key "foo" it could call std::istream& handle_key(std::istream& is, std::string const& key) (similar to the stream extraction operator -- but notice the second argument is `const) which could in most cases recursively call the appropriate stream functions, and set stream errors on failure. Also means I could read an array of object handles (which already have from_json handlers) in an intuitive way consistent with the non-SAX version (and consistent with reading objects from any other stream).

The downside of this, and all such parsers, is that handle_key has to know about every possible key, but the user can easily handle this by managing a list of expected keys and their handlers on the call stack.

For writing then you could perhaps do something like:

std::vector<things> the_things {thing_one, thing_two};
json_ostream["the_things"] << the_things;

OK, perhaps that's too much!

PS: I don't know what to call this package since its name is generic so that's why I say ,,this package'' or ,,nlohmann/json'' vs ,,rapidJSON''

dvhwgumby on 6 Mar 2018

Overview

So about the SAX parsing, or parsing in general:

The library currently supports any combination of the following items:

input: string, input stream, iterator range, container
strict parsing (parsing must end with EOF): yes/no
exceptions on parse error: yes/no
using a callback function to filter values: yes/no

In addition, we also have an acceptor which only returns a boolean to indicate whether the input was a valid JSON text (works with any of 1. and 2.).

SAX Parsing

The current take on a SAX parser can also be combined with 1., 2., and 3. I think, this not too far from the spirit of the library :-)

For testing purposes, I implemented a SAX parser which creates a basic_json value - so basically a "SAX-to-DOM" parser. Surprisingly (for me), its performance was at least as good as the original parser.

Planning

I am planning to replace the original parser (well, the part where no callback (4.) is used) with this SAX-DOM parser, and also use the SAX approach for the binary formats CBOR, MessagePack, and UBJSON. As a result, we would have SAX and DOM parsers for all formats without code duplication on the parser classes. Furthermore, I think we can also realize the acceptor with a SAX parser.

What do you think?

@dvhwgumby Would this be a feasible approach for you? If not, could you please give more details for your proposed syntax?
@dvhwgumby What do you propose instead of pure virtuals?
@gregmarr What do you think of RapidJSON's approach to use templates like struct MyHandler : public BaseReaderHandler<UTF8<>, MyHandler>?

nlohmann on 6 Mar 2018

What do you think of RapidJSON's approach to use templates like struct MyHandler : public BaseReaderHandler?

One downside of this is that the parse() functions become templates. They have to take the handler using a template as the handlers don't share a common base class. It is a tradeoff between potential runtime performance improvements due to inlining vs virtual function calls compared to the code duplication that results from the multiple template instantiations. That's something that would have to be measured.

gregmarr on 6 Mar 2018

👍1

@nlohmann:

Would this be a feasible approach for you? If not, could you please give more details for your proposed syntax?
What do you propose instead of pure virtuals?

Consider the tiny save file below, extracted and simplified from our save format:

    {
        "marker_names": {
        "names": [
                  "Marker1",
                  "Marker2"
                  ],
          "objects": [
                      1,
                      2
                      ]
          }
      "markers": [
                  {
                    "index": 1,
                      "flags": 0,
                      "dependent_nodes": [
                                            1,
                                            4,
                                            5
                                            ]
                      },
                  {
                    "index": 2,
                      "flags": 0,
                      "dependent_nodes": [
                                            2,
                                            3
                                            ]
                      }
                  ],
    }

I imagine a SAX reader something like the following pseudocode. I assume the whole point of SAX parsing is that you want to read the JSON stream right into machine objects; if you wanted to read and query JSON objects you might as well just make a DOM. Note that my approach below supports this approach as well (which might make sense if you had several objects inside a large JSON) you would simply >> them into a json object, query it, discard it and then read the next, if any).

/// These read a uint64_t and look up (or create) a slot in a table:
void from_json(nlohmann::json const&, typename indexed_object<marker>::handle&);
void from_json(nlohmann::json const&, typename indexed_object<node>::handle&);

/// helper type for recognizing labels:
using label_handlers = std::map<std::string, std:function<void(nlohmann::sax_istream& is)>>;

load_marker_names(nlohmann::sax_istream& is) {
  std::vector<indexed_object<marker>::handle> handles;
  std::vector<std::string> names;
  label_handlers ls {{"names", [&](nlohmann::sax_istream& is) { is >> handles; }} , {"objects", [&](nlohmann::sax_istream& is) { is >> names; }}};
  while (auto l = get_label(is))
    ls[l](is);
  match_up_handles_and_names(handles, names);
}

load_marker_definitions(nlohmann::sax_istream& is) {...}

void load_save_object(nlohmann::sax_istream& is) {
  label_handlers ls {{"markers", load_marker_definitions}, {"marker_names", load_marker_names}};

  while (auto l = get_label(is))
    ls[l](is);
}

slurp_stream(std::istream& is) {
  nlohmann::sax_read_object(nlohmann::sax_istream(is), load_save_object);
}

dvhwgumby on 6 Mar 2018

😕1

By the way how should a SAX reader deal with JSON pointers? I don't use them so don't care much but any library should have support.

dvhwgumby on 6 Mar 2018

😕1

@dvhwgumby I personally do not think this approach is simpler, but YMMV. The code has little to do with what I think is SAX (btw: the "S" stands for "Simple"). I also do not understand why a parser would need to take care of JSON Pointers.

nlohmann on 7 Mar 2018

Thanks, that's my conceptual error: I had always thought of SAX as streaming because I only consider it for processing large data sets as they come in rather than using an (expensive) systolic process of read->extract->discard.

What I have never liked about RapidJSON is that I feel like I spend more time thinking about the syntax of the JSON and less about my own data. In your library it feels the other way around, so I get my job done faster and my code is clearer and easier to understand.

As for pointers: when you read the whole JSON into core before examining the data a pointer is easy to resolve and a library can just as well handle json_obj["foo"]["bar"] and json_obj["foo/bar"] automatically. But when streaming, a pointer can be a forward reference or backwards reference to something already read. Unless someone (user code? library code?) keeps track of this it's impossible to resolve. Personally I don't need this feature so I don't care much except that I hope any consideration of it does not complicate the non-pointer case.

dvhwgumby on 7 Mar 2018

Will outputting large json files in a streaming way also be part of this proposal?

lambdafu on 13 Mar 2018

In the sense that the file is not stored, but only events are emitted by the SAX parser: yes.

nlohmann on 13 Mar 2018

Brief update:

We now have a SAX interface for JSON, CBOR, MessagePack, and UBJSON.
We have an implementation of the SAX interface that creates a JSON value. In principle, a "SAX-to-DOM" parser. This is used for all input file formats. As a side effect, exceptions can be switched off, allowing a non-throwing parser for the binary formats (new feature, previously only for JSON).
We also have an implementation of the SAX interface for an acceptor. This now allows to perform simple true/false syntax check for the binary formats (new feature, previously only for JSON).
The JSON parser is now non-recursive - instead of using the call stack for structured values, a vector is used to keep track of the hierarchy. This brings a bit of performance, but also pushes the limit of maximal nesting, and should make a fixed depth possible (see #832).

What is missing:

The callback mechanism for the JSON parser has not been ported into the SAX world yet. I really would like to also have an implementation that uses the SAX interface to get rid of a lot of code that now only exists to support the callback.
The interfaces need to be harmonized to accommodate the new features. This also includes another look at the input adapter interface (see #834).
I have some doubts whether it is a good idea to move the strings from the lexer to the string() and key() SAX events.

Any comments on this?

nlohmann on 21 Mar 2018

About your last bullet point, I think you can simply use const std::string&, I don't see a use case for std::string&&. Could you point to a part of the code where there is one?

I haven't looked at the code yet, but does that interact well with from/to_json?

theodelrieu on 21 Mar 2018

@theodelrieu I changed the interface. Now the lexer returns a reference. Then it is up to the consumer of the string to decide whether to copy or move.

It should not affect and from_json or to_json code - it is just the chain between lexer -> parser -> SAX events.

nlohmann on 21 Mar 2018

I think that returning a non-const reference is quite dangerous.

However if you want to allow users to decide whether they want to move the string or not, one way is to overload on rvalue references:

const string_t& get_string() &
{
  return this->str;
}

string_t&& get_string() &&
{
  return std::move(this->str);
}

Then, to call the second overload:

auto str = std::move(lexer).get_string();

It might look Lovecraft-esque, but I think this is the most flexible and safe way.

theodelrieu on 21 Mar 2018

When a string is parsed, the following steps are taken:

The lexer reads a string literal, performs the required UTF-8 magic, and adds characters to the token_buffer.
The lexer returns token_type::value_string to the parser.
The parser calls get_string() and gets the reference of the lexer's token_buffer.
When the lexer processes the next string or number literal (lexing number literals also need the string buffer to collect the digits), token_buffer.clear() is called.

From my understanding, calling clear() is not just handy to reset the buffer, but also required to make sure that token_buffer is in a valid state after moving. Did I miss something dangerous?

nlohmann on 21 Mar 2018

Returning a non-const reference to a class member is fine as long as the receiver understands that the reference lifetime is equal to or less than the lifetime of the containing object. Another possibility is that the lexer::get_string() function takes a std::string & and moves or swaps token_buffer into it.

gregmarr on 21 Mar 2018

I cannot look at the code at the moment, I'll check tomorrow morning!

theodelrieu on 21 Mar 2018

👍1

FYI, I haven't looked at code either, just going by what was in @nlohmann's last comment. :)

gregmarr on 21 Mar 2018

😄1

Looking at commit 4f6b2b6, it seems there's only two places where you would want the string to be moved.

key = m_lexer.get_string() L242
result.m_value = m_lexer.get_string() L378

The other occurences of get_string() are in if conditions.

I don't like to return a string_t&, the caller could silently modify the private value, plus you cannot pass it to a function that wants a const string_t& without a const_cast.

Going with the rvalue overloads would fit both use-cases, and avoid users' mistakes (e.g. modifying the token_buffer accidently).

theodelrieu on 22 Mar 2018

I don't like to return a string_t&, the caller could silently modify the private value

I assume that's not a problem because it's an output of the lexer, not an input.

plus you cannot pass it to a function that wants a const string_t& without a const_cast.

You can pass string_t& to const string_t& without a cast. It's the other direction that needs the cast.

Going with the rvalue overloads would fit both use-cases,

But it requires that you lie by saying std::move(lexer) when you're not actually destroying it.

and avoid users' mistakes (e.g. modifying the token_buffer accidently).

In what case would anyone ever see the contents of the token_buffer again, and how is that different than modifying it by moving from it?

gregmarr on 22 Mar 2018

You can pass string_t& to const string_t& without a cast. It's the other direction that needs the cast.

Right, I should compile code on my laptop instead of using my head.

But it requires that you lie by saying std::move(lexer) when you're not actually destroying it.

You're not lying in this case, calling std::move simply cast to an rvalue reference. Assigning this reference to another object would "destroy" it. I recommend this article for more details, move semantics are tricky :p.

In what case would anyone ever see the contents of the token_buffer again

In the library's code there are multiple if checks that call get_string() for example.

how is that different than modifying it by moving from it?

The difference is in the expressed intent, calling std::move explicitly is quite clear about what you're trying to do. I would expect the default behavior to not be surprising (i.e. having get_string().clear() to not compile).

By the way, even with the string_t& you need std::move to move the string:

auto s = std::move(lexer.get_string());

In the end I guess it's more a matter of style between those two options (and the previous move_string function).

theodelrieu on 22 Mar 2018

You're not lying in this case, calling std::move simply cast to an rvalue reference

I know the effect, I'm referring to the semantics. Using it like this is just ugly.

The difference is in the expressed intent, calling std::move explicitly is quite clear about what you're trying to do.

Yes, that's why I don't think using it on the lexer it a good thing, it's not quite clear.

By the way, even with the string_t& you need std::move to move the string:

Yes, but in this case, you're casting the string, not the lexer, which is much saner.

Is there a reason we can't have const string_t &get_string() and string_t &&move_string() to make it more obvious what's happening?

gregmarr on 22 Mar 2018

That would be the most straight-forward option.

theodelrieu on 22 Mar 2018

I'm not quite sure what you mean by "SAX-to-DOM parser". I looked at the test code in the sax2 branch and it looks like you do in fact call user type handlers incrementally. This must be a conceptual misunderstanding by me.

It is quite cool that this cleanly handles all the save types (MessagePack _et al_) automatically.

dvhwgumby on 22 Mar 2018

I'm not quite sure what you mean by "SAX-to-DOM parser". I looked at the test code in the sax2 branch and it looks like you do in fact call user type handlers incrementally. This must be a conceptual misunderstanding by me.

It is an implementation of the SAX interface that creates a JSON value from the received events. This may not be a surprising result, but allows to decouple the syntax check from the actual value creation.

nlohmann on 23 Mar 2018

Is there a reason we can't have const string_t &get_string() and string_t &&move_string() to make it more obvious what's happening?

Which function would we pass to the SAX events key() and string()?

nlohmann on 23 Mar 2018

Is there a reason we can't have const string_t &get_string() and string_t &&move_string() to make it more obvious what's happening?

Which function would we pass to the SAX events key() and string()?

For it to be useful, we'd probably need rvalue versions of those as well.

gregmarr on 23 Mar 2018

For it to be useful, we'd probably need rvalue versions of those as well.

You mean that the user should implement both versions? I do not really see the benefit for this. I understand that passing a reference can be seen as dangerous - but in this case, I think it is not.

nlohmann on 27 Mar 2018

I'm saying that in order for there to be any benefit from moving the string, the called function has to take an rvalue reference. Otherwise, it's going to get copied, which will actually be worse than passing a const &.

gregmarr on 27 Mar 2018

I have no judgement right now on what the performance gain would be for any particular JSON file. That's something that would need to be measured to see if it's worth the effort.

gregmarr on 27 Mar 2018

Update: I now realized the callback parser with the SAX interface.

Compared to https://github.com/nlohmann/json/issues/971#issuecomment-374888906 there are now only two open issues:

the interface of the parse functions across all formats and features
how the SAX interface should look like

For the second point, I am still confident that a reference is fine, because it brings the most freedom to the consumer of the SAX events; that is, whether to make a copy or move from the provided string.

nlohmann on 29 Mar 2018

I realized that the "to move or not to move" question is only relevant if we force users to inherit from an abstract base class.

Indeed, the signature is forced on them, which is another reason why I dislike that choice for the SAX interface.

If we instead turn it into a concept, we can simply require that the key method (and other methods too) is callable with an rvalue reference.

The library would then always call std::move on callback parameters.

// User defined SAX interface

bool key(std::string&&);
// or
bool key(const std:: string&);


// Library code
sax.key(std::move(token_buffer));

theodelrieu on 29 Mar 2018

As the functional part of the SAX parser is ready, we need to focus on the interfaces now.

I am still not sure why we should add two functions to the abstract base class (or avoid this if we choose a concept) to cope with whether the SAX client chooses to make a copy or move from the passed string. This feels like overhead to me. Why not pass a std::string & and let the user decide if she wants to copy or move?
I know that that @theodelrieu is a SFINAE magician, but I am afraid of a complex is_sax_struct function to let the parse functions only accept the "right" SAX clients.

There is further quite a construction site for the input adapters (#834 #1031) which influences the interface to the parsers. I am not sure about the best solution there - it is currently (also) some mixture between an abstract base class and SFINAE magic... :-/

As always, any help is greatly appreciated. Maybe a brief chat on Slack could speed things up.

nlohmann on 2 Apr 2018

I agree that meeting on Slack would be more productive than Github comments. I'm always available on Slack, maybe we can schedule a meeting on the #general channel, so that anyone interested can join?

theodelrieu on 3 Apr 2018

I wouldn’t be sure what slack organization to join, but am also on slack most of the day.

I remain confused by the concept of SAX-to-DOM. IN a reply to #927, you (jlohmann) wrote,

Right now, we only support DOM-like parsing to memory. .... With [a SAX] approach, you may parse and process the input without the need of converting each element to a JSON value and storing it.

I have the same enormous-file issue as the submitter of issue #927 which is the efficient processing of of large numbers of large structures into and out of files of JSON. Does the “to-DOM” process end up creating a lot of intermediate temporary data?

dvhwgumby on 3 Apr 2018

Once we merge this issue, we have a SAX interface which works with parse events. You are then free to decide what to do with such events. The parser side only takes care about syntax checking and conversion from JSON text to numbers. We have three implementations for that SAX interface bundled with the library:

SAX to DOM: this creates a JSON value in memory. That is, the events are collected and used to build a JSON value. This was the default behavior before we had the SAX interface.
SAX to DOM with callback: this is the same as above, but allows to pass a callback function which can control whether to store values or not. See the documentation for more info.
Acceptor: This interface only returns true if the input is valid JSON text or false otherwise.

The parser has been overworked to be nonrecursive and only use a bit (i.e., an entry in a std::vector<bool>) for each recursion depth. This reduces the memory consumption quite a bit. All additional memory depends on how the SAX events are processed.

When it comes to large JSON files, the SAX interface now allows to implement your own code to handle the events emitted by the parser. If building a DOM requires too much memory, you now have an option to deal with the input in the best way your domain requires.

nlohmann on 3 Apr 2018

@nlohmann Ah, the same callback as the current implementation? Very nice!

dvhwgumby on 3 Apr 2018

😄1

I could be on Slack on Saturday or Sunday evening CET.

nlohmann on 4 Apr 2018

Saturday evening is great for me.

theodelrieu on 5 Apr 2018

Ok, then tomorrow evening at 19:00 CET on nlohmannjson.slack.com

nlohmann on 6 Apr 2018

We are now live at https://nlohmannjson.slack.com

nlohmann on 7 Apr 2018

I had a chat with @theodelrieu yesterday. On the question whether to use an abstract base class or a concept, we decided to run benchmarks and see if we can measure differences. Based on that, we should go either way. I post results once I have them.

nlohmann on 8 Apr 2018

With concepts not part of the standard, is this a good idea?

(I wish concepts were already in the standard, I am just thinking of the consequences for a library).

On Apr 8, 2018, at 00:21, Niels Lohmann notifications@github.com wrote:

I had a chat with @theodelrieu https://github.com/theodelrieu yesterday. On the question whether to use an abstract base class or a concept, we decided to run benchmarks and see if we can measure differences. Based on that, we should go either way. I post results once I have them.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/nlohmann/json/issues/971#issuecomment-379527273, or mute the thread https://github.com/notifications/unsubscribe-auth/AAbgNEjU_nbIa1dhFiTDZHbcg2RQuRmkks5tmbplgaJpZM4SEWkp.

dvhwgumby on 8 Apr 2018

In this case, concept is just a general term for allowing any class that has the required functions, using templates, rather than virtual functions. There may eventually be a Concept (the language term) for this, but not right away.

gregmarr on 8 Apr 2018

👍1

I started to implement the is_sax trait (https://github.com/theodelrieu/json/commit/01bb178be10f1b9b392189f61d3dab0cb00593d0), I shall continue later if we go the concept way.

PS: Some code could be refactored using the is_detected idiom, and the meta.hpp is starting to grow. I'll likely split it later on :)

theodelrieu on 9 Apr 2018

👍1

FYI: I may not be able to finish the benchmarks this week (and maybe the week after), because I'm traveling.

nlohmann on 10 Apr 2018

I won't be available for the next 2 weeks either.

theodelrieu on 12 Apr 2018

I did not get the code running with a template rather than the current approach. It seems I would have to touch every second class or function and make it templated. I am not sure how to proceed here.

nlohmann on 29 Apr 2018

Yes I think you have to do that unfortunately...

theodelrieu on 30 Apr 2018

I think about merging the sax parser to develop and use this code as baseline for further discussions.

nlohmann on 30 Apr 2018

👍1

I just merged the current SAX state to develop as the core development is done and we were only discussing about the interface. Having this in develop now allows for easier PRs.

nlohmann on 27 May 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 26 Jun 2018

The SAX parser is ready and merged to develop. Before releasing it, I would like to get back to the discussion on the SAX interface. Personally, I am happy about the status quo with the abstract base class, and as described in https://github.com/nlohmann/json/issues/971#issuecomment-385260488 checking @theodelrieu seems to be a lot of work. So I would be happy if someone volunteers for this or could otherwise provide benchmarks on the different approaches. :)

nlohmann on 28 Jun 2018

If you can set up a benchmark which uses the current abstract base class interface, I can implement the template interface on a branch and run the same benchmark.

theodelrieu on 28 Jun 2018

There is a number of benchmarks reading JSON files that you can execute with make run_benchmarks. You can select benchmarks by their name:

./json_benchmarks --benchmark_filter="ParseFile.*"

I should add CBOR/MessagePack/UBJSON to that benchmark eventually, but for the moment this should be sufficient.

nlohmann on 28 Jun 2018

Great, I'll try to get on it in the next days.

theodelrieu on 29 Jun 2018

After refactoring just enough to make things compile and work (i.e. only for the benchmarks), here are the results I got:

Virtual functions

2018-06-29 18:05:46
Run on (8 X 4200 MHz CPU s)
CPU Caches:
  L1 Data 32K (x4)
  L1 Instruction 32K (x4)
  L2 Unified 256K (x4)
  L3 Unified 8192K (x1)
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
------------------------------------------------------------------
Benchmark                           Time           CPU Iterations
------------------------------------------------------------------
ParseFileSAX/jeopardy             491 ms        491 ms          2   107.836MB/s
ParseFileSAX/canada                34 ms         34 ms         21   63.7264MB/s
ParseFileSAX/citm_catalog          11 ms         11 ms         65   153.655MB/s
ParseFileSAX/twitter                5 ms          5 ms        150    129.69MB/s
ParseFileSAX/floats               295 ms        295 ms          2   73.1813MB/s
ParseFileSAX/signed_ints          218 ms        218 ms          3   106.843MB/s
ParseFileSAX/unsigned_ints        208 ms        208 ms          3   111.951MB/s
ParseFile/jeopardy                729 ms        729 ms          1   72.6999MB/s
ParseFile/canada                   39 ms         39 ms         18   55.5084MB/s
ParseFile/citm_catalog             13 ms         13 ms         53   126.019MB/s
ParseFile/twitter                   6 ms          6 ms        110   95.0992MB/s
ParseFile/floats                  300 ms        300 ms          2   72.1386MB/s
ParseFile/signed_ints             221 ms        221 ms          3   105.054MB/s
ParseFile/unsigned_ints           217 ms        217 ms          3   107.192MB/s

Templates

2018-06-29 18:02:43
Run on (8 X 4200 MHz CPU s)
CPU Caches:
  L1 Data 32K (x4)
  L1 Instruction 32K (x4)
  L2 Unified 256K (x4)
  L3 Unified 8192K (x1)
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
------------------------------------------------------------------
Benchmark                           Time           CPU Iterations
------------------------------------------------------------------
ParseFileSAX/jeopardy             487 ms        487 ms          2   108.807MB/s
ParseFileSAX/canada                33 ms         33 ms         21   64.8792MB/s
ParseFileSAX/citm_catalog          11 ms         11 ms         67   156.783MB/s
ParseFileSAX/twitter                5 ms          5 ms        151   131.045MB/s
ParseFileSAX/floats               288 ms        288 ms          2   75.0251MB/s
ParseFileSAX/signed_ints          214 ms        214 ms          3   108.891MB/s
ParseFileSAX/unsigned_ints        207 ms        207 ms          3   112.666MB/s
ParseFile/jeopardy                720 ms        720 ms          1   73.5445MB/s
ParseFile/canada                   38 ms         38 ms         18   56.3457MB/s
ParseFile/citm_catalog             13 ms         13 ms         54   128.212MB/s
ParseFile/twitter                   6 ms          6 ms        111   96.6999MB/s
ParseFile/floats                  297 ms        297 ms          2   72.7776MB/s
ParseFile/signed_ints             222 ms        222 ms          3   104.496MB/s
ParseFile/unsigned_ints           214 ms        214 ms          3   108.674MB/s

I changed the default output from nanoseconds to milliseconds.

Please note that it also improves the DOM parsing, since the SAX interface is also used internally.

The changes were not that long to make, though there are some refactoring to be done in binary_reader (e.g. adding a template argument for the SAX parser).

Also, the default arguments can not be used with the template versions (the no_limit value).

There were fewer changes to make than I expected though.

theodelrieu on 29 Jun 2018

That is very nice, though the differences are really small. Could you share your code so we can compare the APIs?

nlohmann on 30 Jun 2018

I don't have the code on my laptop, but basically it's just removing the virtual functions, the API is exactly the same (the struct you have to pass to sax_parse().

I'll try to finish refactoring next week, it'll be easier to judge on a PR

theodelrieu on 30 Jun 2018

👍1

Thanks a lot. Have a great weekend!

nlohmann on 30 Jun 2018

confused that this is still open, isnt there a sax parser now?

maddanio on 4 Jul 2018

It's only in the develop branch, but not yet released. The interface still might change, see #1153.

nlohmann on 4 Jul 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 3 Aug 2018

Still in progress, not stale.

gregmarr on 3 Aug 2018

👍1

1153 is to be reviewed, should be ok feature-wise

theodelrieu on 3 Aug 2018

👍1

I shall try to look into the PR this weekend.

nlohmann on 4 Aug 2018

Thanks a lot everybody! After merging #1153, all that is left for the issue is some documentation. I shall work on this and then we should be ready for the next release.

nlohmann on 16 Aug 2018

🎉1

Quick nitpick on the release version, shouldn't it be a 3.2.0 instead?

theodelrieu on 16 Aug 2018

(Spoiler alert: yes!)

nlohmann on 16 Aug 2018

I found a bug with the following example code:

#include <iostream>
#include <iomanip>
#include "json.hpp"

using json = nlohmann::json;

int main()
{
    // a JSON text
    auto text = R"(
    {
        "Image": {
            "Width":  800,
            "Height": 600,
            "Title":  "View from 15th Floor",
            "Thumbnail": {
                "Url":    "http://www.example.com/image/481989943",
                "Height": 125,
                "Width":  100
            },
            "Animated" : false,
            "IDs": [116, 943, 234, 38793]
        }
    }
    )";

    // define parser callback
    json::parser_callback_t cb = [](int depth, json::parse_event_t event, json & parsed)
    {
        // skip object elements with key "Thumbnail"
        if (event == json::parse_event_t::key and parsed == json("Thumbnail"))
        {
            return false;
        }
        else
        {
            return true;
        }
    };

    // parse (with callback) and serialize JSON
    json j_filtered = json::parse(text, cb);
    std::cout << std::setw(4) << j_filtered << '\n';
}

Skipping an object after parsing a key yields a situation where we have a null pointer on the reference stack. I need some time to fix this...

nlohmann on 16 Aug 2018

I fixed the issue described in https://github.com/nlohmann/json/issues/971#issuecomment-413678360 in e33b31e6aae013f011a6711d8e8ebca776b63013.

nlohmann on 17 Aug 2018

Was this page helpful?

0 / 5 - 0 ratings