Xgboost: RFC: JSON as Next-Generation Model Serialization Format

Created on 8 Dec 2018 · 48Comments · Source: dmlc/xgboost

RFC: JSON as Next-Generation Model Serialization Format

Motivation: Why a new serialization format?
Proposal: Use JSON serialization to ensure backward and forward compatibility
Semantic versioning
Other benefits of JSON
Implementation of the proposal with DMLC JSON parser
Addressing potential objections
Appendix A: Full 1.0 Schema
Appendix B: Representation of missing values for various data types

In this RFC document, I propose that XGBoost project would eventually migrate to JSON for serializing decision tree models.

Special thanks to @trivialfis for initially coming up with the idea of using JSON.

Motivation: Why a new serialization format?

The current method of serialization used in XGBoost is a binary dump of model parameters. For instance, the decision tree class (RegTree) is serialized as follows:

/* Adopted from include/xgboost/tree_model.h; simplified for presentation */
void Save(dmlc::Stream* fo) {
  // Save tree-wide parameters
  fo->Write(&param, sizeof(TreeParam));
  // Save tree nodes
  fo->Write(nodes_.data(), sizeof(Node) * nodes_.size());
  // Save per-node statistics
  fo->Write(stats_.data(), sizeof(NodeStat) * stats_.size());
  // Save optional leaf vector
  if (param.size_leaf_vector != 0) {
    fo->Write(leaf_vector_);
  }
}
void Load(dmlc::Stream* fi) {
  // Load tree-wide parameters
  fi->Read(&param, sizeof(TreeParam));
  nodes_.resize(param.num_nodes);
  stats_.resize(param.num_nodes);
  // Load tree nodes
  fi->Read(nodes_.data(), sizeof(Node) * nodes_.size());
  // Load per-node statistics
  fi->Read(stats_.data(), sizeof(NodeStat) * stats_.size());
  // Load optional leaf vector
  if (param.size_leaf_vector != 0) {
    fi->Read(&leaf_vector_);
  }
}

The binary serialization method has a few benefits: it is straight-forward to implement, has low memory overhead (since we can read from and write to file streams directly), is fast to run (no pre-processing required), and produces compact model binaries.

Unfortunately, the method has one significant shortcoming: it is not possible to add new fields without breaking backward compatibility. Backward compatibility refers to the ability on the part of a new version of XGBoost to read model files produced by older versions of XGBoost. To see why we can't add new fields, let's refer to the decision tree class example shown above. The current NodeStat class has four fields:

/* NodeStat: Class to store statistics for a node in a decision tree
 * Version 1 */
struct NodeStat {
  // loss change caused by current split
  float loss_chg;
  // sum of hessian values, used to measure coverage of data
  float sum_hess;
  // weight of current node
  float base_weight;
  // number of leaf nodes among the children
  int leaf_child_cnt;
};

The total byte size of NodeStat is 16 bytes. Now imagine a fictitious scenario where we want to add a new field called instance_cnt. This field would store the number of instances (data points) that are associated with the node:

/* NodeStat: Class to store statistics for a node in a decision tree
 * Version 2 */
struct NodeStat {
  // loss change caused by current split
  float loss_chg;
  // sum of hessian values, used to measure coverage of data
  float sum_hess;
  // weight of current node
  float base_weight;
  // number of leaf nodes among the children
  int leaf_child_cnt;
  // NEW: number of instances associated with the node
  int64_t instance_cnt;
};

Note that the new version of NodeStat is now 24 bytes. Now we have already broken backward compatibility: when the latest version of XGBoost runs the snippet

// Load per-node statistics
fi->Read(stats_.data(), sizeof(NodeStat) * stats_.size());

it will attempt to read 24 * M bytes, where M is the number of nodes in the decision tree. However, if the saved model was serialized in an old version of XGBoost, there would be only 16 * M bytes to read in the serialized file! The program will either crash or show some kind of undefined behavior.

What would be the work-around? We can add extra logic in the snippet above to check which version of XGBoost produced the serialized file:

if ( /* if the model was generated with old version */ ) {
  for (size_t i = 0; i < stats.size(); ++i) {
    NodeStatOld tmp;  // Version 1 of NodeStat
    fi->Read(&tmp, sizeof(NodeStatOld));
    stats_[i].loss_chg = tmp.loss_chg;
    stats_[i].sum_hess = tmp.sum_hess;
    stats_[i].base_weight = tmp.base_weight;
    stats_[i].leaf_child_cnt = tmp.leaf_child_cnt;
    // instance count is missing
    stats_[i].instance_cnt = -1;
  }
} else {  /* the model was generated with new version */
  fi->Read(stats_.data(), sizeof(NodeStat) * stats_.size());
}

That's a lot of lines for reading a single C++ vector. In general, the extra logic for backward compatibility can accumulate over time and inflict future maintenance burden to contributors. To make matters worse, current XGBoost codebase does not save the version number. So there is no way to query the version of the serialized file. The proposed work-around is not only messy but also not actually feasible.

Proposal: Use JSON serialization to ensure backward and forward compatibility

JSON (JavaScript Object Notation) is a light-weight serialization format built on two fundamental structures: JSON objects (a set of key-value pairs) and JSON arrays (an ordered list of values). In this section, I will argue that it is possible to JSON as a future-proof method for ensuring backward and forward compatibility, defined as follows:

Backward compatibility: Later versions of XGBoost can read models produced by previous versions of XGBoost.
Forward compatibility: Previous versions of XGBoost can read models produced by later versions of XGBoost.

We will make use of two useful characteristics of JSON objects:

The presence of extra keys in a JSON object does not cause fatal errors.
If a JSON object are missing some keys we wanted, we can reliably detect this fact.

Let's return to the NodeStat example from the previous section. Version 1 of the NodeStat can be expressed as a JSON object as follows:

{
  "loss_chg": -1.5,
  "sum_hess": 12.4,
  "base_weight": 1.0,
  "leaf_child_cnt": 2
}

(The values are made up as an example.) Version 2 would be something like

{
  "loss_chg": -1.5,
  "sum_hess": 12.4,
  "base_weight": 1.0,
  "leaf_child_cnt": 2,
  "instance_cnt": 3230
}

Let's first check if backward compatibility holds. The latest version of XGBoost will attempt to read the keys loss_chg, sum_hess, base_weight, leaf_child_cnt, and instance_cnt from the given JSON file. If the JSON file was produced by an old version, it will only have loss_chg, sum_hess, base_weight, and leaf_child_cnt. The JSON parser is able to reliably detect that instance_cnt is missing. This is much better than having the program crash upon encountering missing bytes, as the binary parser would do. Upon detecting missing keys, we can mark the corresponding missing values. For the NodeStat example, we'd put in -1 to indicate the missing value, since -1 cannot be a valid value for the instance count. For optional fields such as instance_cnt, this response is sufficient.

On the other hand, some fields are not so optional. For instance, a brand new feature could be added that makes use of new key-value pairs. In this case, we will need to throw an error whenever the user attempts to make use of the particular feature. We are still doing better than before, however, since the JSON parser does not have to perform this kind of error handling. The JSON parser would simply mark missing key-value pairs as missing, and later parts of the codebase that use the missing pairs may choose to report errors. We are providing what is known as graceful degradation, where users can bring old model files and still use basic functionalities.

How about forward compatibility? The JSON file given by the later version of XGBoost will contain the extra key instance_cnt, which can be simply ignored by the older version. So forward compatibility holds as well.

Freed from future compatibility issues, developers will be able to focus on more interesting development work.

Semantic versioning

Every JSON files produced by XGBoost should have major and minor version fields at the root level:

{
  "major_version": 1,
  "minor_version": 0,

  ... other key-value pairs ...
}

All versions that have the same major version are compatible with each other. For example, a program using format version 1.1 can read a model with any version of form 1.x. The minor version should be increased whenever a new field is added to a JSON object.

The major version should be kept to 1, unless a major revision occurs that breaks backward and/or forward compatibility. So version 2.x will not be compatible with 1.x.

See Appendix A for the full schema of Version 1.0.

Other benefits of JSON

Compatibility is the largest motivator for me to propose JSON as the next-generation serialization format. However, there are other benefits too:

JSON parsers are available in virtually every language and platform, so we get a parser and serializer for free (in terms of programmer effort) for many languages. It will be easy to parse XGBoost models in Python, Java, JavaScript, and many others. This fact was pointed by @KOLANICH.
It is human-readable text, good for legibility.
DMLC parameter objects (dmlc::Parameter) is represented neatly as JSON objects, since parameter objects are just collections of key-value pairs. We are now free to add more fields to DMLC parameter objects.
Being text, it is endian-independent, provided we use UTF-8 for text encoding. I have received a report that a big-endian machine failed to load a binary serialized file that was copied from a little-endian machine.
JSON is relatively compact compared to other text-based formats (e.g. XML).

Implementation of the proposal with DMLC JSON parser

There are many excellent 3rd-party JSON libraries for C++, e.g. Tencent/RapidJSON. However, we will eschew 3rd-party libraries. As @tqchen pointed out, extra dependencies makes it more difficult to port XGBoost to various platforms and targets. In addition, most JSON libraries assume free-form schema-less objects, where JSON objects and arrays may nest each other arbitrarily (thanks @KOLANICH and @tqchen for pointing this out). This assumption adds unnecessary overhead, when our proposed JSON XGBoost format has a regular schema.

Fortunately, the DMLC-Core repository comes with a built-in JSON parser and serializer. What's more, the DMLC JSON parser (dmlc::JSONReader) lets us make assumptions about how JSON objects and arrays are to be nested. For instance, if we expect a JSON object that contains a single key-value pair with the key foo and the value an array of integers, we'd write:

dmlc::JSONReader reader(&input_stream);
std::string key, value;
reader.BeginObject();  // Assume: the top-level object is JSON object, not array
assert(reader.NextObjectItem(&key));  // Assume: there is one key
assert(key == "foo");  // Assume: the key is "foo"
reader.BeginArray();  // Assume: the value for "foo" is an array

std::vector<int> foo_values;
while (reader.NextArrayItem()) {
  int val;
  reader.Read(&val);  // Assume: each element in the array is an integer
  foo_values.push_back(val);
}
assert(!reader.NextObjectItem(&key));  // Assume: there is only one key

One tricky part is that the built-in helper class for reading structures will throw errors upon encountering missing or extra key-value pairs. I will write a custom helper class as part of the draft implementation of the RFC. The custom helper class will handle missing key-value pairs in a consistent way. Appendix B describes how the missing values are to be represented for each data type.

Addressing potential objections

Q. How about space efficiency? JSON will result in bigger model files than the current binary serialization method.

A. There are a few 3rd-party libraries (e.g. MessagePack) that lets us encode JSON into more compact binary form. We will consider adding a plug-in for generating binary encoding, so that 3rd-party dependencies remain optional. There will still be some overhead for storing integer and float storage, but the cost is in my opinion worthwhile; see previous sections for the compatibility benefits of JSON.

Q. How about read/write performance? All the text operations will slow things down.

A. I admit that it is hard to beat the performance of binary serialization; how can you do better than simply dumping bit-patterns from the memory? We are really making a trade-off here, trading read/write speed for future extensibility. However, we are able to soften the blow by serializing decision trees in parallel, see Microsoft/LightGBM#1083 for an example.

Q. Binary formats are not necessary closed to future extension. With some care, it is possible to design a new binary format with future extensibility in mind.

A. It is certainly possible to design a binary format to enable future extensions. In fact, had the current version of XGBoost stored the version number in model artifacts, we could have added extra logic for handling multiple formats. However, I still think JSON is superior, for two reasons: 1) designing an extensible binary format takes great precision and care to get right; and 2) extra logic for managing multiple versions in a custom parser can be messy, as we saw in the first section.

Appedix A. Full 1.0 Schema

The full schema can be accessed at https://xgboost-json-schema.readthedocs.io/en/latest/. The source is hosted at https://github.com/hcho3/xgboost-json-schema.

Appendix B: Representation of missing values for various data types

String: ""
Floating-point: nan
Integer: std::numeric_limits<int_type>::max()
JSON Object/Array: null (null object)

RFC

Source

hcho3

👍8 🎉3

All 48 comments

I have a few more problems need to address, mostly related to parameters. Introducing JSON is kind of part of the plan. It won't have any conflict with current proposal(nice job). Will add proper comments as soon as possible. :) And it's really nice to have a organized RFC, it's much better than my original attempt.

trivialfis on 8 Dec 2018

👍1

(e.g. MessagePack)

... and CBOR

KOLANICH on 8 Dec 2018

@hcho3 can you please elaborate a bit more on how do we store the tree structure and give a minimum example?

tqchen on 8 Dec 2018

@tqchen I’ll put up the full schema soon, and as part of that will describe how tree structures will be stored.

hcho3 on 8 Dec 2018

Other models

migrate to JSON for serializing decision tree models.

Linear models are also supported, actually much easier than tree models. And I have passion of adding new interesting algorithms.

Deprecating (changing) parameter

RFC addresses adding new parameters explicitly. But in reality we might deprecate parameters in the future due to deprecated feature of algorithms, duplicated parameters for different components (n_gpus and gpu_id comes to mind) or broken features disabled temporally. Luckily, handling removed parameters is the reversed of adding new parameters, which is addressed in this RFC.

Backward compatibility

Reading old model from newer version of XGBoost. Extra parameters removed in newer version but presented in old model file can be simply ignore. This corresponds to situation of the forward compatibility of adding new parameters.

Forward compatibility

Reading new model from older version of XGBoost. This corresponds to situation of backward compatibility in adding new parameters. Here we will have removed parameters, which are not presented in model file, as missing values. Since most of the parameters have default value, parameters not presented in model file (missing) can also be simply ignored.

Summarize

Saving complete snapshot of XGBoost

Most of the classes (in C++) in XGBoost have their own parameters. For examples, objectives, split-evaluator. Currently these parameters are not saved in a unified way and as a matter of fact, rarely saved. My hope is we add IO interface to these classes, and recurs over all of them when Saving/Loading XGBoost. For examples, learner class can call Save/Load methods from updater, then updater call Save/Load of split-evaluators and so on. Once finished and the call stack returns, we will have a complete representation of XGBoost's states' snapshot.

Problems

The this way we might actually need schema less representation, for example, MonotonicConstraint is one of split-evaluators, which contains array (std::vector) parameters. And we have a list of split-evaluators. It's not yet clear to me that how to achieve this with dmlc::JSON* without a lots of boilerplate code. Please know that every added helper class is a fragment of code that's hard to integrate.

Parameter validation

This is a future plan. If there's a typo or unused parameter in user specified model, XGBoost simply ignores it, which might lead to degraded performance or unexpected behavior. We @RAMitchell want to add extra checks for wrong/unused parameters. It's not yet clear to me how to achieve this, but the basic idea is to let each component responsible for checking/marking their own unique parameters and let a manager class (maybe learner) to check unused (not marked by other components) parameters after all components are done with their configuration. Maybe this logic can be cooperated into handling missing/extra parameters of model IO?

@hcho3 Feel free let me know if I can help in any way.

trivialfis on 9 Dec 2018

👍1

@trivialfis I like the idea of complete snapshot, having performed a messy hack to preserve GPU predictor parameters (#3856). That said, we want to do it in a way that minimizes the boilerplate code. I will re-read your comment when I complete the draft for 1.0 schema. Mainly, I'm documenting what's currently being saved.

hcho3 on 10 Dec 2018

@trivialfis Also, I used to be afraid of saving more things into the model, because of future maintenance liability caused by binary serialization. I may be more comfortable now with saving everything, now that JSON serialization could forestall compatibility headaches.

hcho3 on 10 Dec 2018

IMHO we should make XGBoost save itself the hyperparams used to create and evaluate the model (with possibility to disable that) and columns names and types keeping their order (with possibility to disable that and/or to discard the stored names and use integers (starting from 0) without model reloading). Column names are needed in order to reshape a pandas DataFrame to make the columns have the same order the model was trained on (currently I store this info in separate files). But I still insist on keeping data strictly schemed, no arbitrary stuff should be saved there, since it would result in bloated models.

KOLANICH on 10 Dec 2018

@KOLANICH I'm not so sure about that, since column names and types are external to the core C++ codebase. So far, we've had Python / R / Java wrappers to save these information.

@trivialfis @RAMitchell What do you think?

hcho3 on 10 Dec 2018

@KOLANICH @hcho3

and columns names and types keeping their order

It's a viable suggestion that we manage all these in C++, this may lead to a cleaner data management. But that's a topic for another day since we have to first implement this management code then implement IO. Adding this feature later won't bring too many hassles since we will have a very good backward compatibility as explained by @hcho3 . A new issue might be desired after this get sorted out.

no arbitrary stuff should be saved there, since it would cause bloat.

You can make suggestion what specific items should not be saved into the model file after having a initial draft (I imagine there will be a long reviewing process, you are welcome to join), otherwise it's hard to argue what goes into the category "arbitrary stuff".

trivialfis on 10 Dec 2018

@trivialfis @tqchen I've uploaded the full schema as a PDF attachment.

hcho3 on 10 Dec 2018

I may want to put up the schema doc in a GitHub repo, so that we can update the LaTeX source upon every schema update and have it automatically compiled into PDF.

hcho3 on 10 Dec 2018

@trivialfis The PDF doc didn’t address your suggestion. Can you take a look and see how the complete snapshot idea can be implemented?

hcho3 on 10 Dec 2018

@hcho3 If you put the doc in a GitHub repo I can make PRs. Then you can comment on my changes. :)

trivialfis on 10 Dec 2018

Let us just use markdown for documenting schema, it should be in docs eventually

tqchen on 10 Dec 2018

otherwise it's hard to argue what goes into the category "arbitrary stuff".

I really meant arbitrary stuff there. For example weather on Venus or full multi-GiB dataset. If we allowed devs to do such things effortlessly, they would do it and this would result in heavily bloated model files.

Let us just use markdown for documenting schema, it should be in docs eventually

Why not to use JSONSchema? It is a machine-readable spec, it can be automatically validated, and it can be rendered into documentation, including a nice interactive one.

KOLANICH on 10 Dec 2018

@KOLANICH

I really meant arbitrary stuff there. For example weather on Venus or full multi-GiB dataset.

You can't really prevent the user to save arbitrary things in their JSON files. What we can do is to enumerate which fields are to be recognized by XGBoost. My schema document does this. Stuff that's not recognized by the schema will be ignored.

Why not to use JSONSchema?

I'm not sure how we can support StringKeyValuePairCollection class which allows for arbitrary key-value pairs.

For now, I'll be typing up the schema doc in RST.

hcho3 on 10 Dec 2018

You can't really prevent the user to save arbitrary things in their JSON files.

Yes, we cannot. But we can discourage users to store their arbitrary data by avoiding providing API for adding and editing them. So if he wants to do it, he has to parse the resulting blob and add them manually and serialize back, so it may be easier for him to store them in a separate file.

KOLANICH on 10 Dec 2018

@tqchen I've typeset the schema in RST: https://xgboost-json-schema.readthedocs.io/en/latest/

@trivialfis You can submit a pull request at https://github.com/hcho3/xgboost-json-schema

hcho3 on 11 Dec 2018

Update: See hcho3/xgboost-json-schema#3 for a discussion on serializing a more complete snapshot of XGBoost.

hcho3 on 12 Dec 2018

Let me try to keep the threads together. Copied from https://github.com/hcho3/xgboost-json-schema/issues/3 .

Before I start working on it, please consider deprecating the current way of how we save the model. From the specified schema, JSON file is a utf-8 version of current binary format. Can we open up the possibility of introducing a schema that match the complete XGBoost itself rather than the old binary format?

For example:

The // in JSON code snippet is comment for demonstration purpose, should not show up in actual model file.

Let's take Learner class in the draft as an example:

{
  "learner_model_param" : LearnerModelParam,
  "predictor_param" : PredictorParam,
  "name_obj" : string,
  "name_gbm" : string,
  "gbm" : GBM,
  "attributes" : StringKeyValuePairCollection,
  "eval_metrics" : [ array of string ],
  "count_poisson_max_delta_step" : floating-point
}

Here the draft specifies we save predictor_param and count_posson_max_delta_step, which don't belong to Learner itself. Instead I propose we save something like this for Learner:

{
  // This belongs to learner, hence handled by learner
  "LearnerTrainParam": { LearnerTrainParam },   
  // No `LearnerModelParameter`, we don't need it since JSON can save complete model.

  "predictor" : "gpu_predictor",
  "gpu_predictor" : {
    "GPUPredictionParam": { ... }
  },

  "gbm" : "gbtree",
  "gbtree" : { GBM },

  // This can also be an array, I won't argue which one is better.
  "eval_metrics" : {
    "merror": {...},
    "mae": {...}
  }
}

Update: Actually predictor should be handled in gbm, but lets keep it here for the sake of demonstration.

For actual IO of GPUPredictionParam, we leave it to gpu_predictor. Same goes for GBM.
For how to do that, we can implement this in Learner class:

void Learner::Load(KVStore& kv_store) {
  std::string predictor_name = kv_store["predictor"].ToString();  // say "gpu_predictor" or "cpu_predictor"
  auto p_predictor = Predictor::Create(predictor_name);
  p_predictor->Load(kv_store[predictor_name]);

  // (.. other io ...)

  KVStore& eval_metrics = kv_store["eval_metrics"];
  std::vector<Metric> metrics (eval_metrics.ToMap().size());
  for (auto& m : metrics) {
    metrics.Load(eval_metrics);
  }
}

Inside Metric, let's say mae:

void Load(KVStore& kv_store) {
   KVStore self = kv_store["mae"];  // Look up itself by name.
  // load parameters from `self`
  // load other stuffs if needed.
}

Motivation

The reason I want to do it in this way are:

No extra_attributes. That's a remedy for not being able to add new fields. Now we are able to do so.
Modular. Each C++ class handles what it has, once there are need for special code handling changes of dumped model, Learner doesn't get bloated.
Organized. This way it would be much easier for human to interpret the model since it follows a hierarchy that matches XGBoost itself, what you see from the JSON file, is what the internal of XGBoost looks like.
Complete. Described in (3).
Other models. Related to (2). We can save linear model easily, because it handles its own IO.
Can be In-cooperated existing Configuration. Inside XGBoost, the functions like Configure, Init are just another way of loading the class itself from parameters.

The most important one is (2).

What should be saved

Pointed out by @hcho3 , we should draft a list for what goes into final dump file. I will start working on it once these are approved by participants.

Possible objections

Size of the model file.
Most of the added fields are parameters. They are important part of the model. A clean and complete representation should worth the space.
Schema less
Previously I use split_evaluator as an example in https://github.com/dmlc/xgboost/issues/3980#issuecomment-445525072 . It's possible that we @RAMitchell ~remove~ replace it with something simpler due to not being compatible with GPU. So we should still have a schema, but slightly more complicate than current schema.

trivialfis on 13 Dec 2018

@hcho3 @KOLANICH @tqchen @thvasilo @RAMitchell Can you take a look if time allows?

trivialfis on 13 Dec 2018

@trivialfis I cast my vote +1 for your proposal. I think it is possible to use dmlc::JSONReader directly, without the intermediate KVStore structure, as long as we use a schema in the form similar to my current proposal. I will upload a trial implementation, once I'm done getting ready for holidays.

hcho3 on 14 Dec 2018

@hcho3 Let me give a try on drafting the schema. :)

trivialfis on 14 Dec 2018

While many good ideas here, I don't really get the primary motivation for all this:

Unfortunately, the method has one significant shortcoming: it is not possible to add new fields without breaking backward compatibility... Current XGBoost codebase does not save the version number. So there is no way to query the version of the serialized file.

Why is there no way? E.g., some distinct header bytes pattern may be inserted at the beginning (e.g., just the characters "xgboost") which would help to distinguish a new format (that would in addition have a saved version number) from the old legacy format that has no recorded version.

khotilov on 22 Jan 2019

@khotilov

some distinct header bytes pattern may be inserted at the beginning

Yes. There are other things. For example I want to release Learner class from doing all the configurations. And the endianness issue #4072 . Besides, Json is supposed to make everything "easier".

trivialfis on 22 Jan 2019

I don't know if some downstream projects could chime in here that consume the model, and any issues they might have with the binary models, or potential benefits from the JSON format. For example @hcho3 wearing his Treelite hat or SHAP by @slundberg .

These projects support importing models from multiple tree learners like LightGBM so perhaps they have a better view of the differences between the approaches.

thvasilo on 22 Jan 2019

One comment I should toss into this discussion is to be careful about making sure the model saved in JSON format is actually the same as what is in memory once reloaded (the exact same bits). I recently started loading the binary version of the model for SHAP instead of the current JSON dump functionality because they behaved differently due to precision issues that came up from parsing the JSON dump (@hcho3 has already looked at this in #4060).

slundberg on 22 Jan 2019

@thvasilo I would love to hear about feedback.

@slundberg Yes. I made a proposal in https://github.com/hcho3/xgboost-json-schema/pull/5 . Should be able to represent the internal model as close as possible.

trivialfis on 22 Jan 2019

@slundberg I wrote a small C program to show that 32-bit floating point values can be stored as text without loss:

#include <cmath>
#include <cstdio>
#include <chrono>
#include <iostream>
#include <sstream>
#include <limits>
#include <string>
#include <iomanip>
#include <stdexcept>
#include <fmt/format.h>

void test_round_trip(float x) {
  std::string buf = fmt::format("{:.{}g}", x, std::numeric_limits<double>::max_digits10);
  float y = static_cast<float>(std::stod(buf));
  if (x != y) {
    std::ostringstream oss;
    union {
      float f;
      uint32_t u;
    } f2u;
    oss << "Round trip failed to preserve a number. ";
    f2u.f = x;
    oss << std::hex << f2u.u << " vs ";
    f2u.f = y;
    oss << std::hex << f2u.u;
    throw std::runtime_error(oss.str());
  }
}

int main(void) {
  auto tstart = std::chrono::high_resolution_clock::now();

  float x = std::numeric_limits<float>::lowest();
  const float max = std::numeric_limits<float>::max();
  uint64_t count = 0;
  while (x < max) {
    test_round_trip(x);
    ++count;
    x = std::nextafter(x, max);
  }
  auto tend = std::chrono::high_resolution_clock::now();

  std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(tend - tstart).count() << " ms elapsed" << std::endl;
  std::cout << "Count: " << count << std::endl;

  return 0;
}

It took 55 minutes to finish on a c5.2xlarge instance.

hcho3 on 28 Jan 2019

@hcho3 that's great! Then my concern is satisfied (I am now curious what caused the mismatch I found from the JSON output, but can't go back and debug that right now). Overall having a nice readable serialization of the model sounds nice.

slundberg on 28 Jan 2019

@slundberg I revised the snippet so that we don't have to convert float into double when serializing. This improves performance (by 27%: 55 min -> 43 min).

#include <cmath>
#include <cstdio>
#include <chrono>
#include <iostream>
#include <sstream>
#include <limits>
#include <string>
#include <cstring>
#include <iomanip>
#include <stdexcept>
#include <fmt/format.h>

void test_round_trip(float x) {
  std::string buf = fmt::format("{:.{}g}", x, std::numeric_limits<float>::max_digits10);
  float y = std::strtof(buf.c_str(), NULL);
  if (errno == ERANGE) {
    const float huge_val = std::numeric_limits<float>::infinity();
    if (x > -huge_val && x < huge_val) {
      // Ignore ERANGE for denormals.
      // The C++ standard requires ERANGE to be thrown for denormals,
      // but we want to preserve them too
      errno = 0;
    } else if (errno) {
      std::cout << std::strerror(errno) << std::endl;
      throw std::invalid_argument(fmt::format("strtof failed. x = {}", buf));
    }
  }
  if (x != y) {
    std::ostringstream oss;
    union {
      float f;
      uint32_t u;
    } f2u;
    oss << "Round trip failed to preserve a number. ";
    f2u.f = x;
    oss << std::hex << f2u.u << " vs ";
    f2u.f = y;
    oss << std::hex << f2u.u;
    throw std::runtime_error(oss.str());
  }
}

int main(void) {
  auto tstart = std::chrono::high_resolution_clock::now();

  float x = std::numeric_limits<float>::lowest();
  const float max = std::numeric_limits<float>::max();
  uint64_t count = 0;
  while (x < max) {
    test_round_trip(x);
    ++count;
    x = std::nextafter(x, max);
  }
  auto tend = std::chrono::high_resolution_clock::now();

  std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(tend - tstart).count() << " ms elapsed" << std::endl;
  std::cout << "Count: " << count << std::endl;

  return 0;
}

hcho3 on 29 Jan 2019

👍1

Hi, is this function available in Python now? Thanks.

LeZhengThu on 5 Apr 2019

@LeZhengThu No, this thread is just a proposal. It will be a while until this feature implemented.

hcho3 on 5 Apr 2019

@hcho3 Thanks. Hope this feature can go live soon. I like the idea very much.

LeZhengThu on 8 Apr 2019

@hcho3 The last time we talked about JSON, it was suggested that dmlc::Json* should be used instead of 3-party implementation or self designed Json implementation. And I remember that you have some thoughts about how to use dmlc::Json*?

trivialfis on 14 Apr 2019

One issue that might arise with JSON serialization is the treatment of serialized float values. Presumably xgboost will be aware that all decimal values in the JSON should be cast as floats; however I believe most typical JSON parsers will treat the values as doubles, leading to discrepancies for downstream users who don't implement a post-parsing step to cast doubles to floats in the resulting object.

As @hcho3 shows, it is possible to do the round-trip using the maxdigits10, which guarantees that the float will be reproducible from the decimal text but all parsed decimal values must be cast to a float for the resulting object to represent the state of the model correctly. That is the step that is not guaranteed via a simple JSON-parse.

As @thvasilo points out, any downstream users like SHAP by @slundberg and Treelite will need to be aware of this treatment and implement a post-parsing step to cast any doubles to floats (and then treat them as floats in any further calculations, ie. fexp instead of exp, etc.).

As this is my first comment in this issue, I hope it is helpful!

ras44 on 2 May 2019

@ras44 The JSON schema clearly says that all floating-point values are 32-bit: https://xgboost-json-schema.readthedocs.io/en/latest/#notations

hcho3 on 2 May 2019

@hcho3 Thanks for your note. My main point is that the if the JSON dump is to be used by anything other than xgboost (which will presumably need to implement a custom JSON parser that encodes/decodes floats according the schema), the user will have to consider multiple aspects in order produce the same results as xgboost. It appears this is already happening in issues such as #3960, #4060, #4097, #4429. I am providing PR https://github.com/dmlc/xgboost/pull/4439 explaining some of the issues and considerations in case it could be of use for people who aim to work with a JSON dump.

ras44 on 4 May 2019

@hcho3 I will go ahead and create an initial draft implementation based on schema.

trivialfis on 9 May 2019

I will try to use dmlc::JSON, but add abstractions whenever needed.

trivialfis on 9 May 2019

Thanks for the update. Sorry I haven't had a chance to tackle the implementation yet. Let me know if you need any help or feedback.

hcho3 on 9 May 2019

I want to revisit some of this discussion around using external dependencies as our JSON engine. After some discussion with @trivialfis and looking at #5046, it seems clear to me that a simple, partial json implementation is possible in dmlc-core or xgboost but not feature complete or with good performance. We are looking at thousands of lines of code for a custom implementation with good enough performance.

Achieving better performance than the naive implementation may turn out to be critical as our distributed implementations serialize during training.

I am normally strongly against dependencies but in this case using a header only json dependency may solve more problems than it creates.

RAMitchell on 19 Nov 2019

@hcho3 @CodingCat @chenqin @tqchen

trivialfis on 19 Nov 2019

I incline we use customized json writer/reader as long as well tested, introducing dependency out of our control could be liability with many unused features. Having Json could help move to simplify rabit recovery protocol. failed worker can recover from flexible sized key/value JSON payload from adjacent host.

chenqin on 19 Nov 2019

using a header only json dependency

nlohmann/JSON is not especially performant.

KOLANICH on 19 Nov 2019

@KOLANICH Indeed, nice interface but neither memory usage nor computation is efficient.

trivialfis on 21 Nov 2019

Closing as now experimental support with Schema is mainlined. Further improvement will come as separated PRs.

trivialfis on 24 Dec 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

libpath['find_lib_path']() returns absolute path for libxgboost.dll on Windows; setup install fails

Str1ker17 · 3Comments

[PYTHON] Make Feature Name Validation Optional

tqchen · 4Comments

Is it necessary to get rid of useless features in classification task in xgboost?

yananchen1989 · 3Comments

Meaning of feature importance score

FabHan · 4Comments

[Roadmap] 1.3.0 Roadmap

trivialfis · 3Comments