In this RFC document, I propose that XGBoost project would eventually migrate to JSON for serializing decision tree models.
Special thanks to @trivialfis for initially coming up with the idea of using JSON.
The current method of serialization used in XGBoost is a binary dump of model parameters. For instance, the decision tree class (RegTree
) is serialized as follows:
/* Adopted from include/xgboost/tree_model.h; simplified for presentation */
void Save(dmlc::Stream* fo) {
// Save tree-wide parameters
fo->Write(¶m, sizeof(TreeParam));
// Save tree nodes
fo->Write(nodes_.data(), sizeof(Node) * nodes_.size());
// Save per-node statistics
fo->Write(stats_.data(), sizeof(NodeStat) * stats_.size());
// Save optional leaf vector
if (param.size_leaf_vector != 0) {
fo->Write(leaf_vector_);
}
}
void Load(dmlc::Stream* fi) {
// Load tree-wide parameters
fi->Read(¶m, sizeof(TreeParam));
nodes_.resize(param.num_nodes);
stats_.resize(param.num_nodes);
// Load tree nodes
fi->Read(nodes_.data(), sizeof(Node) * nodes_.size());
// Load per-node statistics
fi->Read(stats_.data(), sizeof(NodeStat) * stats_.size());
// Load optional leaf vector
if (param.size_leaf_vector != 0) {
fi->Read(&leaf_vector_);
}
}
The binary serialization method has a few benefits: it is straight-forward to implement, has low memory overhead (since we can read from and write to file streams directly), is fast to run (no pre-processing required), and produces compact model binaries.
Unfortunately, the method has one significant shortcoming: it is not possible to add new fields without breaking backward compatibility. Backward compatibility refers to the ability on the part of a new version of XGBoost to read model files produced by older versions of XGBoost. To see why we can't add new fields, let's refer to the decision tree class example shown above. The current NodeStat
class has four fields:
/* NodeStat: Class to store statistics for a node in a decision tree
* Version 1 */
struct NodeStat {
// loss change caused by current split
float loss_chg;
// sum of hessian values, used to measure coverage of data
float sum_hess;
// weight of current node
float base_weight;
// number of leaf nodes among the children
int leaf_child_cnt;
};
The total byte size of NodeStat
is 16 bytes. Now imagine a fictitious scenario where we want to add a new field called instance_cnt
. This field would store the number of instances (data points) that are associated with the node:
/* NodeStat: Class to store statistics for a node in a decision tree
* Version 2 */
struct NodeStat {
// loss change caused by current split
float loss_chg;
// sum of hessian values, used to measure coverage of data
float sum_hess;
// weight of current node
float base_weight;
// number of leaf nodes among the children
int leaf_child_cnt;
// NEW: number of instances associated with the node
int64_t instance_cnt;
};
Note that the new version of NodeStat
is now 24 bytes. Now we have already broken backward compatibility: when the latest version of XGBoost runs the snippet
// Load per-node statistics
fi->Read(stats_.data(), sizeof(NodeStat) * stats_.size());
it will attempt to read 24 * M
bytes, where M
is the number of nodes in the decision tree. However, if the saved model was serialized in an old version of XGBoost, there would be only 16 * M
bytes to read in the serialized file! The program will either crash or show some kind of undefined behavior.
What would be the work-around? We can add extra logic in the snippet above to check which version of XGBoost produced the serialized file:
if ( /* if the model was generated with old version */ ) {
for (size_t i = 0; i < stats.size(); ++i) {
NodeStatOld tmp; // Version 1 of NodeStat
fi->Read(&tmp, sizeof(NodeStatOld));
stats_[i].loss_chg = tmp.loss_chg;
stats_[i].sum_hess = tmp.sum_hess;
stats_[i].base_weight = tmp.base_weight;
stats_[i].leaf_child_cnt = tmp.leaf_child_cnt;
// instance count is missing
stats_[i].instance_cnt = -1;
}
} else { /* the model was generated with new version */
fi->Read(stats_.data(), sizeof(NodeStat) * stats_.size());
}
That's a lot of lines for reading a single C++ vector. In general, the extra logic for backward compatibility can accumulate over time and inflict future maintenance burden to contributors. To make matters worse, current XGBoost codebase does not save the version number. So there is no way to query the version of the serialized file. The proposed work-around is not only messy but also not actually feasible.
JSON (JavaScript Object Notation) is a light-weight serialization format built on two fundamental structures: JSON objects (a set of key-value pairs) and JSON arrays (an ordered list of values). In this section, I will argue that it is possible to JSON as a future-proof method for ensuring backward and forward compatibility, defined as follows:
We will make use of two useful characteristics of JSON objects:
Let's return to the NodeStat
example from the previous section. Version 1 of the NodeStat
can be expressed as a JSON object as follows:
{
"loss_chg": -1.5,
"sum_hess": 12.4,
"base_weight": 1.0,
"leaf_child_cnt": 2
}
(The values are made up as an example.) Version 2 would be something like
{
"loss_chg": -1.5,
"sum_hess": 12.4,
"base_weight": 1.0,
"leaf_child_cnt": 2,
"instance_cnt": 3230
}
Let's first check if backward compatibility holds. The latest version of XGBoost will attempt to read the keys loss_chg
, sum_hess
, base_weight
, leaf_child_cnt
, and instance_cnt
from the given JSON file. If the JSON file was produced by an old version, it will only have loss_chg
, sum_hess
, base_weight
, and leaf_child_cnt
. The JSON parser is able to reliably detect that instance_cnt
is missing. This is much better than having the program crash upon encountering missing bytes, as the binary parser would do. Upon detecting missing keys, we can mark the corresponding missing values. For the NodeStat
example, we'd put in -1 to indicate the missing value, since -1 cannot be a valid value for the instance count. For optional fields such as instance_cnt
, this response is sufficient.
On the other hand, some fields are not so optional. For instance, a brand new feature could be added that makes use of new key-value pairs. In this case, we will need to throw an error whenever the user attempts to make use of the particular feature. We are still doing better than before, however, since the JSON parser does not have to perform this kind of error handling. The JSON parser would simply mark missing key-value pairs as missing, and later parts of the codebase that use the missing pairs may choose to report errors. We are providing what is known as graceful degradation, where users can bring old model files and still use basic functionalities.
How about forward compatibility? The JSON file given by the later version of XGBoost will contain the extra key instance_cnt
, which can be simply ignored by the older version. So forward compatibility holds as well.
Freed from future compatibility issues, developers will be able to focus on more interesting development work.
Every JSON files produced by XGBoost should have major and minor version fields at the root level:
{
"major_version": 1,
"minor_version": 0,
... other key-value pairs ...
}
All versions that have the same major version are compatible with each other. For example, a program using format version 1.1 can read a model with any version of form 1.x. The minor version should be increased whenever a new field is added to a JSON object.
The major version should be kept to 1, unless a major revision occurs that breaks backward and/or forward compatibility. So version 2.x will not be compatible with 1.x.
See Appendix A for the full schema of Version 1.0.
Compatibility is the largest motivator for me to propose JSON as the next-generation serialization format. However, there are other benefits too:
dmlc::Parameter
) is represented neatly as JSON objects, since parameter objects are just collections of key-value pairs. We are now free to add more fields to DMLC parameter objects.There are many excellent 3rd-party JSON libraries for C++, e.g. Tencent/RapidJSON. However, we will eschew 3rd-party libraries. As @tqchen pointed out, extra dependencies makes it more difficult to port XGBoost to various platforms and targets. In addition, most JSON libraries assume free-form schema-less objects, where JSON objects and arrays may nest each other arbitrarily (thanks @KOLANICH and @tqchen for pointing this out). This assumption adds unnecessary overhead, when our proposed JSON XGBoost format has a regular schema.
Fortunately, the DMLC-Core repository comes with a built-in JSON parser and serializer. What's more, the DMLC JSON parser (dmlc::JSONReader
) lets us make assumptions about how JSON objects and arrays are to be nested. For instance, if we expect a JSON object that contains a single key-value pair with the key foo
and the value an array of integers, we'd write:
dmlc::JSONReader reader(&input_stream);
std::string key, value;
reader.BeginObject(); // Assume: the top-level object is JSON object, not array
assert(reader.NextObjectItem(&key)); // Assume: there is one key
assert(key == "foo"); // Assume: the key is "foo"
reader.BeginArray(); // Assume: the value for "foo" is an array
std::vector<int> foo_values;
while (reader.NextArrayItem()) {
int val;
reader.Read(&val); // Assume: each element in the array is an integer
foo_values.push_back(val);
}
assert(!reader.NextObjectItem(&key)); // Assume: there is only one key
One tricky part is that the built-in helper class for reading structures will throw errors upon encountering missing or extra key-value pairs. I will write a custom helper class as part of the draft implementation of the RFC. The custom helper class will handle missing key-value pairs in a consistent way. Appendix B describes how the missing values are to be represented for each data type.
Q. How about space efficiency? JSON will result in bigger model files than the current binary serialization method.
A. There are a few 3rd-party libraries (e.g. MessagePack) that lets us encode JSON into more compact binary form. We will consider adding a plug-in for generating binary encoding, so that 3rd-party dependencies remain optional. There will still be some overhead for storing integer and float storage, but the cost is in my opinion worthwhile; see previous sections for the compatibility benefits of JSON.
Q. How about read/write performance? All the text operations will slow things down.
A. I admit that it is hard to beat the performance of binary serialization; how can you do better than simply dumping bit-patterns from the memory? We are really making a trade-off here, trading read/write speed for future extensibility. However, we are able to soften the blow by serializing decision trees in parallel, see Microsoft/LightGBM#1083 for an example.
Q. Binary formats are not necessary closed to future extension. With some care, it is possible to design a new binary format with future extensibility in mind.
A. It is certainly possible to design a binary format to enable future extensions. In fact, had the current version of XGBoost stored the version number in model artifacts, we could have added extra logic for handling multiple formats. However, I still think JSON is superior, for two reasons: 1) designing an extensible binary format takes great precision and care to get right; and 2) extra logic for managing multiple versions in a custom parser can be messy, as we saw in the first section.
The full schema can be accessed at https://xgboost-json-schema.readthedocs.io/en/latest/. The source is hosted at https://github.com/hcho3/xgboost-json-schema.
""
nan
std::numeric_limits<int_type>::max()
null
(null object)I have a few more problems need to address, mostly related to parameters. Introducing JSON is kind of part of the plan. It won't have any conflict with current proposal(nice job). Will add proper comments as soon as possible. :) And it's really nice to have a organized RFC, it's much better than my original attempt.
(e.g. MessagePack)
... and CBOR
@hcho3 can you please elaborate a bit more on how do we store the tree structure and give a minimum example?
@tqchen I鈥檒l put up the full schema soon, and as part of that will describe how tree structures will be stored.
migrate to JSON for serializing decision tree models.
Linear models are also supported, actually much easier than tree models. And I have passion of adding new interesting algorithms.
RFC addresses adding new parameters explicitly. But in reality we might deprecate parameters in the future due to deprecated feature of algorithms, duplicated parameters for different components (n_gpus
and gpu_id
comes to mind) or broken features disabled temporally. Luckily, handling removed parameters is the reversed of adding new parameters, which is addressed in this RFC.
Reading old model from newer version of XGBoost. Extra parameters removed in newer version but presented in old model file can be simply ignore. This corresponds to situation of the forward compatibility of adding new parameters.
Reading new model from older version of XGBoost. This corresponds to situation of backward compatibility in adding new parameters. Here we will have removed parameters, which are not presented in model file, as missing values. Since most of the parameters have default value, parameters not presented in model file (missing) can also be simply ignored.
| | Forward | Backward |
|:------:|:-----------------------------------------------------:|:---------------------------------------------:|
| Add | Extra key can be simply ignored by older version. | Use default value to handing missing values or do graceful degradation |
| Remove | Use default value to handing missing values or do graceful degradation | Extra key can be ignored by newer version |
Most of the classes (in C++) in XGBoost have their own parameters. For examples, objectives, split-evaluator. Currently these parameters are not saved in a unified way and as a matter of fact, rarely saved. My hope is we add IO interface to these classes, and recurs over all of them when Saving/Loading XGBoost. For examples, learner
class can call Save/Load
methods from updater
, then updater
call Save/Load
of split-evaluator
s and so on. Once finished and the call stack returns, we will have a complete representation of XGBoost's states' snapshot.
The this way we might actually need schema less representation, for example, MonotonicConstraint
is one of split-evaluators
, which contains array (std::vector
) parameters. And we have a list of split-evaluators
. It's not yet clear to me that how to achieve this with dmlc::JSON*
without a lots of boilerplate code. Please know that every added helper class is a fragment of code that's hard to integrate.
This is a future plan. If there's a typo or unused parameter in user specified model, XGBoost simply ignores it, which might lead to degraded performance or unexpected behavior. We @RAMitchell want to add extra checks for wrong/unused parameters. It's not yet clear to me how to achieve this, but the basic idea is to let each component responsible for checking/marking their own unique parameters and let a manager class (maybe learner
) to check unused (not marked by other components) parameters after all components are done with their configuration. Maybe this logic can be cooperated into handling missing/extra parameters of model IO?
@hcho3 Feel free let me know if I can help in any way.
@trivialfis I like the idea of complete snapshot, having performed a messy hack to preserve GPU predictor parameters (#3856). That said, we want to do it in a way that minimizes the boilerplate code. I will re-read your comment when I complete the draft for 1.0 schema. Mainly, I'm documenting what's currently being saved.
@trivialfis Also, I used to be afraid of saving more things into the model, because of future maintenance liability caused by binary serialization. I may be more comfortable now with saving everything, now that JSON serialization could forestall compatibility headaches.
IMHO we should make XGBoost save itself the hyperparams used to create and evaluate the model (with possibility to disable that) and columns names and types keeping their order (with possibility to disable that and/or to discard the stored names and use integers (starting from 0) without model reloading). Column names are needed in order to reshape a pandas DataFrame to make the columns have the same order the model was trained on (currently I store this info in separate files). But I still insist on keeping data strictly schemed, no arbitrary stuff should be saved there, since it would result in bloated models.
@KOLANICH I'm not so sure about that, since column names and types are external to the core C++ codebase. So far, we've had Python / R / Java wrappers to save these information.
@trivialfis @RAMitchell What do you think?
@KOLANICH @hcho3
and columns names and types keeping their order
It's a viable suggestion that we manage all these in C++, this may lead to a cleaner data management. But that's a topic for another day since we have to first implement this management code then implement IO. Adding this feature later won't bring too many hassles since we will have a very good backward compatibility as explained by @hcho3 . A new issue might be desired after this get sorted out.
no arbitrary stuff should be saved there, since it would cause bloat.
You can make suggestion what specific items should not be saved into the model file after having a initial draft (I imagine there will be a long reviewing process, you are welcome to join), otherwise it's hard to argue what goes into the category "arbitrary stuff".
@trivialfis @tqchen I've uploaded the full schema as a PDF attachment.
I may want to put up the schema doc in a GitHub repo, so that we can update the LaTeX source upon every schema update and have it automatically compiled into PDF.
@trivialfis The PDF doc didn鈥檛 address your suggestion. Can you take a look and see how the complete snapshot idea can be implemented?
@hcho3 If you put the doc in a GitHub repo I can make PRs. Then you can comment on my changes. :)
Let us just use markdown for documenting schema, it should be in docs eventually
otherwise it's hard to argue what goes into the category "arbitrary stuff".
I really meant arbitrary stuff there. For example weather on Venus or full multi-GiB dataset. If we allowed devs to do such things effortlessly, they would do it and this would result in heavily bloated model files.
Let us just use markdown for documenting schema, it should be in docs eventually
Why not to use JSONSchema? It is a machine-readable spec, it can be automatically validated, and it can be rendered into documentation, including a nice interactive one.
@KOLANICH
I really meant arbitrary stuff there. For example weather on Venus or full multi-GiB dataset.
You can't really prevent the user to save arbitrary things in their JSON files. What we can do is to enumerate which fields are to be recognized by XGBoost. My schema document does this. Stuff that's not recognized by the schema will be ignored.
Why not to use JSONSchema?
I'm not sure how we can support StringKeyValuePairCollection
class which allows for arbitrary key-value pairs.
For now, I'll be typing up the schema doc in RST.
You can't really prevent the user to save arbitrary things in their JSON files.
Yes, we cannot. But we can discourage users to store their arbitrary data by avoiding providing API for adding and editing them. So if he wants to do it, he has to parse the resulting blob and add them manually and serialize back, so it may be easier for him to store them in a separate file.
@tqchen I've typeset the schema in RST: https://xgboost-json-schema.readthedocs.io/en/latest/
@trivialfis You can submit a pull request at https://github.com/hcho3/xgboost-json-schema
Update: See hcho3/xgboost-json-schema#3 for a discussion on serializing a more complete snapshot of XGBoost.
Let me try to keep the threads together. Copied from https://github.com/hcho3/xgboost-json-schema/issues/3 .
Before I start working on it, please consider deprecating the current way of how we save the model. From the specified schema, JSON file is a utf-8
version of current binary format. Can we open up the possibility of introducing a schema that match the complete XGBoost itself rather than the old binary format?
The //
in JSON code snippet is comment for demonstration purpose, should not show up in actual model file.
Let's take Learner
class in the draft as an example:
{
"learner_model_param" : LearnerModelParam,
"predictor_param" : PredictorParam,
"name_obj" : string,
"name_gbm" : string,
"gbm" : GBM,
"attributes" : StringKeyValuePairCollection,
"eval_metrics" : [ array of string ],
"count_poisson_max_delta_step" : floating-point
}
Here the draft specifies we save predictor_param
and count_posson_max_delta_step
, which don't belong to Learner
itself. Instead I propose we save something like this for Learner
:
{
// This belongs to learner, hence handled by learner
"LearnerTrainParam": { LearnerTrainParam },
// No `LearnerModelParameter`, we don't need it since JSON can save complete model.
"predictor" : "gpu_predictor",
"gpu_predictor" : {
"GPUPredictionParam": { ... }
},
"gbm" : "gbtree",
"gbtree" : { GBM },
// This can also be an array, I won't argue which one is better.
"eval_metrics" : {
"merror": {...},
"mae": {...}
}
}
Update: Actually predictor
should be handled in gbm
, but lets keep it here for the sake of demonstration.
For actual IO of GPUPredictionParam
, we leave it to gpu_predictor
. Same goes for GBM
.
For how to do that, we can implement this in Learner
class:
void Learner::Load(KVStore& kv_store) {
std::string predictor_name = kv_store["predictor"].ToString(); // say "gpu_predictor" or "cpu_predictor"
auto p_predictor = Predictor::Create(predictor_name);
p_predictor->Load(kv_store[predictor_name]);
// (.. other io ...)
KVStore& eval_metrics = kv_store["eval_metrics"];
std::vector<Metric> metrics (eval_metrics.ToMap().size());
for (auto& m : metrics) {
metrics.Load(eval_metrics);
}
}
Inside Metric
, let's say mae
:
void Load(KVStore& kv_store) {
KVStore self = kv_store["mae"]; // Look up itself by name.
// load parameters from `self`
// load other stuffs if needed.
}
The reason I want to do it in this way are:
extra_attributes
. That's a remedy for not being able to add new fields. Now we are able to do so.Learner
doesn't get bloated.Configuration
. Inside XGBoost, the functions like Configure
, Init
are just another way of loading the class itself from parameters.The most important one is (2).
Pointed out by @hcho3 , we should draft a list for what goes into final dump file. I will start working on it once these are approved by participants.
split_evaluator
as an example in https://github.com/dmlc/xgboost/issues/3980#issuecomment-445525072 . It's possible that we @RAMitchell ~remove~ replace it with something simpler due to not being compatible with GPU. So we should still have a schema, but slightly more complicate than current schema.@hcho3 @KOLANICH @tqchen @thvasilo @RAMitchell Can you take a look if time allows?
@trivialfis I cast my vote +1 for your proposal. I think it is possible to use dmlc::JSONReader
directly, without the intermediate KVStore
structure, as long as we use a schema in the form similar to my current proposal. I will upload a trial implementation, once I'm done getting ready for holidays.
@hcho3 Let me give a try on drafting the schema. :)
While many good ideas here, I don't really get the primary motivation for all this:
Unfortunately, the method has one significant shortcoming: it is not possible to add new fields without breaking backward compatibility... Current XGBoost codebase does not save the version number. So there is no way to query the version of the serialized file.
Why is there no way? E.g., some distinct header bytes pattern may be inserted at the beginning (e.g., just the characters "xgboost") which would help to distinguish a new format (that would in addition have a saved version number) from the old legacy format that has no recorded version.
@khotilov
some distinct header bytes pattern may be inserted at the beginning
Yes. There are other things. For example I want to release Learner
class from doing all the configurations. And the endianness issue #4072 . Besides, Json is supposed to make everything "easier".
I don't know if some downstream projects could chime in here that consume the model, and any issues they might have with the binary models, or potential benefits from the JSON format. For example @hcho3 wearing his Treelite hat or SHAP by @slundberg .
These projects support importing models from multiple tree learners like LightGBM so perhaps they have a better view of the differences between the approaches.
One comment I should toss into this discussion is to be careful about making sure the model saved in JSON format is actually the same as what is in memory once reloaded (the exact same bits). I recently started loading the binary version of the model for SHAP instead of the current JSON dump functionality because they behaved differently due to precision issues that came up from parsing the JSON dump (@hcho3 has already looked at this in #4060).
@thvasilo I would love to hear about feedback.
@slundberg Yes. I made a proposal in https://github.com/hcho3/xgboost-json-schema/pull/5 . Should be able to represent the internal model as close as possible.
@slundberg I wrote a small C program to show that 32-bit floating point values can be stored as text without loss:
#include <cmath>
#include <cstdio>
#include <chrono>
#include <iostream>
#include <sstream>
#include <limits>
#include <string>
#include <iomanip>
#include <stdexcept>
#include <fmt/format.h>
void test_round_trip(float x) {
std::string buf = fmt::format("{:.{}g}", x, std::numeric_limits<double>::max_digits10);
float y = static_cast<float>(std::stod(buf));
if (x != y) {
std::ostringstream oss;
union {
float f;
uint32_t u;
} f2u;
oss << "Round trip failed to preserve a number. ";
f2u.f = x;
oss << std::hex << f2u.u << " vs ";
f2u.f = y;
oss << std::hex << f2u.u;
throw std::runtime_error(oss.str());
}
}
int main(void) {
auto tstart = std::chrono::high_resolution_clock::now();
float x = std::numeric_limits<float>::lowest();
const float max = std::numeric_limits<float>::max();
uint64_t count = 0;
while (x < max) {
test_round_trip(x);
++count;
x = std::nextafter(x, max);
}
auto tend = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(tend - tstart).count() << " ms elapsed" << std::endl;
std::cout << "Count: " << count << std::endl;
return 0;
}
It took 55 minutes to finish on a c5.2xlarge instance.
@hcho3 that's great! Then my concern is satisfied (I am now curious what caused the mismatch I found from the JSON output, but can't go back and debug that right now). Overall having a nice readable serialization of the model sounds nice.
@slundberg I revised the snippet so that we don't have to convert float
into double
when serializing. This improves performance (by 27%: 55 min -> 43 min).
#include <cmath>
#include <cstdio>
#include <chrono>
#include <iostream>
#include <sstream>
#include <limits>
#include <string>
#include <cstring>
#include <iomanip>
#include <stdexcept>
#include <fmt/format.h>
void test_round_trip(float x) {
std::string buf = fmt::format("{:.{}g}", x, std::numeric_limits<float>::max_digits10);
float y = std::strtof(buf.c_str(), NULL);
if (errno == ERANGE) {
const float huge_val = std::numeric_limits<float>::infinity();
if (x > -huge_val && x < huge_val) {
// Ignore ERANGE for denormals.
// The C++ standard requires ERANGE to be thrown for denormals,
// but we want to preserve them too
errno = 0;
} else if (errno) {
std::cout << std::strerror(errno) << std::endl;
throw std::invalid_argument(fmt::format("strtof failed. x = {}", buf));
}
}
if (x != y) {
std::ostringstream oss;
union {
float f;
uint32_t u;
} f2u;
oss << "Round trip failed to preserve a number. ";
f2u.f = x;
oss << std::hex << f2u.u << " vs ";
f2u.f = y;
oss << std::hex << f2u.u;
throw std::runtime_error(oss.str());
}
}
int main(void) {
auto tstart = std::chrono::high_resolution_clock::now();
float x = std::numeric_limits<float>::lowest();
const float max = std::numeric_limits<float>::max();
uint64_t count = 0;
while (x < max) {
test_round_trip(x);
++count;
x = std::nextafter(x, max);
}
auto tend = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(tend - tstart).count() << " ms elapsed" << std::endl;
std::cout << "Count: " << count << std::endl;
return 0;
}
Hi, is this function available in Python now? Thanks.
@LeZhengThu No, this thread is just a proposal. It will be a while until this feature implemented.
@hcho3 Thanks. Hope this feature can go live soon. I like the idea very much.
@hcho3 The last time we talked about JSON, it was suggested that dmlc::Json*
should be used instead of 3-party implementation or self designed Json implementation. And I remember that you have some thoughts about how to use dmlc::Json*
?
One issue that might arise with JSON serialization is the treatment of serialized float values. Presumably xgboost will be aware that all decimal values in the JSON should be cast as floats; however I believe most typical JSON parsers will treat the values as doubles, leading to discrepancies for downstream users who don't implement a post-parsing step to cast doubles to floats in the resulting object.
As @hcho3 shows, it is possible to do the round-trip using the maxdigits10
, which guarantees that the float will be reproducible from the decimal text but all parsed decimal values must be cast to a float for the resulting object to represent the state of the model correctly. That is the step that is not guaranteed via a simple JSON-parse.
As @thvasilo points out, any downstream users like SHAP by @slundberg and Treelite will need to be aware of this treatment and implement a post-parsing step to cast any doubles to floats (and then treat them as floats in any further calculations, ie. fexp
instead of exp
, etc.).
As this is my first comment in this issue, I hope it is helpful!
@ras44 The JSON schema clearly says that all floating-point values are 32-bit: https://xgboost-json-schema.readthedocs.io/en/latest/#notations
@hcho3 Thanks for your note. My main point is that the if the JSON dump is to be used by anything other than xgboost (which will presumably need to implement a custom JSON parser that encodes/decodes floats according the schema), the user will have to consider multiple aspects in order produce the same results as xgboost. It appears this is already happening in issues such as #3960, #4060, #4097, #4429. I am providing PR https://github.com/dmlc/xgboost/pull/4439 explaining some of the issues and considerations in case it could be of use for people who aim to work with a JSON dump.
@hcho3 I will go ahead and create an initial draft implementation based on schema.
I will try to use dmlc::JSON, but add abstractions whenever needed.
Thanks for the update. Sorry I haven't had a chance to tackle the implementation yet. Let me know if you need any help or feedback.
I want to revisit some of this discussion around using external dependencies as our JSON engine. After some discussion with @trivialfis and looking at #5046, it seems clear to me that a simple, partial json implementation is possible in dmlc-core or xgboost but not feature complete or with good performance. We are looking at thousands of lines of code for a custom implementation with good enough performance.
Achieving better performance than the naive implementation may turn out to be critical as our distributed implementations serialize during training.
I am normally strongly against dependencies but in this case using a header only json dependency may solve more problems than it creates.
@hcho3 @CodingCat @chenqin @tqchen
I incline we use customized json writer/reader as long as well tested, introducing dependency out of our control could be liability with many unused features. Having Json could help move to simplify rabit recovery protocol. failed worker can recover from flexible sized key/value JSON payload from adjacent host.
using a header only json dependency
nlohmann/JSON is not especially performant.
@KOLANICH Indeed, nice interface but neither memory usage nor computation is efficient.
Closing as now experimental support with Schema is mainlined. Further improvement will come as separated PRs.