As we add different types of algorithms to xgboost and as the performance of these algorithms improves, the DMatrix class may also be improved in order to improve performance and memory usage for these particular cases.
Current DMatrix
The current DMatrix class has two sub classes. The first is the primary in memory representation where the entire dataset is ingested and converted into a CSR format with values stored as 32-bit floats. The entire dataset may be transposed in memory into a column format for exact style algorithms.
The second is the external memory representation which constructs a number of binary page files (32mb in size by default) which are streamed from disk. If a column format is requested, new transposed pages will be generated and these will be streamed from disk.
The end user does not directly choose the type of DMatrix and these subclasses are not exposed via external APIs. The type of underlying DMatrix is automatically selected according to if the user supplies a cache prefix in the constructor.
Currently a batch of data is stored inside the DMatrix as a 'SparsePage' class. This uses HostDeviceVector in order to expose data to both the CPU and GPU.
Current fast histogram algorithms
The current static histogram algorithms (hist, gpu_hist) convert the DMatrix object on the fly inside of their respective TreeUpdater implementations, not as a part of DMatrix implementation, although I believe that it was eventually intended to be integrated with DMatrix (#1673). 'hist' converts the data into a CSR matrix with integers instead of floats. 'gpu_hist' converts the data into an ELLPACK format with integers instead of floats and additionally applies bitwise compression at run time to both the matrix elements and indices, commonly resulting in 4-5x compression over the standard CSR matrix. In gpu_hist we avoid ever copying the training DMatrix to the GPU if prediction cacheing is used.
Some desirable features for DMatrix
Here I will list some features that we would like to have. Not all of these will be practical under the same design.
Notes on possible implementation
I realise the above is somewhat rambling and does not propose a concrete solution, but I hope to start discussion on these ideas.
@tqchen @hcho3 @trivialfis @CodingCat
silent performance penalty
There's always LOG(WARNING)
.
downside of introducing complicated state
I would suggest implementing lazy initialization. Even if we were to provide an explicit interface for specifying type of DMatrix
, ideally we still need to add a check for users, I doubt that anyone would first try to understand what these jargons mean before using a library, so there must be lots of mistakes. To implement this check, those states are necessary. And the state can be encoded in Updater
with following code built into Learner
:
std::vector<DMatrixType> types;
for (auto up : updaters) {
types.emplace_back(up->PreferredFormat());
}
dmat->BuildData(types);
In an updater, say gpu_hist
:
DMatrixType PreferredFormat() const { return DMatrixType::Qunatile; }
And lastly in DMatrix::BuildData(DMatrixType type)
, that's just dispatching.
DMatrix as a thin wrapper over an already allocated data matrix.
echo on this, as I am facing several scenarios really needing low-latency prediction, making DMatrix as a wrapper instead of really allocating memory space to store data would reduce the overhead involved in the whole code path....
There could be a significant performance penalty here from virtual function look up.
isn't it very hard to eliminate virtual functions in this case? either in iterator layer or lower memory format layer, do you have other suggestions?
What happens if these iterators are called but the underlying data is stored as a float
I would vote for stop training in this case, silently (even putting a no-one-will-read warn) doing something for the users are kind of risky to me
Thanks for your feedback. I think the action on this is to produce a prototype demonstrating DMatrix as a thin wrapper over an existing data structure and possibly exploring the idea of lazy DMatrix construction, then we will reevaluate.
It may be some time before I get to this.
Hi @hcho3 Could you post some references for feature grouping when time allows? Like:
column_matrix
implementation, the algorithm itself or somewhere else?Any other related issues, like difficulties you encountered before.
Are you working on it? ;-)
@trivialfis Feature grouping is a heuristic to cut down the number of feature columns in a sparse, high-dimensional dataset. It reduces memory cost and improves performance on multi-core CPUs (by improving work balance among worker threads). Addressing your questions:
Reference for grouping algorithm: See the LightGBM paper. The term they use is Exclusive Feature Bundling (EFB). The description given is as follows:
Usually in real applications, although there are a large number of features, the feature space is quite sparse, which provides us a possibility of designing a nearly lossless approach to reduce the number of effective features. Specifically, in a sparse feature space, many features are (almost) exclusive, i.e., they rarely take nonzero values simultaneously. Examples include the one-hot features (e.g., one-hot word representation in text mining). We can safely bundle such exclusive features. To this end, we design an efficient algorithm by reducing the optimal bundling problem to a graph coloring problem (by taking features as vertices and adding edges for every two features if they are not mutually exclusive), and solving it by a greedy algorithm with a constant approximation ratio.
Bottleneck in #2326: It's both time and space (as in memory consumption). Some parts of initialization logic is single-threaded so becomes a bottleneck when the number of threads is high. I think @SmirnovEgorRu's pull request (#4310) addresses this. Memory consumption is an open problem however. Feature grouping improves performance (better work balance in OpenMP threads) at the cost of increased memory consumption. See discussion in #2543.
I haven't had a chance to look at this for a while. Can we come back to this after merging #4433 ?
Can we come back to this after merging #4433 ?
Absolutely. Thanks for the information.
Just a note on ongoing work for refactoring:
We currently have a flow of data that looks like this:
External data (e.g. numpy 2d array) -> Dmatrix csr -> dmatrix other (column or quantile matrix)
It has been proposed that we do this to save memory in the intermediate step
External data -> Dmatrix other
There is a consequence to this that we must implement combinatorially many constructors for every possible pair of external data and internal representation. There are a lot of combinations.
In the first example the csr dmatrix works like an intermediate format, where we only need 1 constructor for each external data/internal representation.
This doesnt make it impossible to implement example b but is something to consider. Maybe its possible to use interators in the middle step to generalise data copying without extra storage?
@rongou @sriramch @trivialfis
Another thing I'm worrying is change of updater during experiment.
My goal is to make external memory work for GPUs. I think we likely need something along the lines of https://arxiv.org/abs/1901.09047, but tackling it directly may be too risk/disruptive to the existing code base, thus the need for refactoring.
Possible next steps, in no particular order:
EllpackPage
?) that can hold the binned/compressed features. Basically means moving quantile sketch/compression out of the tree updater.@rongou Thanks for the link. The paper looks quite interesting, let me read it.
Just add a thing to the list, is it possible we turn the field qid_
into local variable? It's added by #2749 , glancing through the diff doesn't seem to be used by any algorithm.
@trivialfis looks like the qids_
are used by ranking objectives.
Groups are used by ranking, qid seems to be an intermediate storage that used to construct group
Added here as a convenient reference. We need to consider removing the strong reference between booster and dmatrix, see #4482.
Probably this line (and its effects): https://github.com/dmlc/xgboost/blob/master/src/learner.cc#L646
Em, sorry currently i can't look into that. Putting it here as a reminder.
See #5044 for a proof of concept on re-factoring DMatrix construction from external data.
@RAMitchell Do you have a roadmap or check lists for this?