Xgboost: [RFC] Possible DMatrix refactor

Created on 10 Apr 2019 · 18Comments · Source: dmlc/xgboost

As we add different types of algorithms to xgboost and as the performance of these algorithms improves, the DMatrix class may also be improved in order to improve performance and memory usage for these particular cases.

Current DMatrix
The current DMatrix class has two sub classes. The first is the primary in memory representation where the entire dataset is ingested and converted into a CSR format with values stored as 32-bit floats. The entire dataset may be transposed in memory into a column format for exact style algorithms.

The second is the external memory representation which constructs a number of binary page files (32mb in size by default) which are streamed from disk. If a column format is requested, new transposed pages will be generated and these will be streamed from disk.

The end user does not directly choose the type of DMatrix and these subclasses are not exposed via external APIs. The type of underlying DMatrix is automatically selected according to if the user supplies a cache prefix in the constructor.

Currently a batch of data is stored inside the DMatrix as a 'SparsePage' class. This uses HostDeviceVector in order to expose data to both the CPU and GPU.

Current fast histogram algorithms
The current static histogram algorithms (hist, gpu_hist) convert the DMatrix object on the fly inside of their respective TreeUpdater implementations, not as a part of DMatrix implementation, although I believe that it was eventually intended to be integrated with DMatrix (#1673). 'hist' converts the data into a CSR matrix with integers instead of floats. 'gpu_hist' converts the data into an ELLPACK format with integers instead of floats and additionally applies bitwise compression at run time to both the matrix elements and indices, commonly resulting in 4-5x compression over the standard CSR matrix. In gpu_hist we avoid ever copying the training DMatrix to the GPU if prediction cacheing is used.

Some desirable features for DMatrix
Here I will list some features that we would like to have. Not all of these will be practical under the same design.

The ability to train entirely on a compressed quantised DMatrix. We could save around ~4x memory and be much more competitive with LightGBM on memory usage.
Support for dense matrices. We would directly halve memory usage by using dense formats where appropriate.
DMatrix as a thin wrapper over an already allocated data matrix. e.g. a numpy ndarray. Given a dense numpy input, we copy this into DMatrix format resulting in 3n memory usage. Simply using the underlying numpy array as our storage uses 1n memory.
Support for ingest of data from GPU memory (see #3997)
Support for external memory for all of the above

Notes on possible implementation

Current iterators used to access DMatrix could be made into virtual functions, abstracting away underlying memory format. There could be a significant performance penalty here from virtual function look up.
The question is how do the histogram learning algorithms then access the underlying integer representation if it exists? Do we provide another set of iterators specifically for quantised data? What happens if these iterators are called but the underlying data is stored as a float? We can either error and tell the user to construct a different type of DMatrix for this particular learning algorithm or generate the required representation on the fly with some silent performance penalty.
Another possible solution is to have the DMatrix as a completely 'lazy' object that contains a pointer to an external data source but defers constructing an internal representation until it knows the optimal format, based on whatever learning algorithm is chosen. This solution has the downside of introducing complicated state to the DMatrix object.

I realise the above is somewhat rambling and does not propose a concrete solution, but I hope to start discussion on these ideas.

@tqchen @hcho3 @trivialfis @CodingCat

RFC

Source

RAMitchell

🎉5

All 18 comments

silent performance penalty

There's always LOG(WARNING).

downside of introducing complicated state

I would suggest implementing lazy initialization. Even if we were to provide an explicit interface for specifying type of DMatrix, ideally we still need to add a check for users, I doubt that anyone would first try to understand what these jargons mean before using a library, so there must be lots of mistakes. To implement this check, those states are necessary. And the state can be encoded in Updater with following code built into Learner:

std::vector<DMatrixType> types;
for (auto up : updaters) {
  types.emplace_back(up->PreferredFormat());
}
dmat->BuildData(types);

In an updater, say gpu_hist:

DMatrixType PreferredFormat() const { return DMatrixType::Qunatile; }

And lastly in DMatrix::BuildData(DMatrixType type), that's just dispatching.

trivialfis on 10 Apr 2019

DMatrix as a thin wrapper over an already allocated data matrix.

echo on this, as I am facing several scenarios really needing low-latency prediction, making DMatrix as a wrapper instead of really allocating memory space to store data would reduce the overhead involved in the whole code path....

There could be a significant performance penalty here from virtual function look up.

isn't it very hard to eliminate virtual functions in this case? either in iterator layer or lower memory format layer, do you have other suggestions?

What happens if these iterators are called but the underlying data is stored as a float

I would vote for stop training in this case, silently (even putting a no-one-will-read warn) doing something for the users are kind of risky to me

CodingCat on 10 Apr 2019

Thanks for your feedback. I think the action on this is to produce a prototype demonstrating DMatrix as a thin wrapper over an existing data structure and possibly exploring the idea of lazy DMatrix construction, then we will reevaluate.

It may be some time before I get to this.

RAMitchell on 22 Apr 2019

Hi @hcho3 Could you post some references for feature grouping when time allows? Like:

Reference for grouping algorithm.
What's the bottle neck in #2326 (time, space or both?)
Where does the bottle neck come from, column_matrix implementation, the algorithm itself or somewhere else?
Any other related issues, like difficulties you encountered before.
Are you working on it? ;-)

trivialfis on 18 May 2019

@trivialfis Feature grouping is a heuristic to cut down the number of feature columns in a sparse, high-dimensional dataset. It reduces memory cost and improves performance on multi-core CPUs (by improving work balance among worker threads). Addressing your questions:

Reference for grouping algorithm: See the LightGBM paper. The term they use is Exclusive Feature Bundling (EFB). The description given is as follows:

Usually in real applications, although there are a large number of features, the feature space is quite sparse, which provides us a possibility of designing a nearly lossless approach to reduce the number of effective features. Specifically, in a sparse feature space, many features are (almost) exclusive, i.e., they rarely take nonzero values simultaneously. Examples include the one-hot features (e.g., one-hot word representation in text mining). We can safely bundle such exclusive features. To this end, we design an efficient algorithm by reducing the optimal bundling problem to a graph coloring problem (by taking features as vertices and adding edges for every two features if they are not mutually exclusive), and solving it by a greedy algorithm with a constant approximation ratio.
Bottleneck in #2326: It's both time and space (as in memory consumption). Some parts of initialization logic is single-threaded so becomes a bottleneck when the number of threads is high. I think @SmirnovEgorRu's pull request (#4310) addresses this. Memory consumption is an open problem however. Feature grouping improves performance (better work balance in OpenMP threads) at the cost of increased memory consumption. See discussion in #2543.
I haven't had a chance to look at this for a while. Can we come back to this after merging #4433 ?

hcho3 on 23 May 2019

👍1

Can we come back to this after merging #4433 ?

Absolutely. Thanks for the information.

trivialfis on 23 May 2019

Just a note on ongoing work for refactoring:

We currently have a flow of data that looks like this:
External data (e.g. numpy 2d array) -> Dmatrix csr -> dmatrix other (column or quantile matrix)

It has been proposed that we do this to save memory in the intermediate step
External data -> Dmatrix other

There is a consequence to this that we must implement combinatorially many constructors for every possible pair of external data and internal representation. There are a lot of combinations.

In the first example the csr dmatrix works like an intermediate format, where we only need 1 constructor for each external data/internal representation.

This doesnt make it impossible to implement example b but is something to consider. Maybe its possible to use interators in the middle step to generalise data copying without extra storage?

@rongou @sriramch @trivialfis

RAMitchell on 4 Aug 2019

Another thing I'm worrying is change of updater during experiment.

trivialfis on 4 Aug 2019

My goal is to make external memory work for GPUs. I think we likely need something along the lines of https://arxiv.org/abs/1901.09047, but tackling it directly may be too risk/disruptive to the existing code base, thus the need for refactoring.

Possible next steps, in no particular order:

The existing code handling SparsePage/CSCPage/SortedCSCPage could use some clean up, especially the external memory portion.
Create a new page format (EllpackPage?) that can hold the binned/compressed features. Basically means moving quantile sketch/compression out of the tree updater.
Enable reading/writing EllpackPages from/to disk. To start with, we can stick with the current approach and keep the whole final dataset in gpu memory.
Stream in the data during tree construction. This may be just too slow to make sense for GPUs.
Implemented stratified storage and importance sampling as in https://arxiv.org/abs/1901.09047.

rongou on 6 Aug 2019

👍1

@rongou Thanks for the link. The paper looks quite interesting, let me read it.

hcho3 on 6 Aug 2019

Just add a thing to the list, is it possible we turn the field qid_ into local variable? It's added by #2749 , glancing through the diff doesn't seem to be used by any algorithm.

trivialfis on 6 Aug 2019

@trivialfis looks like the qids_ are used by ranking objectives.

rongou on 6 Aug 2019

Groups are used by ranking, qid seems to be an intermediate storage that used to construct group

trivialfis on 7 Aug 2019

Added here as a convenient reference. We need to consider removing the strong reference between booster and dmatrix, see #4482.

trivialfis on 9 Aug 2019

👍1

Probably this line (and its effects): https://github.com/dmlc/xgboost/blob/master/src/learner.cc#L646

rongou on 9 Aug 2019

Em, sorry currently i can't look into that. Putting it here as a reminder.

trivialfis on 9 Aug 2019