Hi, @RAMitchell and @trivialfis
I just came across the blog post by @Laurae2 (hope I didn't recognize wrongly)
https://medium.com/data-design/xgboost-gpu-performance-on-low-end-gpu-vs-high-end-cpu-a7bc5fcd425b
The blog post confirms that we have a super-fast gpu algorithm but also raises two issues
GPU implementation is subject to crash with high dimensional feature vector
there are some issues with results reproducibility of GPU hist algorithm
Would you please share some insights about (1) whether these are known issues (or they really exist); and (2) the reason for these issues (if they are there?)
Thanks
Nan
@CodingCat
For the first one, my guess is on memory limitation. And for the second one there is an open issue: #3921
That's a very detailed benchmark. @Laurae2 Huge thanks. I plan to address these problems one by one in the future.
How does the GPU updater obtain random numbers? Does it use the same mechanism as the CPU updater?
@trivialfis thanks for the response,
for the first one, I see @Laurae2 pointed out that adding more feature has 5-15% higher weights than adding samples to your dataset, it's also related to the parallelization mechanism in GPU implementation?
Does it use the same mechanism as the CPU updater?
For feature sampling, it uses ColumnSampler
from /common/random.h
. So should be the same.
adding more feature has 5-15% higher weights than adding samples to your dataset
GPU implementation doesn't use CSR
format, instead ELLPACK
is chosen. So it's not surprising.
@Laurae2 thanks for useful feedback!
Im summary here are the things we need to improve:
@RAMitchell do you also know why xgboost GPU crashes when using a too large depth even if there is GPU RAM available?
Not sure, if you have a reproducible example that would be greatly appreciated.
@RAMitchell Seems not reproducible on newer commits of xgboost.
The following used to crash on a 4GB RAM GPU in the past, but not now anymore:
library(xgboost)
set.seed(1)
N <- 10000000
p <- 100
pp <- 25
X <- matrix(runif(N * p), ncol = p)
betas <- 2 * runif(pp) - 1
sel <- sort(sample(p, pp))
m <- X[, sel] %*% betas - 1 + rnorm(N)
y <- rbinom(N, 1, plogis(m))
format(object.size(X), units = "Mb")
dtrain <- xgboost::xgb.DMatrix(X, label = y)
gc(verbose = FALSE)
set.seed(11111)
model <- xgb.train(list(objective = "binary:logistic", nthread = 1, eta = 0.10, max_depth = 13, max_bin = 255, tree_method = "hist"),
dtrain, nrounds = 3, verbose = 1, watchlist = list(train = dtrain))
rm(dtrain, model)
gc(verbose = FALSE)
However, the following still hangs on 2 GPU when using nthread = 1:
library(xgboost)
set.seed(1)
N <- 1000
p <- 50
pp <- 25
X <- matrix(runif(N * p), ncol = p)
betas <- 2 * runif(pp) - 1
sel <- sort(sample(p, pp))
m <- X[, sel] %*% betas - 1 + rnorm(N)
y <- rbinom(N, 1, plogis(m))
format(object.size(X), units = "Mb")
dtrain <- xgboost::xgb.DMatrix(X, label = y)
gc(verbose = FALSE)
set.seed(11111)
model <- xgb.train(list(objective = "binary:logistic", nthread = 1, eta = 0.10, max_depth = 5, max_bin = 255, tree_method = "gpu_hist", n_gpus = 2),
dtrain, nrounds = 3, verbose = 1, watchlist = list(train = dtrain))
rm(dtrain, model)
gc(verbose = FALSE)
close it for now as the major purpose of filing the issue (to get awareness of the blog post and more insights to the issue mentioned there) has been achieved and there is undergoing work to fix the issues
Most helpful comment
@CodingCat
For the first one, my guess is on memory limitation. And for the second one there is an open issue: #3921
That's a very detailed benchmark. @Laurae2 Huge thanks. I plan to address these problems one by one in the future.