TLDR and ask
I've got a tricky issue which I only see when training on data sets above a certain size. I am currently unable to provide a clean repro as I only encounter the problem when my data is above a certain threshold (around 10 billion entries in the feature matrix). I'm asking for some help on how to further debug this issue; (e.g. since I don't know the details of the code nearly as well as a contributor, perhaps you could provide me with a watchpoint I can set to break on and gather more data about the program state at the onset of the instability)
Symptoms:
See my comment here for an example: https://github.com/dmlc/xgboost/pull/4095#issuecomment-468079355
7:leaf=43151.1016 8:leaf=-0.147368416
Data details
Experiments tried
max_delta_step
, however learning is significantly impaired. Setting max_delta_step
is really just capping the potential impact of the instability, and I find that it is still impossible to learn a model that is competitive with what I've seen using the same data-set training in single-node CPU exact model. Environment and build info
Sample hyperparameters (I've tried many experiments but all are generally similar to these):
"n_gpus": -1,
"num_class": 3,
"nthread": "96",
"eta": 0.1,
"max_bin": 31,
"objective": "multi:softprob",
"num_boost_round": 100,
"tree_method": "gpu_hist",
"max_depth": 10,
"predictor": "cpu_predictor"
nvidia
runtime). CUDA and NCCL were installed via these deb packages from nvidia鈥檚 repository: cuda-toolkit-10-0 libnccl2=2.4.2-1+cuda10.0 libnccl-dev=2.4.2-1+cuda10.0p3dn.24xlarge
instance. A smaller number of GPUs is unable to fit the size of my data. testxgboost
suite and other tests for tree updaters with GPU pass. _Please let me know what other information I can provide that would be useful; any help is appreciated! Thanks all!
-Jeff_
Seems like it could be an overflow issue? Can you please post the text dump of the first couple of trees using some small depth?
The smallest depth I have on hand right now is 5. This model's instability was less pronounced but you can see a leaf weight of over 20 present early on (search the file for "leaf=2" from the start):
12th tree from the top (Considering a 3 class model, this is the 4th round): bzktf.model.txt
28:[f19939<0.000236599997] yes=57,no=58,missing=58
57:leaf=0.100771725
58:leaf=20.1690426
Hey @jeffdk
I鈥檓 wondering if you observed this with other objective functions, as well. Did you try reg:linear
?
How many workers do you launch? I鈥檓 guessing just one from the parameters.
Have you tried not setting the predictor to CPU?
I believe setting nthread
is also unnecessary as the code will use the system maximum if not set.
Hey @mt-jones thanks for checking in,
reg:linear
but the example from here https://github.com/dmlc/xgboost/pull/4095#issuecomment-468079355 shows I get the issue with binary:logistic
in addition to multi:softprob
. nthread
! I'm always just setting it to the number of system cores anyway. It would be helpful to know if the issue persists with all objective functions, etc. Could you try reg:linear
and let me know?
It would also be helpful if we had a small reproducer. Is there a way for you to mock up some kind of data?
@RAMitchell
I have a repro of a run using a depth of 2 where you can see extremely large leaf values in the first few trees (magnitude > 100):
booster[0]:
booster[0]:
0:[f8023<24.9200001] yes=1,no=2,missing=1,gain=187926.812,cover=2666666.75
1:[f3945<1.5] yes=3,no=4,missing=3,gain=53575.875,cover=2511679.75
3:leaf=-0.675734818,cover=0
4:leaf=0.620877981,cover=0
2:[f879<-9.99999975e-06] yes=5,no=6,missing=5,gain=94983.2188,cover=154987.109
5:leaf=-0.65048486,cover=0
6:leaf=1.01979089,cover=0
booster[1]:
0:[f12895<0.770099998] yes=1,no=2,missing=1,gain=917343.75,cover=2666666.75
1:[f10619<18] yes=3,no=4,missing=4,gain=198044.25,cover=2339213
3:leaf=0.195236623,cover=0
4:leaf=1.37889802,cover=0
2:[f19999<68.5] yes=5,no=6,missing=5,gain=234804.484,cover=327453.75
5:leaf=-318.269745,cover=0
6:leaf=-0.482374966,cover=0
booster[2]:
0:[f8023<25.2299995] yes=1,no=2,missing=2,gain=492543.938,cover=2666666.75
1:[f8327<-9.99999975e-06] yes=3,no=4,missing=3,gain=51327.4375,cover=198128.438
3:leaf=1.21191311,cover=0
4:leaf=-0.374824882,cover=0
2:[f10619<2] yes=5,no=6,missing=5,gain=207495.062,cover=2468538.25
5:leaf=-0.698607564,cover=0
6:leaf=0.354673982,cover=0
booster[3]:
0:[f879<-9.99999975e-06] yes=1,no=2,missing=1,gain=97287.6719,cover=1241213.88
1:[f2815<0.0188999996] yes=3,no=4,missing=3,gain=32373.4688,cover=1021803.25
3:leaf=-122.004761,cover=0
4:leaf=-0.437717617,cover=0
2:[f19990<6.9829998] yes=5,no=6,missing=6,gain=63918.6016,cover=219410.609
5:leaf=2.32276249,cover=0
6:leaf=0.152193889,cover=0
...
booster[105]:
0:leaf=-nan,cover=232730.797
booster[106]:
0:leaf=-nan,cover=337007.531
booster[107]:
0:leaf=-nan,cover=236074.344
booster[108]:
0:leaf=0,cover=5.99999994e-10
...
Full dump here: vqqhu.model.txt
Edit: Added stats vqqhu.model.wstats.txt
Hyperparameters used for this run:
{
"n_gpus": -1,
"num_class": 3,
"nthread": "96",
"eta": 1.0,
"max_bin": 31,
"objective": "multi:softprob",
"num_boost_round": 100,
"tree_method": "gpu_hist",
"max_depth": 2
}
@mt-jones
I also see the issue with the reg:linear
objective (ran on the same data where labels are [0,1,2]).
booster[0]:
0:[f8023<24.9200001] yes=1,no=2,missing=2,gain=208415.25,cover=6000000
1:[f12917<-9.99999975e-06] yes=3,no=4,missing=3,gain=43086.1875,cover=404180
3:leaf=1.32514524,cover=0
4:leaf=-0.00346823013,cover=0
2:[f7703<-9.99999975e-06] yes=5,no=6,missing=5,gain=85603.375,cover=5595820
5:leaf=0.520056725,cover=0
6:leaf=-0.134979904,cover=0
booster[1]:
0:[f10619<4] yes=1,no=2,missing=1,gain=69084.2578,cover=6000000
1:[f19981<-9.99999975e-06] yes=3,no=4,missing=3,gain=62528.5664,cover=5490956
3:leaf=-176.849716,cover=0
4:leaf=-0.0325979777,cover=0
2:[f3<-9.99999975e-06] yes=5,no=6,missing=5,gain=40972.1836,cover=509044
5:leaf=117.21711,cover=0
6:leaf=0.351738989,cover=0
booster[2]:
0:[f19981<-9.99999975e-06] yes=1,no=2,missing=1,gain=69850,cover=6000000
1:[f84<-9.99999975e-06] yes=3,no=4,missing=3,gain=75682.5234,cover=3
3:leaf=215.036545,cover=0
4:leaf=-58.2910652,cover=0
2:[f3<-9.99999975e-06] yes=5,no=6,missing=5,gain=198506.359,cover=5999997
5:leaf=134.321472,cover=0
6:leaf=-0.0141801583,cover=0
...
booster[7]:
0:[f13606<0.00220249989] yes=1,no=2,missing=2,gain=1427965.62,cover=6000000
1:leaf=-844.975464,cover=0
2:[f13606<0.00220249989] yes=3,no=4,missing=4,gain=1427966.62,cover=5999999
3:leaf=-844.975464,cover=0
4:leaf=-6.83024211e-07,cover=0
... <Remainder of trees nearly identical from here out> ...
Full dump: zgdjw.model.txt
Edit: Added stats zgdjw.model.wstats.txt
Hyperparameters:
{
"n_gpus": -1,
"nthread": "96",
"eta": 1.0,
"max_bin": 31,
"objective": "reg:linear",
"num_boost_round": 100,
"tree_method": "gpu_hist",
"max_depth": 2
}
It would also be helpful if we had a small reproducer. Is there a way for you to mock up some kind of data?
I took a stab at mocking some data by simplest thing first: random values in a dense np array and load into dmatrix. However this runs OOM on the p3dn.24xlarge
node (750Gb). I'm sure it's possible with some finesse or workarounds (e.g. writing a libsvm format file to disk and reading it back in, as I currently read directly from a libsvm file), and I'll keep working at it. I agree, a small repro is king, but I thought I'd get the ball rolling in case contributors had any other ideas. :) tyty
Thanks @jeffdk. When you dump the tree can you add the option 'with_stats' for a bit more info. Seems like you are using a lot of features. Does the problen occur if you reduce the number of features?
I have two theories:
@RAMitchell
I've edited my above comments to show the splitting statistics. I also did a trial where I truncated the feature space at 1/2 the features, though I increased the number of rows to have a comparable number of entries in the feature matrix. For this case, the instability builds less quickly, but it is still present; see less_feats.model.txt
@mt-jones
I did investigate mocking data using random data on a sparse matrix of comparable columns, rows and sparsity as the data set on which I see the instability. Unfortunately, I did not see the issue there. My theory is that if a mistake is made due to numerical issues, and an incorrect split/weight is computed, it
doesn't matter as there is nothing to learn. I could work on some 'learnable' mock data; will take some more time.
@RAMitchell
I don't think we can rule out either an integer overflow (one guess is perhaps in the indexing of the list of potential splits by gains, something is off), or a classic double precision summation instability. I did explicitly check your theory about the left/right gradient/hessian sums not adding up to the parent_sums by debugging in gdb
; it doesn't appear to be the case: here is an example from the model I posted here (less_feats.model.txt) where the statistics appear to add up:
Thread 1 "python" hit Breakpoint 3, xgboost::tree::GPUHistMakerSpecialised<xgboost::detail::GradientPairInternal<double> >::ApplySplit (this=0x227f3c0, candidate=..., p_tree=0x62c0b10) at /home/dsci/xgboost/src/tree/updater_gpu_hist.cu:1231candidate.nid, candidate.split.findex,
1: left_stats = {sum_grad = -425.618408203125, sum_hess = 5.188568115234375}
2: left_weight = 68.774940490722656
3: right_stats = {sum_grad = 25.152973175048828, sum_hess = 59.612892150878906}
4: right_weight = -0.41497728228569031
5: parent_sum = {sum_grad = -400.46543502807617, sum_hess = 64.801460266113281}
6: base_weight = 6.085965633392334
(gdb)
Continuing.
Thread 1 "python" hit Breakpoint 3, xgboost::tree::GPUHistMakerSpecialised<xgboost::detail::GradientPairInternal<double> >::ApplySplit (this=0x227f3c0, candidate=..., p_tree=0x62c0d50) at /home/dsci/xgboost/src/tree/updater_gpu_hist.cu:1231candidate.nid, candidate.split.findex,
1: left_stats = {sum_grad = -2169.78759765625, sum_hess = 452151.53125}
2: left_weight = 0.0047987955622375011
3: right_stats = {sum_grad = 2455.43359375, sum_hess = 55.732139587402344}
4: right_weight = -43.281173706054688
5: parent_sum = {sum_grad = 285.64599609375, sum_hess = 452207.2633895874}
6: base_weight = -0.00063166912877932191
(gdb)
Continuing.
Thread 1 "python" hit Breakpoint 3, xgboost::tree::GPUHistMakerSpecialised<xgboost::detail::GradientPairInternal<double> >::ApplySplit (this=0x227f3c0, candidate=..., p_tree=0x62c0d50) at /home/dsci/xgboost/src/tree/updater_gpu_hist.cu:1231candidate.nid, candidate.split.findex,
1: left_stats = {sum_grad = 4986.55810546875, sum_hess = 14944.9482421875}
2: left_weight = -0.3336394727230072
3: right_stats = {sum_grad = -7156.345703125, sum_hess = 437206.59375}
4: right_weight = 0.016368301585316658
5: parent_sum = {sum_grad = -2169.78759765625, sum_hess = 452151.5419921875}
Continuing.
Thread 1 "python" hit Breakpoint 3, xgboost::tree::GPUHistMakerSpecialised<xgboost::detail::GradientPairInternal<double> >::ApplySplit (this=0x227f3c0, candidate=..., p_tree=0x62c0d50) at /home/dsci/xgboost/src/tree/updater_gpu_hist.cu:1231
1231 tree.ExpandNode(candidate.nid, candidate.split.findex,
1: left_stats = {sum_grad = 2407.80615234375, sum_hess = 32.859725952148438}
2: left_weight = -71.111213684082031
3: right_stats = {sum_grad = 47.62744140625, sum_hess = 22.872413635253906}
4: right_weight = -1.9950827360153198
5: parent_sum = {sum_grad = 2455.43359375, sum_hess = 55.732139587402344}
6: base_weight = -43.281173706054688
I also have an example from debugging (original 21k feature set) where I see the instability very quickly, in the second tree; perhaps you can pull a clue out of the state I've captured here?
Thread 1 "python" hit Breakpoint 4, xgboost::tree::GPUHistMakerSpecialised<xgboost::detail::GradientPairInternal<double> >::ApplySplit (this=0x393e2f0, candidate=..., p_tree=0x831430f0) at /home/dsci/xgboost/src/tree/updater_gpu_hist.cu:1231andidate.nid, candidate.split.findex,
1: left_weight = -122.0047607421875
2: right_weight = -0.43771761655807495
(gdb) p candidate
$17 = (const xgboost::tree::GPUHistMakerSpecialised<xgboost::detail::GradientPairInternal<double> >::ExpandEntry &) @0x7fffffffcbe0: {nid = 1, depth = 1, split = {loss_chg = 32373.4688, dir = xgboost::tree::kLeftDir, fvalue = 0.0188999996, = {grad_ = 267.262177, hess_ = 1.19058812}, right_sum = {grad_ = 447261.188, hess_ = 1021802.06}}, timestamp = 1}
(gdb) p p_tree
$18 = (xgboost::RegTree *) 0x831430f0
(gdb) p *
$19 = {param = {<dmlc::Parameter<xgboost::TreeParam>> = {<No data fields>}, num_roots = 1, num_nodes = 3, num_deleted = 0, max_depth = 0, num_feature = 21692, size_leaf_vector = 0, reserved = {0 <repeats 31 times>}}, es_ = std::vector of length 3, capacity 4 = {{parent_ = -1, cleft_ = 1, cright_ = 2, sindex_ = 2147484527, info_ = {leaf_value = -9.99999975e-06, split_cond = -9.99999975e-06}}, {parent_ = -2147483648, cleft_ = -1, cright_ = 0, sindex_ = 0, 978655, split_cond = -0.437978655}}, {parent_ = 0, cleft_ = -1, cright_ = 0, sindex_ = 0, info_ = {leaf_value = 0.29592514, split_cond = 0.29592514}}}, deleted_nodes_ = std::vector of length 0, capacity 0, ngth 3, capacity 4 = {{loss_chg = 97287.6719, sum_hess = 1241213.88, base_weight = -0.308245599, leaf_child_cnt = 0}, {loss_chg = 0, sum_hess = 0, base_weight = 0, leaf_child_cnt = 0}, {loss_chg = 0, sum_hess = 0, ild_cnt = 0}}, node_mean_values_ = std::vector of length 0, capacity 0}
The breakpoint is here. Looks like some of the indexes in the data structure I printed are close to min/max int32.
I can't see anything definitive from these dumps. The indices that are close to int32 in this case are not a problem.
We can definitely see the gradients increasing over boosting iterations. This should basically never happen in gradient boosting. The problem will be pinpointing exactly where the incorrect calculation is occurring. Using the 'reg:linear' objective helps because the hessian is always 1.0 and therefore we can see how many training instances are in each node from 'cover' in the text dumps.
One thing I notice is that in some of the trees the child nodes have higher gain than the root node. This almost never happens in gradient boosting because we tend to greedily select the best splits early in the tree.
booster[2]:
0:[f19981<-9.99999975e-06] yes=1,no=2,missing=1,gain=69850,cover=6000000
1:[f84<-9.99999975e-06] yes=3,no=4,missing=3,gain=75682.5234,cover=3
3:leaf=215.036545,cover=0
4:leaf=-58.2910652,cover=0
2:[f3<-9.99999975e-06] yes=5,no=6,missing=5,gain=198506.359,cover=5999997
5:leaf=134.321472,cover=0
6:leaf=-0.0141801583,cover=0
Whatever statistics were used to calculate the split of node 2 in the above were almost certainly incorrect. It's also very unusual to split with only three instances on the left-hand side. You might be able to find more inconsistencies by drilling down into this particular example.
Thanks for taking a look and giving me some direction, @RAMitchell!
I'll dug into a very similar example (details below), but I'm running into a loss of where to point the debugger next; do you have a suggestion?
I'm also open to any other thoughts you might have about tools or routines you generally use to track down these sorts of issues (additional compiler settings, xgb logging modes, or just throw me some links to guides you find helpful). Also let me know if at any point you're interested in connecting over a higher bandwidth communication channel if you feel it would take less of your time compared to back-and-forth over issues here (regardless I'm extremely grateful for all your help).
Script & Parameters
import sys
import xgboost as xgb
dtrain = xgb.DMatrix(sys.argv[1])
params = {
"max_depth": 2,
"eta": 1.0,
"verbosity": 3,
"objective": "reg:linear",
"tree_method": "gpu_hist",
"n_gpus": 8,
"max_bin": 31,
"eval_metric": ['mae', 'rmse']
}
eval_result = {}
bst = xgb.train(params, dtrain, num_boost_round=100, evals_result=eval_result, evals=[(dtrain, 'dtrain')])
print(eval_result)
bst.save_model("repro.model")
bst.dump_model("repro.model.txt", with_stats=True)
I ran with reg:linear
as you suggested (though on my multiclass labeled dataset in [0, 1, 2]).
Debugging state
The first tree seemed to build fine, but the second one was immediately unstable; here is some debugging output (second and third tree); more detailed output here:
debug_info_reg-linear.txt
[0] dtrain-mae:0.148228 dtrain-rmse:0.35644
2: parent_sum = {sum_grad = -54.15625, sum_hess = 6000000}
3: base_weight = 9.0260400611441582e-06
4: left_stats = {sum_grad = 179347.765625, sum_hess = 5490956}
5: left_weight = -0.032662387937307358
6: right_stats = {sum_grad = -179401.921875, sum_hess = 509044}
7: right_weight = 0.35242840647697449
8: candidate = (const xgboost::tree::GPUHistMakerSpecialised<xgboost::detail::GradientPairInternal<double> >::ExpandEntry &) @0x7fffffffcbb0: {nid = 0,
depth = 0, split = {loss_chg = 69084.2578, dir = xgboost::tree::kLeftDir, fvalue = 4, findex = 10619, left_sum = {grad_ = 179347.766, hess_ = 5490956},
right_sum = {grad_ = -179401.922, hess_ = 509044}}, timestamp = 0}
(gdb)
(gdb) c
Continuing.
2: parent_sum = {sum_grad = 179347.76193237305, sum_hess = 5490956}
3: base_weight = -0.032662387937307358
4: left_stats = {sum_grad = 353.69943237304688, sum_hess = 1}
5: left_weight = -176.84971618652344
6: right_stats = {sum_grad = 178994.0625, sum_hess = 5490955}
7: right_weight = -0.032597977668046951
8: candidate = (const xgboost::tree::GPUHistMakerSpecialised<xgboost::detail::GradientPairInternal<double> >::ExpandEntry &) @0x7fffffffcbb0: {nid = 1,
depth = 1, split = {loss_chg = 62528.5664, dir = xgboost::tree::kLeftDir, fvalue = -9.99999975e-06, findex = 19981, left_sum = {grad_ = 353.699432,
hess_ = 1}, right_sum = {grad_ = 178994.062, hess_ = 5490955}}, timestamp = 1}
(gdb)
(gdb) c
Continuing.
2: parent_sum = {sum_grad = -179401.91696166992, sum_hess = 509044}
3: base_weight = 0.35242840647697449
4: left_stats = {sum_grad = -351.65133666992188, sum_hess = 2}
5: left_weight = 117.21710968017578
6: right_stats = {sum_grad = -179050.265625, sum_hess = 509042}
7: right_weight = 0.35173898935317993
8: candidate = (const xgboost::tree::GPUHistMakerSpecialised<xgboost::detail::GradientPairInternal<double> >::ExpandEntry &) @0x7fffffffcbb0: {nid = 2,
depth = 1, split = {loss_chg = 40972.1836, dir = xgboost::tree::kLeftDir, fvalue = -9.99999975e-06, findex = 3, left_sum = {grad_ = -351.651337,
hess_ = 2}, right_sum = {grad_ = -179050.266, hess_ = 509042}}, timestamp = 2}
(gdb) c
Continuing.
[1] dtrain-mae:0.155413 dtrain-rmse:1.35858
2: parent_sum = {sum_grad = 83074.714721679688, sum_hess = 6000000}
3: base_weight = -0.013845783658325672
4: left_stats = {sum_grad = -528.5274658203125, sum_hess = 3}
5: left_weight = 132.13186645507812
6: right_stats = {sum_grad = 83603.2421875, sum_hess = 5999997}
7: right_weight = -0.013933878391981125
8: candidate = (const xgboost::tree::GPUHistMakerSpecialised<xgboost::detail::GradientPairInternal<double> >::ExpandEntry &) @0x7fffffffcbb0: {nid = 0,
depth = 0, split = {loss_chg = 69850, dir = xgboost::tree::kLeftDir, fvalue = -9.99999975e-06, findex = 19981, left_sum = {grad_ = -528.527466,
hess_ = 3}, right_sum = {grad_ = 83603.2422, hess_ = 5999997}}, timestamp = 0}
(gdb) c
Continuing.
2: parent_sum = {sum_grad = -528.52748870849609, sum_hess = 3}
3: base_weight = 132.13186645507812
4: left_stats = {sum_grad = -645.109619140625, sum_hess = 2}
5: left_weight = 215.03654479980469
6: right_stats = {sum_grad = 116.58213043212891, sum_hess = 1}
7: right_weight = -58.291065216064453
8: candidate = (const xgboost::tree::GPUHistMakerSpecialised<xgboost::detail::GradientPairInternal<double> >::ExpandEntry &) @0x7fffffffcbb0: {nid = 1,
depth = 1, split = {loss_chg = 75682.5234, dir = xgboost::tree::kLeftDir, fvalue = -9.99999975e-06, findex = 84, left_sum = {grad_ = -645.109619,
hess_ = 2}, right_sum = {grad_ = 116.58213, hess_ = 1}}, timestamp = 1}
(gdb) c
Continuing.
2: parent_sum = {sum_grad = 83603.2451171875, sum_hess = 5999997}
3: base_weight = -0.013933878391981125
4: left_stats = {sum_grad = -1477.5361328125, sum_hess = 10}
5: left_weight = 134.32147216796875
6: right_stats = {sum_grad = 85080.78125, sum_hess = 5999987}
7: right_weight = -0.014180158264935017
8: candidate = (const xgboost::tree::GPUHistMakerSpecialised<xgboost::detail::GradientPairInternal<double> >::ExpandEntry &) @0x7fffffffcbb0: {nid = 2,
depth = 1, split = {loss_chg = 198506.359, dir = xgboost::tree::kLeftDir, fvalue = -9.99999975e-06, findex = 3, left_sum = {grad_ = -1477.53613,
hess_ = 10}, right_sum = {grad_ = 85080.7812, hess_ = 5999987}}, timestamp = 2}
(gdb) c
Continuing.
[2] dtrain-mae:0.357891 dtrain-rmse:5.60447
The model is identical to the one posted above (I.e. the results here are consistent and deterministic)
I have managed to reproduce the instability with a relatively small data set which I can share. Here is an archive of the data, a repro script, and the model output: xgboost-issue-4204.tar.gz
Updates from above
hist
method (no GPU required)max_bins
valueSymptoms of the instability
Loss and error increasing over iterations:
[6] dtrain-error:0.140089
[7] dtrain-error:0.140464
[8] dtrain-error:0.141727
[9] dtrain-error:0.136664
[10] dtrain-error:0.138864
[11] dtrain-error:0.143077
[12] dtrain-error:0.138251
Unreasonably large leaf weights (only starting in the second iteration):
booster[0]:
0:[f1<-9.99999975e-06] yes=1,no=2,missing=1,gain=24160.4473,cover=19998
1:[f4<0.958100021] yes=3,no=4,missing=3,gain=548.710938,cover=6389
3:leaf=1.93459618,cover=5305.5
4:leaf=1.15366864,cover=1083.5
2:[f2<24.2000008] yes=5,no=6,missing=5,gain=16520.0605,cover=13609
5:leaf=0.636635423,cover=6271.25
6:leaf=-1.57371128,cover=7337.75
booster[1]:
0:[f1<-9.99999975e-06] yes=1,no=2,missing=1,gain=2211.51538,cover=12981.4521
1:[f3<30.7999992] yes=3,no=4,missing=3,gain=635.600464,cover=3131.0188
3:leaf=0.852509975,cover=3060.12256
4:leaf=-2.17617011,cover=70.896431
2:[f2<24.9500008] yes=5,no=6,missing=5,gain=2562.04663,cover=9850.43359
5:leaf=0.179850996,cover=6565.78125
6:leaf=-0.901914954,cover=3284.6521
booster[2]:
0:[f2<27.9200001] yes=1,no=2,missing=1,gain=1546.1908,cover=9669.37793
1:[f3<30.3500004] yes=3,no=4,missing=3,gain=1597.83191,cover=9472.5332
3:leaf=-0.190983862,cover=8581.21387
4:leaf=1.21573615,cover=891.319031
2:[f1<-9.99999975e-06] yes=5,no=6,missing=5,gain=2776.03711,cover=196.845535
5:leaf=1.06381559,cover=163.066742
6:leaf=11.0240746,cover=33.7788048
Hyperparameters for repro
While for the gpu_hist
method I was able to see the instability for the reg:linear
objective, I could could only reproduce it for classification, binary:logistic
, on the cpu based hist
method.
{
"nthread": 1,
"max_depth": 2,
"eta": 1.0,
"objective": "binary:logistic",
"num_boost_round": 125,
"tree_method": "hist",
"max_bin": 31,
"lambda": 0.0,
"eval_metrics": ["logloss", "error"]
}
Note: The fast learning rate and lack of L2 regularization in these parameters is to surface the instability as quickly as possible for the repro. Setting a small learning rate, or L2 regularization slows but does not stop the instability.
Best guess
My best guess is that there is a pathological interaction between the data distribution and the histogram method. It seems to me that I'm hitting some edge case in histogram binning, histogram gradient statistics calculation/updating, histogram split finding, or histogram prediction caching that is resulting in abnormally large leaf weights. I've tried sifting through the statistics, splits, and predictions in gdb, but I'm simply not familiar enough with the code to narrow things any further. @hcho3 @RAMitchell I hope you can use the data I've provided to shed some light on the issue. Many thanks!
I had a look today. There is nothing wrong with prediction or prediction cacheing as far as I can tell.
The problem is identical in 'gpu_hist' and 'hist' but doesn't occur in 'exact'. Reducing the learning rate to 0.5 resolves the problem in this case. @jeffdk can you confirm what learning rates you are using on your larger dataset?
One theory is that xgboost uses a Newton style gradient descent algorithm. Therefore it is not guaranteed to converge for learning rate 1.0 due to approximation of the loss function. But you said above that you also tried the squared error objective, which would not fail if this were the case.
The dataset looks very sparse so maybe there is a problem with missing values?
Hi @RAMitchell, thanks for taking a look!
The problem is identical in 'gpu_hist' and 'hist' but doesn't occur in 'exact'.
That is correct
Reducing the learning rate to 0.5 resolves the problem in this case.
Actually, I claim there is still an issue for learning rate of 0.5; the dtrain error increases on iteration 10. Here is the corresponding tree:
booster[10]:
0:[f2<25.0499992] yes=1,no=2,missing=1,gain=158.67746,cover=9094.04004
1:[f3<30.3500004] yes=3,no=4,missing=3,gain=352.901093,cover=7533.38428
3:leaf=-0.00136313098,cover=6965.0625
4:leaf=0.408399105,cover=568.321716
2:[f4<0.557699978] yes=5,no=6,missing=5,gain=233.026367,cover=1560.6554
5:leaf=0.464131415,cover=142.393295
6:leaf=-0.206838489,cover=1418.26208
@jeffdk can you confirm what learning rates you are using on your larger dataset?
I use rates in the range of [0.09-0.2] usually. I've only been cranking the rate up to make the instability easier to see.
One theory is that xgboost uses a Newton style gradient descent algorithm. Therefore it is not guaranteed to converge for learning rate 1.0 due to approximation of the loss function.
I still see a divergence using the attached example and a learning rate of 0.9. Regardless, my impression is that for the binomial logloss, there should be no problems with convergence _mathematically_ given full Newton steps (well, at least, I claim the solution shouldn't _diverge_ right?).
But you said above that you also tried the squared error objective, which would not fail if this were the case.
I do see the instability on the full data set with gpu_hist
using squared error. I'm not able to repro an instability using squared error loss with the cpu hist
method.
The dataset looks very sparse so maybe there is a problem with missing values?
Sparsity in some way may be related. What sort of problems with missing values do you think could manifest in this way?
Closing in favour of https://github.com/dmlc/xgboost/issues/5023
Most helpful comment
I have managed to reproduce the instability with a relatively small data set which I can share. Here is an archive of the data, a repro script, and the model output: xgboost-issue-4204.tar.gz
Updates from above
hist
method (no GPU required)max_bins
valueSymptoms of the instability
Loss and error increasing over iterations:
Unreasonably large leaf weights (only starting in the second iteration):
Hyperparameters for repro
While for the
gpu_hist
method I was able to see the instability for thereg:linear
objective, I could could only reproduce it for classification,binary:logistic
, on the cpu basedhist
method.Note: The fast learning rate and lack of L2 regularization in these parameters is to surface the instability as quickly as possible for the repro. Setting a small learning rate, or L2 regularization slows but does not stop the instability.
Best guess
My best guess is that there is a pathological interaction between the data distribution and the histogram method. It seems to me that I'm hitting some edge case in histogram binning, histogram gradient statistics calculation/updating, histogram split finding, or histogram prediction caching that is resulting in abnormally large leaf weights. I've tried sifting through the statistics, splits, and predictions in gdb, but I'm simply not familiar enough with the code to narrow things any further. @hcho3 @RAMitchell I hope you can use the data I've provided to shed some light on the issue. Many thanks!