Xgboost: Can you pass sample weights to eval_sets?

Created on 23 Nov 2016 · 15Comments · Source: dmlc/xgboost

Hello I am using XGBoost 6.0 for python, and XGBClassifier to train some models. I have sample weights for my model and pass it into fit, and the eval_set. I don't get an error when I include the weight in eval_set but it isn't mentioned in the documentation so I am not sure it is being used.

eval_set=[(train[clf.signal_names].values, train['label'], train.weight),
          (testing[clf.signal_names].values, testing['label'], testing.weight)]

model2.fit(train[clf.signal_names].values, train['label'], sample_weight=train.weight, 
eval_set=eval_set, eval_metric='logloss', verbose=True)

Also, is there any data format that I can pass into XGBClassifier that will lower the memory consumption. The first stage of training takes a large amount of memory even though the data was already in memory. Can I use a DMatrix instead with XGBClassifier?

Thanks,
Damien

Source

damienrj

Most helpful comment

2354 has been merged into master. Thanks!

hcho3 on 24 May 2018

🎉2

All 15 comments

In XGBModel your weights are definitely not being used as the 3rd item is never being used - sklearn.py#L226

In XGBClassifier also the weights are not being used, but there is a TODO which questions whether to use sample_weight for eval_sets also - sklearn.py#L421

So, to answer your question there is no way of giving weights in fit() (yet).

Not sure whether DMatrix can be used or not yet (never used it myself)

AbdealiJK on 23 Nov 2016

Thanks, yeah that is pretty definitive answer. I suppose I just won't be able to use early stopping. I am currently stepping through a completed model usingcmodel.predict_proba(data, ntree_limit=a) which is pretty slow, do you know if there is a faster way to measure the performance per tree given that the eval_set error won't reflect the weighted data?

damienrj on 23 Nov 2016

Hm, nothing that I can think of.
I just created https://github.com/dmlc/xgboost/issues/1806 as I think an "extend" feature would be useful for a scenario like that (or other scenarios, as I described in the issue)

AbdealiJK on 23 Nov 2016

Sorry for the noise. The extend feature already exists in XGboost. No idea on how to make it faster.
A simple way (possibly faster) could be just to copy a row multiple times to make the weight always 1. i.e. If weight is 10, just copy that row 10 times in the testing dataset ?

AbdealiJK on 24 Nov 2016

No worries, I appreciate the info. It looks like the extend version exists if I don't use the sklearn wrapper which might be one way to go. Just requires more work to fit into the pipeline. That is a good idea of how to get around the issue with the weights, unfortunately my data is already quite large. Even without early stopping and the slow stepping through trees, it is still much faster that the SKlearn version!

damienrj on 24 Nov 2016

It appears the the predict function is much faster if you don't use the xgbclassifier wrapper. My guess is that the data gets converted each time for the prediction which is making it take longer.

damienrj on 1 Dec 2016

Hello,

My team and I are (insurance company) also are very interested by being able to pass sample weights in the eval_set.

I understand tha now XGBRegressor and XGBClassifier both handle sample_weights in the fit method : https://github.com/dmlc/xgboost/pull/1874

Being able to have sample weights in eval_set as well in crucial since without sample_weights the early_stopping is not optimising the same weighted loss, which leads to biaised estimators...

Indeed weigths are often used in insurance problems to handle duration of contracts at risk. For instance when estimating frequencies of claims with a Poisson regression, doing a Poisson regression on the number of observed claims normalised by duration, AND using duration as sample_weights, is the way to estimate unbiaised frequencies.

Early stoppping on a validation set of contracts omitting their duration is not good.

sebconort on 23 Mar 2017

I have just created a PR solving this issue: https://github.com/dmlc/xgboost/pull/2354

pdavalo on 31 May 2017

👍1

@sebconort, @damienrj, could you confirm if the aforementioned PR (https://github.com/dmlc/xgboost/pull/2354) works for you?

pdavalo on 13 Jun 2017

Hello, sorry for very late answer. Yes the PR works for me

sebconort on 3 Nov 2017

Hi @pdavalo
was the #2354 ever merged into the master? This looks very useful, and I'm confused why was it not propagated into v0.71.

mlisovyi on 23 May 2018

The PR remained awaiting for revision/approval for seven months, and then it was closed because it was required to rebase master: https://github.com/dmlc/xgboost/pull/2354. I will re-open the PR to see if it makes it to the next version :)

pdavalo on 23 May 2018

🎉1

@pdavalo I went ahead and re-based the PR on top of the current master. The issue was mainly that sklearn interface had a few new parameters (nthread, xgb_model).

hcho3 on 23 May 2018

@hcho3 Thanks!

mlisovyi on 24 May 2018