Xgboost: Sklearn API Enhancements Needed

Created on 19 May 2017  Â·  24Comments  Â·  Source: dmlc/xgboost

Operating System:
Windows 7 (build 7601, Service Pack 1) 64bit
CUDA 8.0
cuDNN 5.1

Hardware:
CPU: Xeon E5-1620 V4
GPU: Asus GTX 1080 ti FE

Compiler:
Downloaded pre-compiled version from Guido Tapia on 5/12/17

Package used (python/R/jvm/C++):
Python

xgboost version used:
.6

The python version and distribution
Anaconda Python 2.7

@jseabold and @terrytangyuan, it appears there are a number of parameters available for the Booster class that are not available in the Sklearn API (e.g. param['updater'] = 'grow_gpu' or ). It also looks like there's some interest in the community to get these things added to the Sklearn API (since this is such a familiar API to so many people). #1885 mentions adding gpu support to the Sklearn API. I think it would be fairly easy to perform these alterations, and I'd be happy to work on this.

Would you be interested in a PR for these things?

Also, there are two parameters in the Sklearn API that do not follow Sklearn conventions - "seed" should be "random_state" and "nthreads" should be "n_jobs". I would be happy to submit a PR for these things as well.

Please let me know if you're interested in this. Thanks!

Most helpful comment

@rhiever, fixing the verbosity setup of the Sklearn API is definitely on my to-do list. I'll probably work on it sometime in the next week or so. I'll post back here with any updates.

All 24 comments

Yes, PR is always welcomed!

Great. I am thinking it's probably best to do these in two separate PRs - one that updates the Sklearn API to match the Sklearn conventions, and the other that allows usage of the other parameters available in the Train method.

Does that seem reasonable?

Yes, though don't remove old args just yet for backward compatibility purposes. Deprecate warning would be good.

Right, I'll just add the new args and a deprecation warning. I might be able to get the first PR in today. If not, I'll submit it on Monday.

I just filed the issue linked below, that with large-ish data the tree_method switches to 'approx' and I can't seem to switch it back to 'exact' using Sklearn/Python API. Do you think my issue is a subset of this issue?

https://github.com/dmlc/xgboost/issues/2316

Please also make it so nthread (or soon, n_jobs) is 1 by default. It can be problematic for users on high-performance computing clusters when a job unexpectedly uses the maximum number of threads possible on a core. Default behavior should assume that the user does not want multithreading/multiprocessing.

Randy, I had also thought of that but wasnt sute what the maintainers would
think. If there's no objection from anyone I will go ahead and add that in
as well.

On May 19, 2017 2:53 PM, "Randy Olson" notifications@github.com wrote:

Please also make it so nthread (or soon, n_jobs) is 1 by default. It can
be problematic for users on high-performance computing clusters when a job
unexpectedly uses the maximum number of threads possible on a machine.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/dmlc/xgboost/issues/2321#issuecomment-302783212, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AUAFobJa4YUsg478weDhTFX0D1aRTs3Vks5r7eUigaJpZM4Ngc40
.

For old parameters, keep the way it is. For new parameters, do whatever makes more sense, e.g. default to 1 in this case.

@terrytangyuan nthread is not a new parameter to sklearn API (although a name-change is being discussed) so default=1 would be a change.

In this case I actually have the new parameter overriding the old parameter
by default, so whatever is set as the default for n_jobs will be the
default overall.

On May 19, 2017 3:31 PM, "MaxPowerWasTaken" notifications@github.com
wrote:

@terrytangyuan https://github.com/terrytangyuan nthread is not a new
parameter to sklearn API (although a name-change is being discussed) so
default=1 would be a change.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/dmlc/xgboost/issues/2321#issuecomment-302791701, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AUAFoUjbXduxpHYQmlsmuZrJ4dN5KnCIks5r7e4SgaJpZM4Ngc40
.

n_jobs is 1 by default in scikit-learn.

@rhiever, my thoughts exactly.

@terrytanggyuan, what do you think?

n_jobs is 1 by default in scikit-learn.

...hey sorry guys if I'm being really dense here. But the xgboost Python docs and the Sklearn-API portion of those docs list nthread=-1 as the default in the class definitions of both xgboost.XGBClassifier and xgboost.XGBRegressor. Neither lists an n_jobs parameter.

Also when I run

clf = xgb.XGBClassifier()
clf

...I'm shown an xgb clf object listing n_thread=-1 as a parameter, with no n_jobs parameter. I have version 0.6, the latest on PyPl.

Are both my xgboost and the documentation out of date and n_jobs is actually 1 by default? Or am I misunderstanding something?

@MaxPowerWasTaken, you are correct that presently the parameter is nthread and it defaults to -1. We are talking of bringing the API completely up to date sklearn standards which is an n_jobs parameter that defaults to 1 (instead of nthread. I hope that clears it up.

thanks for the clarification.

I've been responding mainly to @terrytangyuan's quote

"For old parameters, keep the way it is."

... I've been trying to point out that renaming n_thread to n_jobs and defaulting it to 1 would be a change of an old parameter, not just a rename, and not a new parameter.

But maybe that was clear to everyone but me. I'll leave you guys to it. Thanks to everyone for your work on such an awesome library.

@MaxPowerWasTaken: Yes, it would certainly be changing an old parameter, but in an important way (IMO). Many users have become angry with XGBoost on my end because it defaults to nthread=-1 when that is not typical behavior elsewhere in the Python ecosystem.

Sounds good to me

@rhiever, the parameter is now n_jobs and set to 1 by default for the sklearn API. We've also updated the parameter "seed" to be "random_state" in keeping with Sklearn standard.

Thanks so much. What version will these changes be released on?

@rhiever Good point. @gaw89 Could you increment the version number as well?

Sure, to what number?

On May 22, 2017 5:51 PM, "Yuan (Terry) Tang" notifications@github.com
wrote:

@rhiever https://github.com/rhiever Good point. @gaw89
https://github.com/gaw89 Could you increment the version number as well?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/dmlc/xgboost/issues/2321#issuecomment-303229747, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AUAFoQEdzGGUk0viJJxEKJfKHAMKaWP3ks5r8gNJgaJpZM4Ngc40
.

Actually don't worry about it. We'll ask @phunterlau to bump that for us when he submits the next version. @rhiever Just stay tuned for the next (major/minor/TBD) version.

Great. While we're at it, can we add a verbose parameter (default 0) that controls XGBoost's verbosity? We've been having issues with TPOT where XGBoost prints error and warning messages when TPOT is in silent mode.

@rhiever, fixing the verbosity setup of the Sklearn API is definitely on my to-do list. I'll probably work on it sometime in the next week or so. I'll post back here with any updates.

Was this page helpful?
0 / 5 - 0 ratings