There is some confusion in naming. non-Dask-API uses "sample_weight" but Dask uses "sample_weights" even while the code docs for Dask show "sample_weight" like non-Dask case.
class DaskScikitLearnBase(XGBModel):
'''Base class for implementing scikit-learn interface with Dask'''
_client = None
# pylint: disable=arguments-differ
def fit(self, X, y,
sample_weights=None,
eval_set=None,
sample_weight_eval_set=None,
verbose=True):
'''Fit the regressor.
Parameters
----------
X : array_like
Feature matrix
y : array_like
Labels
sample_weight : array_like
instance weights
eval_set : list, optional
A list of (X, y) tuple pairs to use as validation sets, for which
metrics will be computed.
Validation metrics will help us track the performance of the model.
sample_weight_eval_set : list, optional
A list of the form [L_1, L_2, ..., L_n], where each L_i is a list
of group weights on the i-th validation set.
verbose : bool
If `verbose` and an evaluation set is used, writes the evaluation
metric measured on the validation set to stderr.'''
raise NotImplementedError
In same file one also has:
sample_weights: list of arrays
The weight vector for validation data.
As if it is a list of arrays, but what is the meaning of that? I think it should be just like non-Dask case and be a single array.
Perhaps there is some confusion between the fact that the sample_weight for X,y is just 1 thing but the sample_weights for eval_set are for each eval_set provided.
A related confusion, is that the normal xgboost scikit-learn API is: https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBClassifier.fit
So fit takes early_stopping_rounds, etc.
Is it correct that dask has no such option and so cannot do early stopping? That would be a critical feature.
I understand the docs say dask has no callback testing: https://xgboost.readthedocs.io/en/latest/tutorials/dask.html#limitations
But does this mean no eval_metric or early stopping is possible?
Are we supposed to use dask for distributed multi-GPU xgboost training? Or is it all deprecated and one can just pass dask_cudf into normal model.
Also, in general, it seems like API overhead to maintain separate APIs for dask and non-dask. You know if the frame incoming is dask_cudf, dask, or not. So you should be able to use a single API. This would make using it alot easier.
@pseudotensor The early stopping support for the Dask API is currently in progress. We had to redesign the callback mechanism to accommodate the Dask API (#6199).
it seems like API overhead to maintain separate APIs for dask and non-dask. You know if the frame incoming is dask_cudf, dask, or not. So you should be able to use a single API. This would make using it alot easier.
Not quite, since the user has to pass in the Dask client object into the Dask API. This is because the user has a wide latitude in configuring a Dask cluster.
Hmm, this suggests it's optional:
https://github.com/dmlc/xgboost/blob/master/demo/dask/sklearn_gpu_training.py#L22
@pseudotensor The early stopping support for the Dask API is currently in progress. We had to redesign the callback mechanism to accommodate the Dask API (#6199).
it seems like API overhead to maintain separate APIs for dask and non-dask. You know if the frame incoming is dask_cudf, dask, or not. So you should be able to use a single API. This would make using it alot easier.
Not quite, since the user has to pass in the Dask client object into the Dask API. This is because the user has a wide latitude in configuring a Dask cluster.
I would still say that a single API and just calling model.fit() model.predict() no matter which is highly beneficial. If "client" must be passed, it can be an extra optional kwarg (default None) that is asserted on in case dask frame is passed.
@pseudotensor The early stopping support for the Dask API is currently in progress. We had to redesign the callback mechanism to accommodate the Dask AI (#6199).
That's great. I see it's merged, so will be in 130?
@pseudotensor So DaskXGBRegressor
can automatically fetch get_client()
from Dask.distributed, if no client
is given by the user.
In general, I agree that users should be able to use DaskXGBRegressor
just like XGBRegressor
. Any major API mismatch should be avoided.
That's great. I see it's merged, so will be in 130?
Yes, we aim to have a working callback mechanism for Dask in 1.3.0 release.
@pseudotensor So
DaskXGBRegressor
can automatically fetchget_client()
from Dask.distributed, if noclient
is given by the user.In general, I agree that users should be able to use
DaskXGBRegressor
just likeXGBRegressor
. Any major API mismatch should be avoided.That's great. I see it's merged, so will be in 130?
Yes, we aim to have a working callback mechanism for Dask in 1.3.0 release.
Ya, and I mean even further there really only needs to be XGBRegressor/XGBClassifier. I can't see why need dask specialized versions. From usability perspective it's extra confusion. Setting up the dask client and the frames has specialized options, but the fit/predict/etc. do not need those I think.
(For context the primary "sample_weight(s)" issue is still there.)
We can consider merging DaskXGBRegressor
with XGBRegressor
once we have feature parity between the two. Dask API must support the callback mechanism first, and that's currently in progress.
Assigning myself to this issue, to address any mismatch in DaskXGBRegressor
and XGBRegressor
API.
Thanks. I'm not even sure things function as they are. At least something like pycharm does not like me passing sample_weights, probably because one of the base classes uses sample_weight. There is no way to get pycharm to like it, plural or not.
@hcho3 BTW, a reason why it's not a good idea to have the model accept "client" is because "client" is not serializable. So this will break anything that relied upon pickle. So using the context manager way as above or lazy getter from dask is probably best.
That is, I experienced this just now. Following https://github.com/dmlc/xgboost/blob/master/demo/dask/sklearn_gpu_training.py#L22 and adding client attribute leads to a model that cannot be pickled.
For the skl interface, accepting an optional client parameter might be necessary, as the client
object carries out most of the async switches. If a wrong client
returned by get_client
the rest will be completely broken.
The typo however, should be fixed. @hcho3 I will submit a PR for that after getting around the overflow issue.
@trivialfis What should we do about client
not being serializable?
If "client" must be passed, it can be an extra optional kwarg (default None) that is asserted on in case dask frame is passed.
I tried that before having the current interface and went through some discussions. The result is not possible at the moment. But I agree that's something we can work toward in the future. An unified interface with different backends. Right now getting dask interface to have feature parity with single node would be a priority, sorry the for the inconvenience.
What should we do about client not being serializable?
train
and predict
functions. Orclient
object for skl interface when not needed. Orget_booster
to obtain the real model. Or