Xgboost: typo in DaskScikitLearnBase and confusion regarding sample_weight

Created on 15 Oct 2020 · 17Comments · Source: dmlc/xgboost

There is some confusion in naming. non-Dask-API uses "sample_weight" but Dask uses "sample_weights" even while the code docs for Dask show "sample_weight" like non-Dask case.

class DaskScikitLearnBase(XGBModel):
    '''Base class for implementing scikit-learn interface with Dask'''

    _client = None

    # pylint: disable=arguments-differ
    def fit(self, X, y,
            sample_weights=None,
            eval_set=None,
            sample_weight_eval_set=None,
            verbose=True):
        '''Fit the regressor.

        Parameters
        ----------
        X : array_like
            Feature matrix
        y : array_like
            Labels
        sample_weight : array_like
            instance weights
        eval_set : list, optional
            A list of (X, y) tuple pairs to use as validation sets, for which
            metrics will be computed.
            Validation metrics will help us track the performance of the model.
        sample_weight_eval_set : list, optional
            A list of the form [L_1, L_2, ..., L_n], where each L_i is a list
            of group weights on the i-th validation set.
        verbose : bool
            If `verbose` and an evaluation set is used, writes the evaluation
            metric measured on the validation set to stderr.'''
        raise NotImplementedError

In same file one also has:

sample_weights: list of arrays
        The weight vector for validation data.

As if it is a list of arrays, but what is the meaning of that? I think it should be just like non-Dask case and be a single array.

Perhaps there is some confusion between the fact that the sample_weight for X,y is just 1 thing but the sample_weights for eval_set are for each eval_set provided.

Source

pseudotensor

All 17 comments

A related confusion, is that the normal xgboost scikit-learn API is: https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBClassifier.fit

So fit takes early_stopping_rounds, etc.

Is it correct that dask has no such option and so cannot do early stopping? That would be a critical feature.

I understand the docs say dask has no callback testing: https://xgboost.readthedocs.io/en/latest/tutorials/dask.html#limitations

But does this mean no eval_metric or early stopping is possible?

Are we supposed to use dask for distributed multi-GPU xgboost training? Or is it all deprecated and one can just pass dask_cudf into normal model.

Also, in general, it seems like API overhead to maintain separate APIs for dask and non-dask. You know if the frame incoming is dask_cudf, dask, or not. So you should be able to use a single API. This would make using it alot easier.

pseudotensor on 15 Oct 2020

@pseudotensor The early stopping support for the Dask API is currently in progress. We had to redesign the callback mechanism to accommodate the Dask API (#6199).

it seems like API overhead to maintain separate APIs for dask and non-dask. You know if the frame incoming is dask_cudf, dask, or not. So you should be able to use a single API. This would make using it alot easier.

Not quite, since the user has to pass in the Dask client object into the Dask API. This is because the user has a wide latitude in configuring a Dask cluster.

hcho3 on 15 Oct 2020

👍1

Hmm, this suggests it's optional:

https://github.com/dmlc/xgboost/blob/master/demo/dask/sklearn_gpu_training.py#L22

pseudotensor on 15 Oct 2020

@pseudotensor The early stopping support for the Dask API is currently in progress. We had to redesign the callback mechanism to accommodate the Dask API (#6199).

it seems like API overhead to maintain separate APIs for dask and non-dask. You know if the frame incoming is dask_cudf, dask, or not. So you should be able to use a single API. This would make using it alot easier.

Not quite, since the user has to pass in the Dask client object into the Dask API. This is because the user has a wide latitude in configuring a Dask cluster.

I would still say that a single API and just calling model.fit() model.predict() no matter which is highly beneficial. If "client" must be passed, it can be an extra optional kwarg (default None) that is asserted on in case dask frame is passed.

pseudotensor on 15 Oct 2020

@pseudotensor The early stopping support for the Dask API is currently in progress. We had to redesign the callback mechanism to accommodate the Dask AI (#6199).

That's great. I see it's merged, so will be in 130?

pseudotensor on 15 Oct 2020

@pseudotensor So DaskXGBRegressor can automatically fetch get_client() from Dask.distributed, if no client is given by the user.

In general, I agree that users should be able to use DaskXGBRegressor just like XGBRegressor. Any major API mismatch should be avoided.

That's great. I see it's merged, so will be in 130?

Yes, we aim to have a working callback mechanism for Dask in 1.3.0 release.

hcho3 on 15 Oct 2020

@pseudotensor So DaskXGBRegressor can automatically fetch get_client() from Dask.distributed, if no client is given by the user.

In general, I agree that users should be able to use DaskXGBRegressor just like XGBRegressor. Any major API mismatch should be avoided.

That's great. I see it's merged, so will be in 130?

Yes, we aim to have a working callback mechanism for Dask in 1.3.0 release.

Ya, and I mean even further there really only needs to be XGBRegressor/XGBClassifier. I can't see why need dask specialized versions. From usability perspective it's extra confusion. Setting up the dask client and the frames has specialized options, but the fit/predict/etc. do not need those I think.

pseudotensor on 15 Oct 2020

(For context the primary "sample_weight(s)" issue is still there.)

pseudotensor on 15 Oct 2020

👍1

We can consider merging DaskXGBRegressor with XGBRegressor once we have feature parity between the two. Dask API must support the callback mechanism first, and that's currently in progress.

hcho3 on 15 Oct 2020

👍1

Assigning myself to this issue, to address any mismatch in DaskXGBRegressor and XGBRegressor API.

hcho3 on 15 Oct 2020

Thanks. I'm not even sure things function as they are. At least something like pycharm does not like me passing sample_weights, probably because one of the base classes uses sample_weight. There is no way to get pycharm to like it, plural or not.

pseudotensor on 15 Oct 2020

@hcho3 BTW, a reason why it's not a good idea to have the model accept "client" is because "client" is not serializable. So this will break anything that relied upon pickle. So using the context manager way as above or lazy getter from dask is probably best.

That is, I experienced this just now. Following https://github.com/dmlc/xgboost/blob/master/demo/dask/sklearn_gpu_training.py#L22 and adding client attribute leads to a model that cannot be pickled.

pseudotensor on 15 Oct 2020

For the skl interface, accepting an optional client parameter might be necessary, as the client object carries out most of the async switches. If a wrong client returned by get_client the rest will be completely broken.

trivialfis on 15 Oct 2020

The typo however, should be fixed. @hcho3 I will submit a PR for that after getting around the overflow issue.

trivialfis on 15 Oct 2020

@trivialfis What should we do about client not being serializable?

hcho3 on 15 Oct 2020

If "client" must be passed, it can be an extra optional kwarg (default None) that is asserted on in case dask frame is passed.

I tried that before having the current interface and went through some discussions. The result is not possible at the moment. But I agree that's something we can work toward in the future. An unified interface with different backends. Right now getting dask interface to have feature parity with single node would be a priority, sorry the for the inconvenience.

trivialfis on 15 Oct 2020