Scikit-learn: Resampler estimators that change the sample size in fitting

Created on 15 Nov 2014 · 68Comments · Source: scikit-learn/scikit-learn

Some data transformations -- including over/under-sampling (#1454), outlier removal, instance reduction, and other forms of dataset compression, like that used in BIRCH (#3802) -- entail altering a dataset at training time, but leaving it unaltered at prediction time. (In some cases, such as outlier removal, it makes sense to reapply a fitted model to new data, while in others model reuse after fitting seems less applicable. )

As noted elsewhere, transformers that change the number of samples are not currently supported, certainly in the context of Pipelines where a transformation is applied both at fit and predict time (although a hack might abuse fit_transform to make this not so). Pipelines of Transformers also would not cope with changes in the sample size at fit time for supervised problems because Transformers do not return a modified y, only X.

To handle this class of problems, I propose introducing a new category of estimator, called a Resampler. It must define at least a fit_resample method, which Pipeline will call at fit time, passing the data unchanged at other times. (For this reason, a Resampler cannot also be a Transformer, or else we need to define their precedence.)

For many models, fit_resample needs only return sample_weight. For sample compression approaches (e.g. that in BIRCH), this is not sufficient as the representative centroids are modified from the input samples. Hence I think fit_resample should return altered data directly, in the form of a dict with keys X, y, sample_weight as required. (It still might be appropriate for many Resamplers to only modify sample_weight; if necessary, another Resampler can be chained that realises the weights as replicated or deleted entries in X and y.)

API Moderate New Feature help wanted

Source

jnothman

👍7

All 68 comments

I hear this positively after discussing this very same pb with @MechCoder

can you write a few lines of code the way you would like to pipe something
like Birch with an estimator that supports sample_weights?

agramfort on 15 Nov 2014

I'm not sure about piping birch with sample weights, but BIRCH could be implemented as

make_pipeline(BirchResampler, PredictorToResampler(SomeClusterer), KNeighborsClassifier)

Not that it's so neat, but it gives an example of the power of the approach. (PredictorToResampler simply takes the predictions of a method and returns it as the y for the input X.)

jnothman on 15 Nov 2014

I think we should list a few use cases to come up with an API that does the
job. The code seems a bit too generic for a single use case, which again I
acknowledge the relevance given our work on birch.

agramfort on 16 Nov 2014

I think that this issue is a core API issue, and a blocker for 1.0.
Thanks for bringing the debate.

To handle this class of problems, I propose introducing a new category of
estimator, called a Resampler. It must define at least a fit_resample method,
which Pipeline will call at fit time, passing the data unchanged at other
times. (For this reason, a Resampler cannot also be a Transformer, or else we
need to define their precedence.)

Why conflating fit and resample? I can see usecases for a separate fit
and resample.

Also, IMHO, the fact that transform does not modify y is a design failure
(mine). I would be happier to define a new method, similar to transform,
that modifies y (I am looking for a good name), and to progressively out
phase 'transform'.

That way we avoid introducing a new class of object, and a new concept.
The more concepts and classes of objects there are in a library, the
harder it is to understand.

Finally, I don't really like the name 'resample'. I find that it is too
specific, and that their are other usecases to the method than resample
(semi-supervised learning to propagate labels to unlabelled data, for
instance).

Here are suggestions of names:

apply
change
modify
mutate
convert

The name transform is just too good, IMHO. In the long run, we could come
back to it, after a couple of years of deprecation of the old behavior.
The new behavior would be that it always return the same number of arrays
than it is given (and raises an error if only X is given for a supervised
method that needs y).

GaelVaroquaux on 16 Nov 2014

Modifying y is not the fundamental issue here. Yes, that's something else that needs to be handled. The issue here is that the set of samples passed out of resample is not necessarily the set passed in. This sort of operation (of which resampling is emblematic, but I am happy to find it a better name) is frequently required for training, and is rarely the right thing to do at test time when you want the predictions to correspond to the inputs.

Not just the "mostly happens [in a pipeline context] at fit time" (and yes, as above, there are cases where a fit model will be reapplied, especially outlier detection) sets this apart from transformers that must equally apply at fit and runtime, but the idea that the sample size can change.

So never mind modifying y. A transformer that allows the sample size to change cannot be used in a FeatureUnion. A transformer that allows the sample size to change cannot be used in a Pipeline unless it modifies y also because score will break, but even so it seems a strange definition of scoring a dataset if it is modified as such.

So as much as redesigning the transformer API may be desirable, there is value IMO in a distinct type of estimator that: (a) has effect in a Pipeline during training and none otherwise; (b) is allowed to change the sample size, where Transformers or their successors should continue not to.

The idea of the name "resample" is that the most important job of this class of estimators is to change the sample size in some way, by oversampling, otherwise re-weighting, compressing, or incorporating unlabelled instances from elsewhere.

jnothman on 16 Nov 2014

A transformer that allows the sample size to change cannot be used in a
FeatureUnion.

That's the argument that I was missing. Thanks! Are there other cases?

The idea of the name "resample" is that the most important job of this
class of estimators is to change the sample size in some way, by
oversampling, otherwise re-weighting, compressing, or incorporating
unlabelled instances from elsewhere.

Based on your arguments justifying the need of the new class, I've been
thinking about the name. And indeed, it should revolve around the notion
of sample, and maybe even the term "sample", as this is what we use in
scikit-learn. The most explicit term would be "transform_samples", but I
think that this is too long (we might need things like
"fit_transform_samples").

One thing that I am worried about, however, is that if we introduce a
"resample" and keep the old "transform" method, it will be ambiguous what
a pipeline means. Of course, we can introduce an argument to the
pipeline, or create a new pipeline variant. However, I am worried that
the added complexity for users and developers does not justify creating
the extra object compared to the Transformers. In other tmers, I think
that we would be better off saying that some transformers change the
number of samples (and we can create an extra sub-class for that).

GaelVaroquaux on 17 Nov 2014

And would this subclass of transformers also only operate at fit time? I think this is different enough to motivate a different family of estimators, but I might be wrong.

This type of estimator pipelining can also be easily modelled as meta-estimators. The only real problem there is the uncomfortable nesting of param names (although I did once play with some magic that allows a nested structure to be wrapped so that parameters can be renamed, or their values tied), and that flat is better than nested.

jnothman on 17 Nov 2014

And would this subclass of transformers also only operate at fit time?

Is there a reason why a fit_transform wouldn't solve that problem?

GaelVaroquaux on 17 Nov 2014

fit_transform solves that component if fit_transform and fit().transform are allowed to have different results. I think transformers are confusing enough to many users even while more-or-less promising the functional equivalence of fit_transform and fit().transform.

jnothman on 17 Nov 2014

fit_transform solves that component if fit_transform and
fit().transform are allowed to have different results. I think
transformers are confusing enough to many users even while more-or-less
promising the functional equivalence of fit_transform and
fit().transform.

Quite clearly I agree with you that breaking this equivalence would be a
very bad idea.

But I am not sure why it would be necessary (although I am starting to
get your point, I am not yet convinced that it is not possible to
implement a transform method that has the logic necessary to have the
match between fit().transform and fit_transform).

GaelVaroquaux on 17 Nov 2014

I'm preparing some examples so that we have something to point at in
discussion, but it takes longer than writing quick responses!

On 17 November 2014 20:45, Gael Varoquaux [email protected] wrote:

fit_transform solves that component if fit_transform and
fit().transform are allowed to have different results. I think
transformers are confusing enough to many users even while more-or-less
promising the functional equivalence of fit_transform and
fit().transform.

Quite clearly I agree with you that breaking this equivalence would be a
very bad idea.

But I am not sure why it would be necessary (although I am starting to
get your point, I am not yet convinced that it is not possible to
implement a transform method that has the logic necessary to have the
match between fit().transform and fit_transform).

—
Reply to this email directly or view it on GitHub
https://github.com/scikit-learn/scikit-learn/issues/3855#issuecomment-63280553
.

jnothman on 17 Nov 2014

I'm preparing some examples so that we have something to point at in
discussion, but it takes longer than writing quick responses!

Thank you. This is very useful!

GaelVaroquaux on 17 Nov 2014

Thanks for restarting the discussion on this.
So with implementing something that, say, resamples the classes to equal sizes during training, there are three distinct problems:

1) It changes y.
2) It resamples during training, but we want to have predictions for all samples during test time.
3) This estimator could not be FeatureUnion'ed with anything else.

The first one might be solved by changing the behavior of transformers, for the other two it is not as obvious as to what to do.
I think we might still get away with the transformer interface, though.
I would not worry too much about 3). I think raising a sensible error when someone tries that would be fine. This should be pretty easy to detect.

3) is maybe the most tricky one as it will definitely require some new mechanism and we should be careful if it is worth adding this complexity.
How about adding transform(X, y, fit=False) or transform(X, y, filter=False) or something keyword, that controlls whether dropping samples is allowed or not.
In a pipeline the option could then depend on whether someone called "fit" or not.

That makes me think: what are the cases when we want different behavior during fitting and predicting? Do we always want to resample during learning, but not during prediction? What if we want to do some visualization and want to filter out outliers for both?

amueller on 17 Nov 2014

@jnothman

As far as I understand this discussion, (sorry if I missed something, I just quickly skimmed through, especially just the parts that say Birch :P ), you mean to subclass Birch (and other instance reduction methods) from a new class of estimators, called Resamples, and whose fit_resample method we call during the fit of Pipeline, right?. Some naive questions for starters.

In a way (like in Birch), one can view the centroids obtained from MBKMeans (and some other clusterers), as instance reduction, especially when n_clusters is large enough, how do we draw a line between whether a fit_resample or a fit_transform should be called?
What are the transform methods of transformers typically used for in a pipeline? It seems to me that using brc.subcluster_centers_ might be much more useful than transforming the input data in the subcluster_centers_ space, especially when piped with AgglomerativeClustering et. al which is what is being done internally.

MechCoder on 21 Nov 2014

@MechCoder

Firstly, I'm not sure that reimplementing BIRCH is what I intend here. It's more that this type of algorithm can be framed as a pipeline of reduction, clustering, etc. There should be a _right way_ to cobble together estimators into these sorts of things in scikit-learn, to whatever extent it is facilitated by the API. As for reimplementing BIRCH itself, the resampler could be pulled out as a separate component, and the full clusterer can be offered as well.

Yes, using MBKmeans for the instance reduction is equally applicable; the fact that it happens to define transform with some different semantics means that however it is wrapped as a resampler needs to appear as a separate class (somewhat like how WardAgglommeration and Ward are distinct classes).

Classifiers or clusterers or regressors that happen to implement transform are a little problematic in general because, as you suggest, the semantics of the associated transformation are not necessarily inherent to the predictor, are not necessarily described in the same reference texts as the predictor, etc. For instance, despite in #2160 suggesting that for consistency all estimators with coef_ or feature_importances_ should also have _LearntSelectorMixin to act as a feature selector, I later thought the approach of the now-stale #3011 would be more appropriate, where we replace this mixin with a way to wrap a classifier/regressor so that it acts as a feature selector; alternatively, a method of a classifier/regressor like .as_feature_selector() could perform the same magic. The idea is to more clearly separate model and function.

@amueller

3) is maybe the most tricky one

Did you mean (2)?

what are the cases when we want different behavior during fitting and predicting? Do we always want to resample during learning, but not during prediction? What if we want to do some visualization and want to filter out outliers for both?

I think this is a key question. Certainly there must be a way to reapply the fitted resampling where appropriate; visualisation is a good example of such. Yet perhaps this is no big deal to expect users to do without the pipeline magic.

jnothman on 23 Nov 2014

@jnothman yes, I meant (2).
Sorry, I'm not sure I understand your reply.
What do you mean by "without the pipeline magic"? That users should not be able to use pipline in this case? Or that the heuristic of not applying resampling for predict, score or transform should be the default but there should be an option to not use this heuristic?

Btw, this heuristic gives me no option to compute the score on the training set that was used, which is a bit odd.

amueller on 25 Nov 2014

I'm not entirely happy with it, but I've mocked up some examples (not plots, just usage code) at https://gist.github.com/jnothman/274710f945e311697466

What do you mean by "without the pipeline magic"? That users should not be able to use pipline in this case?

I mean that currently there are cases where Pipeline can't reasonably be used. It's particularly useful for grid searches, etc., where cloning and parameter setting is involved, while requiring the visualisation of inliers to not use a Pipeline object probably doesn't hurt the user that much.

I agree it's a bit upsetting that this model would not provide a way to compute the training score.

jnothman on 27 Nov 2014

To summarize a discussion with @GaelVaroquaux, we both thought that breaking the equivalence of fit().transform() and fit_transform might be a viable way forward. fit_transform would subsample, but fit().transform() would not.

amueller on 12 Jun 2015

I think it's time to resolve this. We are already breaking fit_transform and transform equivalence elsewhere.

But are you sure we want to allow fit_transform to return (X, y, props) sometimes and only X at others? Do we then require transform to return only X or is it also allowed to change y (I think we should not allow it to change y; it is a bad idea for evaluation).

We also have a small problem in pipeline's handling of fit_params: any fit_params downstream of a resampler cannot be used and should raise an error. (Any props returned by the resampler need to be interpreted with the pipeline's routing strategy.) Indeed maybe it is a design fault in pipeline, but the handling of sample props and y there assumes that fit_transform's output is aligned with the input, sample for sample.

I find these arguments together compelling to suggest that this deserves a separate method, e.g. fit_resample, not just an option for a transformer to return a tuple that results in very different handling. I do not, however, think we should have a corresponding sample method (and find imblearn's Pipeline.sample method quite problematic). At test time, transform should be called, or else we could consider all resamplers to perform the identity transform at test time. (On objects supporting fit_resample, fit_transform should be forbidden.)

Let's make this happen.

jnothman on 14 Jan 2018

I think for now we should forbid resamplers from implementing transform, as the common use cases are identity transforms, and allowing transform is then possible in the future without breaking backwards compatibility.

jnothman on 14 Jan 2018

Proposal of work on this and #9630:

Implementation

Create an OutlierDetectorMixin as a concrete example of an estimator with fit_resample. fit_resample is defined to return (X, y, props) corresponding to only the inliers. This way outlier detectors will act as outlier removers in a Pipeline once the rest of the work is complete (see #9630). They are here only as a tangible example of a resampler. props is merely a dict of params that would be passed to fit. {'sample_weight': [...]} or {} most often.
Inherit from OutlierDetectorMixin where appropriate. Test it.
Extend common tests to cover resamplers. It should:
- check the output of fit_resample is of correct structure
- check the output of fit_resample has consistent lengths
- check the output of fit_resample is consistent for repeated calls
- assert that having fit_resample means no fit_transform or transform
handle fit_resample in Pipeline.fit, making sure that props are handled correctly (no props should already be set for downstream pipeline steps; returned props should be interpreted as Pipeline.fit's fit_params are)
perform identity transform for resamplers in Pipeline.{transform,predict,...}
? add fit_resample method to Pipelines whose last step has fit_resample
perhaps implement other resamplers (e.g. oversample), perhaps based on #1454

Documentation

add/modify example of outlier removal
discuss outlier removal in outlier detection docs
discuss resamplers in Pipeline docs
add OutlierDetectorMixin to classes.rst
mention OutlierDetectorMixin in glossary under outlier detector
entry for resampler in glossary
describe resamplers in developer docs

I'm happy to open this to a contributor (or a GSoC) if others think this is the right way to go.

jnothman on 16 Jan 2018

Ping @glemaitre

jnothman on 16 Jan 2018

perhaps implement other resamplers (e.g. oversample), perhaps based on #1454

Do you think that OutlierDetectorMIxin will be a good naming for resamplers?

glemaitre on 16 Jan 2018

An outlier detector is a kind of resampler that removes outliers. Not all resamplers are outlier detectors.

jnothman on 16 Jan 2018

The idea is to start by implementing something tangible, rather than an abstract API or Pipeline that cannot be tested.

jnothman on 16 Jan 2018

I've clarified above.

jnothman on 16 Jan 2018

OK. Could you link to the props PR or issue. Otherwise the plan seems good.
Maybe we should add that the different methods need to pass the common test of the estimators_checks.

glemaitre on 16 Jan 2018

By props I just mean a dict of params that would be passed to fit. {'sample_weight': [...]} or {} most often.

jnothman on 16 Jan 2018

This looks like a very interesting project. I would like to explore more and ask for help if there's a doubt.

thechargedneutron on 19 Jan 2018

I've marked this help wanted and would really like to see someone pursuing https://github.com/scikit-learn/scikit-learn/issues/3855#issuecomment-357949997, if only so that we have a concrete implementation to reach consensus on.

jnothman on 9 Jun 2018

I would name the Mixin as ResamplerMixin or SamplerMixin. Also, I would name the method fit_sample to be inline with imbalanced-learn.

chkoar on 12 Jul 2018

I think doing this in line with imbalance learn would be good and we should just adopt their stuff. cc @NicolasHug also ;)

amueller on 12 Jul 2018

I could work on that

chkoar on 12 Jul 2018

go for it, I'd say!

amueller on 12 Jul 2018

I would be happy to see this moving forward. Ping me if I can help :)

glemaitre on 12 Jul 2018

I recall some weird behaviours in imblearn that I don't want to replicate
here. Generative estimators support a sample method so fit_sample sounds
like something else. It also sounds to me like the naive user would expect
it to mean "fit on a sample", i.e. partial fit. So I'm against fit_sample

jnothman on 13 Jul 2018

I recall fit_resample from your hand if I am not wrong?

glemaitre on 13 Jul 2018

I don't mind what else, i just don't like fit_sample

jnothman on 13 Jul 2018

It also sounds to me like the naive user would expect it to mean "fit
on a sample", i.e. partial fit. So I'm against fit_sample

GaelVaroquaux on 13 Jul 2018

It's a bit of a kludge, but this was my implementation for an sklearn pipeline that can resample:

https://github.com/dmbee/seglearn/blob/master/seglearn/pipe.py

dmbee on 19 Nov 2018

I don't think we can reuse the transform verb. And I don't think we can
transform y at test time. Why did you think it was appropriate to do so in
seglearn?

jnothman on 19 Nov 2018

I agree with you Joel that most use cases will not need or even prohibit resampling during test.

seglearn deals with time series / sequences and resampling is still required at test for segmenting the data. I am not sure if any other applications would also require resampling at test. None come to mind at the moment.

In any case, you could use the transform verb for the resampler (as I have) and call it only during fit / transform. However, the implementation you proposed also makes sense. I just posted my implementation as a working example.

Happy to help out if needed.

David

dmbee on 20 Nov 2018

@dmbee, you're welcome to help out with a contribution here. We need someone capable and dedicated to push this through with mentoring from the core dev team.

jnothman on 8 Dec 2018

@jnothman - ok I'll get started and aim to complete it over the winter holidays.

dmbee on 10 Dec 2018

@jnothman - I have a one question about your proposed implementation:

sample_weight is passed to the pipeline during fit / fit_transform / fit_predict as part of **fit_params and has to be prefixed with the label for the final estimator (eg 'logreg__sample_weight')

I think the cleanest solution is to have pipeline check if *__sample_weight is in fit_params and pass this as sample_weight to the resampler. The resampler will then return (X, y, sample_weight), which can be used to overwrite sample_weight for the final estimator.

I am not sure why you want a dictionary props to be returned by fit_resample? Since each step in the pipeline only receives the fit_params assigned to it using the prefixing api.

Cheers,
David

dmbee on 10 Dec 2018

Roughly here is the template I was thinking could work well for resamplers::

class ResamplerMixin(object):

    def fit_resample(self, X, y, **fit_params):
        sample_weight = None
        if 'sample_weight' in fit_params:
            sample_weight = fit_params.pop('sample_weight')

        return self.fit(X, y, **fit_params).resample(X, y, sample_weight)

    def fit(self, X, y):
        return self

    def resample(self, X, y, sample_weight):
        return X, y, sample_weight

dmbee on 10 Dec 2018

You're asking the right questions, so this is very encouraging :)

I am not sure why you want a dictionary props to be returned by fit_resample? Since each step in the pipeline only receives the fit_params assigned to it using the prefixing api.

Well, we do have a problem that we've not really agreed on the design of a sample property routing mechanism, and Pipeline's one at the moment is simply not designed to work where the properties cannot be specified to each step when fit is called.

But if we want fit_resample to be able to modify or generate sample_weight then it needs to be able to modify or generate other things (unspecified) that align to each sample; alignment to the original samples doesn't make sense. Hence a dict.

However, the handling of this in a Pipeline is open for debate

no props should already be set for downstream pipeline steps

I think this must be the case.

returned props should be interpreted as Pipeline.fit's fit_params are

For now I would consider raising a NotImplementedError if in a pipeline and the dict is not empty, and we can work out what appropriate behaviour should be once we've worked out sample prop routing (may I have the time to consider it!)...

jnothman on 11 Dec 2018

Thanks Joel. I'm pretty sure I understand what you are after. Essentially:

Some fit_params (eg sample_weight and perhaps others) are tied to the samples and must be accordingly modified by the resamplers for any downstream estimators. Let's call these sample_props.

A backwards compatible implementation that would involve the least changes to the API (existing transformers, estimators, etc) could be as follows.

Add an optional parameter to the pipeline fit routines called sample_props. This is a list of strings which correspond to keys of the fit_params that are sample_props and need to be modified by the resamplers. The relevant parameters can then be passed and updated by the resamplers. The pipeline _fit routine is modified to get the relevant fit_params at each step as follows::

for step_idx, (name, transformer) in enumerate(self.steps[:-1]):

    fit_params_step = {key.split('__', 1)[1]: fit_params.pop(key) for key in fit_params if name + "__" in key};

    if isinstance(transformer, ResamplerMixin):
        props = {key : fit_params[key] for key in sample_props}
        X, y, props = transformer.fit_resample(X, y, **fit_params_step, props)
        fit_params.update(props)


    else: ... # is a regular transfomer

No rush on making this architecture decision, but I'd like to have a plan before moving forward with writing the code.

edited to use pop to avoid recalculating sample_props for upstream transformers

dmbee on 11 Dec 2018

Don't worry about the Pipeline fit routine in particular. Worry about the
fit_resample API. In general, fit methods can take additional parameters
beyond (X, y) that are aligned with X, i.e. sample props. I believe
fit_resample needs to be able to return such things as well as take them as
input. How they are then handled in a Pipeline is an open question, and IMO
we don't need to commit on that yet to create the resampling API.

jnothman on 12 Dec 2018

Fair enough. I suppose modifying the pipeline was part I found most interesting. In any case, you're ok with this as the rough outline for the API?

class ResamplerMixin(object):

    def fit_resample(self, X, y, **fit_params, props=None ):
        return self.fit(X, y, **fit_params).resample(X, y, props)

class TakeOneSample(BaseEstimator, ResamplerMixin):

    def __init__(self):
        pass        

    def fit(self, X, y=None):
        return self

    def resample(self, X, y, props):
        return X[0], y[0], {k : v[k][0] for k in props}

dmbee on 12 Dec 2018

I don't know any use case that requires a resample method, really, and it
certainly complicates things in Pipeline: why would we change the set of
samples at test time, and how would that affect scoring a pipeline's
predictions?

jnothman on 12 Dec 2018

That is, I propose supporting fit_resample, but not resample.

jnothman on 12 Dec 2018

why would we change the set of samples at test time, and how would that affect scoring a pipeline's predictions?

Yes, changing the number of samples at test time is a semantic that is
unclear to me. I would rather frown away from it.

GaelVaroquaux on 12 Dec 2018

+1 for only having fit_resample defined in the Mixin. I don't recall any use case having only resample.

Regarding the Pipeline implementation, I think that the changes done in the imblearn implementation should make the trick or at least a good start. It will remain the issue regarding the props handling.

Regarding the handling of the sample_props in the resampler itself, it looks like @dmbee would go in the same direction than what we thought to handle sample_weight:
https://github.com/scikit-learn-contrib/imbalanced-learn/pull/463/files

@dmbee do not hesitate to retake some of the code/tests of imblearn. You can also ping me to review the PR. I'm going to become active again in scikit-learn from next week.

glemaitre on 12 Dec 2018

👍1

I don't know any use case that requires a resample method, really, and it certainly complicates things in Pipeline: why would we change the set of samples at test time, and how would that affect scoring a pipeline's predictions?

Above, it was not my intent to support resampling at test time. I understand that is out of scope here / niche.

Allowing separate fit / resample methods (instead of just fit_resample) does not affect how the resampler is used in pipeline or the complexity of the pipeline in my view (see my pipeline code above). It came to mind that it may be useful to separate the fit and resample methods for some potential use cases (eg generative resampling).

However, if this capability is not desirable we can use a very similar API to imblearn adding handling of sample_props which is straightforward if the resampler is just indexing the data. See rough example below.

Let me know your thoughts?

import numpy as np
from sklearn.base import BaseEstimator
from abc import abstractmethod

class ResamplerMixin(object):

    def fit_resample(self, X, y, props=None, **fit_params): # gets called by pipeline._fit()
        self._check_data(X, y, props)
        return self._fit_resample(X, y, props, **fit_params)

    @abstractmethod # must be implemented in derived classes
    def _fit_resample(self, X, y, props=None, **fit_params):
        return X, y, props

    def _check_data(self, X, y, props): # to be expanded upon
        if props is not None:
            if not np.all((np.array([len(props[k]) for k in props]) == len(y))):
                raise ValueError

    def _resample_from_indices(self, X, y, props, indices):
        if props is not None:
            return X[indices], y[indices], {k : props[k][indices] for k in props}
        else:
            return X[indices], y[indices], None


class TakeOneSample(BaseEstimator, ResamplerMixin):

    def __init__(self, index = 0):
        self.index = index

    def _fit_resample(self, X, y, props=None):
        return self._resample_from_indices(X, y, props, self.index)

dmbee on 12 Dec 2018

👍1

@glemaitre - thank you. I certainly have no desire to replicate what has been done in imblearn (I like and use that package by the way!). It seems the main thing lacking atm from imblearn is the pipeline / resampler changes required to support sample_props. Otherwise, it seems very compatible with sklearn.

David

dmbee on 12 Dec 2018

I don't really like that we're committing to a format for the sample props here in a sense, but I guess it's not that different from the handling of fit_params in the cross-validation code right now. So I think it should be good do go.

amueller on 12 Dec 2018

👍1

I certainly have no desire to replicate what has been done in imblearn

Actually, you should take whatever works for scikit-learn from imblearn. Our idea is to contribute whatever is good upstream and remove from our code base. We are at a stage that the API start to be more stable and we recently made some changes to reflect some discussions with @jnothman.

Bottom line, take whatever is beneficial for scikit-learn ;)

glemaitre on 12 Dec 2018

So I think it should be good do go.

Well, in theory, we would need to work on the SLEPs.

However, it may be a good exercise to move this issue forward, in order
to be able to write a concise SLEP.

GaelVaroquaux on 12 Dec 2018

Actually, you should take whatever works for scikit-learn from imblearn. Our idea is to contribute whatever is good upstream and remove from our code base.

OK good to know - thanks. I should have used re-implement rather than replicate. It seems most things from imblearn can be readily ported.

Well, in theory, we would need to work on the SLEPs.

Not sure what a SLEP is....

dmbee on 13 Dec 2018

A SLEP is an (under-used) Scikit-learn enhancement proposal. https://github.com/scikit-learn/enhancement_proposals/

Regarding the API proposal in https://github.com/scikit-learn/scikit-learn/issues/3855#issuecomment-446690557, yes, that approach to resampling looks good... Not all estimators supporting fit_resample can do so from indices, but I think you are aware of that; sample reduction techniques will not, for instance.

jnothman on 13 Dec 2018

Regarding the API proposal in #3855 (comment), yes, that approach to resampling looks good... Not all estimators supporting fit_resample can do so from indices, but I think you are aware of that; sample reduction techniques will not, for instance.

OK good stuff. Yes - I am aware that not all resamplers will sample from indices. Those that do not will have to either implement their own method of dealing with props (if there is a sensible option - hard to know for sure without knowing what will be in props) or otherwise raise a notimplemented error if props is not none.

dmbee on 13 Dec 2018

Is anyone in Paris working on this? I'd be happy to help (and the api would be useful for fitting semi-supervised classifiers, as discussed in this Review.)

orausch on 25 Feb 2019

Yep I started to port the imblearn implementation. Do you want to take other and I can help for the review instead?

glemaitre on 25 Feb 2019

Sure, where can I find you?

orausch on 25 Feb 2019

I started working on this (I'm not in Paris), but got too busy over the last couple of months. I am happy to share what I have done already, and continue working on it.

dmbee on 25 Feb 2019

I started working on this (I'm not in Paris), but got too busy over the last couple of months. I am happy to share what I have done already, and continue working on it.

I'm starting work on this now for the sprint. If you have existing work that you think could be useful, I'd be more than happy to build on what you've done.

orausch on 25 Feb 2019

here it is - look at resample folder in sklearn
https://github.com/dmbee/scikit-learn/tree/dmbee-resampling

dmbee on 25 Feb 2019

Was this page helpful?

0 / 5 - 0 ratings