Some data transformations -- including over/under-sampling (#1454), outlier removal, instance reduction, and other forms of dataset compression, like that used in BIRCH (#3802) -- entail altering a dataset at training time, but leaving it unaltered at prediction time. (In some cases, such as outlier removal, it makes sense to reapply a fitted model to new data, while in others model reuse after fitting seems less applicable. )
As noted elsewhere, transformers that change the number of samples are not currently supported, certainly in the context of Pipeline
s where a transformation is applied both at fit
and predict
time (although a hack might abuse fit_transform
to make this not so). Pipeline
s of Transformer
s also would not cope with changes in the sample size at fit time for supervised problems because Transformer
s do not return a modified y
, only X
.
To handle this class of problems, I propose introducing a new category of estimator, called a Resampler
. It must define at least a fit_resample
method, which Pipeline
will call at fit
time, passing the data unchanged at other times. (For this reason, a Resampler
cannot also be a Transformer
, or else we need to define their precedence.)
For many models, fit_resample
needs only return sample_weight
. For sample compression approaches (e.g. that in BIRCH), this is not sufficient as the representative centroids are modified from the input samples. Hence I think fit_resample
should return altered data directly, in the form of a dict with keys X
, y
, sample_weight
as required. (It still might be appropriate for many Resampler
s to only modify sample_weight
; if necessary, another Resampler
can be chained that realises the weights as replicated or deleted entries in X
and y
.)
I hear this positively after discussing this very same pb with @MechCoder
can you write a few lines of code the way you would like to pipe something
like Birch with an estimator that supports sample_weights?
I'm not sure about piping birch with sample weights, but BIRCH could be implemented as
make_pipeline(BirchResampler, PredictorToResampler(SomeClusterer), KNeighborsClassifier)
Not that it's so neat, but it gives an example of the power of the approach. (PredictorToResampler
simply takes the predictions of a method and returns it as the y
for the input X
.)
I think we should list a few use cases to come up with an API that does the
job. The code seems a bit too generic for a single use case, which again I
acknowledge the relevance given our work on birch.
I think that this issue is a core API issue, and a blocker for 1.0.
Thanks for bringing the debate.
To handle this class of problems, I propose introducing a new category of
estimator, called a Resampler. It must define at least a fit_resample method,
which Pipeline will call at fit time, passing the data unchanged at other
times. (For this reason, a Resampler cannot also be a Transformer, or else we
need to define their precedence.)
Why conflating fit and resample? I can see usecases for a separate fit
and resample.
Also, IMHO, the fact that transform does not modify y is a design failure
(mine). I would be happier to define a new method, similar to transform,
that modifies y (I am looking for a good name), and to progressively out
phase 'transform'.
That way we avoid introducing a new class of object, and a new concept.
The more concepts and classes of objects there are in a library, the
harder it is to understand.
Finally, I don't really like the name 'resample'. I find that it is too
specific, and that their are other usecases to the method than resample
(semi-supervised learning to propagate labels to unlabelled data, for
instance).
Here are suggestions of names:
The name transform is just too good, IMHO. In the long run, we could come
back to it, after a couple of years of deprecation of the old behavior.
The new behavior would be that it always return the same number of arrays
than it is given (and raises an error if only X is given for a supervised
method that needs y).
Modifying y
is not the fundamental issue here. Yes, that's something else that needs to be handled. The issue here is that the set of samples passed out of resample is not necessarily the set passed in. This sort of operation (of which resampling is emblematic, but I am happy to find it a better name) is frequently required for training, and is rarely the right thing to do at test time when you want the predictions to correspond to the inputs.
Not just the "mostly happens [in a pipeline context] at fit time" (and yes, as above, there are cases where a fit model will be reapplied, especially outlier detection) sets this apart from transformers that must equally apply at fit and runtime, but the idea that the sample size can change.
So never mind modifying y
. A transformer that allows the sample size to change cannot be used in a FeatureUnion
. A transformer that allows the sample size to change cannot be used in a Pipeline
unless it modifies y
also because score
will break, but even so it seems a strange definition of scoring a dataset if it is modified as such.
So as much as redesigning the transformer API may be desirable, there is value IMO in a distinct type of estimator that: (a) has effect in a Pipeline
during training and none otherwise; (b) is allowed to change the sample size, where Transformer
s or their successors should continue not to.
The idea of the name "resample" is that the most important job of this class of estimators is to change the sample size in some way, by oversampling, otherwise re-weighting, compressing, or incorporating unlabelled instances from elsewhere.
A transformer that allows the sample size to change cannot be used in a
FeatureUnion.
That's the argument that I was missing. Thanks! Are there other cases?
The idea of the name "resample" is that the most important job of this
class of estimators is to change the sample size in some way, by
oversampling, otherwise re-weighting, compressing, or incorporating
unlabelled instances from elsewhere.
Based on your arguments justifying the need of the new class, I've been
thinking about the name. And indeed, it should revolve around the notion
of sample, and maybe even the term "sample", as this is what we use in
scikit-learn. The most explicit term would be "transform_samples", but I
think that this is too long (we might need things like
"fit_transform_samples").
One thing that I am worried about, however, is that if we introduce a
"resample" and keep the old "transform" method, it will be ambiguous what
a pipeline means. Of course, we can introduce an argument to the
pipeline, or create a new pipeline variant. However, I am worried that
the added complexity for users and developers does not justify creating
the extra object compared to the Transformers. In other tmers, I think
that we would be better off saying that some transformers change the
number of samples (and we can create an extra sub-class for that).
And would this subclass of transformers also only operate at fit time? I think this is different enough to motivate a different family of estimators, but I might be wrong.
This type of estimator pipelining can also be easily modelled as meta-estimators. The only real problem there is the uncomfortable nesting of param names (although I did once play with some magic that allows a nested structure to be wrapped so that parameters can be renamed, or their values tied), and that flat is better than nested.
And would this subclass of transformers also only operate at fit time?
Is there a reason why a fit_transform wouldn't solve that problem?
fit_transform
solves that component if fit_transform
and fit().transform
are allowed to have different results. I think transformers are confusing enough to many users even while more-or-less promising the functional equivalence of fit_transform
and fit().transform
.
fit_transform solves that component if fit_transform and
fit().transform are allowed to have different results. I think
transformers are confusing enough to many users even while more-or-less
promising the functional equivalence of fit_transform and
fit().transform.
Quite clearly I agree with you that breaking this equivalence would be a
very bad idea.
But I am not sure why it would be necessary (although I am starting to
get your point, I am not yet convinced that it is not possible to
implement a transform method that has the logic necessary to have the
match between fit().transform and fit_transform).
I'm preparing some examples so that we have something to point at in
discussion, but it takes longer than writing quick responses!
On 17 November 2014 20:45, Gael Varoquaux [email protected] wrote:
fit_transform solves that component if fit_transform and
fit().transform are allowed to have different results. I think
transformers are confusing enough to many users even while more-or-less
promising the functional equivalence of fit_transform and
fit().transform.Quite clearly I agree with you that breaking this equivalence would be a
very bad idea.But I am not sure why it would be necessary (although I am starting to
get your point, I am not yet convinced that it is not possible to
implement a transform method that has the logic necessary to have the
match between fit().transform and fit_transform).—
Reply to this email directly or view it on GitHub
https://github.com/scikit-learn/scikit-learn/issues/3855#issuecomment-63280553
.
I'm preparing some examples so that we have something to point at in
discussion, but it takes longer than writing quick responses!
Thank you. This is very useful!
Thanks for restarting the discussion on this.
So with implementing something that, say, resamples the classes to equal sizes during training, there are three distinct problems:
1) It changes y.
2) It resamples during training, but we want to have predictions for all samples during test time.
3) This estimator could not be FeatureUnion
'ed with anything else.
The first one might be solved by changing the behavior of transformers, for the other two it is not as obvious as to what to do.
I think we might still get away with the transformer interface, though.
I would not worry too much about 3). I think raising a sensible error when someone tries that would be fine. This should be pretty easy to detect.
3) is maybe the most tricky one as it will definitely require some new mechanism and we should be careful if it is worth adding this complexity.
How about adding transform(X, y, fit=False)
or transform(X, y, filter=False)
or something keyword, that controlls whether dropping samples is allowed or not.
In a pipeline the option could then depend on whether someone called "fit" or not.
That makes me think: what are the cases when we want different behavior during fitting and predicting? Do we always want to resample during learning, but not during prediction? What if we want to do some visualization and want to filter out outliers for both?
@jnothman
As far as I understand this discussion, (sorry if I missed something, I just quickly skimmed through, especially just the parts that say Birch :P ), you mean to subclass Birch (and other instance reduction methods) from a new class of estimators, called Resamples, and whose fit_resample
method we call during the fit
of Pipeline, right?. Some naive questions for starters.
n_clusters
is large enough, how do we draw a line between whether a fit_resample
or a fit_transform
should be called?brc.subcluster_centers_
might be much more useful than transforming the input data in the subcluster_centers_
space, especially when piped with AgglomerativeClustering
et. al which is what is being done internally.@MechCoder
Firstly, I'm not sure that reimplementing BIRCH is what I intend here. It's more that this type of algorithm can be framed as a pipeline of reduction, clustering, etc. There should be a _right way_ to cobble together estimators into these sorts of things in scikit-learn, to whatever extent it is facilitated by the API. As for reimplementing BIRCH itself, the resampler could be pulled out as a separate component, and the full clusterer can be offered as well.
Yes, using MBKmeans for the instance reduction is equally applicable; the fact that it happens to define transform
with some different semantics means that however it is wrapped as a resampler needs to appear as a separate class (somewhat like how WardAgglommeration
and Ward
are distinct classes).
Classifiers or clusterers or regressors that happen to implement transform
are a little problematic in general because, as you suggest, the semantics of the associated transformation are not necessarily inherent to the predictor, are not necessarily described in the same reference texts as the predictor, etc. For instance, despite in #2160 suggesting that for consistency all estimators with coef_
or feature_importances_
should also have _LearntSelectorMixin
to act as a feature selector, I later thought the approach of the now-stale #3011 would be more appropriate, where we replace this mixin with a way to wrap a classifier/regressor so that it acts as a feature selector; alternatively, a method of a classifier/regressor like .as_feature_selector()
could perform the same magic. The idea is to more clearly separate model and function.
@amueller
3) is maybe the most tricky one
Did you mean (2)?
what are the cases when we want different behavior during fitting and predicting? Do we always want to resample during learning, but not during prediction? What if we want to do some visualization and want to filter out outliers for both?
I think this is a key question. Certainly there must be a way to reapply the fitted resampling where appropriate; visualisation is a good example of such. Yet perhaps this is no big deal to expect users to do without the pipeline magic.
@jnothman yes, I meant (2).
Sorry, I'm not sure I understand your reply.
What do you mean by "without the pipeline magic"? That users should not be able to use pipline in this case? Or that the heuristic of not applying resampling for predict
, score
or transform
should be the default but there should be an option to not use this heuristic?
Btw, this heuristic gives me no option to compute the score on the training set that was used, which is a bit odd.
I'm not entirely happy with it, but I've mocked up some examples (not plots, just usage code) at https://gist.github.com/jnothman/274710f945e311697466
What do you mean by "without the pipeline magic"? That users should not be able to use pipline in this case?
I mean that currently there are cases where Pipeline can't reasonably be used. It's particularly useful for grid searches, etc., where cloning and parameter setting is involved, while requiring the visualisation of inliers to not use a Pipeline object probably doesn't hurt the user that much.
I agree it's a bit upsetting that this model would not provide a way to compute the training score.
To summarize a discussion with @GaelVaroquaux, we both thought that breaking the equivalence of fit().transform()
and fit_transform
might be a viable way forward. fit_transform
would subsample, but fit().transform()
would not.
I think it's time to resolve this. We are already breaking fit_transform and transform equivalence elsewhere.
But are you sure we want to allow fit_transform to return (X, y, props) sometimes and only X at others? Do we then require transform to return only X or is it also allowed to change y (I think we should not allow it to change y; it is a bad idea for evaluation).
We also have a small problem in pipeline's handling of fit_params: any fit_params downstream of a resampler cannot be used and should raise an error. (Any props returned by the resampler need to be interpreted with the pipeline's routing strategy.) Indeed maybe it is a design fault in pipeline, but the handling of sample props and y there assumes that fit_transform's output is aligned with the input, sample for sample.
I find these arguments together compelling to suggest that this deserves a separate method, e.g. fit_resample, not just an option for a transformer to return a tuple that results in very different handling. I do not, however, think we should have a corresponding sample method (and find imblearn's Pipeline.sample method quite problematic). At test time, transform should be called, or else we could consider all resamplers to perform the identity transform at test time. (On objects supporting fit_resample, fit_transform should be forbidden.)
Let's make this happen.
I think for now we should forbid resamplers from implementing transform, as the common use cases are identity transforms, and allowing transform is then possible in the future without breaking backwards compatibility.
Proposal of work on this and #9630:
Implementation
OutlierDetectorMixin
as a concrete example of an estimator with fit_resample
. fit_resample
is defined to return (X, y, props)
corresponding to only the inliers. This way outlier detectors will act as outlier removers in a Pipeline once the rest of the work is complete (see #9630). They are here only as a tangible example of a resampler. props
is merely a dict of params that would be passed to fit. {'sample_weight': [...]}
or {}
most often.OutlierDetectorMixin
where appropriate. Test it.fit_resample
is of correct structurefit_resample
has consistent lengthsfit_resample
is consistent for repeated callsfit_resample
means no fit_transform
or transform
fit_resample
in Pipeline.fit
, making sure that props are handled correctly (no props should already be set for downstream pipeline steps; returned props should be interpreted as Pipeline.fit
's fit_params
are)fit_resample
method to Pipelines whose last step has fit_resample
Documentation
I'm happy to open this to a contributor (or a GSoC) if others think this is the right way to go.
Ping @glemaitre
- perhaps implement other resamplers (e.g. oversample), perhaps based on #1454
Do you think that OutlierDetectorMIxin
will be a good naming for resamplers?
An outlier detector is a kind of resampler that removes outliers. Not all resamplers are outlier detectors.
The idea is to start by implementing something tangible, rather than an abstract API or Pipeline that cannot be tested.
I've clarified above.
OK. Could you link to the props PR or issue. Otherwise the plan seems good.
Maybe we should add that the different methods need to pass the common test of the estimators_checks.
By props I just mean a dict of params that would be passed to fit. {'sample_weight': [...]}
or {}
most often.
This looks like a very interesting project. I would like to explore more and ask for help if there's a doubt.
I've marked this help wanted and would really like to see someone pursuing https://github.com/scikit-learn/scikit-learn/issues/3855#issuecomment-357949997, if only so that we have a concrete implementation to reach consensus on.
I would name the Mixin as ResamplerMixin
or SamplerMixin
. Also, I would name the method fit_sample
to be inline with imbalanced-learn.
I think doing this in line with imbalance learn would be good and we should just adopt their stuff. cc @NicolasHug also ;)
I could work on that
go for it, I'd say!
I would be happy to see this moving forward. Ping me if I can help :)
I recall some weird behaviours in imblearn that I don't want to replicate
here. Generative estimators support a sample method so fit_sample sounds
like something else. It also sounds to me like the naive user would expect
it to mean "fit on a sample", i.e. partial fit. So I'm against fit_sample
I recall fit_resample
from your hand if I am not wrong?
I don't mind what else, i just don't like fit_sample
It also sounds to me like the naive user would expect it to mean "fit
on a sample", i.e. partial fit. So I'm against fit_sample
+1
It's a bit of a kludge, but this was my implementation for an sklearn pipeline that can resample:
https://github.com/dmbee/seglearn/blob/master/seglearn/pipe.py
I don't think we can reuse the transform verb. And I don't think we can
transform y at test time. Why did you think it was appropriate to do so in
seglearn?
I agree with you Joel that most use cases will not need or even prohibit resampling during test.
seglearn deals with time series / sequences and resampling is still required at test for segmenting the data. I am not sure if any other applications would also require resampling at test. None come to mind at the moment.
In any case, you could use the transform verb for the resampler (as I have) and call it only during fit / transform. However, the implementation you proposed also makes sense. I just posted my implementation as a working example.
Happy to help out if needed.
@dmbee, you're welcome to help out with a contribution here. We need someone capable and dedicated to push this through with mentoring from the core dev team.
@jnothman - ok I'll get started and aim to complete it over the winter holidays.
@jnothman - I have a one question about your proposed implementation:
sample_weight is passed to the pipeline during fit / fit_transform / fit_predict as part of **fit_params and has to be prefixed with the label for the final estimator (eg 'logreg__sample_weight')
I think the cleanest solution is to have pipeline check if *__sample_weight is in fit_params and pass this as sample_weight to the resampler. The resampler will then return (X, y, sample_weight), which can be used to overwrite sample_weight for the final estimator.
I am not sure why you want a dictionary props to be returned by fit_resample? Since each step in the pipeline only receives the fit_params assigned to it using the prefixing api.
Cheers,
David
Roughly here is the template I was thinking could work well for resamplers::
class ResamplerMixin(object):
def fit_resample(self, X, y, **fit_params):
sample_weight = None
if 'sample_weight' in fit_params:
sample_weight = fit_params.pop('sample_weight')
return self.fit(X, y, **fit_params).resample(X, y, sample_weight)
def fit(self, X, y):
return self
def resample(self, X, y, sample_weight):
return X, y, sample_weight
You're asking the right questions, so this is very encouraging :)
I am not sure why you want a dictionary props to be returned by fit_resample? Since each step in the pipeline only receives the fit_params assigned to it using the prefixing api.
Well, we do have a problem that we've not really agreed on the design of a sample property routing mechanism, and Pipeline's one at the moment is simply not designed to work where the properties cannot be specified to each step when fit is called.
But if we want fit_resample
to be able to modify or generate sample_weight
then it needs to be able to modify or generate other things (unspecified) that align to each sample; alignment to the original samples doesn't make sense. Hence a dict.
However, the handling of this in a Pipeline is open for debate
no props should already be set for downstream pipeline steps
I think this must be the case.
returned props should be interpreted as Pipeline.fit's fit_params are
For now I would consider raising a NotImplementedError if in a pipeline and the dict is not empty, and we can work out what appropriate behaviour should be once we've worked out sample prop routing (may I have the time to consider it!)...
Thanks Joel. I'm pretty sure I understand what you are after. Essentially:
Some fit_params (eg sample_weight and perhaps others) are tied to the samples and must be accordingly modified by the resamplers for any downstream estimators. Let's call these sample_props.
A backwards compatible implementation that would involve the least changes to the API (existing transformers, estimators, etc) could be as follows.
Add an optional parameter to the pipeline fit routines called sample_props. This is a list of strings which correspond to keys of the fit_params that are sample_props and need to be modified by the resamplers. The relevant parameters can then be passed and updated by the resamplers. The pipeline _fit routine is modified to get the relevant fit_params at each step as follows::
for step_idx, (name, transformer) in enumerate(self.steps[:-1]):
fit_params_step = {key.split('__', 1)[1]: fit_params.pop(key) for key in fit_params if name + "__" in key};
if isinstance(transformer, ResamplerMixin):
props = {key : fit_params[key] for key in sample_props}
X, y, props = transformer.fit_resample(X, y, **fit_params_step, props)
fit_params.update(props)
else: ... # is a regular transfomer
No rush on making this architecture decision, but I'd like to have a plan before moving forward with writing the code.
edited to use pop to avoid recalculating sample_props for upstream transformers
Don't worry about the Pipeline fit routine in particular. Worry about the
fit_resample API. In general, fit methods can take additional parameters
beyond (X, y) that are aligned with X, i.e. sample props. I believe
fit_resample needs to be able to return such things as well as take them as
input. How they are then handled in a Pipeline is an open question, and IMO
we don't need to commit on that yet to create the resampling API.
Fair enough. I suppose modifying the pipeline was part I found most interesting. In any case, you're ok with this as the rough outline for the API?
class ResamplerMixin(object):
def fit_resample(self, X, y, **fit_params, props=None ):
return self.fit(X, y, **fit_params).resample(X, y, props)
class TakeOneSample(BaseEstimator, ResamplerMixin):
def __init__(self):
pass
def fit(self, X, y=None):
return self
def resample(self, X, y, props):
return X[0], y[0], {k : v[k][0] for k in props}
I don't know any use case that requires a resample method, really, and it
certainly complicates things in Pipeline: why would we change the set of
samples at test time, and how would that affect scoring a pipeline's
predictions?
That is, I propose supporting fit_resample, but not resample.
why would we change the set of samples at test time, and how would that affect scoring a pipeline's predictions?
Yes, changing the number of samples at test time is a semantic that is
unclear to me. I would rather frown away from it.
+1 for only having fit_resample
defined in the Mixin. I don't recall any use case having only resample
.
Regarding the Pipeline
implementation, I think that the changes done in the imblearn implementation should make the trick or at least a good start. It will remain the issue regarding the props
handling.
Regarding the handling of the sample_props
in the resampler itself, it looks like @dmbee would go in the same direction than what we thought to handle sample_weight
:
https://github.com/scikit-learn-contrib/imbalanced-learn/pull/463/files
@dmbee do not hesitate to retake some of the code/tests of imblearn
. You can also ping me to review the PR. I'm going to become active again in scikit-learn
from next week.
I don't know any use case that requires a resample method, really, and it certainly complicates things in Pipeline: why would we change the set of samples at test time, and how would that affect scoring a pipeline's predictions?
Above, it was not my intent to support resampling at test time. I understand that is out of scope here / niche.
Allowing separate fit / resample methods (instead of just fit_resample) does not affect how the resampler is used in pipeline or the complexity of the pipeline in my view (see my pipeline code above). It came to mind that it may be useful to separate the fit and resample methods for some potential use cases (eg generative resampling).
However, if this capability is not desirable we can use a very similar API to imblearn adding handling of sample_props which is straightforward if the resampler is just indexing the data. See rough example below.
Let me know your thoughts?
import numpy as np
from sklearn.base import BaseEstimator
from abc import abstractmethod
class ResamplerMixin(object):
def fit_resample(self, X, y, props=None, **fit_params): # gets called by pipeline._fit()
self._check_data(X, y, props)
return self._fit_resample(X, y, props, **fit_params)
@abstractmethod # must be implemented in derived classes
def _fit_resample(self, X, y, props=None, **fit_params):
return X, y, props
def _check_data(self, X, y, props): # to be expanded upon
if props is not None:
if not np.all((np.array([len(props[k]) for k in props]) == len(y))):
raise ValueError
def _resample_from_indices(self, X, y, props, indices):
if props is not None:
return X[indices], y[indices], {k : props[k][indices] for k in props}
else:
return X[indices], y[indices], None
class TakeOneSample(BaseEstimator, ResamplerMixin):
def __init__(self, index = 0):
self.index = index
def _fit_resample(self, X, y, props=None):
return self._resample_from_indices(X, y, props, self.index)
@glemaitre - thank you. I certainly have no desire to replicate what has been done in imblearn
(I like and use that package by the way!). It seems the main thing lacking atm from imblearn
is the pipeline / resampler changes required to support sample_props. Otherwise, it seems very compatible with sklearn
.
I don't really like that we're committing to a format for the sample props here in a sense, but I guess it's not that different from the handling of fit_params
in the cross-validation code right now. So I think it should be good do go.
I certainly have no desire to replicate what has been done in imblearn
Actually, you should take whatever works for scikit-learn from imblearn. Our idea is to contribute whatever is good upstream and remove from our code base. We are at a stage that the API start to be more stable and we recently made some changes to reflect some discussions with @jnothman.
Bottom line, take whatever is beneficial for scikit-learn ;)
So I think it should be good do go.
Well, in theory, we would need to work on the SLEPs.
However, it may be a good exercise to move this issue forward, in order
to be able to write a concise SLEP.
Actually, you should take whatever works for scikit-learn from imblearn. Our idea is to contribute whatever is good upstream and remove from our code base.
OK good to know - thanks. I should have used re-implement rather than replicate. It seems most things from imblearn
can be readily ported.
Well, in theory, we would need to work on the SLEPs.
Not sure what a SLEP is....
A SLEP is an (under-used) Scikit-learn enhancement proposal. https://github.com/scikit-learn/enhancement_proposals/
Regarding the API proposal in https://github.com/scikit-learn/scikit-learn/issues/3855#issuecomment-446690557, yes, that approach to resampling looks good... Not all estimators supporting fit_resample
can do so from indices, but I think you are aware of that; sample reduction techniques will not, for instance.
Regarding the API proposal in #3855 (comment), yes, that approach to resampling looks good... Not all estimators supporting
fit_resample
can do so from indices, but I think you are aware of that; sample reduction techniques will not, for instance.
OK good stuff. Yes - I am aware that not all resamplers will sample from indices. Those that do not will have to either implement their own method of dealing with props (if there is a sensible option - hard to know for sure without knowing what will be in props) or otherwise raise a notimplemented error if props is not none.
Is anyone in Paris working on this? I'd be happy to help (and the api would be useful for fitting semi-supervised classifiers, as discussed in this Review.)
Yep I started to port the imblearn implementation. Do you want to take other and I can help for the review instead?
Sure, where can I find you?
I started working on this (I'm not in Paris), but got too busy over the last couple of months. I am happy to share what I have done already, and continue working on it.
I started working on this (I'm not in Paris), but got too busy over the last couple of months. I am happy to share what I have done already, and continue working on it.
I'm starting work on this now for the sprint. If you have existing work that you think could be useful, I'd be more than happy to build on what you've done.
here it is - look at resample folder in sklearn
https://github.com/dmbee/scikit-learn/tree/dmbee-resampling