I get this error when I try to use LabelBinarizer
and LabelEncoder
in a Pipeline:
sklearn/pipeline.pyc in fit_transform(self, X, y, **fit_params)
141 Xt, fit_params = self._pre_transform(X, y, **fit_params)
142 if hasattr(self.steps[-1][-1], 'fit_transform'):
--> 143 return self.steps[-1][-1].fit_transform(Xt, y, **fit_params)
144 else:
145 return self.steps[-1][-1].fit(Xt, y, **fit_params).transform(Xt)
TypeError: fit_transform() takes exactly 2 arguments (3 given)
It seems like this is because the classes' fit
and transform
signatures are different from most other estimators and only accept a single argument.
I think this is a pretty easy fix (just change the signature to def(self, X, y=None)
) that I'd be happy to send a pull request for, but I wanted to check if there were any other reasons that the signatures are the way they are that I didn't think of.
I think you're right to fix that.
On 26 April 2014 19:37, hxu [email protected] wrote:
I get this error when I try to use LabelBinarizer and LabelEncoder in a
Pipeline:sklearn/pipeline.pyc in fit_transform(self, X, y, *_fit_params)
141 Xt, fit_params = self._pre_transform(X, y, *_fit_params)
142 if hasattr(self.steps[-1][-1], 'fit_transform'):--> 143 return self.steps[-1][-1].fit_transform(Xt, y, *_fit_params)
144 else:
145 return self.steps[-1][-1].fit(Xt, y, *_fit_params).transform(Xt)
TypeError: fit_transform() takes exactly 2 arguments (3 given)It seems like this is because the classes' fit and transform signatureshttps://github.com/scikit-learn/scikit-learn/blob/master/sklearn/preprocessing/label.py#L85are different from most other estimators and only accept a single argument.
I think this is a pretty easy fix (just change the signature to def(self,
X, y=None)) that I'd be happy to send a pull request for, but I wanted to
check if there were any other reasons that the signatures are the way they
are that I didn't think of.—
Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/issues/3112
.
In #3113 we have decided this is not to be fixed because label encoding doesn't really belong in a Pipeline
.
@jnothman, just to know: what should I be doing instead if I happen to need to vectorize a categorical feature in a pipeline?
You might be best off writing your own Pipeline-like
code (perhaps inheriting from the existing) to handle your specific case.
Instead of using LabelBinarizer in a pipeline I just implemented my own transformer:
class CustomBinarizer(BaseEstimator, TransformerMixin):
def fit(self, X, y=None,**fit_params):
return self
def transform(self, X):
return LabelBinarizer().fit(X).transform(X)
Seems to do the trick!
edit:
this is a better solution:
https://github.com/scikit-learn/scikit-learn/pull/7375/files#diff-1e175ddb0d84aad0a578d34553f6f9c6
I see that there have been a lot of negative reactions on this page. I think there has been a long misunderstanding of the purpose of LabelBinarizer and LabelEncoder. These are for targets, not features. Although admittedly they were designed (and poorly named) before my time.
Although I think users could have been using CountVectorizer (or DictVectorizer with dataframe.to_dict(orient='records')
if you're coming from a dataframe) for this purpose for a long time, we have recently merged a CategoricalEncoder
(#9151) into master, although this may be rolled into OneHotEncoer, and a new OrdinalEncoder before release (#10521).
I hope this satisfies the needs of a clearly disgruntled populace.
I must say that as someone who has been volunteering enormous quantities of free time for the development of this project for nearly five years now (and recently has been employed to work on it too), seeing the magnitude of negative reactions, rather than constructive contributions to the library is quite saddening. Although admittedly my response above that you should write a new Pipeline-like thing, rather than a new transformer for categorical inputs was a misunderstanding on my part (and should/could have been corrected by others), which I hope is understandable while working through the enormous workload that is maintaining this project.
Most helpful comment
I see that there have been a lot of negative reactions on this page. I think there has been a long misunderstanding of the purpose of LabelBinarizer and LabelEncoder. These are for targets, not features. Although admittedly they were designed (and poorly named) before my time.
Although I think users could have been using CountVectorizer (or DictVectorizer with
dataframe.to_dict(orient='records')
if you're coming from a dataframe) for this purpose for a long time, we have recently merged aCategoricalEncoder
(#9151) into master, although this may be rolled into OneHotEncoer, and a new OrdinalEncoder before release (#10521).I hope this satisfies the needs of a clearly disgruntled populace.
I must say that as someone who has been volunteering enormous quantities of free time for the development of this project for nearly five years now (and recently has been employed to work on it too), seeing the magnitude of negative reactions, rather than constructive contributions to the library is quite saddening. Although admittedly my response above that you should write a new Pipeline-like thing, rather than a new transformer for categorical inputs was a misunderstanding on my part (and should/could have been corrected by others), which I hope is understandable while working through the enormous workload that is maintaining this project.