0
down vote
favorite
I am facing error in label encoding features. To generate my case(Originally, i have imported a csv file in dask dataframe and after cleaning, it is left with 28 columns), I have created dask dataframe like below:
import dask
import dask.dataframe as dd
from dask_ml.preprocessing import LabelEncoder
country = np.random.choice(['US','UK','IN'],1700000)
df = pd.DataFrame({'A':country,'B':range(1700000)})
ddf = dd.from_pandas(df,npartitions=2,sort=False)
Then I tried to label encode categorical columns like below :
le = LabelEncoder()
ddf = ddf.assign(A=dd.from_dask_array(le.fit_transform(ddf['A'])))
which threw following error :
ValueError Traceback (most recent call last)
<ipython-input-106-480a5e12886a> in <module>()
10 type(le.fit_transform(ddf['A']))
11 #ddf['A'] = dd.from_array(le.fit_transform(ddf['A']))
---> 12 ddf = ddf.assign(A=dd.from_dask_array(le.fit_transform(ddf['A'])))
/opt/conda/lib/python3.6/site-packages/dask/dataframe/core.py in assign(self, **kwargs)
2698 # Figure out columns of the output
2699 df2 = self._meta.assign(**_extract_meta(kwargs))
-> 2700 return elemwise(methods.assign, self, *pairs, meta=df2)
2701
2702 @derived_from(pd.DataFrame, ua_args=['index'])
/opt/conda/lib/python3.6/site-packages/dask/dataframe/core.py in elemwise(op, *args, **kwargs)
3277
3278 from .multi import _maybe_align_partitions
-> 3279 args = _maybe_align_partitions(args)
3280 dasks = [arg for arg in args if isinstance(arg, (_Frame, Scalar, Array))]
3281 dfs = [df for df in dasks if isinstance(df, _Frame)]
/opt/conda/lib/python3.6/site-packages/dask/dataframe/multi.py in _maybe_align_partitions(args)
145 divisions = dfs[0].divisions
146 if not all(df.divisions == divisions for df in dfs):
--> 147 dfs2 = iter(align_partitions(*dfs)[0])
148 return [a if not isinstance(a, _Frame) else next(dfs2) for a in args]
149 return args
/opt/conda/lib/python3.6/site-packages/dask/dataframe/multi.py in align_partitions(*dfs)
101 raise ValueError("dfs contains no DataFrame and Series")
102 if not all(df.known_divisions for df in dfs1):
--> 103 raise ValueError("Not all divisions are known, can't align "
104 "partitions. Please use `set_index` "
105 "to set the index.")
This is because dd.from_dask_array(le.fit_transform(ddf['A'])) loses Pandas index information, so Dask DataFrame is uncertain how to combine these two datasets. Short term I suspect that you can resolve your problem with the following:
dd.from_dask_array(le.fit_transform(ddf['A']), index=ddf.index)
@TomAugspurger I'm curious to hear your thoughts on this. I can see two approaches to making this easier on users:
fit_transform convert dataframes to dataframes, although this conflicts with the sparse array situation that we had beforeOther thoughts?
I think adding a preserve_dataframe keyword to LabelEncoder should be OK here.
@BParesh89 does dask_ml.preprocessing.OneHotEncoder work for your use-case? http://ml.dask.org/modules/api.html#dask_ml.preprocessing.OneHotEncoder
@mrocklin , thanks for the workaround, its working!
@TomAugspurger , I was getting error while passing _preserve_dataframe=True_ in Labelencoder. I think its not a valid argument for it.
Also, I tried below code
country = np.random.choice(['US','UK','IN'],1700000)
df = pd.DataFrame({'A':country,'B':range(1700000)})
ddf = dd.from_pandas(df,npartitions=2,sort=False)
ohe = OneHotEncoder()
le = LabelEncoder()
ddf = ddf.assign(A=dd.from_dask_array(ohe.fit_transform(ddf['A']),index=ddf.index))
and got below error
ValueError: Expected 2D array, got 1D array instead:
array=['US' 'UK' 'UK' ... 'IN' 'US' 'US'].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
preserve_dataframe would be a new keyword for LabelEconder. It hasn't been
implemented yet.
OneHotEncoder expects a 2-D array. So one.fit_transform(df[['A']]) ought
to work.
On Wed, Nov 21, 2018 at 4:02 AM Paresh Bhatia notifications@github.com
wrote:
@mrocklin https://github.com/mrocklin , thanks for the workaround, its
working!
@TomAugspurger https://github.com/TomAugspurger , I was getting error
while passing preserve_dataframe=True in Labelencoder. I think its not
a valid argument for it.
Also, I tried below codecountry = np.random.choice(['US','UK','IN'],1700000)
df = pd.DataFrame({'A':country,'B':range(1700000)})
ddf = dd.from_pandas(df,npartitions=2,sort=False)
ohe = OneHotEncoder()
le = LabelEncoder()ddf = ddf.assign(A=dd.from_dask_array(ohe.fit_transform(ddf['A']),index=ddf.index))
and got below error
ValueError: Expected 2D array, got 1D array instead:
array=['US' 'UK' 'UK' ... 'IN' 'US' 'US'].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/dask/dask/issues/4226#issuecomment-440605613, or mute
the thread
https://github.com/notifications/unsubscribe-auth/ABQHIjle2yxInF3g4bp7k_35UrgliPx2ks5uxSSfgaJpZM4YokS8
.
@TomAugspurger , I tried that also like below
country = np.random.choice(['US','UK','IN'],1700000)
df = pd.DataFrame({'A':country,'B':range(1700000)})
ddf = dd.from_pandas(df,npartitions=2,sort=False)
ohe = OneHotEncoder()
ddf = ddf.assign(A=dd.from_dask_array(ohe.fit_transform(ddf[['A']])),index=ddf.index)
ddf.head()
but got below Value error
ValueError Traceback (most recent call last)
<ipython-input-22-8ca893cc6c15> in <module>()
8 # le = LabelEncoder()
9 #ddf = ddf.assign(A=dd.from_dask_array(le.fit_transform(ddf['A']),index=ddf.index))
---> 10 ddf = ddf.assign(A=dd.from_dask_array(ohe.fit_transform(ddf[['A']])),index=ddf.index)
11 ddf.head()
/opt/conda/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py in fit_transform(self, X, y)
499 self._categorical_features, copy=True)
500 else:
--> 501 return self.fit(X).transform(X)
502
503 def _legacy_transform(self, X):
/opt/conda/lib/python3.6/site-packages/dask_ml/preprocessing/_encoders.py in fit(self, X, y)
126
127 if isinstance(X, (pd.Series, pd.DataFrame)) or dask.is_dask_collection(X):
--> 128 self._fit(X, handle_unknown=self.handle_unknown)
129 else:
130 super(OneHotEncoder, self).fit(X, y=y)
/opt/conda/lib/python3.6/site-packages/dask_ml/preprocessing/_encoders.py in _fit(self, X, handle_unknown)
170 else:
171 if not (X.dtypes == "category").all():
--> 172 raise ValueError("All columns must be Categorical dtype.")
173 if self.categories == "auto":
174 for col in X.columns:
ValueError: All columns must be Categorical dtype.
You'll need to convert them to / store as categorical first.
On Thu, Nov 22, 2018 at 3:51 AM Paresh Bhatia notifications@github.com
wrote:
@TomAugspurger https://github.com/TomAugspurger , I tried that also
like belowcountry = np.random.choice(['US','UK','IN'],1700000)
df = pd.DataFrame({'A':country,'B':range(1700000)})
ddf = dd.from_pandas(df,npartitions=2,sort=False)ohe = OneHotEncoder()
ddf = ddf.assign(A=dd.from_dask_array(ohe.fit_transform(ddf[['A']])),index=ddf.index)
ddf.head()but got below Value error
ValueError Traceback (most recent call last)
in ()
8 # le = LabelEncoder()
9 #ddf = ddf.assign(A=dd.from_dask_array(le.fit_transform(ddf['A']),index=ddf.index))
---> 10 ddf = ddf.assign(A=dd.from_dask_array(ohe.fit_transform(ddf[['A']])),index=ddf.index)
11 ddf.head()/opt/conda/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py in fit_transform(self, X, y)
499 self._categorical_features, copy=True)
500 else:
--> 501 return self.fit(X).transform(X)
502
503 def _legacy_transform(self, X):/opt/conda/lib/python3.6/site-packages/dask_ml/preprocessing/_encoders.py in fit(self, X, y)
126
127 if isinstance(X, (pd.Series, pd.DataFrame)) or dask.is_dask_collection(X):
--> 128 self._fit(X, handle_unknown=self.handle_unknown)
129 else:
130 super(OneHotEncoder, self).fit(X, y=y)/opt/conda/lib/python3.6/site-packages/dask_ml/preprocessing/_encoders.py in _fit(self, X, handle_unknown)
170 else:
171 if not (X.dtypes == "category").all():
--> 172 raise ValueError("All columns must be Categorical dtype.")
173 if self.categories == "auto":
174 for col in X.columns:ValueError: All columns must be Categorical dtype.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/dask/dask/issues/4226#issuecomment-440973761, or mute
the thread
https://github.com/notifications/unsubscribe-auth/ABQHIkOp4Ztc8PqgYclfVnf3gU_AYBsHks5uxnOpgaJpZM4YokS8
.