Dask: error while assigning label encoded values to column in dask dataframe

Created on 19 Nov 2018 · 6Comments · Source: dask/dask

0
down vote
favorite
I am facing error in label encoding features. To generate my case(Originally, i have imported a csv file in dask dataframe and after cleaning, it is left with 28 columns), I have created dask dataframe like below:

import dask
import dask.dataframe as dd
from dask_ml.preprocessing import LabelEncoder

country = np.random.choice(['US','UK','IN'],1700000)
df = pd.DataFrame({'A':country,'B':range(1700000)})
ddf = dd.from_pandas(df,npartitions=2,sort=False)

Then I tried to label encode categorical columns like below :

le = LabelEncoder()
ddf = ddf.assign(A=dd.from_dask_array(le.fit_transform(ddf['A'])))

which threw following error :

ValueError                                Traceback (most recent call last)
<ipython-input-106-480a5e12886a> in <module>()
     10 type(le.fit_transform(ddf['A']))
     11 #ddf['A'] = dd.from_array(le.fit_transform(ddf['A']))
---> 12 ddf = ddf.assign(A=dd.from_dask_array(le.fit_transform(ddf['A'])))

/opt/conda/lib/python3.6/site-packages/dask/dataframe/core.py in assign(self, **kwargs)
   2698         # Figure out columns of the output
   2699         df2 = self._meta.assign(**_extract_meta(kwargs))
-> 2700         return elemwise(methods.assign, self, *pairs, meta=df2)
   2701 
   2702     @derived_from(pd.DataFrame, ua_args=['index'])

/opt/conda/lib/python3.6/site-packages/dask/dataframe/core.py in elemwise(op, *args, **kwargs)
   3277 
   3278     from .multi import _maybe_align_partitions
-> 3279     args = _maybe_align_partitions(args)
   3280     dasks = [arg for arg in args if isinstance(arg, (_Frame, Scalar, Array))]
   3281     dfs = [df for df in dasks if isinstance(df, _Frame)]

/opt/conda/lib/python3.6/site-packages/dask/dataframe/multi.py in _maybe_align_partitions(args)
    145     divisions = dfs[0].divisions
    146     if not all(df.divisions == divisions for df in dfs):
--> 147         dfs2 = iter(align_partitions(*dfs)[0])
    148         return [a if not isinstance(a, _Frame) else next(dfs2) for a in args]
    149     return args

/opt/conda/lib/python3.6/site-packages/dask/dataframe/multi.py in align_partitions(*dfs)
    101         raise ValueError("dfs contains no DataFrame and Series")
    102     if not all(df.known_divisions for df in dfs1):
--> 103         raise ValueError("Not all divisions are known, can't align "
    104                          "partitions. Please use `set_index` "
    105                          "to set the index.")

Source

BParesh89

All 6 comments

This is because dd.from_dask_array(le.fit_transform(ddf['A'])) loses Pandas index information, so Dask DataFrame is uncertain how to combine these two datasets. Short term I suspect that you can resolve your problem with the following:

dd.from_dask_array(le.fit_transform(ddf['A']), index=ddf.index)

@TomAugspurger I'm curious to hear your thoughts on this. I can see two approaches to making this easier on users:

We have fit_transform convert dataframes to dataframes, although this conflicts with the sparse array situation that we had before
We allow Dask DataFrame to blindly combine two dataframes with known and unknown divisions if they have the same number of partitions. We raise a warning rather than an error.

Other thoughts?

mrocklin on 19 Nov 2018

👍1

I think adding a preserve_dataframe keyword to LabelEncoder should be OK here.

@BParesh89 does dask_ml.preprocessing.OneHotEncoder work for your use-case? http://ml.dask.org/modules/api.html#dask_ml.preprocessing.OneHotEncoder

TomAugspurger on 19 Nov 2018

👍1

@mrocklin , thanks for the workaround, its working!
@TomAugspurger , I was getting error while passing _preserve_dataframe=True_ in Labelencoder. I think its not a valid argument for it.
Also, I tried below code

country = np.random.choice(['US','UK','IN'],1700000)
df = pd.DataFrame({'A':country,'B':range(1700000)})
ddf = dd.from_pandas(df,npartitions=2,sort=False)
ohe = OneHotEncoder()
le = LabelEncoder()

ddf = ddf.assign(A=dd.from_dask_array(ohe.fit_transform(ddf['A']),index=ddf.index))

and got below error

ValueError: Expected 2D array, got 1D array instead:
array=['US' 'UK' 'UK' ... 'IN' 'US' 'US'].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

BParesh89 on 21 Nov 2018

preserve_dataframe would be a new keyword for LabelEconder. It hasn't been
implemented yet.

OneHotEncoder expects a 2-D array. So one.fit_transform(df[['A']]) ought
to work.

On Wed, Nov 21, 2018 at 4:02 AM Paresh Bhatia notifications@github.com
wrote:

@mrocklin https://github.com/mrocklin , thanks for the workaround, its
working!
@TomAugspurger https://github.com/TomAugspurger , I was getting error
while passing preserve_dataframe=True in Labelencoder. I think its not
a valid argument for it.
Also, I tried below code

country = np.random.choice(['US','UK','IN'],1700000)
df = pd.DataFrame({'A':country,'B':range(1700000)})
ddf = dd.from_pandas(df,npartitions=2,sort=False)
ohe = OneHotEncoder()
le = LabelEncoder()

ddf = ddf.assign(A=dd.from_dask_array(ohe.fit_transform(ddf['A']),index=ddf.index))

and got below error

ValueError: Expected 2D array, got 1D array instead:
array=['US' 'UK' 'UK' ... 'IN' 'US' 'US'].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/dask/dask/issues/4226#issuecomment-440605613, or mute
the thread
https://github.com/notifications/unsubscribe-auth/ABQHIjle2yxInF3g4bp7k_35UrgliPx2ks5uxSSfgaJpZM4YokS8
.

TomAugspurger on 21 Nov 2018

👍1

@TomAugspurger , I tried that also like below

country = np.random.choice(['US','UK','IN'],1700000)
df = pd.DataFrame({'A':country,'B':range(1700000)})
ddf = dd.from_pandas(df,npartitions=2,sort=False)

ohe = OneHotEncoder()

ddf = ddf.assign(A=dd.from_dask_array(ohe.fit_transform(ddf[['A']])),index=ddf.index)
ddf.head()

but got below Value error

ValueError                                Traceback (most recent call last)
<ipython-input-22-8ca893cc6c15> in <module>()
      8 # le = LabelEncoder()
      9 #ddf = ddf.assign(A=dd.from_dask_array(le.fit_transform(ddf['A']),index=ddf.index))
---> 10 ddf = ddf.assign(A=dd.from_dask_array(ohe.fit_transform(ddf[['A']])),index=ddf.index)
     11 ddf.head()

/opt/conda/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py in fit_transform(self, X, y)
    499                 self._categorical_features, copy=True)
    500         else:
--> 501             return self.fit(X).transform(X)
    502 
    503     def _legacy_transform(self, X):

/opt/conda/lib/python3.6/site-packages/dask_ml/preprocessing/_encoders.py in fit(self, X, y)
    126 
    127         if isinstance(X, (pd.Series, pd.DataFrame)) or dask.is_dask_collection(X):
--> 128             self._fit(X, handle_unknown=self.handle_unknown)
    129         else:
    130             super(OneHotEncoder, self).fit(X, y=y)

/opt/conda/lib/python3.6/site-packages/dask_ml/preprocessing/_encoders.py in _fit(self, X, handle_unknown)
    170         else:
    171             if not (X.dtypes == "category").all():
--> 172                 raise ValueError("All columns must be Categorical dtype.")
    173             if self.categories == "auto":
    174                 for col in X.columns:

ValueError: All columns must be Categorical dtype.

BParesh89 on 22 Nov 2018

You'll need to convert them to / store as categorical first.

On Thu, Nov 22, 2018 at 3:51 AM Paresh Bhatia notifications@github.com
wrote:

@TomAugspurger https://github.com/TomAugspurger , I tried that also
like below

country = np.random.choice(['US','UK','IN'],1700000)
df = pd.DataFrame({'A':country,'B':range(1700000)})
ddf = dd.from_pandas(df,npartitions=2,sort=False)

ohe = OneHotEncoder()

ddf = ddf.assign(A=dd.from_dask_array(ohe.fit_transform(ddf[['A']])),index=ddf.index)
ddf.head()

but got below Value error

ValueError Traceback (most recent call last)
in ()
8 # le = LabelEncoder()
9 #ddf = ddf.assign(A=dd.from_dask_array(le.fit_transform(ddf['A']),index=ddf.index))
---> 10 ddf = ddf.assign(A=dd.from_dask_array(ohe.fit_transform(ddf[['A']])),index=ddf.index)
11 ddf.head()

/opt/conda/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py in fit_transform(self, X, y)
499 self._categorical_features, copy=True)
500 else:
--> 501 return self.fit(X).transform(X)
502
503 def _legacy_transform(self, X):

/opt/conda/lib/python3.6/site-packages/dask_ml/preprocessing/_encoders.py in fit(self, X, y)
126
127 if isinstance(X, (pd.Series, pd.DataFrame)) or dask.is_dask_collection(X):
--> 128 self._fit(X, handle_unknown=self.handle_unknown)
129 else:
130 super(OneHotEncoder, self).fit(X, y=y)

/opt/conda/lib/python3.6/site-packages/dask_ml/preprocessing/_encoders.py in _fit(self, X, handle_unknown)
170 else:
171 if not (X.dtypes == "category").all():
--> 172 raise ValueError("All columns must be Categorical dtype.")
173 if self.categories == "auto":
174 for col in X.columns:

ValueError: All columns must be Categorical dtype.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/dask/dask/issues/4226#issuecomment-440973761, or mute
the thread
https://github.com/notifications/unsubscribe-auth/ABQHIkOp4Ztc8PqgYclfVnf3gU_AYBsHks5uxnOpgaJpZM4YokS8
.