Pandas: API/ENH: from_dummies

Created on 6 Nov 2014  Â·  32Comments  Â·  Source: pandas-dev/pandas

Motivating from SO

This is the inverse of pd.get_dummies. So maybe invert_dummies is better?
I think this name makes more sense though.

This seems a reasonable way to do it. Am I missing anything?

In [46]: s = Series(list('aaabbbccddefgh')).astype('category')

In [47]: s
Out[47]: 
0     a
1     a
2     a
3     b
4     b
5     b
6     c
7     c
8     d
9     d
10    e
11    f
12    g
13    h
dtype: category
Categories (8, object): [a < b < c < d < e < f < g < h]

In [48]: df = pd.get_dummies(s)

In [49]: df
Out[49]: 
    a  b  c  d  e  f  g  h
0   1  0  0  0  0  0  0  0
1   1  0  0  0  0  0  0  0
2   1  0  0  0  0  0  0  0
3   0  1  0  0  0  0  0  0
4   0  1  0  0  0  0  0  0
5   0  1  0  0  0  0  0  0
6   0  0  1  0  0  0  0  0
7   0  0  1  0  0  0  0  0
8   0  0  0  1  0  0  0  0
9   0  0  0  1  0  0  0  0
10  0  0  0  0  1  0  0  0
11  0  0  0  0  0  1  0  0
12  0  0  0  0  0  0  1  0
13  0  0  0  0  0  0  0  1

In [50]: x = df.stack()

# I don't think you actually need to specify ALL of the categories here, as by definition
# they are in the dummy matrix to start (and hence the column index)
In [51]: Series(pd.Categorical(x[x!=0].index.get_level_values(1)))
Out[51]: 
0     a
1     a
2     a
3     b
4     b
5     b
6     c
7     c
8     d
9     d
10    e
11    f
12    g
13    h
Name: level_1, dtype: category
Categories (8, object): [a < b < c < d < e < f < g < h]

NB. this is buggy ATM.

In [51]: Series(pd.Categorical(x[x!=0].index.get_level_values(1)),categories=df.categories)
API Design Bug Categorical Enhancement Reshaping

Most helpful comment

Here is a quick-and-dirty solution for the easiest case, using no prefix.

def from_dummies(data, categories, prefix_sep='_'):
    out = data.copy()
    for l in categories:
        cols, labs = [[c.replace(x,"") for c in data.columns if l+prefix_sep in c] for x in ["", l+prefix_sep]]
        out[l] = pd.Categorical(np.array(labs)[np.argmax(data[cols].as_matrix(), axis=1)])
        out.drop(cols, axis=1, inplace=True)
    return out

Usage:

categorical_cols = df.columns[df.dtypes.astype(str) == "category"]
dummies = pd.get_dummies(df)
original_df = from_dummies(dummies, categories=categorical_cols)

Please note that the the transformed columns are appended at the end, hence the DataFrame will not be in the same order. I hope that helps some of you!
Cheers!

All 32 comments

We'll need to handle the case of a DataFrame with dummy columns and non-dummy columns.

@TomAugspurger Can't we say that it is up to the user to provide the correct selection of columns? (and so error on non-dummy columns?)

I am not really sold on get_categories (as this could also mean a lot of other things, you can get categories from other type of data than dummies), so something with 'dummies' in the name feels better (invert_dummies, from_dummies, .. or something with the meaning of 'condense/melt dummies')

@jorisvandenbossche, yeah, by "handle" I meant think about, and I think raising is the best solution, sorry.

What to do with NaNs? pd.get_dummies(['a', 'b', np.nan], dummy_na=True) We should probably have a symmetrical argument for from_dummies. (I'm not sure how Categorical handles a NaN as a category).

I like from_dummies

+1

Should the milestone be modified from 0.16.0 to 0.18.0?

Here's a function for DataFrames (again from SO):

from collections import defaultdict

def reverse_dummy(df_dummies):
    pos = defaultdict(list)
    vals = defaultdict(list)

    for i, c in enumerate(df_dummies.columns):
        if "_" in c:
            k, v = c.split("_", 1)
            pos[k].append(i)
            vals[k].append(v)
        else:
            pos["_"].append(i)

    df = pd.DataFrame({k: pd.Categorical.from_codes(
                              np.argmax(df_dummies.iloc[:, pos[k]].values, axis=1),
                              vals[k])
                      for k in vals})

    df[df_dummies.columns[pos["_"]]] = df_dummies.iloc[:, pos["_"]]
    return df

What kind of roundtrip-ability can we hope for here. Ideally we have

x == pd.from_dummies(pd.get_dummies(x))

The problem is we lose the Categorical information when calling get_dummies.
In order to fully reconstruct a Categorical we would need to include the categories (if any, remember get_dummies will work on non-categorical) and the ordering when calling from_dummies.

def from_dummies(data, categories, ordered):
   ...

Additionally it could be that data came from a DataFrame, so they're might be multiple sets of dummy columns and non-dummy columns. In this case we have something like

def from_dummies(data, categories, ordered, prefixes)
    pass

Where all of prefixes, categories and ordered are scalars or lists of the same length (special case for categories and ordered as scalars and prefixes=None to handle inverting pd.get_dummies(Series).

Thoughts? That's kind of messy, but I don't see any way around it and I think we should shoot for perfect roundtrip-ability.

you can simply infer the categories (as they are the labels of the matrix).

Categories you can get, but not whether it's ordered and what the ordering is if they are ordered.

EDIT: Oh, you can't necessarily infer categories even since pd.get_dummies(['a', 'a', 'b']) is the same as pd.get_dummies(pd.Series(pd.Categorical(['a', 'a', 'b'])))

On Jan 9, 2016, at 15:25, Jeff Reback [email protected] wrote:

you can simply infer the categories (as they are the labels of the matrix).

—
Reply to this email directly or view it on GitHub.

@TomAugspurger How does the signature look like in the version you are working on?
Is the purpose to detect the different sets of dummies based on the column names (as the output of get_dummies looks like)?
Would it return object or category columns?

Current signature

def from_dummies(data, categories=None, ordered=None, prefixes=None):
    '''
    The inverse transformation of ``pandas.get_dummies``.

    Parameters
    ----------
    data : DataFrame
    categories : Index or list of Indexes
    ordered : boolean or list of booleans
    prefixes : str or list of str

    Returns
    -------
    transformed : Series or DataFrame

    Notes
    -----
    To recover a Categorical, you must provide the categories and
    maybe whether it is ordered (default False). To invert a DataFrame that includes either
    multiple sets of dummy-encoded columns or a mixture of dummy-encoded
    columns and regular columns, you must specify ``prefixes``.

The default will be to return a regular Series where the values are the column labels (so int or str probably). To return a Categorical you pass in the categories. If I switched to returning a Categorical by default, we would need to provide a flag like return_categorical to disable that.

Is the purpose to detect the different sets of dummies based on the column names

That's what my prefixes argument is for. If you have multiple dummy-encoded sets you use prefixes=["fist_dummy_set", "second_set", ..."] and that will find all the ones with that as the prefix. This will maybe fail (or succeed silently!) if you have a column name that happened to share a prefix... This is beginning to look pretty complicated.

This is exactly what I'm looking for... any progress? Beta?

Thanks!

@jpgrossman I have a branch at https://github.com/TomAugspurger/pandas/tree/from_dummies, though it's been a while since I've looked at that. There are several changes I would make to that, so if you're interested you could use that as a starting point (maybe just the tests).

Thank you Tom – will have a look at this soon.

This is exactly what I am looking. Definitely a feature id use all the time.

A valuable addition that I would be glad to see.

pull requests are welcome!

Any update here?
@TomAugspurger Your link doesn't work anymore

@liorshk I haven't had time. Would you have a chance to submit a PR?

Here is a quick-and-dirty solution for the easiest case, using no prefix.

def from_dummies(data, categories, prefix_sep='_'):
    out = data.copy()
    for l in categories:
        cols, labs = [[c.replace(x,"") for c in data.columns if l+prefix_sep in c] for x in ["", l+prefix_sep]]
        out[l] = pd.Categorical(np.array(labs)[np.argmax(data[cols].as_matrix(), axis=1)])
        out.drop(cols, axis=1, inplace=True)
    return out

Usage:

categorical_cols = df.columns[df.dtypes.astype(str) == "category"]
dummies = pd.get_dummies(df)
original_df = from_dummies(dummies, categories=categorical_cols)

Please note that the the transformed columns are appended at the end, hence the DataFrame will not be in the same order. I hope that helps some of you!
Cheers!

Would it make more sense to provide an option in get_dummies to also output a map between the original column name, new column name and categories? This could then be used to feed the reverse from_dummies function to recreate the old dataframe

I have edited @kevin-winter 's code in case someone has drop_first=True in pd.get_dummies():
i.e., dummies = pd.get_dummies(df, drop_first=True)

def from_dummies(data, categorical_cols, categorical_cols_first, prefix_sep='_'):
    out = data.copy()

    for col_parent in categorical_cols:

        filter_col = [col for col in data if col.startswith(col_parent)]
        cols_with_ones = np.argmax(data[filter_col].values, axis=1)

        org_col_values = []
        for row, col in enumerate(cols_with_ones):
            if((col==0) & (data[filter_col].iloc[row][col] < 1)):
                org_col_values.append(categorical_cols_first.get(col_parent))
            else:
                org_col_values.append(data[filter_col].columns[col].split(col_parent+prefix_sep,1)[1])

        out[col_parent] = pd.Series(org_col_values).values
        out.drop(filter_col, axis=1, inplace=True)    

    return out

categorical_cols_first is a dictionary of first levels (of each categorical variables) that will be dropped by pd.get_dummies()

categorical_cols_first = []
for col in categorical_cols:
    categorical_cols_first.append(df[col].value_counts().sort_index().keys()[0])
categorical_cols_first = dict(zip(categorical_cols, categorical_cols_first))

Wrote it quickly, so please comment if there is any bug. It worked for me though.
Hope this helps!

I would raise en exception in the function of @kevin-winter in case data[cols] is empty, explaining that one of the provided cols is incorrect

Seems like a popular request, I'll start working on this

I failed to find this on a search, and so created a duplicate issue.

My approach was to add from_dummies as an alternate constructor for Categorical: that way it's clear what it creates, it's easy to discover and to find documentation for, and the additional arguments are passed straight to that object. And let's not forget, "Namespaces are one honking great idea -- let's do more of those!".

This implementation minimises loops in python (although there are a couple of whole-dataframe copies), but doesn't do a lot of nannying for incorrect inputs:

import numpy as np 
import pandas as pd

class Categorical:
    ...

    @classmethod
    def from_dummies(cls, df: pd.DataFrame, **kwargs):
        onehot = df.astype(bool)

        if (onehot.sum(axis=1) > 1).any():
            raise ValueError("Some rows belong to >1 category")

        index_into = pd.Series([np.nan] + list(onehot.columns))
        mult_by = np.arange(1, len(index_into))

        indexes = (onehot.astype(int) * mult_by).sum(axis=1)
        values = index_into[indexes]

        return cls(values, df.columns, **kwargs)

Think I'm taking this on, should be able to have a go tomorrow. For the sake of symmetry, I'd also like to give Categorical a to_dummies. If we go down that route, it might be nice to eventually deprecate the get_dummies free function so as to keep categorical-related functionality on the Categorical class and not duplicate API surface.

Also just to check - strictly, dummy variables are of float type, and valued 0 and 1, where one-hot encoded variables are of binary type? Is that a distinction we want to keep here? Users can always .astype(bool) on it.

Also just to check - strictly, dummy variables are of float type, and valued 0 and 1, where one-hot encoded variables are of binary type? Is that a distinction we want to keep here?

Why do you say they're float dtype?

In [4]: pd.get_dummies(pd.Series([1, 2, 3])).dtypes
Out[4]:
1    uint8
2    uint8
3    uint8
dtype: object

I just had a look through some docs and it looked like the term "dummy variable" is used mainly in regression, in cases where you have a categorical variable but need to encode it as continuous (i.e. floating) for the purposes of that regression. The term "one-hot encoding" seems more commonly used in applications which deals in actual booleans. For both of them, the information itself is binary, of course.

I may be completely making up that distinction, though.

In my experience "one-hot encoding" and "dummy variables" are synonymous.

In my experience "one-hot encoding" and "dummy variables" are synonymous.

Seems the scikit-learn docs would agree

The features are encoded using a one-hot (aka ‘one-of-K’ or ‘dummy’) encoding scheme. This creates a binary column for each category and returns a sparse matrix or dense array (depending on the sparse parameter)

take

Was this page helpful?
0 / 5 - 0 ratings