Pandas: API/ENH: from_dummies

Created on 6 Nov 2014 · 32Comments · Source: pandas-dev/pandas

Motivating from SO

This is the inverse of pd.get_dummies. So maybe invert_dummies is better?
I think this name makes more sense though.

This seems a reasonable way to do it. Am I missing anything?

In [46]: s = Series(list('aaabbbccddefgh')).astype('category')

In [47]: s
Out[47]: 
0     a
1     a
2     a
3     b
4     b
5     b
6     c
7     c
8     d
9     d
10    e
11    f
12    g
13    h
dtype: category
Categories (8, object): [a < b < c < d < e < f < g < h]

In [48]: df = pd.get_dummies(s)

In [49]: df
Out[49]: 
    a  b  c  d  e  f  g  h
0   1  0  0  0  0  0  0  0
1   1  0  0  0  0  0  0  0
2   1  0  0  0  0  0  0  0
3   0  1  0  0  0  0  0  0
4   0  1  0  0  0  0  0  0
5   0  1  0  0  0  0  0  0
6   0  0  1  0  0  0  0  0
7   0  0  1  0  0  0  0  0
8   0  0  0  1  0  0  0  0
9   0  0  0  1  0  0  0  0
10  0  0  0  0  1  0  0  0
11  0  0  0  0  0  1  0  0
12  0  0  0  0  0  0  1  0
13  0  0  0  0  0  0  0  1

In [50]: x = df.stack()

# I don't think you actually need to specify ALL of the categories here, as by definition
# they are in the dummy matrix to start (and hence the column index)
In [51]: Series(pd.Categorical(x[x!=0].index.get_level_values(1)))
Out[51]: 
0     a
1     a
2     a
3     b
4     b
5     b
6     c
7     c
8     d
9     d
10    e
11    f
12    g
13    h
Name: level_1, dtype: category
Categories (8, object): [a < b < c < d < e < f < g < h]

NB. this is buggy ATM.

In [51]: Series(pd.Categorical(x[x!=0].index.get_level_values(1)),categories=df.categories)

API Design Bug Categorical Enhancement Reshaping

Source

jreback

👍8

Most helpful comment

Here is a quick-and-dirty solution for the easiest case, using no prefix.

def from_dummies(data, categories, prefix_sep='_'):
    out = data.copy()
    for l in categories:
        cols, labs = [[c.replace(x,"") for c in data.columns if l+prefix_sep in c] for x in ["", l+prefix_sep]]
        out[l] = pd.Categorical(np.array(labs)[np.argmax(data[cols].as_matrix(), axis=1)])
        out.drop(cols, axis=1, inplace=True)
    return out

Usage:

categorical_cols = df.columns[df.dtypes.astype(str) == "category"]
dummies = pd.get_dummies(df)
original_df = from_dummies(dummies, categories=categorical_cols)

Please note that the the transformed columns are appended at the end, hence the DataFrame will not be in the same order. I hope that helps some of you!
Cheers!

kevin-winter on 8 Jul 2017

👍8

All 32 comments

We'll need to handle the case of a DataFrame with dummy columns and non-dummy columns.

TomAugspurger on 6 Nov 2014

👍1

@TomAugspurger Can't we say that it is up to the user to provide the correct selection of columns? (and so error on non-dummy columns?)

I am not really sold on get_categories (as this could also mean a lot of other things, you can get categories from other type of data than dummies), so something with 'dummies' in the name feels better (invert_dummies, from_dummies, .. or something with the meaning of 'condense/melt dummies')

jorisvandenbossche on 6 Nov 2014

@jorisvandenbossche, yeah, by "handle" I meant think about, and I think raising is the best solution, sorry.

What to do with NaNs? pd.get_dummies(['a', 'b', np.nan], dummy_na=True) We should probably have a symmetrical argument for from_dummies. (I'm not sure how Categorical handles a NaN as a category).

TomAugspurger on 6 Nov 2014

I like from_dummies

jreback on 6 Nov 2014

metasyn on 17 May 2015

Should the milestone be modified from 0.16.0 to 0.18.0?

pkch on 30 Nov 2015

Here's a function for DataFrames (again from SO):

from collections import defaultdict

def reverse_dummy(df_dummies):
    pos = defaultdict(list)
    vals = defaultdict(list)

    for i, c in enumerate(df_dummies.columns):
        if "_" in c:
            k, v = c.split("_", 1)
            pos[k].append(i)
            vals[k].append(v)
        else:
            pos["_"].append(i)

    df = pd.DataFrame({k: pd.Categorical.from_codes(
                              np.argmax(df_dummies.iloc[:, pos[k]].values, axis=1),
                              vals[k])
                      for k in vals})

    df[df_dummies.columns[pos["_"]]] = df_dummies.iloc[:, pos["_"]]
    return df

hayd on 30 Dec 2015

👍5

What kind of roundtrip-ability can we hope for here. Ideally we have

x == pd.from_dummies(pd.get_dummies(x))

The problem is we lose the Categorical information when calling get_dummies.
In order to fully reconstruct a Categorical we would need to include the categories (if any, remember get_dummies will work on non-categorical) and the ordering when calling from_dummies.

def from_dummies(data, categories, ordered):
   ...

Additionally it could be that data came from a DataFrame, so they're might be multiple sets of dummy columns and non-dummy columns. In this case we have something like

def from_dummies(data, categories, ordered, prefixes)
    pass

Where all of prefixes, categories and ordered are scalars or lists of the same length (special case for categories and ordered as scalars and prefixes=None to handle inverting pd.get_dummies(Series).

Thoughts? That's kind of messy, but I don't see any way around it and I think we should shoot for perfect roundtrip-ability.

TomAugspurger on 9 Jan 2016

👍1

you can simply infer the categories (as they are the labels of the matrix).

jreback on 9 Jan 2016

Categories you can get, but not whether it's ordered and what the ordering is if they are ordered.

EDIT: Oh, you can't necessarily infer categories even since pd.get_dummies(['a', 'a', 'b']) is the same as pd.get_dummies(pd.Series(pd.Categorical(['a', 'a', 'b'])))

On Jan 9, 2016, at 15:25, Jeff Reback [email protected] wrote:

you can simply infer the categories (as they are the labels of the matrix).

—
Reply to this email directly or view it on GitHub.

TomAugspurger on 9 Jan 2016

@TomAugspurger How does the signature look like in the version you are working on?
Is the purpose to detect the different sets of dummies based on the column names (as the output of get_dummies looks like)?
Would it return object or category columns?

jorisvandenbossche on 10 Jan 2016

Current signature

def from_dummies(data, categories=None, ordered=None, prefixes=None):
    '''
    The inverse transformation of ``pandas.get_dummies``.

    Parameters
    ----------
    data : DataFrame
    categories : Index or list of Indexes
    ordered : boolean or list of booleans
    prefixes : str or list of str

    Returns
    -------
    transformed : Series or DataFrame

    Notes
    -----
    To recover a Categorical, you must provide the categories and
    maybe whether it is ordered (default False). To invert a DataFrame that includes either
    multiple sets of dummy-encoded columns or a mixture of dummy-encoded
    columns and regular columns, you must specify ``prefixes``.

The default will be to return a regular Series where the values are the column labels (so int or str probably). To return a Categorical you pass in the categories. If I switched to returning a Categorical by default, we would need to provide a flag like return_categorical to disable that.

Is the purpose to detect the different sets of dummies based on the column names

That's what my prefixes argument is for. If you have multiple dummy-encoded sets you use prefixes=["fist_dummy_set", "second_set", ..."] and that will find all the ones with that as the prefix. This will maybe fail (or succeed silently!) if you have a column name that happened to share a prefix... This is beginning to look pretty complicated.

TomAugspurger on 10 Jan 2016

👍1

This is exactly what I'm looking for... any progress? Beta?

Thanks!

jpgrossman on 19 Oct 2016

@jpgrossman I have a branch at https://github.com/TomAugspurger/pandas/tree/from_dummies, though it's been a while since I've looked at that. There are several changes I would make to that, so if you're interested you could use that as a starting point (maybe just the tests).

TomAugspurger on 24 Oct 2016

Thank you Tom – will have a look at this soon.

jpgrossman on 25 Oct 2016

This is exactly what I am looking. Definitely a feature id use all the time.

kieran-mace on 7 Jan 2017

A valuable addition that I would be glad to see.

antonbabkin on 28 Jan 2017

pull requests are welcome!

jreback on 28 Jan 2017

Any update here?
@TomAugspurger Your link doesn't work anymore

liorshk on 20 Jun 2017

❤1

@liorshk I haven't had time. Would you have a chance to submit a PR?

TomAugspurger on 21 Jun 2017

Here is a quick-and-dirty solution for the easiest case, using no prefix.

def from_dummies(data, categories, prefix_sep='_'):
    out = data.copy()
    for l in categories:
        cols, labs = [[c.replace(x,"") for c in data.columns if l+prefix_sep in c] for x in ["", l+prefix_sep]]
        out[l] = pd.Categorical(np.array(labs)[np.argmax(data[cols].as_matrix(), axis=1)])
        out.drop(cols, axis=1, inplace=True)
    return out

Usage:

categorical_cols = df.columns[df.dtypes.astype(str) == "category"]
dummies = pd.get_dummies(df)
original_df = from_dummies(dummies, categories=categorical_cols)

Please note that the the transformed columns are appended at the end, hence the DataFrame will not be in the same order. I hope that helps some of you!
Cheers!

kevin-winter on 8 Jul 2017

👍8

Would it make more sense to provide an option in get_dummies to also output a map between the original column name, new column name and categories? This could then be used to feed the reverse from_dummies function to recreate the old dataframe

joshlk on 21 May 2018

I have edited @kevin-winter 's code in case someone has drop_first=True in pd.get_dummies():
i.e., dummies = pd.get_dummies(df, drop_first=True)

def from_dummies(data, categorical_cols, categorical_cols_first, prefix_sep='_'):
    out = data.copy()

    for col_parent in categorical_cols:

        filter_col = [col for col in data if col.startswith(col_parent)]
        cols_with_ones = np.argmax(data[filter_col].values, axis=1)

        org_col_values = []
        for row, col in enumerate(cols_with_ones):
            if((col==0) & (data[filter_col].iloc[row][col] < 1)):
                org_col_values.append(categorical_cols_first.get(col_parent))
            else:
                org_col_values.append(data[filter_col].columns[col].split(col_parent+prefix_sep,1)[1])

        out[col_parent] = pd.Series(org_col_values).values
        out.drop(filter_col, axis=1, inplace=True)    

    return out

categorical_cols_first is a dictionary of first levels (of each categorical variables) that will be dropped by pd.get_dummies()

categorical_cols_first = []
for col in categorical_cols:
    categorical_cols_first.append(df[col].value_counts().sort_index().keys()[0])
categorical_cols_first = dict(zip(categorical_cols, categorical_cols_first))

Wrote it quickly, so please comment if there is any bug. It worked for me though.
Hope this helps!

raam93 on 1 Sep 2018

I would raise en exception in the function of @kevin-winter in case data[cols] is empty, explaining that one of the provided cols is incorrect

andreaaraldo on 17 Apr 2019

Seems like a popular request, I'll start working on this

MarcoGorelli on 2 Feb 2020

I failed to find this on a search, and so created a duplicate issue.

My approach was to add from_dummies as an alternate constructor for Categorical: that way it's clear what it creates, it's easy to discover and to find documentation for, and the additional arguments are passed straight to that object. And let's not forget, "Namespaces are one honking great idea -- let's do more of those!".

This implementation minimises loops in python (although there are a couple of whole-dataframe copies), but doesn't do a lot of nannying for incorrect inputs:

import numpy as np 
import pandas as pd

class Categorical:
    ...

    @classmethod
    def from_dummies(cls, df: pd.DataFrame, **kwargs):
        onehot = df.astype(bool)

        if (onehot.sum(axis=1) > 1).any():
            raise ValueError("Some rows belong to >1 category")

        index_into = pd.Series([np.nan] + list(onehot.columns))
        mult_by = np.arange(1, len(index_into))

        indexes = (onehot.astype(int) * mult_by).sum(axis=1)
        values = index_into[indexes]

        return cls(values, df.columns, **kwargs)

clbarnes on 19 May 2020

👍1

Think I'm taking this on, should be able to have a go tomorrow. For the sake of symmetry, I'd also like to give Categorical a to_dummies. If we go down that route, it might be nice to eventually deprecate the get_dummies free function so as to keep categorical-related functionality on the Categorical class and not duplicate API surface.

Also just to check - strictly, dummy variables are of float type, and valued 0 and 1, where one-hot encoded variables are of binary type? Is that a distinction we want to keep here? Users can always .astype(bool) on it.

clbarnes on 26 May 2020

Also just to check - strictly, dummy variables are of float type, and valued 0 and 1, where one-hot encoded variables are of binary type? Is that a distinction we want to keep here?

Why do you say they're float dtype?

In [4]: pd.get_dummies(pd.Series([1, 2, 3])).dtypes
Out[4]:
1    uint8
2    uint8
3    uint8
dtype: object

TomAugspurger on 26 May 2020

I just had a look through some docs and it looked like the term "dummy variable" is used mainly in regression, in cases where you have a categorical variable but need to encode it as continuous (i.e. floating) for the purposes of that regression. The term "one-hot encoding" seems more commonly used in applications which deals in actual booleans. For both of them, the information itself is binary, of course.

I may be completely making up that distinction, though.

clbarnes on 26 May 2020

In my experience "one-hot encoding" and "dummy variables" are synonymous.

TomAugspurger on 26 May 2020

👍3

In my experience "one-hot encoding" and "dummy variables" are synonymous.

Seems the scikit-learn docs would agree

The features are encoded using a one-hot (aka ‘one-of-K’ or ‘dummy’) encoding scheme. This creates a binary column for each category and returns a sparse matrix or dense array (depending on the sparse parameter)

MarcoGorelli on 26 May 2020

👍1

take

clbarnes on 27 May 2020

Was this page helpful?

0 / 5 - 0 ratings