Pandas: Drop duplicate columns from DataFrame based on column values

Created on 6 Oct 2015 · 14Comments · Source: pandas-dev/pandas

Here is sample code that finds duplicate columns in a DataFrame based on their values (useful for cleaning data):

def duplicate_columns(frame):
    groups = frame.columns.to_series().groupby(frame.dtypes).groups
    dups = []
    for t, v in groups.items():
        dcols = frame[v].to_dict(orient="list")

        vs = dcols.values()
        ks = dcols.keys()
        lvs = len(vs)

        for i in range(lvs):
            for j in range(i+1,lvs):
                if vs[i] == vs[j]: 
                    dups.append(ks[i])
                    break

    return dups

I've seen others suggest something like df.T.drop_duplicates().T. However, transposing is a bad idea when working with large DataFrames.

I would add a pull request, but I'm not sure I even know what that means.

Enhancement Missing-data Performance Reshaping

Source

121onto

👍3

Most helpful comment

May I ask if there's any chance of this being implemented? (This thread is actually on of the top results for googling this issue/ removing columns by duplicated values). Thanks!

ddofer on 17 Jan 2018

👍10

All 14 comments

This one is more memory efficient:

def duplicate_columns(frame):
    groups = frame.columns.to_series().groupby(frame.dtypes).groups
    dups = []

    for t, v in groups.items():

        cs = frame[v].columns
        vs = frame[v]
        lcs = len(cs)

        for i in range(lcs):
            iv = vs.iloc[:,i].tolist()
            for j in range(i+1, lcs):
                jv = vs.iloc[:,j].tolist()
                if iv == jv:
                    dups.append(cs[i])
                    break

    return dups

121onto on 6 Oct 2015

👍3

@121onto what is the purpose of this issue?

you should need to benchmark this, with potentially mixed data, pass tests, etc.

jreback on 6 Oct 2015

Thanks @jreback. Here is modified code with a test:

from __future__ import print_function
from pandas import DataFrame
from pandas.core.common import array_equivalent
from pandas.util.testing import assert_equal
import numpy as np


def duplicate_columns(frame):
    groups = frame.columns.to_series().groupby(frame.dtypes).groups
    dups = []

    for t, v in groups.items():

        cs = frame[v].columns
        vs = frame[v]
        lcs = len(cs)

        for i in range(lcs):
            ia = vs.iloc[:,i].values
            for j in range(i+1, lcs):
                ja = vs.iloc[:,j].values
                if array_equivalent(ia, ja):
                    dups.append(cs[i])
                    break

    return dups


def test_duplicate_columns():

        data = {
            'objects1' : ['A', 'B', 'C', 'C', 'B', 'A'],
            'objects2' : ['A', 'B', 'C', 'B', 'B', 'A'],
            'objects_dup' : ['A', 'B', 'C', 'C', 'B', 'A'],
            'bools1' : [True, False, True, True, False, False],
            'bools2' : [True, False, True, True, True, False],
            'bools_dup' : [True, False, True, True, False, False],
            'floats1' : [1., 2., 3., 3., 2., 1.],
            'floats2' : [1., 3., 3., 5., 2., 1.],
            'floats_dup' : [1., 2., 3., 3., 2., 1.],
            'floats_withnan1' : [1., 2., 3., 3., np.nan, 1.],
            'floats_withnan2' : [1., np.nan, 3., 3., np.nan, 1.],
            'floats_withnan3' : [1., 2., 3., 3., 3., 1.],
            'floats_withnan_dup' : [1., 2., 3., 3., np.nan, 1.],
            'integers1' : [1, 2, 3, 3, 2, 1],
            'integers2' : [1, 2, 2, 2, 2, 1],
            'integers_dup' : [1, 2, 3, 3, 2, 1],
        }

        frame = DataFrame(data)
        frame = frame.sort(axis=1)

        dups = duplicate_columns(frame)
        expected = [
            'objects1',
            'bools1',
            'floats1',
            'floats_withnan1',
            'integers1',
        ]

        assert_equal(sorted(dups), sorted(expected))

121onto on 6 Oct 2015

so what are the timings / memory figures?

jreback on 6 Oct 2015

A timing comparison with .T

def transpose_duplicate_columns(frame):
    groups = frame.columns.to_series().groupby(frame.dtypes).groups
    dups = []

    for t, v in groups.items():
        tframe = frame[v].T
        dups.extend(list(tframe.loc[tframe.duplicated(),:].index))

    return dups

def profile_duplicate_columns(func, size=1):
        data = {
            'objects1' : ['A', 'B', 'C', 'C', 'B', 'A'],
            'objects2' : ['A', 'B', 'C', 'B', 'B', 'A'],
            'objects_dup' : ['A', 'B', 'C', 'C', 'B', 'A'],
            'bools1' : [True, False, True, True, False, False],
            'bools2' : [True, False, True, True, True, False],
            'bools_dup' : [True, False, True, True, False, False],
            'floats1' : [1., 2., 3., 3., 2., 1.],
            'floats2' : [1., 3., 3., 5., 2., 1.],
            'floats_dup' : [1., 2., 3., 3., 2., 1.],
            'floats_withnan1' : [1., 2., 3., 3., np.nan, 1.],
            'floats_withnan2' : [1., np.nan, 3., 3., np.nan, 1.],
            'floats_withnan3' : [1., 2., 3., 3., 3., 1.],
            'floats_withnan_dup' : [1., 2., 3., 3., np.nan, 1.],
            'integers1' : [1, 2, 3, 3, 2, 1],
            'integers2' : [1, 2, 2, 2, 2, 1],
            'integers_dup' : [1, 2, 3, 3, 2, 1],
        }

        frame = DataFrame(data)
        frame = frame.sort(axis=1)

        for i in range(size):
                frame = pd.concat([frame,frame]).reset_index(drop=True)

        return func(frame)

Using timeit:

In [71]: %timeit profile_duplicate_columns(transpose_duplicate_columns, size=3)
10 loops, best of 3: 46.1 ms per loop

In [72]: %timeit profile_duplicate_columns(transpose_duplicate_columns, size=4)
10 loops, best of 3: 80.2 ms per loop

In [73]: %timeit profile_duplicate_columns(transpose_duplicate_columns, size=5)
10 loops, best of 3: 149 ms per loop

In [74]: %timeit profile_duplicate_columns(transpose_duplicate_columns, size=6)
1 loops, best of 3: 336 ms per loop

In [75]: %timeit profile_duplicate_columns(transpose_duplicate_columns, size=7)
1 loops, best of 3: 1.19 s per loop

In [76]: %timeit profile_duplicate_columns(transpose_duplicate_columns, size=8)
1 loops, best of 3: 6.35 s per loop

In [81]: %timeit profile_duplicate_columns(duplicate_columns, size=3)
10 loops, best of 3: 21.3 ms per loop

In [82]: %timeit profile_duplicate_columns(duplicate_columns, size=4)
10 loops, best of 3: 23.8 ms per loop

In [83]: %timeit profile_duplicate_columns(duplicate_columns, size=5)
10 loops, best of 3: 26.5 ms per loop

In [84]: %timeit profile_duplicate_columns(duplicate_columns, size=6)
10 loops, best of 3: 28.8 ms per loop

In [85]: %timeit profile_duplicate_columns(duplicate_columns, size=7)
10 loops, best of 3: 31.5 ms per loop

In [86]: %timeit profile_duplicate_columns(duplicate_columns, size=8)
10 loops, best of 3: 34.1 ms per loop

In [87]: %timeit profile_duplicate_columns(duplicate_columns, size=20)
1 loops, best of 3: 4.75 s per loop

121onto on 6 Oct 2015

Not sure how to profile memory usage but would be open to suggestions.

121onto on 6 Oct 2015

still not clear what you are actually proposing

what you are timing and what drop--duplicates do are not the same thing

pls give a simple example

jreback on 7 Oct 2015

Ah, perhaps my naming convention is not the best. Maybe I should call it redundant_columns?

The functions are the same except each implements a distinct convention for picking out redundant columns: given a data frame with two identical columns 'first' and 'second', duplicate_columns will return 'first' while transpose_duplicate_columns will return 'second'.

duplicate_columns solves a practical problem. Suppose my client hands me a data set that was created by joining several tables. They may be sloppy about joining, in which case the data will contain redundant columns with potentially different labels. duplicate_columns solves this problem by creating a list of duplicative columns.

For example:

def build_frame():
        data = {
            'objects1' : ['A', 'B', 'C', 'C', 'B', 'A'],
            'objects2' : ['A', 'B', 'C', 'B', 'B', 'A'],
            'objects_dup1' : ['A', 'B', 'C', 'C', 'B', 'A'],
            'objects_dup2' : ['A', 'B', 'C', 'C', 'B', 'A']
        }

        frame = DataFrame(data)
        return frame.sort(axis=1)

frame = build_frame()

frame has several redundant columns pairs ('objects1', 'objects_dup1', 'objects_dup2' are the same). duplicate_columns finds these pairs, returning a list of redundant columns, treating the last instance as non-redundant. transpose_duplicate_columns does the same thing, except it treats the first instance as non-redundant

In [10]: frame = build_frame()

In [11]: frame
Out[12]:
  objects1 objects2 objects_dup1 objects_dup2
0        A        A            A            A
1        B        B            B            B
2        C        C            C            C
3        C        B            C            C
4        B        B            B            B
5        A        A            A            A

In [13]: duplicate_columns(frame)
Out[13]: ['objects1', 'objects_dup1']

In [14]: transpose_duplicate_columns(frame)
Out[14]: ['objects_dup1', 'objects_dup2']

121onto on 7 Oct 2015

@121onto

so you are taking advantage of segregated dtypes, and using array_equiavalent which is a quick way of determining equality, whereas .T.duplicated() needs to factorize things first.

Ok, so this would be ok as axis=1 parameter for .duplicated() (and equivalently for .drop_duplicates()). Not that care must be taken with processing of the keep parameter. Further this should return a boolean Series (like the current .duplicates). Also, you don't need to groupby by dtypes, simply using array_equivalent will work.

Further would need an some asv perf tests for this.

pls submit a pull-request

jreback on 7 Oct 2015

Do you think it makes sense to compute and compare column sums before doing array_equivalent for data frames with many numeric columns?

bolau on 22 Mar 2016

May I ask if there's any chance of this being implemented? (This thread is actually on of the top results for googling this issue/ removing columns by duplicated values). Thanks!

ddofer on 17 Jan 2018

👍10

@ddofer this would take a community pull request. It would be very difficult as you can start with the code above. The tricky part would be a) writing the benchmarks (done somwhat above), b) assuring it only minimilly affects the common case (no dups).

jreback on 17 Jan 2018

Hi all,

any chance to resurrect this issue? What do you mean by b)?
As array_equivalent is marked as deprecated, will there be any implementation of duplicate columns? For me, this is a very common use case.