Pandas: Feature Request: Keep only these columns (vs. dropping all the ones you don't want)

Created on 8 Nov 2016 · 4Comments · Source: pandas-dev/pandas

Apologies if this has been submitted or considered in the past, I searched through the GitHub issues and couldn't find any information pertaining to this.

The idea is that instead of specifying all of the columns that you wish to delete from a DataFrame via the .drop method, you specify instead the columns you wish to keep through a .keep_cols method - all other columns are deleted. This would save typing in cases where there are many columns, and we only want to keep a small subset of columns. The prime use case here is method chaining, where using [[ doesn't really work in the middle of many methods being chained together.

A small, complete example of the issue

import pandas as pd

# Create an example DataFrame
data = [
    [1, 'ABC', 4, 10, 6.3],
    [2, 'BCD', 10, 9, 11.6],
    [3, 'CDE', 7, 4, 10.0],
    [4, 'DEF', 7, 10, 5.4],
    [5, 'EFG', 2, 9, 5.3],
]
data = pd.DataFrame(data, 
    columns = ['Id', 'Name', 'Rating1', 'Rating2', 'ThisIsANumber'])

# Just want columns Id and Ratings2
new_data = data.drop(['Name', 'Rating1', 'ThisIsANumber'], axis = 1)
new_data.head()

# ** It would be nice to be able to only specify the columns we want 
# ** to keep to save typing - similar to dplyr in R             

def keep_cols(DataFrame, keep_these):
    """Keep only the columns [keep_these] in a DataFrame, delete
    all other columns. 
    """
    drop_these = list(set(list(DataFrame)) - set(keep_these))
    return DataFrame.drop(drop_these, axis = 1)

new_data = data.pipe(keep_cols, ['Id', 'Rating2'])
new_data.head()

# In this specific example there was not much more typing between
# `.drop` and the `keep_cols` function, but often when a `DataFrame`
# has many columns this is not the case!

In this contrived example I created a keep_cols function as a rough draft of a .keep_columns method to the DataFrame object, and used the .pipe method to pipe that function to the DataFrame as if it were a method.

I don't think using [[ cuts if here. Yes, doing new_data[['Id', 'Rating2]] would work, but when method chaining, people often want to drop columns somewhere in the middle of a bunch of methods.

Just in case it's helpful, here's a good article demonstrating the power/beauty of method chaining in Pandas: https://tomaugspurger.github.io/modern-1.html.

Thanks!

Source

jakesherman

Most helpful comment

Note that this function already exists:

In [214]: data.filter(['Id', 'Rating2'])
Out[214]: 
   Id  Rating2
0   1       10
1   2        9
2   3        4
3   4       10
4   5        9

(but it's not a much publicized method, and some are arguing to remove it)

jorisvandenbossche on 8 Nov 2016

👍9

All 4 comments

I don't think using [[ cuts if here. Yes, doing new_data[['Id', 'Rating2]] would work, but when method chaining, people often want to drop columns somewhere in the middle of a bunch of methods.

so adding another method helps how?
you can certainly chain selection

jreback on 8 Nov 2016

I'm -1 on adding another indexing methods (we have enough as is 😆 ). @jakesherman can you give an example where .loc or even __getitem__ doesn't work in a method chain? I think they should be valid anywhere. Note that method chaining for .loc and friends was added in 0.18.1: http://pandas.pydata.org/pandas-docs/version/0.19.0/whatsnew.html#whatsnew-0181-enhancements-method-chain

TomAugspurger on 8 Nov 2016

Note that this function already exists:

In [214]: data.filter(['Id', 'Rating2'])
Out[214]: 
   Id  Rating2
0   1       10
1   2        9
2   3        4
3   4       10
4   5        9

(but it's not a much publicized method, and some are arguing to remove it)

jorisvandenbossche on 8 Nov 2016

👍9

Ahh, I didn't realize that there was a filter method! Sorry about that. I will close this ticket.

jakesherman on 8 Nov 2016

Was this page helpful?

0 / 5 - 0 ratings

Related issues

ValueError plotting bar plot from DataFrame with existing Axes

swails · 3Comments

Interpolate (upsample) non-equispaced timeseries into equispaced 18.0rc1

marcelnem · 3Comments

frame _apply_standard error when operating on 0 or NaN values

venuktan · 3Comments

BUG: fillna with inplace does not work with multiple columns selection by loc

hiiwave · 3Comments

Storing a dict in a DataFrame fails

andreas-thomik · 3Comments