Apologies if this has been submitted or considered in the past, I searched through the GitHub issues and couldn't find any information pertaining to this.
The idea is that instead of specifying all of the columns that you wish to delete from a DataFrame via the .drop method, you specify instead the columns you wish to keep through a .keep_cols method - all other columns are deleted. This would save typing in cases where there are many columns, and we only want to keep a small subset of columns. The prime use case here is method chaining, where using [[ doesn't really work in the middle of many methods being chained together.
import pandas as pd
# Create an example DataFrame
data = [
[1, 'ABC', 4, 10, 6.3],
[2, 'BCD', 10, 9, 11.6],
[3, 'CDE', 7, 4, 10.0],
[4, 'DEF', 7, 10, 5.4],
[5, 'EFG', 2, 9, 5.3],
]
data = pd.DataFrame(data,
columns = ['Id', 'Name', 'Rating1', 'Rating2', 'ThisIsANumber'])
# Just want columns Id and Ratings2
new_data = data.drop(['Name', 'Rating1', 'ThisIsANumber'], axis = 1)
new_data.head()
# ** It would be nice to be able to only specify the columns we want
# ** to keep to save typing - similar to dplyr in R
def keep_cols(DataFrame, keep_these):
"""Keep only the columns [keep_these] in a DataFrame, delete
all other columns.
"""
drop_these = list(set(list(DataFrame)) - set(keep_these))
return DataFrame.drop(drop_these, axis = 1)
new_data = data.pipe(keep_cols, ['Id', 'Rating2'])
new_data.head()
# In this specific example there was not much more typing between
# `.drop` and the `keep_cols` function, but often when a `DataFrame`
# has many columns this is not the case!
In this contrived example I created a keep_cols function as a rough draft of a .keep_columns method to the DataFrame object, and used the .pipe method to pipe that function to the DataFrame as if it were a method.
I don't think using [[ cuts if here. Yes, doing new_data[['Id', 'Rating2]] would work, but when method chaining, people often want to drop columns somewhere in the middle of a bunch of methods.
Just in case it's helpful, here's a good article demonstrating the power/beauty of method chaining in Pandas: https://tomaugspurger.github.io/modern-1.html.
Thanks!
I don't think using [[ cuts if here. Yes, doing new_data[['Id', 'Rating2]] would work, but when method chaining, people often want to drop columns somewhere in the middle of a bunch of methods.
so adding another method helps how?
you can certainly chain selection
I'm -1 on adding another indexing methods (we have enough as is 馃槅 ). @jakesherman can you give an example where .loc or even __getitem__ doesn't work in a method chain? I think they should be valid anywhere. Note that method chaining for .loc and friends was added in 0.18.1: http://pandas.pydata.org/pandas-docs/version/0.19.0/whatsnew.html#whatsnew-0181-enhancements-method-chain
Note that this function already exists:
In [214]: data.filter(['Id', 'Rating2'])
Out[214]:
Id Rating2
0 1 10
1 2 9
2 3 4
3 4 10
4 5 9
(but it's not a much publicized method, and some are arguing to remove it)
Ahh, I didn't realize that there was a filter method! Sorry about that. I will close this ticket.
Most helpful comment
Note that this function already exists:
(but it's not a much publicized method, and some are arguing to remove it)