Pandas: DEPR: filter & select

Created on 20 Feb 2016  路  15Comments  路  Source: pandas-dev/pandas

do we need label selectors? we should for sure just have a single method for this. maybe call it query_labels? to be consistent with .query as the workhorse for data selection.

  • [x] .select (#17633)
  • [ ] .filter

xref #6599

API Design Deprecate Indexing Needs Discussion

Most helpful comment

On the pandas-dev mailing list concern was raised about the the deprecation of select, see https://mail.python.org/pipermail/pandas-dev/2017-November/000649.html

I think the example makes a point. For me the alternative like .loc[:, lambda df: complex_fxn_that_selects_a_few_cols(df.columns)] is harder to read and to teach as .select(complex_fxn_that_selects_a_few_cols(). Which makes the deprecation of select a step backwards for those cases.

All 15 comments

I personally find filter a useful function (at least I have used it to good purpose in my own work) to select certain columns. See also the examples added in https://github.com/pandas-dev/pandas/pull/12399. Although it should rather be called select ...

Less sure about select. That seems less useful, certainly now loc accepts a function.

I think I have revised my thoughts here.

we should promote (in the doc / the-one-way-to-do-it), .select as the main label filtering function, and deprecate .filter (which ATM serve the same purpose). Maybe needs some API tweaks.

.filter is traditionally a data selection / filtering function.

They are quite different at the moment:

  • filter:

    • acts on columns by default (for dataframe)

    • can select based on list, or simple 'like'/more advanced regex

  • select:

    • acts on index by default

    • selects based on function applied to index labels

further: .filter uses .select for regex matching in its implementation.

further: we use .filter() in .groupby() to allow a filter for group inclusion (boolean return)

I have found DataFrame.filter to be useful, especially with like or regex. I have never used DataFrame.select, which feels very non-idiomatic to me.

So I would be happy to deprecate select. It's also highly confusing how GroupBy.filter works like DataFrame.select, not .filter.

It's also highly confusing how GroupBy.filter works like DataFrame.select, not .filter.

I agree this is highly confusing. Is renaming one of those out of the question? filter is a common name for a higher-order function which filters elements based on the result of a Boolean-valued function that was passed in, exactly like GroupBy.filter, so that seems like an appropriate name for what is currently DataFrame.select. There's also Python's builtin filter function.

Another option might be merging the functionality of select and filter under one name, so it supports both list-like and function arguments.

so the problem as highlited by @jorisvandenbossche is that .select acts on the index (which is what groubpy.filter and boolean selection does). so it is a highly confusing name.

.filter is also a confusing name as it acts on the labels of columns.

We need a combined functionaility of the current DataFrame.select/filter (IOW to select labels from an axis and should accept a list-like, scalar and callable, like most other functions)

signature should be something like this (default for most functions is axis=0)

def select_labels(arraylike or scalar or callable, axis=0, regex=False)

now as to what to do:

  • select_labels I think is a nice name (open to suggestions), though other systems (spark & sql), use .select to mean label/column selection.
  • deprecate .select in favor of .select_labels
  • deprecate .filter in favor of select_labels

@dkasak interested in taking this on?

I would suggest simply deprecating/removing select without making a replacement. Indexing is a fine alternative.

DataFrame.filter() is useful. I wish it were called select instead, both because that matches SQL and filter suggests filtering rows with a boolean expression (like filter in dplyr or Ibis), but I don't think changing the name is worth the hassle.

In general, I think we should avoid making small changes in the API for the basic grammar of data manipulation in pandas, unless we rethink things more broadly for a larger, breaking change (e.g., in pandas2).

but I don't think changing the name is worth the hassle

sure it is - pandas is going to exist for 1.x for quite some time

better to make changes to the right spelling sooner rather than later

I am all for deprecating filter and calling it select (or select_labels)

I don't have time to handle this at the moment, but I may be interested in doing it when time permits if it hasn't been done already by then.

FWIW, upon some thought, I still think changing the name of .filter to .select* would be best. I don't feel strongly about .select vs .select_labels. I generally prefer shorter names, but the added verbosity here might make things clearer. Calling it .select has the benefit that only one name is deprecated, not two.

I'm not so sure about dropping the current behaviour of .select entirely because I have a use case which I'm not sure how to implement without it (and without resorting to things like .reset_index() to regain the ability to select by using a function).

In particular, I have a MultiIndex with 2 levels, each of which has elements of type str. In other words, each index value is conceptually a pair of strings. Currently I'm doing something like

df.select(lambda x: condition1(x[0]) and condition2(x[1]))

and similar to select particular rows. How could this be implemented without current .select functionality?

can u show a complete example of how using select

On the pandas-dev mailing list concern was raised about the the deprecation of select, see https://mail.python.org/pipermail/pandas-dev/2017-November/000649.html

I think the example makes a point. For me the alternative like .loc[:, lambda df: complex_fxn_that_selects_a_few_cols(df.columns)] is harder to read and to teach as .select(complex_fxn_that_selects_a_few_cols(). Which makes the deprecation of select a step backwards for those cases.

The deprecation message currently only suggests a replacement for the case axis=0.

I suggest to expand this to:

use df.loc[df.index.map(crit)] to select labels, df.loc(axis=1)[df.columns.map(crit)] to select columns.

I only just found out about this change and the doc still doesn't give guidance. For actual selection by column value, people also use numpy operators np.select(condlist, choicelist, ...) (for multiple values) and np.where(cond, [valTrue, valFalse]) for two values. Is that good/bad/another alternative? Witness the confusion on SO. I think the root of the issue is that pandas select verb disagreed with what numpy and SQL select do, hence created confusion.

There's still a docbug needed on this, but first we need to know what you actually recommend.

Was this page helpful?
0 / 5 - 0 ratings