do we need label selectors? we should for sure just have a single method for this. maybe call it query_labels
? to be consistent with .query
as the workhorse for data selection.
.select
(#17633).filter
xref #6599
I personally find filter
a useful function (at least I have used it to good purpose in my own work) to select certain columns. See also the examples added in https://github.com/pandas-dev/pandas/pull/12399. Although it should rather be called select
...
Less sure about select
. That seems less useful, certainly now loc
accepts a function.
I think I have revised my thoughts here.
we should promote (in the doc / the-one-way-to-do-it), .select
as the main label filtering function, and deprecate .filter
(which ATM serve the same purpose). Maybe needs some API tweaks.
.filter
is traditionally a data selection / filtering function.
They are quite different at the moment:
filter
:select
:further: .filter
uses .select
for regex matching in its implementation.
further: we use .filter()
in .groupby()
to allow a filter for group inclusion (boolean return)
I have found DataFrame.filter
to be useful, especially with like
or regex
. I have never used DataFrame.select
, which feels very non-idiomatic to me.
So I would be happy to deprecate select
. It's also highly confusing how GroupBy.filter
works like DataFrame.select
, not .filter
.
It's also highly confusing how
GroupBy.filter
works likeDataFrame.select
, not.filter
.
I agree this is highly confusing. Is renaming one of those out of the question? filter
is a common name for a higher-order function which filters elements based on the result of a Boolean-valued function that was passed in, exactly like GroupBy.filter
, so that seems like an appropriate name for what is currently DataFrame.select
. There's also Python's builtin filter
function.
Another option might be merging the functionality of select
and filter
under one name, so it supports both list-like and function arguments.
so the problem as highlited by @jorisvandenbossche is that .select
acts on the index (which is what groubpy.filter
and boolean selection does). so it is a highly confusing name.
.filter
is also a confusing name as it acts on the labels of columns.
We need a combined functionaility of the current DataFrame.select/filter
(IOW to select labels from an axis and should accept a list-like, scalar and callable, like most other functions)
signature should be something like this (default for most functions is axis=0)
def select_labels(arraylike or scalar or callable, axis=0, regex=False)
now as to what to do:
select_labels
I think is a nice name (open to suggestions), though other systems (spark & sql), use .select
to mean label/column selection..select
in favor of .select_labels
.filter
in favor of select_labels
@dkasak interested in taking this on?
I would suggest simply deprecating/removing select
without making a replacement. Indexing is a fine alternative.
DataFrame.filter()
is useful. I wish it were called select
instead, both because that matches SQL and filter
suggests filtering rows with a boolean expression (like filter
in dplyr or Ibis), but I don't think changing the name is worth the hassle.
In general, I think we should avoid making small changes in the API for the basic grammar of data manipulation in pandas, unless we rethink things more broadly for a larger, breaking change (e.g., in pandas2).
but I don't think changing the name is worth the hassle
sure it is - pandas is going to exist for 1.x for quite some time
better to make changes to the right spelling sooner rather than later
I am all for deprecating filter and calling it select (or select_labels)
I don't have time to handle this at the moment, but I may be interested in doing it when time permits if it hasn't been done already by then.
FWIW, upon some thought, I still think changing the name of .filter
to .select*
would be best. I don't feel strongly about .select
vs .select_labels
. I generally prefer shorter names, but the added verbosity here might make things clearer. Calling it .select
has the benefit that only one name is deprecated, not two.
I'm not so sure about dropping the current behaviour of .select
entirely because I have a use case which I'm not sure how to implement without it (and without resorting to things like .reset_index()
to regain the ability to select by using a function).
In particular, I have a MultiIndex
with 2 levels, each of which has elements of type str
. In other words, each index value is conceptually a pair of strings. Currently I'm doing something like
df.select(lambda x: condition1(x[0]) and condition2(x[1]))
and similar to select particular rows. How could this be implemented without current .select
functionality?
can u show a complete example of how using select
On the pandas-dev mailing list concern was raised about the the deprecation of select
, see https://mail.python.org/pipermail/pandas-dev/2017-November/000649.html
I think the example makes a point. For me the alternative like .loc[:, lambda df: complex_fxn_that_selects_a_few_cols(df.columns)]
is harder to read and to teach as .select(complex_fxn_that_selects_a_few_cols()
. Which makes the deprecation of select
a step backwards for those cases.
The deprecation message currently only suggests a replacement for the case axis=0
.
I suggest to expand this to:
use df.loc[df.index.map(crit)] to select labels, df.loc(axis=1)[df.columns.map(crit)] to select columns.
I only just found out about this change and the doc still doesn't give guidance. For actual selection by column value, people also use numpy operators np.select(condlist, choicelist, ...)
(for multiple values) and np.where(cond, [valTrue, valFalse])
for two values. Is that good/bad/another alternative? Witness the confusion on SO. I think the root of the issue is that pandas select
verb disagreed with what numpy and SQL select
do, hence created confusion.
There's still a docbug needed on this, but first we need to know what you actually recommend.
Most helpful comment
On the pandas-dev mailing list concern was raised about the the deprecation of
select
, see https://mail.python.org/pipermail/pandas-dev/2017-November/000649.htmlI think the example makes a point. For me the alternative like
.loc[:, lambda df: complex_fxn_that_selects_a_few_cols(df.columns)]
is harder to read and to teach as.select(complex_fxn_that_selects_a_few_cols()
. Which makes the deprecation ofselect
a step backwards for those cases.