xref #18202
We have some cruft, let's deprecate it (I have noted some which already have an issue associated).
Series.compress()
(#21930)Series.imag
/ Series.real
(#27106) Series.nonzero()
(#24048) Series.put()
(#27106)Series.itemsize
Series.flags
Series.strides
Series.first()
Series.last()
Series.swapaxes()
MultiIndex.to_hierarchical
(only used for Panel
) (#21613)Series/DataFrame.compound()
(#26405)Series.ptp()
(#21614)Series.from_array
(#18213)Series.valid()
(#18800)Series/DataFrame.slice_shift()
(#37601)Series/DataFrame.tshift()
(https://github.com/pandas-dev/pandas/pull/34545)Series/DataFrame.get_values()
(https://github.com/pandas-dev/pandas/issues/19617)Index.dtype_str
(#27106)Index.summary()
(#18217).get_ftype_counts
(#18243) (#20404).get_dtype_counts
#27145 Index/Series.asobject
(#18237) (#18572)Index.to_native_types()
(make private) (https://github.com/pandas-dev/pandas/pull/36418)DataFrame/Series.as_matrix
(#18458).clip_upper/.clip_lower
(replace by .clip
) (#24203).ftypes
(#18243) (#26744) .xs()
(#6249).iat/.at
.take
.lookup
(#35224)DataFrame.from_items
Series/DataFrame.add_prefix/add_suffix
~ (https://github.com/pandas-dev/pandas/pull/18347)NDFrame.filter
cc @jorisvandenbossche @TomAugspurger if any comments / objections pls note and I will update the top section.
I think I'm -1 on deprecating xs
.
-0 on deprecating ptp
What's the alternative to tshift
? I think that's sometimes useful when a shift won't quite work.
I was just writing up a similar issue :-) (but only for Series, as by fixing the api docs and docstrings, I bumped into quite some methods unknown to me)
Overview of methods that could be considered for removal (note that this list is very long, and are methods that I would not miss if they are gone, which does not mean that they are not useful to others, it's just for stirring discussion):
Related to the original ndarray subclassing:
Series.compress()
Series.flags
Series.imag
/ Series.real
Series.item()
Series.itemsize
Series.nonzero()
Series.put()
Series.strides
Series.ptp
Time series specific ones (the question here is if they are all worth it as method, while very specific in application):
Series.at_time()
Series.first()
Series.last()
Series.between_time()
Series.tshift()
Finance? specific
Series.compound()
Other:
Series.asobject
Series.as_matrix
Series.between()
Series.first_valid_index()
/ Series.last_valid_index()
Series.from_array()
Series.slice_shift()
Series.swapaxes()
Series.truncate()
Series.valid()
What's the alternative to tshift? I think that's sometimes useful when a shift won't quite work.
shift
already seems to have a freq
keyword as well, and it dispatches to tshift
if freq
is specified
Note that my above list is very long. The more obvious ones to me that are not yet in the list in the top post are: as_matrix
(and maybe swapaxes
?)
Could .add_prefix
and .add_suffix
be added to the deprecation list?
The dataframe/Series namespace is huge and cutting down can make the API easier to grasp. I would also think it more logical and idiomatic to operate directly on the columns, rather than on the dataframe.
@topper-123 added, feel free to submit PR's for any of these!
@jorisvandenbossche @jreback I would love to submit PRs for this issue. How can I go about in deprecating these APIs?
see for example https://github.com/pandas-dev/pandas/pull/18258
@ManrajGrover Best first post a comment here with which one you would start doing, as I think not all those listed above are uncontroversial.
I think .compound
, while not very useful ATM, could be more useful if it was cumulative, ie. just use .cumprod
instead of .prod
and return a series:
>>> s = pd.Series([0.2, 0.2, 0.2])
>>> s.compound() # essentialy the same as (s+1).cumprod() - 1
0 0.2000
1 0.440
2 0.728
dtype: float64
The above would play excellently together with .pct_change
, so data.pct_change().compound()
would read really well and be very useful in many use cases.
any opinions if .compound
could be changed like above rather than deprecated? If .compound
returns a scalar as today, I agree it should be deprecated.
I've started a PR for add_prefix
and add_postfix
.
I will take on Series.asobject
and NDFrame.as_matrix
next, unless @ManrajGrover wants to to them, in which case you'll be welcome.
And yes, I like to remove superfluous methods that start with a
, as these are so visible then tab-completing in the REPL :-)
Although I never use add_prefix
/ add_suffix
, I think they are quite used a bit (looking at the number of stackoverflow questions), so I am not yet fully convinced they are ok to deprecate.
So I would rather already start with the others like asobject, as_matrix, valid, tshift, ..
as_matrix
is also used a bit (more than the others mentioned in the list above), but because it is a very confusing name for what it does, I think it would be good to deprecate.
@jorisvandenbossche I can start with Index.summary()
for now and pick the next one from the following list:
Index.dtype_str
.ftypes/.get_ftype_counts
(#18243)Index/Series.asobject
(#18237)Index.to_native_types()
(make private)@jorisvandenbossche, what about removing a prefix/suffix or doing any other transformation you'd want to do on a index? My point is that .add_prefix
/.add_suffix
are way too specialized methods, and pandas should have methods that are more generally useful. .rename
is great in that respect, and should be the canonical method for changing axis values.
I've already made a proposal for .add_prefix
/.add_suffix
(#18347), so that can wait to see what the agreement will be on that. I would appreciate input though, if you see anything obvoÃous, as that is the first deprerecation PR I've made.
In the same vein, the difference between .as_matrix
and .values
is miniscule to the point where df[columns].values
is the same is df.as_matrix(columns)
. pandas will be cleaner and leaner with only one way to archieve such a common result (obviously so, IMO...).
I'll make a PR for as_matrix
, as there seems to be agreement on that.
My point is that .add_prefix/.add_suffix are way too specialized methods, and pandas should have methods that are more generally useful. .rename is great in that respect, and should be the canonical method for changing axis values.
I completely agree with this. But, you also have the fact that people are using it and thus a removal will cause inconvenience / break code. So it is always a balance between both.
I am certainly +1 on deprecating as_matrix
. As you say this is almost exactly the same as .values
, and although I think this method is also used quite a bit (the argument I use for add_prefix ..), it's an awfully confusing name, so that's for me an extra reason to deprecate it.
Here are my deprecation suggestions:
read_table
deprecated. Its the exact same as read_csv
with tab delimiterget_dtype_counts/get_ftype_counts
- these are just convenience for DataFrame.dtypes.value_counts
iat/at
. They give a small performance boost for an increase in API complexityiterrows/itertuples
to iterate over rowslookup
and take
- other indexers do the same thingcombine
- never used it and almost no use on SO. Looks to do nothing more than DataFrame.add
applymap
- should do the same thing with apply
and then map
inside of itagg
not aggregate
clip_upper
and clip_lower
and keep DataFrame.clip
for bothadd_prefix/add_suffix
into one methodfirst, last, truncate, at_time, between_time, to_period, to_timestamp
- These could be removed or put in an accessorreindex_axis
and reindex_like
in favor of just reindex
isna
- its an alias to isnull
remove the indexers iat/at. They give a small performance boost for an increase in API complexity
these are convenience methods, not sure they add much to API burden
take
this is a very common notion and is a very array-like method
remove isna - its an alias to isnull
this was just added for compat with dropna
, fillna
, see the pattern :>, so if anything we would remove isnull
, but that has been in the API so long that it may well nigh be impossible to actually remove (and more to the point very annoying).
Remove reindex_axis and reindex_like in favor of just reindex
reindex_axis
is already deprecated.
One of the biggest issues are the methods that work only with DataFrames with a DatetimeIndex - first, last, truncate, at_time, between_time, to_period, to_timestamp - These could be removed or put in an accessor
to_period
and to_timestamp
on a series do something else than the methods in the .dt
accessor. The former work on the index, the latter on the values. So it's not possible to just move them.
But on the other datetime-related I agree, I also find it a bit unfortunate that those exist (certainly first and last are very confusing in naming)
@jreback There are only a total of 13 occurrences of df.take
in all of Stack Overflow and in my opinion should never be used.
df.iat/.at
are probably too entrenched in legacy code to remove but they provide no extra functionality. Indexing is the most confusing aspect to pandas and the less the better. Maybe a better design would have been to do df.loc(type='scalar')['row', 'col']
I guess there is no going back on isna/isnull
but I really dislike having methods that are aliases of one another.
@jorisvandenbossche I wasn't being clear, but all those DataFrame/Series methods that only work on DateTimeIndexes could be put in their own accesor (not .dt) but I don't think even that would be a good idea. Perhaps just deprecating all of them would be best.
@jreback There are only a total of 13 occurrences of df.take in all of Stack Overflow and in my opinion should never be used.
.take()
is a common name for array-like things.
df.iat/.at are probably too entrenched in legacy code to remove but they provide no extra functionality. Indexing is the most confusing aspect to pandas and the less the better. Maybe a better design would have been to do df.loc(type='scalar')['row', 'col']
your suggestion is much less readable
I guess there is no going back on isna/isnull but I really dislike having methods that are aliases of one another.
sure, but isnull is even more entrenched than anything else.
df.iat/.at are probably too entrenched in legacy code to remove but they provide no extra functionality. Indexing is the most confusing aspect to pandas and the less the better. Maybe a better design would have been to do df.loc(type='scalar')['row', 'col']
As indexing is simplified & improved, the speed diff between .loc
and .at
should fall (a lot of the time it's a very similar function). Then deprecating .at
will cause less strife
DataFrames have a method .boxplot
. I would assume this should be deprecated and people should use .plot.box
instead?
yes i think there is an issue about this
The two methods aren’t quite equivalent iirc.
From: Jeff Reback notifications@github.com
Sent: Saturday, November 25, 2017 10:23:01 PM
To: pandas-dev/pandas
Cc: Tom Augspurger; Mention
Subject: Re: [pandas-dev/pandas] DEPR: let's deprecate (#18262)
yes i think there is an issue about this
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com/pandas-dev/pandas/issues/18262#issuecomment-346981274, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABQHItMklrTp28DbdKXtQmYS9WIL_7sKks5s6NmVgaJpZM4Qb0GL.
Does anyone have any objections to a PR to deprecate from_items
? There is already an issue discussing its deprecation #17320. It seems that any of its functionality can be be replicated in DataFrame(dict(..))
or similar.
Also xref #17312, #4916
boxplot
has some differences to df.box.plot
. One issue is here - #15079. boxplot
also defaults to making a matplotlib grid (those background dotted lines on the tick marks).
Also, clip_upper
and clip_lower
have to up there on easiest and least likely to affect others.
@reidy-p DataFrame.from_items
looks like a good candidate. I only see 6 occurrences on SO.
This might be unpopular, but I don't like the fact that the indexing operator selects rows when given a slice but columns when given a single label or list of labels.
Selects columns
>>> df['col1']
>>> df[['col1', 'col2']]
All of a sudden selects rows?
>>> df[:5]
Lots of users will find this behavior confusing.
I think it actually might be simplest to deprecate the indexing operator altogether and only have .loc
and .iloc
.
@tdpetrou let's keep this issue focused on discussing "potential methods that can be deprecated because they are hardly used / very awkward / internals".
Indexing is a whole other can of worms (see eg https://github.com/pandas-dev/pandas/issues/9595), the behaviour you mention can maybe be confusing but is well established and changing this will break a gigantic amount of user code (but feel free to open a separate issue if you want)
Thanks @jorisvandenbossche. I added my comment there.
Another source of confusion for users is the ability of Series.map
to take a function. Series.apply
can also accept a function as well as other arguments. There is no reason for Series.map
to accept a function and its better served to do exactly one thing - and that is to literally map one value to another.
@tdpetrou , Series.map(func)
is similar to the python built-in map(func, iterable)
, except it's object-oriented. It's as intended and expected.
@topper-123 Yes, I know it has a similar functionality to the built-in map
function, but it also accepts a dictionary. This is a bad idea and cause for lots of confusion. Users see map
and apply
and have no idea what the difference between them are. It makes much more sense to have it literally map one value to another with a dictionary/Series as its only functionality.
IMO DataFrame.filter
is confusingly named, and is easily confused with the similarly named DataFrame.groupby(...).filter
when googling etc.
I propose that DataFrame.filter
be deprecated and a similarly functioning DataFrame.select_filter
be added. By doing this rename, the relation to select_dtypes
is emphasized.
It could also be named just select
(shorter), but that means that the change will have to wait until the current deprecated select
method is removed.
I found the OP from @TomAugspurger in #21894 quite important, and since that issue is closed now, I quote it here:
As a reminder, the plan is to have no new deprecations in 0.25.x and 1.0.0. So this [v0.24] is the last round of deprecations before 1.0.
In this context, I'd like to bring up for discussion the following two issues: #21950 #21951
Finally, l'll repeat a comment I made in the other thread:
Most likely too late to the game, but for completeness I'd like to add: if #21855 #21858 are solved for v0.24, then
combine_first
could be deprecated at the same time, see #21859.
Hi, I am also -1 for deprecating xs()
.
I read most of the discussion in the related issue.
But as someone whose using pandas from time to time, the sheer amount of capacity of loc
is kind of confusing me.
Especially when I am using a multi index with different levels. I tend to remember the levels not in their order of hierarchy but by their names. Selection by loc
becomes a hassle when I want to slice third and fifth level in the hierarchy, because most of the time I confuse the third with the second one, or fourth with the fifth etc.
I don't loose much time on it, but still it is a little inefficient compared to where I can simply pass a value and the name of the level to a function.
Why on earth would you deprecate read_table
? That makes no damn sense.
The suggested change is to call read_csv
to read things that are not comma-separated? This is 100% backwards.
One could argue why not deprecate read_csv()
instead of read_table()
since table
sounds more flexible.
Edit:
I have to agree with @jimmywan here, and if they are basically the same, why not at least keep it as an alias? One could always wrap it, but people would not be confused or avoid updating.
DataFrame.where
and DataFrame.mask
are duals, but their names don't indicate that. perhaps deprecate mask? since mask is just where(~cond)
, IIUC. alternatively rename to where_not
.
cc @jbrockmendel here is my original list :->
Would it be possible to get a rationale to why .item() has been deprecated or a suggested alternative? I really appreciated this feature as it would assume one and only one match to a query/column combo. i.e pd.query("Country=='%s' % country)['CapitalCity'].unique().item()
.
Using the suggest next(iter, None)) feature as suggested in the link below is very readable, but breaks the assumption that there are not multiple matches to a query and returns only the first value, meaning I'd need to do add a check-query-length==1 prior to extracting the value.
https://stackoverflow.com/questions/57390363/pandas-item-has-been-deprecated
.unique() currently returns an ndarray so .item() would still be valid
this is for .item() on Series (and Index)
Thanks! I'll put in unique() to all pd.query("row_name=='var')[col_name] as a midfix to resolve the deprecation warning.
Do we want to deprecate iat / at
? These are under "potentially" now. I would be +1, iloc
and loc
can be used.
Most helpful comment
Why on earth would you deprecate
read_table
? That makes no damn sense.The suggested change is to call
read_csv
to read things that are not comma-separated? This is 100% backwards.