Pandas: DEPR: let's deprecate

Created on 13 Nov 2017 · 42Comments · Source: pandas-dev/pandas

xref #18202

We have some cruft, let's deprecate it (I have noted some which already have an issue associated).

From ndarray

[x] Series.compress() (#21930)
[x] Series.imag / Series.real (#27106)
[x] Series.nonzero() (#24048)
[x] Series.put() (#27106)
[x] Series.itemsize
[x] Series.flags
[x] Series.strides

timeseries specific

[ ] Series.first()
let head / tail take a timedelta
[ ] Series.last()

other

[ ] Series.swapaxes()
- No reason once panel is gone.

non-controversial

[x] MultiIndex.to_hierarchical (only used for Panel) (#21613)
[x] Series/DataFrame.compound() (#26405)
[x] Series.ptp() (#21614)
[x] Series.from_array (#18213)
[x] Series.valid() (#18800)
[x] Series/DataFrame.slice_shift() (#37601)
[x] Series/DataFrame.tshift() (https://github.com/pandas-dev/pandas/pull/34545)
[x] Series/DataFrame.get_values() (https://github.com/pandas-dev/pandas/issues/19617)
[x] Index.dtype_str (#27106)
[x] Index.summary() (#18217)
[x] .get_ftype_counts (#18243) (#20404)
[x] .get_dtype_counts #27145
[x] Index/Series.asobject (#18237) (#18572)
[x] Index.to_native_types() (make private) (https://github.com/pandas-dev/pandas/pull/36418)
[x] DataFrame/Series.as_matrix (#18458)
[x] .clip_upper/.clip_lower (replace by .clip) (#24203)

Potentially

[x] .ftypes (#18243) (#26744)
[ ] .xs() (#6249)
[ ] .iat/.at
[ ] .take
[x] .lookup (#35224)
[x] DataFrame.from_items
[ ] ~Series/DataFrame.add_prefix/add_suffix~ (https://github.com/pandas-dev/pandas/pull/18347)
- Maybe add suffix / prefix to concat?
[ ] NDFrame.filter

Deprecate Master Tracker

Source

jreback

Most helpful comment

Why on earth would you deprecate read_table? That makes no damn sense.
The suggested change is to call read_csv to read things that are not comma-separated? This is 100% backwards.

jimmywan on 25 Jan 2019

👍13

All 42 comments

cc @jorisvandenbossche @TomAugspurger if any comments / objections pls note and I will update the top section.

jreback on 13 Nov 2017

I think I'm -1 on deprecating xs.

-0 on deprecating ptp

What's the alternative to tshift? I think that's sometimes useful when a shift won't quite work.

TomAugspurger on 13 Nov 2017

I was just writing up a similar issue :-) (but only for Series, as by fixing the api docs and docstrings, I bumped into quite some methods unknown to me)

Overview of methods that could be considered for removal (note that this list is very long, and are methods that I would not miss if they are gone, which does not mean that they are not useful to others, it's just for stirring discussion):

Related to the original ndarray subclassing:
- Series.compress()
- Series.flags
- Series.imag / Series.real
- Series.item()
- Series.itemsize
- Series.nonzero()
- Series.put()
- Series.strides
- Series.ptp
Time series specific ones (the question here is if they are all worth it as method, while very specific in application):
- Series.at_time()
- Series.first()
- Series.last()
- Series.between_time()
- Series.tshift()
Finance? specific
- Series.compound()
Other:
- Series.asobject
- Series.as_matrix
- Series.between()
- Series.first_valid_index() / Series.last_valid_index()
- Series.from_array()
- Series.slice_shift()
- Series.swapaxes()
- Series.truncate()
- Series.valid()

jorisvandenbossche on 13 Nov 2017

What's the alternative to tshift? I think that's sometimes useful when a shift won't quite work.

shift already seems to have a freq keyword as well, and it dispatches to tshift if freq is specified

Note that my above list is very long. The more obvious ones to me that are not yet in the list in the top post are: as_matrix (and maybe swapaxes ?)

jorisvandenbossche on 13 Nov 2017

Could .add_prefix and .add_suffix be added to the deprecation list?

The dataframe/Series namespace is huge and cutting down can make the API easier to grasp. I would also think it more logical and idiomatic to operate directly on the columns, rather than on the dataframe.

topper-123 on 16 Nov 2017

@topper-123 added, feel free to submit PR's for any of these!

jreback on 16 Nov 2017

@jorisvandenbossche @jreback I would love to submit PRs for this issue. How can I go about in deprecating these APIs?

manrajgrover on 16 Nov 2017

see for example https://github.com/pandas-dev/pandas/pull/18258

jreback on 16 Nov 2017

@ManrajGrover Best first post a comment here with which one you would start doing, as I think not all those listed above are uncontroversial.

jorisvandenbossche on 17 Nov 2017

I think .compound, while not very useful ATM, could be more useful if it was cumulative, ie. just use .cumprod instead of .prod and return a series:

>>> s = pd.Series([0.2, 0.2, 0.2])
>>> s.compound()  # essentialy the same as (s+1).cumprod() - 1
0    0.2000
1    0.440
2    0.728
dtype: float64

The above would play excellently together with .pct_change, so data.pct_change().compound() would read really well and be very useful in many use cases.

any opinions if .compound could be changed like above rather than deprecated? If .compound returns a scalar as today, I agree it should be deprecated.

topper-123 on 17 Nov 2017

I've started a PR for add_prefix and add_postfix.

I will take on Series.asobject and NDFrame.as_matrix next, unless @ManrajGrover wants to to them, in which case you'll be welcome.

And yes, I like to remove superfluous methods that start with a, as these are so visible then tab-completing in the REPL :-)

topper-123 on 17 Nov 2017

Although I never use add_prefix / add_suffix, I think they are quite used a bit (looking at the number of stackoverflow questions), so I am not yet fully convinced they are ok to deprecate.
So I would rather already start with the others like asobject, as_matrix, valid, tshift, ..

as_matrix is also used a bit (more than the others mentioned in the list above), but because it is a very confusing name for what it does, I think it would be good to deprecate.

jorisvandenbossche on 18 Nov 2017

@jorisvandenbossche I can start with Index.summary() for now and pick the next one from the following list:

Index.dtype_str
.ftypes/.get_ftype_counts (#18243)
Index/Series.asobject (#18237)
Index.to_native_types() (make private)

manrajgrover on 18 Nov 2017

@jorisvandenbossche, what about removing a prefix/suffix or doing any other transformation you'd want to do on a index? My point is that .add_prefix/.add_suffix are way too specialized methods, and pandas should have methods that are more generally useful. .rename is great in that respect, and should be the canonical method for changing axis values.

I've already made a proposal for .add_prefix/.add_suffix (#18347), so that can wait to see what the agreement will be on that. I would appreciate input though, if you see anything obvoíous, as that is the first deprerecation PR I've made.

In the same vein, the difference between .as_matrix and .values is miniscule to the point where df[columns].values is the same is df.as_matrix(columns). pandas will be cleaner and leaner with only one way to archieve such a common result (obviously so, IMO...).

I'll make a PR for as_matrix, as there seems to be agreement on that.

topper-123 on 18 Nov 2017

My point is that .add_prefix/.add_suffix are way too specialized methods, and pandas should have methods that are more generally useful. .rename is great in that respect, and should be the canonical method for changing axis values.

I completely agree with this. But, you also have the fact that people are using it and thus a removal will cause inconvenience / break code. So it is always a balance between both.

I am certainly +1 on deprecating as_matrix. As you say this is almost exactly the same as .values, and although I think this method is also used quite a bit (the argument I use for add_prefix ..), it's an awfully confusing name, so that's for me an extra reason to deprecate it.

jorisvandenbossche on 22 Nov 2017

Here are my deprecation suggestions:

I'd like to see read_table deprecated. Its the exact same as read_csv with tab delimiter
remove get_dtype_counts/get_ftype_counts - these are just convenience for DataFrame.dtypes.value_counts
remove the indexers iat/at. They give a small performance boost for an increase in API complexity
Use one of iterrows/itertuples to iterate over rows
remove lookup and take - other indexers do the same thing
remove combine - never used it and almost no use on SO. Looks to do nothing more than DataFrame.add
Probably get rid of applymap - should do the same thing with apply and then map inside of it
Use only agg not aggregate
Remove clip_upper and clip_lower and keep DataFrame.clip for both
Combine add_prefix/add_suffix into one method
One of the biggest issues are the methods that work only with DataFrames with a DatetimeIndex - first, last, truncate, at_time, between_time, to_period, to_timestamp - These could be removed or put in an accessor
Remove reindex_axis and reindex_like in favor of just reindex
remove isna - its an alias to isnull

tdpetrou on 23 Nov 2017

👍5 👎1

remove the indexers iat/at. They give a small performance boost for an increase in API complexity

these are convenience methods, not sure they add much to API burden

take

this is a very common notion and is a very array-like method

remove isna - its an alias to isnull

this was just added for compat with dropna, fillna, see the pattern :>, so if anything we would remove isnull, but that has been in the API so long that it may well nigh be impossible to actually remove (and more to the point very annoying).

jreback on 23 Nov 2017

Remove reindex_axis and reindex_like in favor of just reindex

reindex_axis is already deprecated.

One of the biggest issues are the methods that work only with DataFrames with a DatetimeIndex - first, last, truncate, at_time, between_time, to_period, to_timestamp - These could be removed or put in an accessor

to_period and to_timestamp on a series do something else than the methods in the .dt accessor. The former work on the index, the latter on the values. So it's not possible to just move them.
But on the other datetime-related I agree, I also find it a bit unfortunate that those exist (certainly first and last are very confusing in naming)

jorisvandenbossche on 23 Nov 2017

@jreback There are only a total of 13 occurrences of df.take in all of Stack Overflow and in my opinion should never be used.

df.iat/.at are probably too entrenched in legacy code to remove but they provide no extra functionality. Indexing is the most confusing aspect to pandas and the less the better. Maybe a better design would have been to do df.loc(type='scalar')['row', 'col']

I guess there is no going back on isna/isnull but I really dislike having methods that are aliases of one another.

@jorisvandenbossche I wasn't being clear, but all those DataFrame/Series methods that only work on DateTimeIndexes could be put in their own accesor (not .dt) but I don't think even that would be a good idea. Perhaps just deprecating all of them would be best.

tdpetrou on 23 Nov 2017

@jreback There are only a total of 13 occurrences of df.take in all of Stack Overflow and in my opinion should never be used.

.take() is a common name for array-like things.

df.iat/.at are probably too entrenched in legacy code to remove but they provide no extra functionality. Indexing is the most confusing aspect to pandas and the less the better. Maybe a better design would have been to do df.loc(type='scalar')['row', 'col']

your suggestion is much less readable

I guess there is no going back on isna/isnull but I really dislike having methods that are aliases of one another.

sure, but isnull is even more entrenched than anything else.

jreback on 23 Nov 2017

df.iat/.at are probably too entrenched in legacy code to remove but they provide no extra functionality. Indexing is the most confusing aspect to pandas and the less the better. Maybe a better design would have been to do df.loc(type='scalar')['row', 'col']

As indexing is simplified & improved, the speed diff between .loc and .at should fall (a lot of the time it's a very similar function). Then deprecating .at will cause less strife

max-sixty on 23 Nov 2017

DataFrames have a method .boxplot. I would assume this should be deprecated and people should use .plot.box instead?

topper-123 on 26 Nov 2017

yes i think there is an issue about this

jreback on 26 Nov 2017

The two methods aren’t quite equivalent iirc.

From: Jeff Reback notifications@github.com
Sent: Saturday, November 25, 2017 10:23:01 PM
To: pandas-dev/pandas
Cc: Tom Augspurger; Mention
Subject: Re: [pandas-dev/pandas] DEPR: let's deprecate (#18262)

yes i think there is an issue about this

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com/pandas-dev/pandas/issues/18262#issuecomment-346981274, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABQHItMklrTp28DbdKXtQmYS9WIL_7sKks5s6NmVgaJpZM4Qb0GL.

TomAugspurger on 26 Nov 2017

Does anyone have any objections to a PR to deprecate from_items? There is already an issue discussing its deprecation #17320. It seems that any of its functionality can be be replicated in DataFrame(dict(..)) or similar.

Also xref #17312, #4916

reidy-p on 26 Nov 2017

boxplot has some differences to df.box.plot. One issue is here - #15079. boxplot also defaults to making a matplotlib grid (those background dotted lines on the tick marks).

Also, clip_upper and clip_lower have to up there on easiest and least likely to affect others.

@reidy-p DataFrame.from_items looks like a good candidate. I only see 6 occurrences on SO.

tdpetrou on 26 Nov 2017

This might be unpopular, but I don't like the fact that the indexing operator selects rows when given a slice but columns when given a single label or list of labels.

Selects columns

>>> df['col1'] 
>>> df[['col1', 'col2']]

All of a sudden selects rows?

>>> df[:5]

Lots of users will find this behavior confusing.

I think it actually might be simplest to deprecate the indexing operator altogether and only have .loc and .iloc.

tdpetrou on 26 Nov 2017

@tdpetrou let's keep this issue focused on discussing "potential methods that can be deprecated because they are hardly used / very awkward / internals".
Indexing is a whole other can of worms (see eg https://github.com/pandas-dev/pandas/issues/9595), the behaviour you mention can maybe be confusing but is well established and changing this will break a gigantic amount of user code (but feel free to open a separate issue if you want)

jorisvandenbossche on 27 Nov 2017

Thanks @jorisvandenbossche. I added my comment there.

Another source of confusion for users is the ability of Series.map to take a function. Series.apply can also accept a function as well as other arguments. There is no reason for Series.map to accept a function and its better served to do exactly one thing - and that is to literally map one value to another.

tdpetrou on 28 Nov 2017

@tdpetrou , Series.map(func) is similar to the python built-in map(func, iterable), except it's object-oriented. It's as intended and expected.

topper-123 on 28 Nov 2017

@topper-123 Yes, I know it has a similar functionality to the built-in map function, but it also accepts a dictionary. This is a bad idea and cause for lots of confusion. Users see map and apply and have no idea what the difference between them are. It makes much more sense to have it literally map one value to another with a dictionary/Series as its only functionality.

tdpetrou on 28 Nov 2017

IMO DataFrame.filter is confusingly named, and is easily confused with the similarly named DataFrame.groupby(...).filter when googling etc.

I propose that DataFrame.filter be deprecated and a similarly functioning DataFrame.select_filter be added. By doing this rename, the relation to select_dtypes is emphasized.

It could also be named just select (shorter), but that means that the change will have to wait until the current deprecated select method is removed.

topper-123 on 16 Apr 2018

I found the OP from @TomAugspurger in #21894 quite important, and since that issue is closed now, I quote it here:

As a reminder, the plan is to have no new deprecations in 0.25.x and 1.0.0. So this [v0.24] is the last round of deprecations before 1.0.

In this context, I'd like to bring up for discussion the following two issues: #21950 #21951

Finally, l'll repeat a comment I made in the other thread:

Most likely too late to the game, but for completeness I'd like to add: if #21855 #21858 are solved for v0.24, then combine_first could be deprecated at the same time, see #21859.

h-vetinari on 17 Jul 2018

Hi, I am also -1 for deprecating xs().
I read most of the discussion in the related issue.
But as someone whose using pandas from time to time, the sheer amount of capacity of loc is kind of confusing me.
Especially when I am using a multi index with different levels. I tend to remember the levels not in their order of hierarchy but by their names. Selection by loc becomes a hassle when I want to slice third and fifth level in the hierarchy, because most of the time I confuse the third with the second one, or fourth with the fifth etc.
I don't loose much time on it, but still it is a little inefficient compared to where I can simply pass a value and the name of the level to a function.

D-K-E on 29 Jul 2018

👍1

Why on earth would you deprecate read_table? That makes no damn sense.
The suggested change is to call read_csv to read things that are not comma-separated? This is 100% backwards.

jimmywan on 25 Jan 2019

👍13

One could argue why not deprecate read_csv() instead of read_table() since table sounds more flexible.

Edit:
I have to agree with @jimmywan here, and if they are basically the same, why not at least keep it as an alias? One could always wrap it, but people would not be confused or avoid updating.

st-bender on 1 Feb 2019

👍5

DataFrame.where and DataFrame.mask are duals, but their names don't indicate that. perhaps deprecate mask? since mask is just where(~cond), IIUC. alternatively rename to where_not.

ghost on 26 Jun 2019

cc @jbrockmendel here is my original list :->

jreback on 11 Sep 2019

Would it be possible to get a rationale to why .item() has been deprecated or a suggested alternative? I really appreciated this feature as it would assume one and only one match to a query/column combo. i.e pd.query("Country=='%s' % country)['CapitalCity'].unique().item().

Using the suggest next(iter, None)) feature as suggested in the link below is very readable, but breaks the assumption that there are not multiple matches to a query and returns only the first value, meaning I'd need to do add a check-query-length==1 prior to extracting the value.

https://stackoverflow.com/questions/57390363/pandas-item-has-been-deprecated