Pandas: Un-deprecate DataFrame.from_items()

Created on 11 Jul 2018  Â·  21Comments  Â·  Source: pandas-dev/pandas

Under #18262 a FutureWarning was added suggesting that existing code like this:

pd.DataFrame.from_items(x)

Should be changed to this:

import collections
pd.DataFrame.from_dict(collections.OrderedDict(x))

The fact that from_items() appeared only 6 times (now 8 times) in a Stack Overflow search was used as partial justification for removing it. But if you search on GitHub, pd.DataFrame.from_items() appears more than 15,000 times in Python--almost half as many as from_records()!

We should celebrate the fact that this function doesn't cause enough confusion to appear often on Stack Overflow. But it does occur (a lot!) in real code, and deprecating it is a mistake.

If constructing a temporary OrderedDict around items is the best way to construct a DataFrame, Pandas should implement that as a short function called DataFrame.from_items(), rather than asking thousands of people to busy themselves to accommodate this unnecessary API change.

I recommend removing the FutureWarning, and retaining this widely-used, longstanding function.

For reference, the FutureWarning starts in 0.23 and looks like this:

FutureWarning: from_items is deprecated. Please use DataFrame.from_dict(dict(items), ...) instead. DataFrame.from_dict(OrderedDict(items)) may be used to preserve the key order.

API Design

Most helpful comment

I haven't seen any mention that the from_dict(OrderedDict(items)... doesn't work in cases where there would be duplicates in the index. Here is a test that would fail:

import pandas
from collections import OrderedDict
def test_from_dict_replacing_from_items():
    rows = [(1, (2,)), (1, (2,))]
    df1 = pandas.DataFrame.from_items(rows, columns=('a', ), orient='index')
    df2 = pandas.DataFrame.from_dict(OrderedDict(rows), columns=('a', ), orient='index')
    pandas.testing.assert_frame_equal(df1, df2)

All 21 comments

cc @jreback @chris-b1

@jreback I would appreciate your review of this topic.

cc @reidy-p @chris-b1

I disgree, this is a good deprecation. Keeping around old code is not helpful. So -1 on adding back.

@jreback It is clear that you think this is a good deprecation, otherwise you would not have done it. However, the initial justification ("only 6 hits on Stack Overflow") completely ignored the larger picture, which is that this function is widely used in a huge number of applications and libraries.

It might be the case that the original implementation was unwieldy or costly to maintain. But what could be the reason not to maintain backward compatibility by a 3-line function that delegates the work to OrderedDict and the main DataFrame constructor?

In other words, maybe the old code was not helpful. Would you accept new code to implement this widely-used function in a single place, rather than having thousands of users sprinkle the same into their local codebases?

there is a burden to maintain apis and code
why should this be any different from other ‘highly used code’

I agree with @jzwinck that "old code" and "burden to maintain apis and code" is not a good reason for deprecating from_items. from_items is really not that of a complicated function (compared to many other things in pandas) and as @jzwinck shows can be easily cleaned-up.

If that was the original reason to deprecate it, I think the PR of @jzwinck is a perfectly reasonable solution.

However, often when we deprecate one of the public methods, the main reasons are related to the "User API" design: inconsistent apis, duplicate functionality, hardly useful functionality that clobbers the API, etc are all reasons to consider deprecating a functionality.
We have a huge number of methods on the DataFrame object, and this huge number results in an overload for users. Therefore for having something as a method on the DataFrame, there should be a clear added value that is big enough to balance the added complexity / API load.

So also for from_items: is it worth having this as a DataFrame method given the relatively limited use case / relatively easy alternative?

And to be clear, such a trade-off is difficult and subjective.
And when it is about removing existing functionality (instead of adding a new method), the balance should typically be more strongly against having the method (because then there is also the disruption of the existing use cases).


That said, I personally never had a use case myself where from_items would have been fitting. That is not saying that those do not exist, but for me it is very hard to judge its usefulness.

If this would have been a new method proposal, I would probably be -1 to add it. But given that this is about removing an already existing method, I personally don't know.

Thanks for the insight @jorisvandenbossche. If this were a brand new API, I would not promote it either. But as you point out, it is an API Pandas has had for a long time.

To me the largest concerns are

  1. Existing uses of from_items() in the wild.
  2. The inadequacy (even bugs, see below) of the refactoring proposed by the deprecation warning. Reading #4916, it was not totally clear even to the experts how to replace from_items() in some cases.

On 24 August 2017, @bashtage asked:

What about `orient='index'? I used this just last week.

This was a prescient question: it is exactly the case which triggers #8425 if you can't use from_items(). But no one replied. Evidently no one tried it either, because one of the existing Pandas test cases fails if you try to replace from_items(orient='index') with from_dict().

I could see making the docs say from_items() is deprecated, but when the difference between user code working and not working comes down to 15 lines of Python, I don't see why we'd break those apps or force them all to copy-paste an incomplete and buggy workaround.

I haven't seen any mention that the from_dict(OrderedDict(items)... doesn't work in cases where there would be duplicates in the index. Here is a test that would fail:

import pandas
from collections import OrderedDict
def test_from_dict_replacing_from_items():
    rows = [(1, (2,)), (1, (2,))]
    df1 = pandas.DataFrame.from_items(rows, columns=('a', ), orient='index')
    df2 = pandas.DataFrame.from_dict(OrderedDict(rows), columns=('a', ), orient='index')
    pandas.testing.assert_frame_equal(df1, df2)

@PatrickDRusk Good point. There are several cases where the replacement for from_items() is not obvious. Pandas maintainers simply don't care about backward compatibility--deleting working code from Pandas is more important than providing a smooth experience for long-time Pandas users like you and me.

Pandas maintainers simply don't care about backward compatibility--deleting working code from Pandas is more important than providing a smooth experience for long-time Pandas users like you and me.

@jzwinck : I should point out that @jorisvandenbossche was willing to reopen the issue, but you promptly closed it. From the looks of it, your comments are in fact somewhat hypocritical.

I would caution your tone before you make such broad statements about us like that. We don't delete code on a whim, and with so many users of this codebase it is difficult to please everyone when it comes to making a library like this work.

If you want to have a discussion again about this topic, then please re-open the issue.

@gfyoung OK, I've reopened this issue. The current state of play is that there are a number of users of Pandas who can't easily upgrade to 0.23 or later because from_items() is deprecated and the suggested workaround only covers some fraction of use cases (maybe 30% by cardinality and 70% by popularity). My PR #22094 eliminates most of the burdensome implementation code for from_items() and covers maybe 90% of the use cases (but not the one from @PatrickDRusk, I think).

What should we do next?

Okay, let's take stock of what we have here:

I haven't seen any mention that the from_dict(OrderedDict(items)... doesn't work in cases where there would be duplicates in the index.

Hmmm...duplicate indices generally have not been a fan-favorite for us as maintainers because such support can be very hard to maintain, not just for one case, but as a "feature" across the entire interface of pandas.

@PatrickDRusk : So admittedly I'm somewhat torn: the statement is indeed inaccurate with regards to duplicate indices, but committing to supporting such behavior could send the wrong message that we intend to support duplicates full-throttle, when that is far from the case in our codebase.

@pandas-dev/pandas-core : Thoughts on this?

the suggested workaround only covers some fraction of use cases (maybe 30% by cardinality and 70% by popularity)

@jzwinck : This statement unfortunately is too vague for us to work with. Especially since we already made the decision to deprecate, we will need more than words and GitHub search results to make a convincing argument. Meet us halfway and present us with more examples to illustrate that the suggested change is incompatible.

Caveat: if they fall into same category of duplicates, I personally wouldn't count it unless it is somehow different (I'll leave it to your discretion to judge appropriately) from what @PatrickDRusk has given.

@pandas-dev/pandas-core : Thoughts on this?

I'm +0.5 on keeping the method. In general, I don't see "easy to replace with user code" as an argument to deprecate - maybe the opposite, if the code is so simple and so widely used...
I'm -1.5 on "we don't want to encourage duplicated labels" as an argument. Pandas supports them, let's face it. It is perfectly fine to not support them - and document it - in those cases when it's just a mess. But not supporting them "just to give a message to users", in cases where it's extremely simple to support them, is not nice.

Anyway, if we _really_ want to discourage users from using duplicated labels, the the correct way to do it is to introduce a general warning for it, inside the Index (and subclasses) constructor. Maybe even a DeprecationWarning, giving users several releases to adapt, would allow us to eventually simplify our code. I just don't think it's worth the pain (but this is very IMHO).

By the way: deprecating DataFrame.from_items because the API is too large is particularly funny because we just introduced the very specific MultiIndex.from_frame, which is a very simple wrapper.

@gfyoung I don't personally need support for duplicate values in the index, and I wrote #22094 without considering that. For people who like reduced implementation code and don't want duplicate values in the index, #22094 provides both.

You said:

Meet us halfway and present us with more examples to illustrate that the suggested change is incompatible.

Please look at the logic in #22094. That's what I had to write to make the preexisting unit tests pass. In other words, the unit tests I restored in that PR are the examples you are looking for. Any code using from_items() with the columns and/or orient parameters requires more than the suggested workaround using OrderedDict (and that additional code is what my PR implements). I wrote that PR for two reasons: to solve the problem, and to demonstrate that the trivial-seeming workaround code using OrderedDict is not sufficient.

"we don't want to encourage duplicated labels" as an argument. Pandas supports them, let's face it. It is perfectly fine to not support them - and document it

@toobaz : I beg to differ to the extent in which duplicates of labels, whether it be in columns or indices is supported as well as non-duplicates, but let's agree to disagree on this one for the moment. My point is not about "sending a message" by not supporting it. It's actually about "not sending the wrong message" that's more important IMO.

Please look at the logic in #22094. That's what I had to write to make the preexisting unit tests pass.

@jzwinck : I read it, and I'm really confused here...what pre-existing unit test was failing before your change? You seemed to have removed several tests + removed deprecation warnings that we had in place that we were checking for...

@gfyoung I should have said "to make the tests pass as they did before the deprecation warnings were added." In other words, I put the unit tests back to the way they were before the deprecation, and then implemented the logic in from_items() to make them all work again without the "burdensome" logic (replacing that with the OrderedDict solution plus a bit more logic to cover all the test cases).

I would also lean towards keeping from_items around. It may not be an optimal API, but it solves a real use case and it isn't broken. I don't see it really adding much to the pandas maintenance burden.

If .from_items isn't going to be un-deprecated then Rpy2 is going to need to update their code:

@ri2py.register(DataFrame)
def ri2py_dataframe(obj):
    items = tuple((k, ri2py(v)) for k, v in obj.items())
    res = PandasDataFrame.from_items(items)
    return res

I set up a GitHub issue to let them know the change breaks their code.
https://github.com/rpy2/rpy2/issues/680

Was this page helpful?
0 / 5 - 0 ratings