Pandas: cross section coercion with output iterating

Created on 11 Apr 2016 · 9Comments · Source: pandas-dev/pandas

I'm am trying to call the to_dict function on the following DataFrame:

import pandas as pd

data = {"a": [1,2,3,4,5], "b": [90,80,40,60,30]}

df = pd.DataFrame(data)

df.reset_index().to_dict("r")
[{'a': 1, 'b': 90, 'index': 0},
 {'a': 2, 'b': 80, 'index': 1},
 {'a': 3, 'b': 40, 'index': 2},
 {'a': 4, 'b': 60, 'index': 3},
 {'a': 5, 'b': 30, 'index': 4}]

However my problem occurs if I perform a float operation on the dataframe, which mutates the index into a float:

(df*1.0).reset_index().to_dict("r")
[{'a': 1.0, 'b': 90.0, 'index': 0.0},  
{'a': 2.0, 'b': 80.0, 'index': 1.0},  
{'a': 3.0, 'b': 40.0, 'index': 2.0},  
{'a': 4.0, 'b': 60.0, 'index': 3.0},  
{'a': 5.0, 'b': 30.0, 'index': 4.0}]

Can anyone explain the above behaviour or recommend a workaround, or verify whether or not this could be a pandas bug? None of the other outtypes in the to_dict method mutates the index as shown above.

I've replicated this on both pandas 0.14 and 0.18 (latest)

Many thanks!

link to stackoverflow: http://stackoverflow.com/questions/36548151/pandas-to-dict-changes-index-type-with-outtype-records

API Design Dtypes Needs Discussion good first issue

Source

tsu-shiuan

👍3

All 9 comments

Nothing to do with the index, just the fact that you have any float dtypes in the data

data = {"a": [1.0,2,3,4,5], "b": [90,80,40,60,30]}
In [19]: df.to_dict("records")
Out[19]:
[{'a': 1.0, 'b': 90.0},
 {'a': 2.0, 'b': 80.0},
 {'a': 3.0, 'b': 40.0},
 {'a': 4.0, 'b': 60.0},
 {'a': 5.0, 'b': 30.0}]

If you look at the code, we use DataFrame.values, which returns a NumPy array, which must have a single dtype (float64 in this case).

We probably don't need to use .values here.

TomAugspurger on 11 Apr 2016

Thanks for your response. It there a possible workaround that I can use in the meantime?

tsu-shiuan on 11 Apr 2016

Something like

In [28]: [x._asdict() for x in df.itertuples()]
Out[28]:
[OrderedDict([('Index', 0), ('a', 1.0), ('b', 90)]),
 OrderedDict([('Index', 1), ('a', 2.0), ('b', 80)]),
 OrderedDict([('Index', 2), ('a', 3.0), ('b', 40)]),
 OrderedDict([('Index', 3), ('a', 4.0), ('b', 60)]),
 OrderedDict([('Index', 4), ('a', 5.0), ('b', 30)])]

That's an OrderedDict using namedtuple._asdict, you can write dict comprehension if you want a regular one.

TomAugspurger on 11 Apr 2016

Thanks :)

tsu-shiuan on 11 Apr 2016

Though one could argue that this result is correct as we don't support mixed types in int-float when doing a cross-section, IOW:

In [10]: (df*1.0).reset_index().iloc[1]
Out[10]: 
index     1.0
a         2.0
b        80.0
Name: 1, dtype: float64

this is somewhat related to #12532, meaning that we should be iterating directly over (which already does the proper coercion), rather that doing a specific coercion in .to_dict().

jreback on 11 Apr 2016

giong to mark this an an API issue that needs discussion. This would actually be a fairly large change to correctly change this (though to be honest I think the current behavior is fine).

jreback on 11 Apr 2016

👍1

Note for future seekers - I'm trying to combine multiple pandas objects into one nested json structure.

Since to_json doesn't work in this case (manipulating json strings is hard), you might try to do to_dict(orient="records"), and combine the results of the to_dict()s into a bigger object, and do json.dumps on that. But because of this bug, you can't do that without screwing with the types of everything.

So then you might try doing @TomAugspurger's solution but you might find that for some reason it won't convert numpy types to python types, like to_dict() does, which makes json.dumps() fail.

My workaround solution is to do to_json() which gives you a correct json string with correct types, then do json.loads() on that to get python objects corresponding to that string, which you then put together whichever way you want (e.g. big_obj = {"a": df_a_json, "b": df_b_json}) and then run json.dumps on the whole thing. It's roundabout but it's the closest general solution I found without having to muck about with type conversions myself!

def to_records(df):
    """Replacement for pandas' to_dict(orient="records") which has trouble with
    upcasting ints to floats in the case of other floats being there.

    https://github.com/pandas-dev/pandas/issues/12859
    """
    import json
    return json.loads(df.to_json(orient="records"))