Pandas: Feature Request: pd.DataFrame() should flatten nested dicts when given list of dicts

Created on 5 Oct 2014 · 9Comments · Source: pandas-dev/pandas

Nested dictionaries are commonly emitted by web APIs that speak json. Getting this sort of data into pandas isn't very easy right now, without manual data structure munging, as the dicts reaing objects rather then converted into a flat naming hirerchy.

Here's a common example of data:

In [95]: data=[dict(user=dict(uid=123,full_name='Alice'),followers=1),
    ...:  dict(user=dict(uid=456,full_name='Bob'),followers=2)]
    ...: data
Out[95]: 
[{'followers': 1, 'user': {'full_name': 'Alice', 'uid': 123}},
 {'followers': 2, 'user': {'full_name': 'Bob', 'uid': 456}}]

Pandas keeps the dicts as objects:

In [96]: df=pd.DataFrame(data)
    ...: df
Out[96]: 
   followers                                   user
0          1  {u'uid': 123, u'full_name': u'Alice'}
1          2    {u'uid': 456, u'full_name': u'Bob'}

But I'd love to see something along the lines of:

In [91]:  df=pd.DataFrame(data,flatten_dicts=True)
    ...: df
Out[94]: 
   followers  user.uid user.full_name
0          1       123          Alice
1          2       456            Bob

API Design Enhancement IO JSON Reshaping

Source

kay1793

Most helpful comment

Well, this is in general a non-trivial problem and that's why its not done by default. You can do the below.

In [10]: from pandas.io.json import json_normalize

In [11]: json_normalize(data)
Out[11]: 
   followers user.full_name  user.uid
0          1          Alice       123
1          2            Bob       456

see docs here: http://pandas-docs.github.io/pandas-docs-travis/io.html#normalization

I suppose a 2 level dict could be examined and this done.

Don't want additional context specific keywords. But you could try this type of soln an dsee if it breaks anything. If it works ok, would consider it.

jreback on 5 Oct 2014

👍14 🎉10 ❤3

All 9 comments

Well, this is in general a non-trivial problem and that's why its not done by default. You can do the below.

In [10]: from pandas.io.json import json_normalize

In [11]: json_normalize(data)
Out[11]: 
   followers user.full_name  user.uid
0          1          Alice       123
1          2            Bob       456

see docs here: http://pandas-docs.github.io/pandas-docs-travis/io.html#normalization

I suppose a 2 level dict could be examined and this done.

Don't want additional context specific keywords. But you could try this type of soln an dsee if it breaks anything. If it works ok, would consider it.

jreback on 5 Oct 2014

👍14 🎉10 ❤3

I don't understand, you don't want new keywords but would be ok with a flatten keyword?
In any case, json_normalize works fine for me - I just didn't know it was hidden there. Thanks!

kay1793 on 5 Oct 2014

no new keywords

but u maybe could inspect a 2-level dict and flatten

why don't u give it a shot

jreback on 5 Oct 2014

Wouldn't that break existing code? If no new keywords are allowed (I can see why), I think it's better to leave it as-is. json_normalize works fine for me, I'll add any extra features I need there.

kay1793 on 5 Oct 2014

@kay1793 here's a couple of things to try (and can see what works best):

have pd.read_json interpret this (it normally takes a string / file handle), and essentially call json_normalize if its a nested dict-of-dicts (we might be bending the definition a bit though)
have the DataFrame constructor deal with this and see if it can do unambiguous interpretation (e.g. you have a dict of dict / scalar mix, instead of the current behavior, actually call json_normalize) - might break things, but unknown until you try
provide a pd.JSON which is essentially a wrapper around json_normalize (and is a more expandable way of doing things like this)

jreback on 5 Oct 2014

The read_json data schema isn't wonderful but it is what it is, I don't think making it as mysterious and full of private cases as the Dataframe constructor is a good idea.
As for making the Dataframe constructor silently guess what the user wants, there's nothing unambiguous about it breaking someone's code. Currently it keeps the dictionary as an object, doing something else will break code. Without a keyword, I don't think this should be done, pandas already second-guesses the user too much in certain places.
adding pd.JSON isn't reasonable either. json isn't really the point, any nested dictionary could be serialized as json. What matters is the actual structure, and how to deal with it. What you're suggesting is to take a special case of the datafram constructor's existing functionality (list of dicts) and turn it into a different dataframe constructor. That's not right either IMO.

json_normalize works fine for me, and when it comes to API design I'm kind of conservative, so none of those options seem acceptable to me. I vote "do nothing".

Edit: clarified my 2nd point.

kay1793 on 5 Oct 2014

@kay1793 I think you missed my point on these three. I was suggesting you actually try it and see if its a problem to try to infer it.

jreback on 5 Oct 2014

I did disagree with you, but I'm not sure what I've misunderstood. Can you be more specific?

kay1793 on 7 Oct 2014

@kay1793

I was trying to find a nice way to promote json_normalize its kind of buried.

that's why I suggested making read_json try it (e.g. if a string/filename is not presented). E.g. you have a 'json' like object, read_json can internally call json_normalize and to try to figure it out.

Doing this by default is problematic on many levels in the DataFrame constructor (though I wanted you to try it and see, maybe it IS possible to infer these types of multi-level dicts).

jreback on 7 Oct 2014

Was this page helpful?

0 / 5 - 0 ratings